数据集:
indonesian-nlp/librivox-indonesia
The LibriVox Indonesia dataset consists of MP3 audio and a corresponding text file we generated from the public domain audiobooks LibriVox . We collected only languages in Indonesia for this dataset. The original LibriVox audiobooks or sound files' duration varies from a few minutes to a few hours. Each audio file in the speech dataset now lasts from a few seconds to a maximum of 20 seconds.
We converted the audiobooks to speech datasets using the forced alignment software we developed. It supports multilingual, including low-resource languages, such as Acehnese, Balinese, or Minangkabau. We can also use it for other languages without additional work to train the model.
The dataset currently consists of 8 hours in 7 languages from Indonesia. We will add more languages or audio files as we collect them.
Acehnese, Balinese, Bugisnese, Indonesian, Minangkabau, Javanese, Sundanese
A typical data point comprises the path to the audio file and its sentence . Additional fields include reader and language .
{ 'path': 'librivox-indonesia/sundanese/universal-declaration-of-human-rights/human_rights_un_sun_brc_0000.mp3', 'language': 'sun', 'reader': '3174', 'sentence': 'pernyataan umum ngeunaan hak hak asasi manusa sakabeh manusa', 'audio': { 'path': 'librivox-indonesia/sundanese/universal-declaration-of-human-rights/human_rights_un_sun_brc_0000.mp3', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 44100 }, }
path ( string ): The path to the audio file
language ( string ): The language of the audio file
reader ( string ): The reader Id in LibriVox
sentence ( string ): The sentence the user read from the book.
audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
The speech material has only train split.
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Public Domain, CC-0