数据集:
common_language
任务:
音频分类计算机处理:
multilingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
extended|common_voice许可:
cc-by-4.0This dataset is composed of speech recordings from languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset has been extracted from CommonVoice to train language-id systems.
The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain
List of included languages:
Arabic, Basque, Breton, Catalan, Chinese_China, Chinese_Hongkong, Chinese_Taiwan, Chuvash, Czech, Dhivehi, Dutch, English, Esperanto, Estonian, French, Frisian, Georgian, German, Greek, Hakha_Chin, Indonesian, Interlingua, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Maltese, Mongolian, Persian, Polish, Portuguese, Romanian, Romansh_Sursilvan, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Ukranian, Welsh
A typical data point comprises the path to the audio file, and its label language . Additional fields include age , client_id , gender and sentence .
{ 'client_id': 'itln_trn_sp_175', 'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav', 'audio': {'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000}, 'sentence': 'Con gli studenti è leggermente simile.', 'age': 'not_defined', 'gender': 'not_defined', 'language': 22 }
client_id ( string ): An id for which client (voice) made the recording
path ( string ): The path to the audio file
language ( ClassLabel ): The language of the recording (see the Languages section above)
sentence ( string ): The sentence the user was prompted to speak
age ( string ): The age of the speaker.
gender ( string ): The gender of the speaker
The dataset is already balanced and split into train, dev (validation) and test sets.
Name | Train | Dev | Test |
---|---|---|---|
# of utterances | 177552 | 47104 | 47704 |
# unique speakers | 11189 | 1297 | 1322 |
Total duration, hr | 30.04 | 7.53 | 7.53 |
Min duration, sec | 0.86 | 0.98 | 0.89 |
Mean duration, sec | 4.87 | 4.61 | 4.55 |
Max duration, sec | 21.72 | 105.67 | 29.83 |
Duration per language, min | ~40 | ~10 | ~10 |
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The Mongolian and Ukrainian languages are spelled as "Mangolian" and "Ukranian" in this version of the dataset.
Ganesh Sinisetty; Pavlo Ruban; Oleksandr Dymov; Mirco Ravanelli
Creative Commons Attribution 4.0 International
@dataset{ganesh_sinisetty_2021_5036977, author = {Ganesh Sinisetty and Pavlo Ruban and Oleksandr Dymov and Mirco Ravanelli}, title = {CommonLanguage}, month = jun, year = 2021, publisher = {Zenodo}, version = {0.1}, doi = {10.5281/zenodo.5036977}, url = {https://doi.org/10.5281/zenodo.5036977} }
Thanks to @anton-l for adding this dataset.