数据集:

anton-l/common_language

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

crowdsourced

批注创建人:

crowdsourced
中文

Dataset Card for common_language

Dataset Summary

This dataset is composed of speech recordings from languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset has been extracted from CommonVoice to train language-id systems.

Supported Tasks and Leaderboards

The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain

Languages

List of included language:

Arabic, Basque, Breton, Catalan, Chinese_China, Chinese_Hongkong, Chinese_Taiwan, Chuvash, Czech, Dhivehi, Dutch, English, Esperanto, Estonian, French, Frisian, Georgian, German, Greek, Hakha_Chin, Indonesian, Interlingua, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Maltese, Mongolian, Persian, Polish, Portuguese, Romanian, Romansh_Sursilvan, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Ukranian, Welsh

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, and its label language . Additional fields include age , client_id , gender and sentence .

{
  'client_id': 'itln_trn_sp_175',
  'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav',
  'sentence': 'Con gli studenti è leggermente simile.',
  'age': 'not_defined',
  'gender': 'not_defined',
  'language': 22
}

Data Fields

client_id ( string ): An id for which client (voice) made the recording

path ( string ): The path to the audio file

language ( ClassLabel ): The language of the recording (see the Languages section above)

sentence ( string ): The sentence the user was prompted to speak

age ( string ): The age of the speaker.

gender ( string ): The gender of the speaker

Data Splits

The dataset is already balanced and split into train, dev (validation) and test sets.

Name Train Dev Test
# of utterances 177552 47104 47704
# unique speakers 11189 1297 1322
Total duration, hr 30.04 7.53 7.53
Min duration, sec 0.86 0.98 0.89
Mean duration, sec 4.87 4.61 4.55
Max duration, sec 21.72 105.67 29.83
Duration per language, min ~40 ~10 ~10

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Considerations for Using the Data

Social Impact of Dataset

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Discussion of Biases

More Information Needed

Other Known Limitations

The Mongolian and Ukrainian languages are spelled as "Mangolian" and "Ukranian" in this version of the dataset.

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@dataset{ganesh_sinisetty_2021_5036977,
  author       = {Ganesh Sinisetty and
                  Pavlo Ruban and
                  Oleksandr Dymov and
                  Mirco Ravanelli},
  title        = {CommonLanguage},
  month        = jun,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5036977},
  url          = {https://doi.org/10.5281/zenodo.5036977}
}

Contributions

Thanks to @anton-l for adding this dataset.