数据集:
MLCommons/ml_spoken_words
任务:
音频分类计算机处理:
multilingual大小:
10M<n<100M语言创建人:
other批注创建人:
machine-generated源数据集:
extended|common_voice许可:
cc-by-4.0Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.
Data is provided in two formats: wav (16KHz) and opus (48KHz). Default configurations look like "{lang}_{format}" , so to load, for example, Tatar in wav format do:
ds = load_dataset("MLCommons/ml_spoken_words", "tt_wav")
To download multiple languages in a single dataset pass list of languages to languages argument:
ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"])
To download a specific format pass it to the format argument (default format is wav ):
ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"], format="opus")
Note that each time you provide different sets of languages, examples are generated from scratch even if you already provided one or several of them before because custom configurations are created each time (the data is not redownloaded though).
Keyword spotting, Spoken term search
The dataset is multilingual. To specify several languages to download pass a list of them to the languages argument:
ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"])
The dataset contains data for the following languages:
Low-resourced (<10 hours):
Medium-resourced (>10 & <100 hours):
Hig-resourced (>100 hours):
{'file': 'абзар_common_voice_tt_17737010.opus', 'is_valid': True, 'language': 0, 'speaker_id': '687025afd5ce033048472754c8d2cb1cf8a617e469866bbdb3746e2bb2194202094a715906f91feb1c546893a5d835347f4869e7def2e360ace6616fb4340e38', 'gender': 0, 'keyword': 'абзар', 'audio': {'path': 'абзар_common_voice_tt_17737010.opus', 'array': array([2.03458695e-34, 2.03458695e-34, 2.03458695e-34, ..., 2.03458695e-34, 2.03458695e-34, 2.03458695e-34]), 'sampling_rate': 48000}}
The data for each language is splitted into train / validation / test parts.
[More Information Needed]
The data comes form Common Voice dataset.
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
he dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is licensed under CC-BY 4.0 and can be used for academic research and commercial applications in keyword spotting and spoken term search.
@inproceedings{mazumder2021multilingual, title={Multilingual Spoken Words Corpus}, author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)}, year={2021} }
Thanks to @polinaeterna for adding this dataset.