数据集:
multilingual_librispeech
计算机处理:
multilingual大小:
100K<n<1M批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2012.03411许可:
cc-by-4.0Deprecated: This legacy dataset doesn't support streaming and is not updated. Use "facebook/multilingual_librispeech" instead.
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish
A typical data point comprises the path to the audio file, usually called file and its transcription, called text . Some additional information about the speaker and the passage which contains the transcription is provided.
{'chapter_id': 141231, 'file': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'audio': {'path': '/home/patrick/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/dev_clean/1272/141231/1272-141231-0000.flac', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000}, 'id': '1272-141231-0000', 'speaker_id': 1272, 'text': 'A MAN SAID TO THE UNIVERSE SIR I EXIST'}
file: A path to the downloaded audio file in .flac format.
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
text: the transcription of the audio file.
id: unique id of the data sample.
speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
chapter_id: id of the audiobook chapter which includes the transcription.
Train | Train.9h | Train.1h | Dev | Test | |
---|---|---|---|---|---|
german | 469942 | 2194 | 241 | 3469 | 3394 |
dutch | 374287 | 2153 | 234 | 3095 | 3075 |
french | 258213 | 2167 | 241 | 2416 | 2426 |
spanish | 220701 | 2110 | 233 | 2408 | 2385 |
italian | 59623 | 2173 | 240 | 1248 | 1262 |
portuguese | 37533 | 2116 | 236 | 826 | 871 |
polish | 25043 | 2173 | 238 | 512 | 520 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
[More Information Needed]
[More Information Needed]
[Needs More Information]
[Needs More Information]
Public Domain, Creative Commons Attribution 4.0 International Public License ( CC-BY-4.0 )
@article{Pratap2020MLSAL, title={MLS: A Large-Scale Multilingual Dataset for Speech Research}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} }
Thanks to @patrickvonplaten for adding this dataset.