数据集:
facebook/multilingual_librispeech
任务:
自动语音识别计算机处理:
multilingual大小:
100K<n<1M批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2012.03411许可:
cc-by-4.0This is a streamable version of the Multilingual LibriSpeech (MLS) dataset. The data archives were restructured from the original ones from OpenSLR to make it easier to stream.
MLS dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish
The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.
For example, to download the German config, simply specify the corresponding language config name (i.e., "german" for German):
from datasets import load_dataset mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
from datasets import load_dataset mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True) print(next(iter(mls)))
Bonus : create a PyTorch dataloader directly with your own datasets (local/streamed).
Local:
from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler mls = load_dataset("facebook/multilingual_librispeech", "german", split="train") batch_sampler = BatchSampler(RandomSampler(mls), batch_size=32, drop_last=False) dataloader = DataLoader(mls, batch_sampler=batch_sampler)
Streaming:
from datasets import load_dataset from torch.utils.data import DataLoader mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True) dataloader = DataLoader(mls, batch_size=32)
To find out more about loading and preparing audio datasets, head over to hf.co/blog/audio-datasets .
Train your own CTC or Seq2Seq Automatic Speech Recognition models on MultiLingual Librispeech with transformers - here .
A typical data point comprises the path to the audio file, usually called file and its transcription, called text . Some additional information about the speaker and the passage which contains the transcription is provided.
{'file': '10900_6473_000030.flac', 'audio': {'path': '10900_6473_000030.flac', 'array': array([-1.52587891e-04, 6.10351562e-05, 0.00000000e+00, ..., 4.27246094e-04, 5.49316406e-04, 4.57763672e-04]), 'sampling_rate': 16000}, 'text': 'więc czego chcecie odemnie spytałem wysłuchawszy tego zadziwiającego opowiadania broń nas stary człowieku broń zakrzyknęli równocześnie obaj posłowie\n', 'speaker_id': 10900, 'chapter_id': 6473, 'id': '10900_6473_000030'}
file: A filename .flac format.
audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
text: the transcription of the audio file.
id: unique id of the data sample.
speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
chapter_id: id of the audiobook chapter which includes the transcription.
Train | Train.9h | Train.1h | Dev | Test | |
---|---|---|---|---|---|
german | 469942 | 2194 | 241 | 3469 | 3394 |
dutch | 374287 | 2153 | 234 | 3095 | 3075 |
french | 258213 | 2167 | 241 | 2416 | 2426 |
spanish | 220701 | 2110 | 233 | 2408 | 2385 |
italian | 59623 | 2173 | 240 | 1248 | 1262 |
portuguese | 37533 | 2116 | 236 | 826 | 871 |
polish | 25043 | 2173 | 238 | 512 | 520 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
[More Information Needed]
[More Information Needed]
[Needs More Information]
[Needs More Information]
Public Domain, Creative Commons Attribution 4.0 International Public License ( CC-BY-4.0 )
@article{Pratap2020MLSAL, title={MLS: A Large-Scale Multilingual Dataset for Speech Research}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} }
Thanks to @patrickvonplaten and @polinaeterna for adding this dataset.