数据集:
LIUM/tedlium
源数据集:
original批注创建人:
expert-generated语言创建人:
expert-generated大小:
10K<n<100K计算机处理:
monolingual语言:
en任务:
自动语音识别The TED-LIUM corpus is English-language TED talks, with transcriptions, sampled at 16kHz. The three releases of the corpus range from 118 to 452 hours of transcribed speech data.
from datasets import load_dataset tedlium = load_dataset("LIUM/tedlium", "release1") # for Release 1 # see structure print(tedlium) # load audio sample on the fly audio_input = tedlium["train"][0]["audio"] # first decoded audio sample transcription = tedlium["train"][0]["text"] # first transcription
The audio and transcriptions are in English, as per the TED talks at http://www.ted.com .
{'audio': {'path': '/home/sanchitgandhi/cache/downloads/extracted/6e3655f9e735ae3c467deed1df788e0dabd671c1f3e2e386e30aa3b571bd9761/TEDLIUM_release1/train/sph/PaulaScher_2008P.sph', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000}, 'text': '{COUGH} but <sil> i was so {COUGH} utterly unqualified for(2) this project and {NOISE} so utterly ridiculous {SMACK} and ignored the brief {SMACK} <sil>', 'speaker_id': 'PaulaScher_2008P', 'gender': 'female', 'file': '/home/sanchitgandhi/cache/downloads/extracted/6e3655f9e735ae3c467deed1df788e0dabd671c1f3e2e386e30aa3b571bd9761/TEDLIUM_release1/train/sph/PaulaScher_2008P.sph', 'id': 'PaulaScher_2008P-1003.35-1011.16-<o,f0,female>'}
There are three releases for the TED-LIUM corpus, progressively increasing the number of transcribed speech training data from 118 hours (Release 1), to 207 hours (Release 2), to 452 hours (Release 3).
Release 1:
Release 2:
Release 3:
Release 3 contains two different corpus distributions:
Each release is split into a training, validation and test set:
Split | Release 1 | Release 2 | Release 3 |
---|---|---|---|
Train | 56,803 | 92,973 | 268,263 |
Validation | 591 | 591 | 591 |
Test | 1,469 | 1,469 | 1,469 |
TED-LIUM was built during The International Workshop on Spoken Language Trans- lation (IWSLT) 2011 Evaluation Campaign , an annual workshop focused on the automatic translation of public talks and included tracks for speech recognition, speech translation, text translation, and system combination.
The data was obtained from publicly available TED talks at http://www.ted.com . Proper alignments between the speech and the transcribed text were generated using an in-house speaker segmentation and clustering tool ( LIUM_SpkDiarization ). Speech disfluencies (e.g. repetitions, hesitations, false starts) were treated in the following way: repetitions were transcribed, hesitations mapped to a specific filler word, and false starts not taken into account. For full details on the data collection and processing, refer to the TED-LIUM paper .
Who are the source language producers?TED Talks are influential videos from expert speakers on education, business, science, tech and creativity.
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
Licensed under Creative Commons BY-NC-ND 3.0 ( http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en ).
Release 1:
@inproceedings{rousseau2012tedlium, title={TED-LIUM: an Automatic Speech Recognition dedicated corpus}, author={Rousseau, Anthony and Del{\'e}glise, Paul and Est{\`e}ve, Yannick}, booktitle={Conference on Language Resources and Evaluation (LREC)}, pages={125--129}, year={2012} }
Release 2:
@inproceedings{rousseau2014enhancing, title={Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks.}, author={Rousseau, Anthony and Del{\'e}glise, Paul and Esteve, Yannick and others}, booktitle={LREC}, pages={3935--3939}, year={2014} }
Release 3:
@inproceedings{hernandez2018ted, author="Hernandez, Fran{\c{c}}ois and Nguyen, Vincent and Ghannay, Sahar and Tomashenko, Natalia and Est{\`e}ve, Yannick", title="TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation", booktitle="Speech and Computer", year="2018", publisher="Springer International Publishing", pages="198--208", }