数据集:

LIUM/tedlium

源数据集:

original

批注创建人:

expert-generated

语言创建人:

expert-generated

大小:

10K<n<100K

计算机处理:

monolingual

语言:

en
中文

Dataset Card for tedlium

Dataset Summary

The TED-LIUM corpus is English-language TED talks, with transcriptions, sampled at 16kHz. The three releases of the corpus range from 118 to 452 hours of transcribed speech data.

Example

from datasets import load_dataset

tedlium = load_dataset("LIUM/tedlium", "release1") # for Release 1

# see structure
print(tedlium)

# load audio sample on the fly
audio_input = tedlium["train"][0]["audio"]  # first decoded audio sample
transcription = tedlium["train"][0]["text"]  # first transcription

Supported Tasks and Leaderboards

  • automatic-speech-recognition : The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active leaderboard which can be found at https://paperswithcode.com/sota/speech-recognition-on-tedlium that ranks models based on their WER.

Languages

The audio and transcriptions are in English, as per the TED talks at http://www.ted.com .

Dataset Structure

Data Instances

{'audio': {'path': '/home/sanchitgandhi/cache/downloads/extracted/6e3655f9e735ae3c467deed1df788e0dabd671c1f3e2e386e30aa3b571bd9761/TEDLIUM_release1/train/sph/PaulaScher_2008P.sph', 
  'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346,
          0.00091553,  0.00085449], dtype=float32),
  'sampling_rate': 16000},
'text': '{COUGH} but <sil> i was so {COUGH} utterly unqualified for(2) this project and {NOISE} so utterly ridiculous {SMACK} and ignored the brief {SMACK} <sil>', 
'speaker_id': 'PaulaScher_2008P', 
'gender': 'female', 
'file': '/home/sanchitgandhi/cache/downloads/extracted/6e3655f9e735ae3c467deed1df788e0dabd671c1f3e2e386e30aa3b571bd9761/TEDLIUM_release1/train/sph/PaulaScher_2008P.sph', 
'id': 'PaulaScher_2008P-1003.35-1011.16-<o,f0,female>'}

Data Fields

  • audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
  • file: A path to the downloaded audio file in .sph format.
  • text: the transcription of the audio file.
  • gender: the gender of the speaker. One of: male, female or N/A.
  • id: unique id of the data sample.
  • speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.

Data Splits

There are three releases for the TED-LIUM corpus, progressively increasing the number of transcribed speech training data from 118 hours (Release 1), to 207 hours (Release 2), to 452 hours (Release 3).

Release 1:

Release 2:

  • 1495 audio talks and automatically aligned transcriptions.
  • Contains 207 hours of speech audio data.
  • Dictionary with pronunciations (159848 entries).
  • Selected monolingual data for language modeling from WMT12 publicly available corpora.
  • Homepage: https://www.openslr.org/19/

Release 3:

  • 2351 audio talks and automatically aligned transcriptions.
  • Contains 452 hours of speech audio data.
  • TED-LIUM 2 validation and test data: 19 TED talks with their corresponding manual transcriptions.
  • Dictionary with pronunciations (159848 entries), the same file as the one included in TED-LIUM 2.
  • Selected monolingual data for language modeling from WMT12 publicly available corpora: these files come from the TED-LIUM 2 release, but have been modified to produce a tokenization more relevant for English language.
  • Homepage: https://www.openslr.org/51/

Release 3 contains two different corpus distributions:

  • The ‘legacy’ one, on which the dev and test datasets are the same as in TED-LIUM 2 (and TED-LIUM 1).
  • The ‘speaker adaptation’ one, specially designed for experiments on speaker adaptation.

Each release is split into a training, validation and test set:

Split Release 1 Release 2 Release 3
Train 56,803 92,973 268,263
Validation 591 591 591
Test 1,469 1,469 1,469

Dataset Creation

Curation Rationale

TED-LIUM was built during The International Workshop on Spoken Language Trans- lation (IWSLT) 2011 Evaluation Campaign , an annual workshop focused on the automatic translation of public talks and included tracks for speech recognition, speech translation, text translation, and system combination.

Source Data

Initial Data Collection and Normalization

The data was obtained from publicly available TED talks at http://www.ted.com . Proper alignments between the speech and the transcribed text were generated using an in-house speaker segmentation and clustering tool ( LIUM_SpkDiarization ). Speech disfluencies (e.g. repetitions, hesitations, false starts) were treated in the following way: repetitions were transcribed, hesitations mapped to a specific filler word, and false starts not taken into account. For full details on the data collection and processing, refer to the TED-LIUM paper .

Who are the source language producers?

TED Talks are influential videos from expert speakers on education, business, science, tech and creativity.

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

Licensed under Creative Commons BY-NC-ND 3.0 ( http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en ).

Citation Information

Release 1:

@inproceedings{rousseau2012tedlium,
  title={TED-LIUM: an Automatic Speech Recognition dedicated corpus},
  author={Rousseau, Anthony and Del{\'e}glise, Paul and Est{\`e}ve, Yannick},
  booktitle={Conference on Language Resources and Evaluation (LREC)},
  pages={125--129},
  year={2012}
}

Release 2:

@inproceedings{rousseau2014enhancing,
  title={Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks.},
  author={Rousseau, Anthony and Del{\'e}glise, Paul and Esteve, Yannick and others},
  booktitle={LREC},
  pages={3935--3939},
  year={2014}
}

Release 3:

@inproceedings{hernandez2018ted,
  author="Hernandez, Fran{\c{c}}ois
  and Nguyen, Vincent
  and Ghannay, Sahar
  and Tomashenko, Natalia
  and Est{\`e}ve, Yannick",
  title="TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation",
  booktitle="Speech and Computer",
  year="2018",
  publisher="Springer International Publishing",
  pages="198--208",
}