数据集:

facebook/voxpopuli

任务:

自动语音识别

语言:

计算机处理:

multilingual

预印本库:

arxiv:2101.00390

许可:

cc0-1.0

other

数据集介绍文件清单

中文

Dataset Card for Voxpopuli

Dataset Summary

VoxPopuli is a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. The raw data is collected from 2009-2020 European Parliament event recordings . We acknowledge the European Parliament for creating and sharing these materials. This implementation contains transcribed speech data for 18 languages. It also contains 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)

Example usage

VoxPopuli contains labelled data for 18 languages. To load a specific language pass its name as a config name:

from datasets import load_dataset

voxpopuli_croatian = load_dataset("facebook/voxpopuli", "hr")

To load all the languages in a single dataset use "multilang" config name:

voxpopuli_all = load_dataset("facebook/voxpopuli", "multilang")

To load a specific set of languages, use "multilang" config name and pass a list of required languages to languages parameter:

voxpopuli_slavic = load_dataset("facebook/voxpopuli", "multilang", languages=["hr", "sk", "sl", "cs", "pl"])

To load accented English data, use "en_accented" config name:

voxpopuli_accented = load_dataset("facebook/voxpopuli", "en_accented")

Note that L2 English subset contains only test split.

Supported Tasks and Leaderboards

automatic-speech-recognition: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER).

Accented English subset can also be used for research in ASR for accented speech (15 L2 accents)

Languages

VoxPopuli contains labelled (transcribed) data for 18 languages:

Language	Code	Transcribed Hours	Transcribed Speakers	Transcribed Tokens
English	En	543	1313	4.8M
German	De	282	531	2.3M
French	Fr	211	534	2.1M
Spanish	Es	166	305	1.6M
Polish	Pl	111	282	802K
Italian	It	91	306	757K
Romanian	Ro	89	164	739K
Hungarian	Hu	63	143	431K
Czech	Cs	62	138	461K
Dutch	Nl	53	221	488K
Finnish	Fi	27	84	160K
Croatian	Hr	43	83	337K
Slovak	Sk	35	96	270K
Slovene	Sl	10	45	76K
Estonian	Et	3	29	18K
Lithuanian	Lt	2	21	10K
Total	1791	4295	15M

Accented speech transcribed data has 15 various L2 accents:

Accent	Code	Transcribed Hours	Transcribed Speakers
Dutch	en_nl	3.52	45
German	en_de	3.52	84
Czech	en_cs	3.30	26
Polish	en_pl	3.23	33
French	en_fr	2.56	27
Hungarian	en_hu	2.33	23
Finnish	en_fi	2.18	20
Romanian	en_ro	1.85	27
Slovak	en_sk	1.46	17
Spanish	en_es	1.42	18
Italian	en_it	1.11	15
Estonian	en_et	1.08	6
Lithuanian	en_lt	0.65	7
Croatian	en_hr	0.42	9
Slovene	en_sl	0.25	7

Dataset Structure

Data Instances

{
  'audio_id': '20180206-0900-PLENARY-15-hr_20180206-16:10:06_5',
  'language': 11,  # "hr"
  'audio': {
    'path': '/home/polina/.cache/huggingface/datasets/downloads/extracted/44aedc80bb053f67f957a5f68e23509e9b181cc9e30c8030f110daaedf9c510e/train_part_0/20180206-0900-PLENARY-15-hr_20180206-16:10:06_5.wav',
    'array': array([-0.01434326, -0.01055908,  0.00106812, ...,  0.00646973], dtype=float32),
    'sampling_rate': 16000
  },
  'raw_text': '',
  'normalized_text': 'poast genitalnog sakaenja ena u europi tek je jedna od manifestacija takve tetne politike.',
  'gender': 'female',
  'speaker_id': '119431',
  'is_gold_transcript': True,
  'accent': 'None'
}

Data Fields

audio_id (string) - id of audio segment
language (datasets.ClassLabel) - numerical id of audio segment
audio (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. In non-streaming mode (default), the path points to the locally extracted audio. In streaming mode, the path is the relative path of an audio inside its archive (as files are not downloaded and extracted locally).
raw_text (string) - original (orthographic) audio segment text
normalized_text (string) - normalized audio segment transcription
gender (string) - gender of speaker
speaker_id (string) - id of speaker
is_gold_transcript (bool) - ?
accent (string) - type of accent, for example "en_lt", if applicable, else "None".

Data Splits

All configs (languages) except for accented English contain data in three splits: train, validation and test. Accented English en_accented config contains only test split.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

The raw data is collected from 2009-2020 European Parliament event recordings

Initial Data Collection and Normalization

The VoxPopuli transcribed set comes from aligning the full-event source speech audio with the transcripts for plenary sessions. Official timestamps are available for locating speeches by speaker in the full session, but they are frequently inaccurate, resulting in truncation of the speech or mixture of fragments from the preceding or the succeeding speeches. To calibrate the original timestamps, we perform speaker diarization (SD) on the full-session audio using pyannote.audio (Bredin et al.2020) and adopt the nearest SD timestamps (by L1 distance to the original ones) instead for segmentation. Full-session audios are segmented into speech paragraphs by speaker, each of which has a transcript available.

The speech paragraphs have an average duration of 197 seconds, which leads to significant. We hence further segment these paragraphs into utterances with a maximum duration of 20 seconds. We leverage speech recognition (ASR) systems to force-align speech paragraphs to the given transcripts. The ASR systems are TDS models (Hannun et al., 2019) trained with ASG criterion (Collobert et al., 2016) on audio tracks from in-house deidentified video data.

The resulting utterance segments may have incorrect transcriptions due to incomplete raw transcripts or inaccurate ASR force-alignment. We use the predictions from the same ASR systems as references and filter the candidate segments by a maximum threshold of 20% character error rate(CER).

Who are the source language producers?

Speakers are participants of the European Parliament events, many of them are EU officials.

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

Gender speakers distribution is imbalanced, percentage of female speakers is mostly lower than 50% across languages, with the minimum of 15% for the Lithuanian language data.

VoxPopuli includes all available speeches from the 2009-2020 EP events without any selections on the topics or speakers. The speech contents represent the standpoints of the speakers in the EP events, many of which are EU officials.

Other Known Limitations

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

The dataset is distributet under CC0 license, see also European Parliament's legal notice for the raw data.

Citation Information

Please cite this paper:

@inproceedings{wang-etal-2021-voxpopuli,
    title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation",
    author = "Wang, Changhan  and
      Riviere, Morgane  and
      Lee, Ann  and
      Wu, Anne  and
      Talnikar, Chaitanya  and
      Haziza, Daniel  and
      Williamson, Mary  and
      Pino, Juan  and
      Dupoux, Emmanuel",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.80",
    pages = "993--1003",
}

Contributions

Thanks to @polinaeterna for adding this dataset.

作者:

facebook

数据集大小:

200.71 GB