数据集:

bond005/sberdevices_golos_10h_crowd

语言:

ru

计算机处理:

monolingual

大小:

10K<n<100K

批注创建人:

expert-generated

源数据集:

extended

预印本库:

arxiv:2106.10161

许可:

other
中文

Dataset Card for sberdevices_golos_10h_crowd

Dataset Summary

Sberdevices Golos is a corpus of approximately 1200 hours of 16kHz Russian speech from crowd (reading speech) and farfield (communication with smart devices) domains, prepared by SberDevices Team (Alexander Denisenko, Angelina Kovalenko, Fedor Minkin, and Nikolay Karpov). The data is derived from the crowd-sourcing platform, and has been manually annotated.

Authors divide all dataset into train and test subsets. The training subset includes approximately 1000 hours. For experiments with a limited number of records, authors identified training subsets of shorter length: 100 hours, 10 hours, 1 hour, 10 minutes.

This dataset is a simpler version of the above mentioned Golos:

  • it includes the crowd domain only (without any sound from the farfield domain);
  • validation split is built on the 1-hour training subset;
  • training split corresponds to the 10-hour training subset without sounds from the 1-hour training subset;
  • test split is a full original test split.

Supported Tasks and Leaderboards

  • automatic-speech-recognition : The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). The task has an active Hugging Face leaderboard which can be found at https://huggingface.co/spaces/huggingface/hf-speech-bench . The leaderboard ranks models uploaded to the Hub based on their WER.

Languages

The audio is in Russian.

Dataset Structure

Data Instances

A typical data point comprises the audio data, usually called audio and its transcription, called transcription . Any additional information about the speaker and the passage which contains the transcription is not provided.

{'audio': {'path': None,
  'array': array([ 3.05175781e-05,  3.05175781e-05,  0.00000000e+00, ...,
                  -1.09863281e-03, -7.93457031e-04, -1.52587891e-04]), dtype=float64),
  'sampling_rate': 16000},
 'transcription': 'шестнадцатая часть сезона пять сериала лемони сникет тридцать три несчастья'}

Data Fields

  • audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
  • transcription: the transcription of the audio file.

Data Splits

This dataset is a simpler version of the original Golos:

  • it includes the crowd domain only (without any sound from the farfield domain);
  • validation split is built on the 1-hour training subset;
  • training split corresponds to the 10-hour training subset without sounds from the 1-hour training subset;
  • test split is a full original test split.
Train Validation Test
examples 7993 793 9994
hours 8.9h 0.9h 11.2h

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

All recorded audio files were manually annotated on the crowd-sourcing platform.

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

The dataset consists of people who have donated their voice. You agree to not attempt to determine the identity of speakers in this dataset.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

The dataset was initially created by Alexander Denisenko, Angelina Kovalenko, Fedor Minkin, and Nikolay Karpov.

Licensing Information

Public license with attribution and conditions reserved

Citation Information

@misc{karpov2021golos,
  author = {Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor},
  title = {Golos: Russian Dataset for Speech Research},
  publisher = {arXiv},
  year = {2021},
  url = {https://arxiv.org/abs/2106.10161}
}

Contributions

Thanks to @bond005 for adding this dataset.