数据集:
bond005/sova_rudevices
语言:
ru计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
extended许可:
cc-by-4.0SOVA Dataset is free public STT/ASR dataset. It consists of several parts, one of them is SOVA RuDevices. This part is an acoustic corpus of approximately 100 hours of 16kHz Russian live speech with manual annotating, prepared by SOVA.ai team .
Authors do not divide the dataset into train, validation and test subsets. Therefore, I was compelled to prepare this splitting. The training subset includes more than 82 hours, the validation subset includes approximately 6 hours, and the test subset includes approximately 6 hours too.
The audio is in Russian.
A typical data point comprises the audio data, usually called audio and its transcription, called transcription . Any additional information about the speaker and the passage which contains the transcription is not provided.
{'audio': {'path': '/home/bond005/datasets/sova_rudevices/data/train/00003ec0-1257-42d1-b475-db1cd548092e.wav', 'array': array([ 0.00787354, 0.00735474, 0.00714111, ..., -0.00018311, -0.00015259, -0.00018311]), dtype=float32), 'sampling_rate': 16000}, 'transcription': 'мне получше стало'}
This dataset consists of three splits: training, validation, and test. This splitting was realized with accounting of internal structure of SOVA RuDevices (the validation split is based on the subdirectory 0 , and the test split is based on the subdirectory 1 of the original dataset), but audio recordings of the same speakers can be in different splits at the same time (the opposite is not guaranteed).
Train | Validation | Test | |
---|---|---|---|
examples | 81607 | 5835 | 5799 |
hours | 82.4h | 5.9h | 5.8h |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
All recorded audio files were manually annotated.
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice. You agree to not attempt to determine the identity of speakers in this dataset.
[More Information Needed]
[More Information Needed]
[Needs More Information]
The dataset was initially created by Egor Zubarev, Timofey Moskalets, and SOVA.ai team.
@misc{sova2021rudevices, author = {Zubarev, Egor and Moskalets, Timofey and SOVA.ai}, title = {SOVA RuDevices Dataset: free public STT/ASR dataset with manually annotated live speech}, publisher = {GitHub}, journal = {GitHub repository}, year = {2021}, howpublished = {\url{https://github.com/sovaai/sova-dataset}}, }
Thanks to @bond005 for adding this dataset.