数据集:
vivos
任务:
自动语音识别语言:
vi计算机处理:
monolingual大小:
10K<n<100K批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-sa-4.0VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for Vietnamese Automatic Speech Recognition task.
The corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.
We publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.
[Needs More Information]
Vietnamese
A typical data point comprises the path to the audio file, called path and its transcription, called sentence . Some additional information about the speaker and the passage which contains the transcription is provided.
{'speaker_id': 'VIVOSSPK01', 'path': '/home/admin/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/vivos/train/waves/VIVOSSPK01/VIVOSSPK01_R001.wav', 'audio': {'path': '/home/admin/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/vivos/train/waves/VIVOSSPK01/VIVOSSPK01_R001.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000}, 'sentence': 'KHÁCH SẠN'}
speaker_id: An id for which speaker (voice) made the recording
path: The path to the audio file
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
sentence: The sentence the user was prompted to speak
The speech material has been subdivided into portions for train and test.
Speech was recorded in a quiet environment with high quality microphone, speakers were asked to read one sentence at a time.
Train | Test | |
---|---|---|
Speakers | 46 | 19 |
Utterances | 11660 | 760 |
Duration | 14:55 | 00:45 |
Unique Syllables | 4617 | 1692 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
[More Information Needed]
[More Information Needed]
Dataset provided for research purposes only. Please check dataset license for additional information.
The dataset was initially prepared by AILAB, a computer science lab of VNUHCM - University of Science.
Public Domain, Creative Commons Attribution NonCommercial ShareAlike v4.0 ( CC BY-NC-SA 4.0 )
@inproceedings{luong-vu-2016-non, title = "A non-expert {K}aldi recipe for {V}ietnamese Speech Recognition System", author = "Luong, Hieu-Thi and Vu, Hai-Quan", booktitle = "Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies ({WLSI}/{OIAF}4{HLT}2016)", month = dec, year = "2016", address = "Osaka, Japan", publisher = "The COLING 2016 Organizing Committee", url = "https://aclanthology.org/W16-5207", pages = "51--55", }
Thanks to @binh234 for adding this dataset.