数据集:

vivos

任务:

自动语音识别

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for VIVOS

Dataset Summary

VIVOS is a free Vietnamese speech corpus consisting of 15 hours of recording speech prepared for Vietnamese Automatic Speech Recognition task.

The corpus was prepared by AILAB, a computer science lab of VNUHCM - University of Science, with Prof. Vu Hai Quan is the head of.

We publish this corpus in hope to attract more scientists to solve Vietnamese speech recognition problems.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

Vietnamese

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, called path and its transcription, called sentence . Some additional information about the speaker and the passage which contains the transcription is provided.

{'speaker_id': 'VIVOSSPK01',
 'path': '/home/admin/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/vivos/train/waves/VIVOSSPK01/VIVOSSPK01_R001.wav',
 'audio': {'path': '/home/admin/.cache/huggingface/datasets/downloads/extracted/b7ded9969e09942ab65313e691e6fc2e12066192ee8527e21d634aca128afbe2/vivos/train/waves/VIVOSSPK01/VIVOSSPK01_R001.wav',
           'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
           'sampling_rate': 16000},
 'sentence': 'KHÁCH SẠN'}

Data Fields

speaker_id: An id for which speaker (voice) made the recording
path: The path to the audio file
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
sentence: The sentence the user was prompted to speak

Data Splits

The speech material has been subdivided into portions for train and test.

Speech was recorded in a quiet environment with high quality microphone, speakers were asked to read one sentence at a time.

Train	Test
Speakers	46	19
Utterances	11660	760
Duration	14:55	00:45
Unique Syllables	4617	1692

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Dataset provided for research purposes only. Please check dataset license for additional information.

Additional Information

Dataset Curators

The dataset was initially prepared by AILAB, a computer science lab of VNUHCM - University of Science.

Licensing Information

Public Domain, Creative Commons Attribution NonCommercial ShareAlike v4.0 ( CC BY-NC-SA 4.0 )

Citation Information

@inproceedings{luong-vu-2016-non,
    title = "A non-expert {K}aldi recipe for {V}ietnamese Speech Recognition System",
    author = "Luong, Hieu-Thi  and
      Vu, Hai-Quan",
    booktitle = "Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies ({WLSI}/{OIAF}4{HLT}2016)",
    month = dec,
    year = "2016",
    address = "Osaka, Japan",
    publisher = "The COLING 2016 Organizing Committee",
    url = "https://aclanthology.org/W16-5207",
    pages = "51--55",
}

Contributions

Thanks to @binh234 for adding this dataset.

作者:

佚名

数据集大小:

1.37 GB