数据集:

jimregan/clarinpl_studio

任务:

task_categories:other

自动语音识别

语言:

计算机处理:

monolingual

大小:

10K<n<100K

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1706.00245

许可:

other

数据集介绍文件清单

中文

Dataset Card for ClarinPL Studio Speech Corpus

Dataset Summary

The corpus consists of 317 speakers recorded in 554 sessions, where each session consists of 20 read sentences and 10 phonetically rich words. The size of the audio portion of the corpus amounts to around 56 hours, with transcriptions containing 356674 words from a vocabulary of size 46361.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

The audio is in Polish.

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, usually called file and its transcription, called text . An example from the dataset is:

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/333ddc746f2df1e1d19b44986992d4cbe28710fde81d533a220e755ee6c5c519/audio/SES0001/rich001.wav',
 'id': 'SES0001_rich001',
 'speaker_id': 'SPK0001',
 'text': 'drożdże dżip gwożdżenie ozimina wędzarz rdzeń wędzonka ingerować kładzenie jutrzenka'}

Data Fields

file: A path to the downloaded audio file in .wav format.
text: the transcription of the audio file.
speaker_id: The ID of the speaker of the audio.

Data Splits

Train	Test	Valid
dataset	11222	1362	1229

Dataset Creation

Curation Rationale

The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. Another purpose of the corpus was to serve as a reference for studies in phonetics and pronunciation.

Source Data

Initial Data Collection and Normalization

The corpus was recorded in a studio environment using two microphones: a high-quality studio microphone and a typical consumer audio headset.

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

CLARIN PUB+BY+INF+NORED

Citation Information

@article{korvzinek2017polish,
  title={Polish read speech corpus for speech tools and services},
  author={Kor{\v{z}}inek, Danijel and Marasek, Krzysztof and Brocki, {\L}ukasz and Wo{\l}k, Krzysztof},
  journal={arXiv preprint arXiv:1706.00245},
  year={2017}
}

Contributions

[Needs More Information]

作者:

jimregan

数据集大小:

13.77 KB