数据集:
jimregan/clarinpl_studio
语言:
pl计算机处理:
monolingual大小:
10K<n<100K批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1706.00245许可:
otherThe corpus consists of 317 speakers recorded in 554 sessions, where each session consists of 20 read sentences and 10 phonetically rich words. The size of the audio portion of the corpus amounts to around 56 hours, with transcriptions containing 356674 words from a vocabulary of size 46361.
[Needs More Information]
The audio is in Polish.
A typical data point comprises the path to the audio file, usually called file and its transcription, called text . An example from the dataset is:
{'file': '/root/.cache/huggingface/datasets/downloads/extracted/333ddc746f2df1e1d19b44986992d4cbe28710fde81d533a220e755ee6c5c519/audio/SES0001/rich001.wav', 'id': 'SES0001_rich001', 'speaker_id': 'SPK0001', 'text': 'drożdże dżip gwożdżenie ozimina wędzarz rdzeń wędzonka ingerować kładzenie jutrzenka'}
Train | Test | Valid | |
---|---|---|---|
dataset | 11222 | 1362 | 1229 |
The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. Another purpose of the corpus was to serve as a reference for studies in phonetics and pronunciation.
The corpus was recorded in a studio environment using two microphones: a high-quality studio microphone and a typical consumer audio headset.
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[More Information Needed]
[More Information Needed]
[Needs More Information]
[Needs More Information]
@article{korvzinek2017polish, title={Polish read speech corpus for speech tools and services}, author={Kor{\v{z}}inek, Danijel and Marasek, Krzysztof and Brocki, {\L}ukasz and Wo{\l}k, Krzysztof}, journal={arXiv preprint arXiv:1706.00245}, year={2017} }
[Needs More Information]