数据集:
collectivat/tv3_parla
子任务:
language-modeling语言:
ca计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original许可:
cc-by-nc-4.0This corpus includes 240 hours of Catalan speech from broadcast material. The details of segmentation, data processing and also model training are explained in Külebi, Öktem; 2018. The content is owned by Corporació Catalana de Mitjans Audiovisuals, SA (CCMA); we processed their material and hereby making it available under their terms of use.
This project was supported by the Softcatalà Association.
The dataset can be used for:
The dataset is in Catalan ( ca ).
{ 'path': 'tv3_0.3/wav/train/5662515_1492531876710/5662515_1492531876710_120.180_139.020.wav', 'audio': {'path': 'tv3_0.3/wav/train/5662515_1492531876710/5662515_1492531876710_120.180_139.020.wav', 'array': array([-0.01168823, 0.01229858, 0.02819824, ..., 0.015625 , 0.01525879, 0.0145874 ]), 'sampling_rate': 16000}, 'text': 'algunes montoneres que que et feien anar ben col·locat i el vent també hi jugava una mica de paper bufava vent de cantó alguns cops o de cul i el pelotón el vent el porta molt malament hi havia molts nervis' }
The dataset is split into "train" and "test".
train | test | |
---|---|---|
Number of examples | 159242 | 2220 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution-NonCommercial 4.0 International .
@inproceedings{kulebi18_iberspeech, author={Baybars Külebi and Alp Öktem}, title={{Building an Open Source Automatic Speech Recognition System for Catalan}}, year=2018, booktitle={Proc. IberSPEECH 2018}, pages={25--29}, doi={10.21437/IberSPEECH.2018-6} }
Thanks to @albertvillanova for adding this dataset.