数据集:
collectivat/tv3_parla
子任务:
language-modeling语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original许可:
This corpus includes 240 hours of Catalan speech from broadcast material. The details of segmentation, data processing and also model training are explained in Külebi, Öktem; 2018. The content is owned by Corporació Catalana de Mitjans Audiovisuals, SA (CCMA); we processed their material and hereby making it available under their terms of use.
This project was supported by the Softcatalà Association.
The dataset can be used for:
The dataset is in Catalan ( ca ).
{
'path': 'tv3_0.3/wav/train/5662515_1492531876710/5662515_1492531876710_120.180_139.020.wav',
'audio': {'path': 'tv3_0.3/wav/train/5662515_1492531876710/5662515_1492531876710_120.180_139.020.wav',
'array': array([-0.01168823, 0.01229858, 0.02819824, ..., 0.015625 ,
0.01525879, 0.0145874 ]),
'sampling_rate': 16000},
'text': 'algunes montoneres que que et feien anar ben col·locat i el vent també hi jugava una mica de paper bufava vent de cantó alguns cops o de cul i el pelotón el vent el porta molt malament hi havia molts nervis'
}
The dataset is split into "train" and "test".
| train | test | |
|---|---|---|
| Number of examples | 159242 | 2220 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution-NonCommercial 4.0 International .
@inproceedings{kulebi18_iberspeech,
author={Baybars Külebi and Alp Öktem},
title={{Building an Open Source Automatic Speech Recognition System for Catalan}},
year=2018,
booktitle={Proc. IberSPEECH 2018},
pages={25--29},
doi={10.21437/IberSPEECH.2018-6}
}
Thanks to @albertvillanova for adding this dataset.