数据集:
collectivat/tv3_parla
子任务:
language-modeling语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original许可:
This corpus includes 240 hours of Catalan speech from broadcast material. The details of segmentation, data processing and also model training are explained in Külebi, Öktem; 2018. The content is owned by Corporació Catalana de Mitjans Audiovisuals, SA (CCMA); we processed their material and hereby making it available under their terms of use.
This project was supported by the Softcatalà Association.
The dataset can be used for:
The dataset is in Catalan ( ca ).
{
  'path': 'tv3_0.3/wav/train/5662515_1492531876710/5662515_1492531876710_120.180_139.020.wav',
  'audio': {'path': 'tv3_0.3/wav/train/5662515_1492531876710/5662515_1492531876710_120.180_139.020.wav',
   'array': array([-0.01168823,  0.01229858,  0.02819824, ...,  0.015625  ,
          0.01525879,  0.0145874 ]),
   'sampling_rate': 16000},
  'text': 'algunes montoneres que que et feien anar ben col·locat i el vent també hi jugava una mica de paper bufava vent de cantó alguns cops o de cul i el pelotón el vent el porta molt malament hi havia molts nervis'
}
 The dataset is split into "train" and "test".
| train | test | |
|---|---|---|
| Number of examples | 159242 | 2220 | 
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution-NonCommercial 4.0 International .
@inproceedings{kulebi18_iberspeech,
  author={Baybars Külebi and Alp Öktem},
  title={{Building an Open Source Automatic Speech Recognition System for Catalan}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={25--29},
  doi={10.21437/IberSPEECH.2018-6}
}
 Thanks to @albertvillanova for adding this dataset.