This is the ParlamentParla speech corpus for Catalan prepared by Col·lectivaT. The audio segments were extracted from recordings the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007/07/11 - 2018/07/17. We aligned the transcriptions with the recordings and extracted the corpus. The content belongs to the Catalan Parliament and the data is released conforming their terms of use.
Preparation of this corpus was partly supported by the Department of Culture of the Catalan autonomous government, and the v2.0 was supported by the Barcelona Supercomputing Center, within the framework of Projecte AINA of the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya.
As of v2.0 the corpus is separated into 211 hours of clean and 400 hours of other quality segments. Furthermore, each speech segment is tagged with its speaker and each speaker with their gender. The statistics are detailed in the readme file.
The dataset can be used for:
The dataset is in Catalan ( ca-CA ).
{ 'path': 'clean_train/c/c/ccca4790a55aba3e6bcf_63.88_74.06.wav' 'audio': { 'path': 'clean_train/c/c/ccca4790a55aba3e6bcf_63.88_74.06.wav', 'array': array([-6.10351562e-05, -6.10351562e-05, -1.22070312e-04, ..., -1.22070312e-04, 0.00000000e+00, -3.05175781e-05]), 'sampling_rate': 16000 }, 'speaker_id': 167, 'sentence': "alguns d'ells avui aquí presents un agraïment a aquells que mantenen viva la memòria aquest acte de reparació i dignitat és", 'gender': 0, 'duration': 10.18 }
The dataset is split in: "train", "validation" and "test".
The dataset is created by aligning the parliamentary session transcripts and the audiovisual content. For more detailed information please consult this paper .
We created this corpus to contribute to the development of language models in Catalan, a low-resource language.
The audio segments were extracted from recordings the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007/07/11 - 2018/07/17. The cleaning procedures are in the archived repository Long Audio Aligner
Who are the source language producers?The parliamentary members of the legislatures between 2007/07/11 - 2018/07/17
The dataset is unannotated.
Annotation process[N/A]
Who are the annotators?[N/A]
The initial content is publicly available furthermore, the identities of the parliamentary members are anonymized.
We hope this corpus contributes to the development of language models in Catalan, a low-resource language.
This dataset has a gender bias, however since the speakers are tagged according to their genders, creating a balanced subcorpus is possible.
Subcorpus | Gender | Duration (h) |
---|---|---|
other_test | F | 2.516 |
other_dev | F | 2.701 |
other_train | F | 109.68 |
other_test | M | 2.631 |
other_dev | M | 2.513 |
other_train | M | 280.196 |
other total | 400.239 | |
clean_test | F | 2.707 |
clean_dev | F | 2.576 |
clean_train | F | 77.905 |
clean_test | M | 2.516 |
clean_dev | M | 2.614 |
clean_train | M | 123.162 |
clean total | 211.48 | |
Total | 611.719 |
The text corpus belongs to the domain of Catalan politics
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Creative Commons Attribution 4.0 International .
@dataset{kulebi_baybars_2021_5541827, author = {Külebi, Baybars}, title = {{ParlamentParla - Speech corpus of Catalan Parliamentary sessions}}, month = oct, year = 2021, publisher = {Zenodo}, version = {v2.0}, doi = {10.5281/zenodo.5541827}, url = {https://doi.org/10.5281/zenodo.5541827} }
For the paper:
@inproceedings{kulebi2022parlamentparla, title={ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions}, author={K{\"u}lebi, Baybars and Armentano-Oller, Carme and Rodr{\'\i}guez-Penagos, Carlos and Villegas, Marta}, booktitle={Workshop on Creating, Enriching and Using Parliamentary Corpora}, volume={125}, number={130}, pages={125}, year={2022} }
Thanks to @albertvillanova for adding this dataset.