数据集:
BSC-LT/tecla
语言:
caIf you use any of these resources (datasets or models) in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }
https://doi.org/10.5281/zenodo.4627198
TeCla is a Catalan News corpus for thematic Text Classification tasks. It contains 153.265 articles classified under 30 different categories.
The source data is crawled from the ACN (Catalan News Agency) site: [ http://www.acn.cat] , and used under CC-BY-NC-ND 4.0 licence. The dataset is released under the same licence, and is intended exclusively for training Machine Learning models.
This dataset was developed by BSC TeMU as part of the AINA project, and intended as part of CLUB (Catalan Language Understanding Benchmark). It is part of the Catalan Language Understanding Benchmark (CLUB) as presented in:
Armengol-Estapé J., Carrino CP., Rodriguez-Penagos C., de Gibert Bonet O., Armentano-Oller C., Gonzalez-Agirre A., Melero M. and Villegas M.,Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan". Findings of ACL 2021 (ACL-IJCNLP 2021).
Text classification, Language Model
CA- Catalan
Three json files, one for each split.
We used a simple model with the article text and associated labels, without further metadata.
{"version": "1.0", "data": [ { 'sentence': 'L\\\\'editorial valenciana Media Vaca, Premi Nacional a la Millor Tasca Editorial Cultural del 2018. El jurat en destaca la cura "exquisida" del catàleg, la qualitat dels llibres i el "respecte" pels lectors. ACN Madrid.-L\\\\'editorial valenciana Media Vaca ha obtingut el Premi Nacional a la Millor Labor Editorial Cultural corresponent a l\\\\'any 2018 que atorga el Ministeri de Cultura i Esports. El guardó pretén distingir la tasca editorial d\\\\'una persona física o jurídica que hagi destacat per l\\\\'aportació a la vida cultural espanyola. El premi és de caràcter honorífic i no té dotació econòmica. En el cas de Media Vaca, fundada pel valencià Vicente Ferrer i la bilbaïna Begoña Lobo, el jurat n\\\\'ha destacat la cura "exquisida" del catàleg, la qualitat dels llibres i el "respecte" pels lectors i per la resta d\\\\'agents de la cadena del llibre. Media Vaca va publicar els primers llibres el desembre del 1998. El catàleg actual el componen 64 títols dividits en sis col·leccions, que barregen ficció i no ficció. Des del Ministeri de Cultura es destaca que la il·lustració té un pes "fonamental" als productes de l\\\\'editorial i que la majoria de projectes solen partir de propostes literàries i textos preexistents. L\\\\'editorial ha rebut quatre vegades el Bologna Ragazzi Award. És l\\\\'única editorial estatal que ha aconseguit el guardó que atorga la Fira del Llibre per a Nens de Bolonya, la més important del sector.', 'label': 'Lletres' }, . . . ] }
'Societat', 'Política', 'Turisme', 'Salut', 'Economia', 'Successos', 'Partits', 'Educació', 'Policial', 'Medi ambient', 'Parlament', 'Empresa', 'Judicial', 'Unió Europea', 'Comerç', 'Cultura', 'Cinema', 'Govern', 'Lletres', 'Infraestructures', 'Música', 'Festa i cultura popular', 'Teatre', 'Mobilitat', 'Govern espanyol', 'Equipaments i patrimoni', 'Meteorologia', 'Treball', 'Trànsit', 'Món'
train.json: 122587 articles
Label | Num art | % art |
---|---|---|
Societat | 24975 | 20.37% |
Política | 18344 | 14.96% |
Partits | 10056 | 8.2% |
Successos | 7874 | 6.42% |
Judicial | 5788 | 4.72% |
Policial | 5557 | 4.53% |
Salut | 5430 | 4.43% |
Economia | 5032 | 4.1% |
Parlament | 4176 | 3.41% |
Medi_ambient | 3027 | 2.47% |
Música | 2872 | 2.34% |
Educació | 2757 | 2.25% |
Empresa | 2698 | 2.2% |
Cultura | 2495 | 2.04% |
Unió_Europea | 2064 | 1.68% |
Govern | 2039 | 1.66% |
Infraestructures | 1740 | 1.42% |
Treball | 1655 | 1.35% |
Mobilitat | 1624 | 1.32% |
Cinema | 1560 | 1.27% |
Teatre | 1492 | 1.22% |
Turisme | 1232 | 1.01% |
Equipaments_i_patrimoni | 1229 | 1.0% |
Lletres | 1180 | 0.96% |
Meteorologia | 1080 | 0.88% |
Comerç | 984 | 0.8% |
Govern_espanyol | 983 | 0.8% |
Món | 893 | 0.73% |
Festa_i_cultura_popular | 888 | 0.72% |
Trànsit | 863 | 0.7% |
dev.json and test.json: 153265 articles each split
Label | Num art | % art |
---|---|---|
Societat | 3122 | 20.35% |
Política | 2294 | 14.96% |
Partits | 1257 | 8.19% |
Successos | 985 | 6.42% |
Judicial | 724 | 4.72% |
Policial | 695 | 4.53% |
Salut | 679 | 4.43% |
Economia | 630 | 4.11% |
Parlament | 523 | 3.41% |
Medi_ambient | 379 | 2.47% |
Música | 359 | 2.34% |
Educació | 345 | 2.25% |
Empresa | 338 | 2.2% |
Cultura | 312 | 2.03% |
Unió_Europea | 258 | 1.68% |
Govern | 256 | 1.67% |
Infraestructures | 218 | 1.42% |
Treball | 208 | 1.36% |
Mobilitat | 204 | 1.33% |
Cinema | 195 | 1.27% |
Teatre | 187 | 1.22% |
Turisme | 154 | 1.0% |
Equipaments_i_patrimoni | 154 | 1.0% |
Lletres | 148 | 0.96% |
Meteorologia | 135 | 0.88% |
Govern_espanyol | 124 | 0.81% |
Comerç | 123 | 0.8% |
Festa_i_cultura_popular | 112 | 0.73% |
Món | 112 | 0.73% |
Trànsit | 109 | 0.71% |
We crawled 219.586 articles from the Catalan News Agency ( www.acn.cat ) newswire archive, the latest from October 11, 2020. We used the "subsection" category as a classification label, after excluding territorial labels (see territorial_labels.txt file) and labels with less than 2000 occurrences. With this criteria compiled a total of 153.265 articles for this text classification dataset.
We used the "subsection" category as a classification label, after excluding territorial labels (see territorial_labels.txt file) and labels with less than 2000 occurrences.
The source data are crawled articles from ACN (Catalan News Agency) site: www.acn.cat
Who are the source language producers?The Catalan News Agency (CNA, in Catalan: Agència Catalana de Notícies (ACN)) is a news agency owned by the Catalan government via the public corporation Intracatalònia, SA. It is one of the first digital news agencies created in Europe and has been operating since 1999 (source: [ https://en.wikipedia.org/wiki/Catalan_News_Agency] )
We used the "subsection" category as a classification label, after excluding territorial labels (see territorial_labels.txt file) and labels with less than 2000 occurrences.
Who are the annotators?Editorial staff classified the articles under the different thematic sections, and we extracted these from metadata.
Casimiro Pio Carrino, Carlos Rodríguez and Carme Armentano, from BSC-CNS
No personal or sensitive information included.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carlos Rodríguez-Penagos or Carme Armentano-Oller ( bsc-temu@bsc.es )
This work is licensed under a Attribution-NonCommercial-NoDerivatives 4.0 International License .