数据集:
projecte-aina/tecla
TeCla (Text Classification) is a Catalan News corpus for thematic multi-class Text Classification tasks. The present version (2.0) contains 113.376 articles classified under a hierarchical class structure consisting of a coarse-grained and a fine-grained class. Each of the 4 coarse-grained classes accept a subset of fine-grained ones, 53 in total.
The previous version (1.0.1) can still be found at https://zenodo.org/record/4761505
This dataset was developed by BSC TeMU as part of Projecte AINA , to enrich the Catalan Language Understanding Benchmark (CLUB) .
Text classification, Language Model
The dataset is in Catalan ( ca-CA ).
Three json files, one for each split.
Each example contains the following 3 fields:
{"version": "2.0", "data": [ { 'sentence': "La setena edició del Festival Fantàstik inclourà les cintes 'Matar a dios' i 'Mandy' i un homenatge a 'Mi vecino Totoro'. Es projectaran 22 curtmetratges seleccionats d'entre més de 500 presentats a nivell internacional. El Centre Cultural de Granollers acull del 8 a l'11 de novembre la setena edició del Festival Fantàstik. El certamen, que s'allargarà un dia, arrencarà amb la projecció de la cinta de Caye Casas i Albert Pide 'Matar a Dios'. Els dos directors estaran presents en la inauguració de la cita. A més, els asssitents podran gaudir de 'Mandy', el darrer treball de Nicolas Cage. Altres llargmetratges seleccionats per aquest any són 'Aterrados' (2017), 'Revenge' (2017), 'A Mata Negra' (2018), 'Top Knot Detective' (2018) i 'La Gran Desfeta' (2018). A més, amb motiu del trentè aniversari de la pel·lícula 'El meu veí Totoro' es durà a terme l'exposició dedicada a aquest film '30 anys 30 artistes' comissariada per Jordi Pastor i Reinaldo Pereira. La mostra '30 anys 30 artistes' recull els treballs de trenta artistes d'estils diferents al voltant de la figura de Totoro i el seu director. Es podrà veure durant els dies de festival i es complementarà amb la projecció de la pel·lícula el diumenge 11 de novembre. Al llarg del festival també es projectaran els 22 curtmetratges prèviament seleccionats d'entre més de 500 presentats a nivell internacional. El millor tindrà una dotació de 1000 euros fruit de la unió de forces amb el Mercat Audiovisual de Catalunya.", 'label1': 'Cultura', 'label2': 'Cinema' }, ... ] }Labels
Train, development and test splits were created in a stratified fashion, following a 0.8, 0.05 and 0.15 proportion, respectively. The sizes of each split are the following:
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
The source data are crawled articles from the Catalan News Agency ( Agència Catalana de Notícies, ACN ) site.
We crawled 219.586 articles from the Catalan News Agency ( Agència Catalana de Notícies; ACN ) newswire archive, the latest from October 11, 2020.
From the crawled data, we selected those articles whose 'section' and 'subsection' categories followed the expected codification combinations included in the ACN's style guide and whose 'section' complied the requirements of containing subsections and being thematically founded (in contrast to geographically defined categories such as 'Món' and 'Unió Europea'). The articles originally belonging to the 'Unió Europea' section, which were related to political organisms from the European Union, were included in the 'Política' coarse-grained category (within a fine-grained category named 'Unió Europea') due to its close proximity between some of the original subsections of 'Política' and those of 'Unió Europea', both defined by the specific political organism dealt with in the article.
The text field in each example is a concatenation of the original title, subtitle and body of the article (before the concatenation, both title and subtitle were added a final dot whenever they lacked one). The preprocessing of the texts was minimal and consisted in the removal of the pattern "ACN {location}.-" included before the body in each text as well as newlines originally used to divide the text in paragraphs.
Who are the source language producers?The Catalan News Agency ( Agència Catalana de Notícies; ACN ) is a news agency owned by the Catalan government via the public corporation Intracatalònia, SA. It is one of the first digital news agencies created in Europe and has been operating since 1999 (source: wikipedia ).
The crawled data contained the categories' annotations, which were then used to create this dataset with the mentioned criteria.
Who are the annotators?Editorial staff classified the articles under the different thematic sections and subsections, and we extracted these from metadata.
No personal or sensitive information included.
We hope this dataset contributes to the development of language models in Catalan, a low-resource language.
[N/A]
[N/A]
Irene Baucells ( irene.baucells@bsc.es ), Casimiro Pio Carrino ( casimiro.carrino@bsc.es ), Carlos Rodríguez ( carlos.rodriguez1@bsc.es ) and Carme Armentano ( carme.armentano@bsc.es ), from BSC-CNS .
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
This work is licensed under a Attribution-NonCommercial-NoDerivatives 4.0 International License .