数据集:
projecte-aina/ancora-ca-ner
This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes.
AnCora corpus is used under CC-by licence.
This dataset was developed by BSC TeMU as part of the Projecte AINA , to enrich the Catalan Language Understanding Benchmark (CLUB) .
Named Entities Recognition, Language Model
The dataset is in Catalan ( ca-CA ).
Three two-column files, one for each split.
Fundació B-ORG Privada I-ORG Fira I-ORG de I-ORG Manresa I-ORG ha O fet O un O balanç O de O l' O activitat O del O Palau B-LOC Firal I-LOC
Every file has two columns, with the word form or punctuation symbol in the first one and the corresponding IOB tag in the second one.
We took the original train, dev and test splits from the UD version of the corpus
We created this corpus to contribute to the development of language models in Catalan, a low-resource language.
AnCora consists of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 tokens (some multi-word). The corpora are annotated for linguistic phenomena at different levels. AnCora corpus is mainly based on newswire texts. For more information, refer to Taulé, M., M.A. Martí, M. Recasens (2009): "AnCora: Multilevel Annotated Corpora for Catalan and Spanish” , Proceedings of 6th International Conference on language Resources and Evaluation.
Who are the source language producers?Catalan AnCora corpus is compiled from articles from the following news outlets: EFE , ACN , El Periodico .
We adapted the NER labels from AnCora corpus to a token-per-line, multi-column format.
Who are the annotators?Original annotators from AnCora corpus .
No personal or sensitive information included.
We hope this corpus contributes to the development of language models in Catalan, a low-resource language.
[N/A]
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
This work is licensed under a Attribution 4.0 International License .
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }
[N/A]