数据集:
BSC-LT/ancora-ca-ner
语言:
caIf you use any of these resources (datasets or models) in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual, title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan", author = "Armengol-Estap{\'e}, Jordi and Carrino, Casimiro Pio and Rodriguez-Penagos, Carlos and de Gibert Bonet, Ona and Armentano-Oller, Carme and Gonzalez-Agirre, Aitor and Melero, Maite and Villegas, Marta", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.437", doi = "10.18653/v1/2021.findings-acl.437", pages = "4933--4946", }
https://doi.org/10.5281/zenodo.4529299
This is a dataset for Named Entity Recognition (NER) from Ancora corpus adapted for Machine Learning and Language Model evaluation purposes.
Since multiwords (including Named Entities) in the original Ancora corpus are aggregated as a single lexical item using underscores (e.g. "Ajuntament_de_Barcelona") we splitted them to align with word-per-line format, and added conventional Begin-Inside-Outside (IOB) tags to mark and classify Named Entities. We did not filter out the different categories of NEs from Ancora (weak and strong). We did 6 minor edits by hand.
AnCora corpus is used under [CC-by] ( https://creativecommons.org/licenses/by/4.0/ ) licence.
This dataset was developed by BSC TeMU as part of the AINA project, and to enrich the Catalan Language Understanding Benchmark (CLUB).
Named Entities Recognition, Language Model
CA- Catalan
three two-column files, one for each split.
Every file has two columns, with the word form or punctuation symbol in the first one and the corresponding IOB tag in the second one.
Fundació B-ORG Privada I-ORG Fira I-ORG de I-ORG Manresa I-ORG ha O fet O un O balanç O de O l' O activitat O del O Palau B-LOC Firal I-LOC
One for each sub-dataset for train, evaluation and test.
We adapted the NER labels from Ancora corpus to a word-per-line format. Since multiwords in the original Ancora corpus are aggregated as a single lexical item using underscores (e.g. "Ajuntament_de_Barcelona") we splitted them to align with this format, and added conventional Begin-Inside-Outside (IOB) tags to mark and classify Named Entities. We did not filter out the different categories of NEs from Ancora (weak and strong). We did 6 minor edits by hand.
AnCora consists of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 tokens (some multi-word). The corpora are annotated for linguistic phenomena at different levels. AnCora corpus is mainly based on newswire texts. For more information, refer to Taulé, M., M.A. Martí, M. Recasens (2009). “AnCora: Multilevel Annotated Corpora for Catalan and Spanish”, Proceedings of 6th International Conference on language Resources and Evaluation. http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf
Who are the source language producers?Catalan Ancora corpus is compiled from articles from the following news outlets: EFE , ACN , El Periodico .
We adapted the NER labels from Ancora corpus to a token-per-line, multi-column format.
Who are the annotators?Original annotators from Ancora corpus.
Carlos Rodríguez and Carme Armentano, from BSC-CNS, did the conversion and curation.
No personal or sensitive information included.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Carlos Rodríguez-Penagos or Carme Armentano-Oller ( bsc-temu@bsc.es )
This work is licensed under a Attribution 4.0 International License .