数据集:
PlanTL-GOB-ES/CoNLL-NERC-es
CoNLL-NERC is the Spanish dataset of the CoNLL-2002 Shared Task (Tjong Kim Sang, 2002) . The dataset is annotated with four types of named entities --persons, locations, organizations, and other miscellaneous entities-- formatted in the standard Beginning-Inside-Outside (BIO) format. The corpus consists of 8,324 train sentences with 19,400 named entities, 1,916 development sentences with 4,568 named entities, and 1,518 test sentences with 3,644 named entities.
We use this corpus as part of the EvalEs Spanish language benchmark.
Named Entity Recognition and Classification
The dataset is in Spanish ( es-ES )
El DA O Abogado NC B-PER General AQ I-PER del SP I-PER Estado NC I-PER , Fc O Daryl VMI B-PER Williams NC I-PER , Fc O subrayó VMI O hoy RG O la DA O necesidad NC O de SP O tomar VMN O medidas NC O para SP O proteger VMN O al SP O sistema NC O judicial AQ O australiano AQ O frente RG O a SP O una DI O página NC O de SP O internet NC O que PR O imposibilita VMI O el DA O cumplimiento NC O de SP O los DA O principios NC O básicos AQ O de SP O la DA O Ley NC B-MISC . Fp O
Every file has two columns, with the word form or punctuation symbol in the first one and the corresponding IOB tag in the second one. The different files are separated by an empty line.
[N/A]
The data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000.
Initial Data Collection and NormalizationFor more information visit the paper from the CoNLL-2002 Shared Task (Tjong Kim Sang, 2002) .
Who are the source language producers?For more information visit the paper from the CoNLL-2002 Shared Task (Tjong Kim Sang, 2002) .
For more information visit the paper from the CoNLL-2002 Shared Task (Tjong Kim Sang, 2002) .
Who are the annotators?The annotation was carried out by the TALP Research Center2 of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC3 ) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC pro ject (IST-1999-12392).
For more information visit the paper from the CoNLL-2002 Shared Task (Tjong Kim Sang, 2002) .
[N/A]
This dataset contributes to the development of language models in Spanish.
[N/A]
[N/A]
The following paper must be cited when using this corpus:
Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002).
[N/A]