数据集:

PlanTL-GOB-ES/UD_Spanish-AnCora

任务:

标记分类

子任务:

part-of-speech

语言:

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

许可:

cc-by-4.0

数据集介绍文件清单

中文

UD_Spanish-AnCora

Dataset Summary

This dataset is composed of the annotations from the AnCora corpus , projected on the Universal Dependencies treebank . We use the POS annotations of this corpus as part of the EvalEs Spanish language benchmark.

Supported Tasks and Leaderboards

POS tagging

Languages

The dataset is in Spanish ( es-ES )

Dataset Structure

Data Instances

Three conllu files.

Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines:

Word lines containing the annotation of a word/token in 10 fields separated by single tab characters (see below).

Blank lines marking sentence boundaries.

Comment lines starting with hash (#).

Data Fields

Word lines contain the following fields:

ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).

FORM: Word form or punctuation symbol.

LEMMA: Lemma or stem of word form.

UPOS: Universal part-of-speech tag.

XPOS: Language-specific part-of-speech tag; underscore if not available.

FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.

HEAD: Head of the current word, which is either a value of ID or zero (0).

DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.

DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.

MISC: Any other annotation.

From: https://universaldependencies.org

Data Splits

es_ancora-ud-train.conllu
es_ancora-ud-dev.conllu
es_ancora-ud-test.conllu

Dataset Creation

Curation Rationale

[N/A]

Source Data

UD_Spanish-AnCora

Initial Data Collection and Normalization

The original annotation was done in a constituency framework as a part of the AnCora project at the University of Barcelona. It was converted to dependencies by the Universal Dependencies team and used in the CoNLL 2009 shared task. The CoNLL 2009 version was later converted to HamleDT and to Universal Dependencies.

For more information on the AnCora project, visit the AnCora site .

To learn about the Universal Dependences, visit the webpage https://universaldependencies.org

Who are the source language producers?

For more information on the AnCora corpus and its sources, visit the AnCora site .

Annotations

Annotation process

For more information on the first AnCora annotation, visit the AnCora site .

Who are the annotators?

For more information on the AnCora annotation team, visit the AnCora site .

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

This dataset contributes to the development of language models in Spanish.

Discussion of Biases

[N/A]

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

[N/A]

Licensing Information

This work is licensed under a CC Attribution 4.0 International License .

Citation Information

The following paper must be cited when using this corpus:

Taulé, M., M.A. Martí, M. Recasens (2008) 'Ancora: Multilevel Annotated Corpora for Catalan and Spanish', Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

To cite the Universal Dependencies project:

Rueter, J. (Creator), Erina, O. (Contributor), Klementeva, J. (Contributor), Ryabov, I. (Contributor), Tyers, F. M. (Contributor), Zeman, D. (Contributor), Nivre, J. (Creator) (15 Nov 2020). Universal Dependencies version 2.7 Erzya JR. Universal Dependencies Consortium.

Contributions

[N/A]

作者:

PlanTL-GOB-ES

数据集大小:

50.25 MB