数据集:

PlanTL-GOB-ES/UD_Spanish-AnCora

子任务:

part-of-speech

语言:

es

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

许可:

cc-by-4.0
中文

UD_Spanish-AnCora

Dataset Summary

This dataset is composed of the annotations from the AnCora corpus , projected on the Universal Dependencies treebank . We use the POS annotations of this corpus as part of the EvalEs Spanish language benchmark.

Supported Tasks and Leaderboards

POS tagging

Languages

The dataset is in Spanish ( es-ES )

Dataset Structure

Data Instances

Three conllu files.

Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines:

  • Word lines containing the annotation of a word/token in 10 fields separated by single tab characters (see below).
  • Blank lines marking sentence boundaries.
  • Comment lines starting with hash (#).
  • Data Fields

    Word lines contain the following fields:

  • ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
  • FORM: Word form or punctuation symbol.
  • LEMMA: Lemma or stem of word form.
  • UPOS: Universal part-of-speech tag.
  • XPOS: Language-specific part-of-speech tag; underscore if not available.
  • FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  • HEAD: Head of the current word, which is either a value of ID or zero (0).
  • DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  • DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  • MISC: Any other annotation.
  • From: https://universaldependencies.org

    Data Splits

    • es_ancora-ud-train.conllu
    • es_ancora-ud-dev.conllu
    • es_ancora-ud-test.conllu

    Dataset Creation

    Curation Rationale

    [N/A]

    Source Data

    UD_Spanish-AnCora

    Initial Data Collection and Normalization

    The original annotation was done in a constituency framework as a part of the AnCora project at the University of Barcelona. It was converted to dependencies by the Universal Dependencies team and used in the CoNLL 2009 shared task. The CoNLL 2009 version was later converted to HamleDT and to Universal Dependencies.

    For more information on the AnCora project, visit the AnCora site .

    To learn about the Universal Dependences, visit the webpage https://universaldependencies.org

    Who are the source language producers?

    For more information on the AnCora corpus and its sources, visit the AnCora site .

    Annotations

    Annotation process

    For more information on the first AnCora annotation, visit the AnCora site .

    Who are the annotators?

    For more information on the AnCora annotation team, visit the AnCora site .

    Personal and Sensitive Information

    No personal or sensitive information included.

    Considerations for Using the Data

    Social Impact of Dataset

    This dataset contributes to the development of language models in Spanish.

    Discussion of Biases

    [N/A]

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    [N/A]

    Licensing Information

    This work is licensed under a CC Attribution 4.0 International License .

    Citation Information

    The following paper must be cited when using this corpus:

    Taulé, M., M.A. Martí, M. Recasens (2008) 'Ancora: Multilevel Annotated Corpora for Catalan and Spanish', Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

    To cite the Universal Dependencies project:

    Rueter, J. (Creator), Erina, O. (Contributor), Klementeva, J. (Contributor), Ryabov, I. (Contributor), Tyers, F. M. (Contributor), Zeman, D. (Contributor), Nivre, J. (Creator) (15 Nov 2020). Universal Dependencies version 2.7 Erzya JR. Universal Dependencies Consortium.

    Contributions

    [N/A]