数据集:

DFKI-SLT/multitacred

任务:

文本分类

子任务:

multi-class-classification

语言:

大小:

100K<n<1M

语言创建人:

found

批注创建人:

crowdsourced expert-generated

源数据集:

DFKI-NLP/tacred

预印本库:

arxiv:2305.04582

其他:

relation extraction relation+extraction

许可:

other

数据集介绍文件清单

中文

Dataset Card for "MultiTACRED"

Dataset Summary

MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset . It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper . Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).

Languages covered are: Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, Turkish. Intended use is supervised relation classification. Audience - researchers.

Please see our ACL paper for full details.

NOTE: This Datasetreader supports a reduced version of the original TACRED JSON format with the following changes:

Removed fields: stanford_pos, stanford_ner, stanford_head, stanford_deprel, docid The motivation for this is that we want to support additional languages, for which these fields were not required or available. The reader expects the specification of a language-specific configuration specifying the variant (original, revisited or retacred) and the language (as a two-letter iso code).

The DatasetReader changes the offsets of the following fields, to conform with standard Python usage (see _generate_examples()):

subj_end to subj_end + 1 (make end offset exclusive)
obj_end to obj_end + 1 (make end offset exclusive)

NOTE 2: The MultiTACRED dataset offers an additional 'split', namely the backtranslated test data (translated to a target language and then back to English). To access this split, use dataset['backtranslated_test'].

You can find the TACRED dataset reader for the English version of the dataset at https://huggingface.co/datasets/DFKI-SLT/tacred .

Supported Tasks and Leaderboards

Tasks: Relation Classification
Leaderboards: https://paperswithcode.com/sota/relation-extraction-on-multitacred

Languages

The languages in the dataset are Arabic, German, English, Spanish, Finnish, French, Hindi, Hungarian, Japanese, Polish, Russian, Turkish, and Chinese. All languages except English are machine-translated using either Deepl's or Google's translation APIs.

Dataset Structure

Data Instances

Size of downloaded dataset files: 15.4KB (TACRED-Revisited), 3.7 MB (Re-TACRED)
Size of the generated dataset: 1.7 GB (all languages, all versions)
Total amount of disk used: 1.7 GB (all languages, all versions)

An example of 'train' looks as follows:

{
  "id": "61b3a5c8c9a882dcfcd2", 
  "token": ["Tom", "Thabane", "trat", "im", "Oktober", "letzten", "Jahres", "zurück", ",", "um", "die", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", "zu", "gründen", ",", "die", "mit", "17", "Abgeordneten", "das", "Wort", "ergriff", ",", "woraufhin", "der", "konstitutionelle", "Monarch", "König", "Letsie", "III.", "das", "Parlament", "auflöste", "und", "Neuwahlen", "ansetzte", "."], 
  "relation": "org:founded_by", 
  "subj_start": 11, 
  "subj_end": 13, 
  "obj_start": 0, 
  "obj_end": 1, 
  "subj_type": "ORGANIZATION", 
  "obj_type": "PERSON"
}

Data Fields

The data fields are the same among all splits.

id : the instance id of this sentence, a string feature.
token : the list of tokens of this sentence, a list of string features.
relation : the relation label of this instance, a string classification label.
subj_start : the 0-based index of the start token of the relation subject mention, an ìnt feature.
subj_end : the 0-based index of the end token of the relation subject mention, exclusive, an ìnt feature.
subj_type : the NER type of the subject mention, among the types used in the Stanford NER system , a string feature.
obj_start : the 0-based index of the start token of the relation object mention, an ìnt feature.
obj_end : the 0-based index of the end token of the relation object mention, exclusive, an ìnt feature.
obj_type : the NER type of the object mention, among 23 fine-grained types used in the Stanford NER system , a string feature.

Data Splits

To miminize dataset bias, TACRED is stratified across years in which the TAC KBP challenge was run. Languages statistics for the splits differ because not all instances could be translated with the subject and object entity markup still intact, these were discarded.

Language	Train	Dev	Test	Backtranslated Test	Translation Engine
en	68,124	22,631	15,509	-	-
ar	67,736	22,502	15,425	15,425	Google
de	67,253	22,343	15,282	15,079	DeepL
es	65,247	21,697	14,908	14,688	DeepL
fi	66,751	22,268	15,083	14,462	DeepL
fr	66,856	22,298	15,237	15,088	DeepL
hi	67,751	22,511	15,440	15,440	Google
hu	67,766	22,519	15,436	15,436	Google
ja	61,571	20,290	13,701	12,913	DeepL
pl	68,124	22,631	15,509	15,509	Google
ru	66,413	21,998	14,995	14,703	DeepL
tr	67,749	22,510	15,429	15,429	Google
zh	65,260	21,538	14,694	14,021	DeepL

Dataset Creation

Curation Rationale

To enable more research on multilingual Relation Extraction, we generate translations of the TAC relation extraction dataset using DeepL and Google Translate.

Source Data

Initial Data Collection and Normalization

The instances of this dataset are sentences from the original TACRED dataset , which in turn are sampled from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges .

Who are the source language producers?

Newswire and web texts collected for the TAC Knowledge Base Population (TAC KBP) challenges .

Annotations

Annotation process

See the Stanford paper, the TACRED Revisited paper, and the Re-TACRED paper, plus their appendices, for details on the original annotation process. The translated versions do not change the original labels.

Translations were tokenized with language-specific Spacy models (Spacy 3.1, 'core_news/web_sm' models) or Trankit (Trankit 1.1.0) when there was no Spacy model for a given language (Hungarian, Turkish, Arabic, Hindi).

Who are the annotators?

The original TACRED dataset was annotated by crowd workers, see the TACRED paper .

Personal and Sensitive Information

The authors of the original TACRED dataset have not stated measures that prevent collecting sensitive or offensive text. Therefore, we do not rule out the possible risk of sensitive/offensive content in the translated data.

Considerations for Using the Data

Social Impact of Dataset

not applicable

Discussion of Biases

The dataset is drawn from web and newswire text, and thus reflects any biases of these original texts, as well as biases introduced by the MT models.

Other Known Limitations

not applicable

Additional Information

Dataset Curators

The dataset was created by members of the DFKI SLT team: Leonhard Hennig, Philippe Thomas, Sebastian Möller, Gabriel Kressin

Licensing Information

To respect the copyright of the underlying TACRED dataset, MultiTACRED is released via the Linguistic Data Consortium ( LDC License ). You can download MultiTACRED from the LDC MultiTACRED webpage . If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.

Citation Information

The original dataset:

@inproceedings{zhang2017tacred,
  author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)},
  title = {Position-aware Attention and Supervised Data Improve Slot Filling},
  url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf},
  pages = {35--45},
  year = {2017}
}

For the revised version, please also cite:

@inproceedings{alt-etal-2020-tacred,
    title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task",
    author = "Alt, Christoph  and
      Gabryszak, Aleksandra  and
      Hennig, Leonhard",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.142",
    doi = "10.18653/v1/2020.acl-main.142",
    pages = "1558--1569",
}

For the Re-TACRED version, please also cite:

@inproceedings{DBLP:conf/aaai/StoicaPP21,
  author       = {George Stoica and
                  Emmanouil Antonios Platanios and
                  Barnab{\'{a}}s P{\'{o}}czos},
  title        = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset},
  booktitle    = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI}
                  2021, Thirty-Third Conference on Innovative Applications of Artificial
                  Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances
                  in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9,
                  2021},
  pages        = {13843--13850},
  publisher    = {{AAAI} Press},
  year         = {2021},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/17631},
}

Contributions

Thanks to @leonhardhennig for adding this dataset.

作者:

DFKI-SLT

数据集大小:

148.54 KB