数据集:
DFKI-SLT/multitacred
MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset . It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper . Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).
Languages covered are: Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, Turkish. Intended use is supervised relation classification. Audience - researchers.
Please see our ACL paper for full details.
NOTE: This Datasetreader supports a reduced version of the original TACRED JSON format with the following changes:
The DatasetReader changes the offsets of the following fields, to conform with standard Python usage (see _generate_examples()):
NOTE 2: The MultiTACRED dataset offers an additional 'split', namely the backtranslated test data (translated to a target language and then back to English). To access this split, use dataset['backtranslated_test'].
You can find the TACRED dataset reader for the English version of the dataset at https://huggingface.co/datasets/DFKI-SLT/tacred .
The languages in the dataset are Arabic, German, English, Spanish, Finnish, French, Hindi, Hungarian, Japanese, Polish, Russian, Turkish, and Chinese. All languages except English are machine-translated using either Deepl's or Google's translation APIs.
An example of 'train' looks as follows:
{ "id": "61b3a5c8c9a882dcfcd2", "token": ["Tom", "Thabane", "trat", "im", "Oktober", "letzten", "Jahres", "zurück", ",", "um", "die", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", "zu", "gründen", ",", "die", "mit", "17", "Abgeordneten", "das", "Wort", "ergriff", ",", "woraufhin", "der", "konstitutionelle", "Monarch", "König", "Letsie", "III.", "das", "Parlament", "auflöste", "und", "Neuwahlen", "ansetzte", "."], "relation": "org:founded_by", "subj_start": 11, "subj_end": 13, "obj_start": 0, "obj_end": 1, "subj_type": "ORGANIZATION", "obj_type": "PERSON" }
The data fields are the same among all splits.
To miminize dataset bias, TACRED is stratified across years in which the TAC KBP challenge was run. Languages statistics for the splits differ because not all instances could be translated with the subject and object entity markup still intact, these were discarded.
Language | Train | Dev | Test | Backtranslated Test | Translation Engine |
---|---|---|---|---|---|
en | 68,124 | 22,631 | 15,509 | - | - |
ar | 67,736 | 22,502 | 15,425 | 15,425 | |
de | 67,253 | 22,343 | 15,282 | 15,079 | DeepL |
es | 65,247 | 21,697 | 14,908 | 14,688 | DeepL |
fi | 66,751 | 22,268 | 15,083 | 14,462 | DeepL |
fr | 66,856 | 22,298 | 15,237 | 15,088 | DeepL |
hi | 67,751 | 22,511 | 15,440 | 15,440 | |
hu | 67,766 | 22,519 | 15,436 | 15,436 | |
ja | 61,571 | 20,290 | 13,701 | 12,913 | DeepL |
pl | 68,124 | 22,631 | 15,509 | 15,509 | |
ru | 66,413 | 21,998 | 14,995 | 14,703 | DeepL |
tr | 67,749 | 22,510 | 15,429 | 15,429 | |
zh | 65,260 | 21,538 | 14,694 | 14,021 | DeepL |
To enable more research on multilingual Relation Extraction, we generate translations of the TAC relation extraction dataset using DeepL and Google Translate.
The instances of this dataset are sentences from the original TACRED dataset , which in turn are sampled from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges .
Who are the source language producers?Newswire and web texts collected for the TAC Knowledge Base Population (TAC KBP) challenges .
See the Stanford paper, the TACRED Revisited paper, and the Re-TACRED paper, plus their appendices, for details on the original annotation process. The translated versions do not change the original labels.
Translations were tokenized with language-specific Spacy models (Spacy 3.1, 'core_news/web_sm' models) or Trankit (Trankit 1.1.0) when there was no Spacy model for a given language (Hungarian, Turkish, Arabic, Hindi).
Who are the annotators?The original TACRED dataset was annotated by crowd workers, see the TACRED paper .
The authors of the original TACRED dataset have not stated measures that prevent collecting sensitive or offensive text. Therefore, we do not rule out the possible risk of sensitive/offensive content in the translated data.
not applicable
The dataset is drawn from web and newswire text, and thus reflects any biases of these original texts, as well as biases introduced by the MT models.
not applicable
The dataset was created by members of the DFKI SLT team: Leonhard Hennig, Philippe Thomas, Sebastian Möller, Gabriel Kressin
To respect the copyright of the underlying TACRED dataset, MultiTACRED is released via the Linguistic Data Consortium ( LDC License ). You can download MultiTACRED from the LDC MultiTACRED webpage . If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.
The original dataset:
@inproceedings{zhang2017tacred, author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.}, booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)}, title = {Position-aware Attention and Supervised Data Improve Slot Filling}, url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf}, pages = {35--45}, year = {2017} }
For the revised version, please also cite:
@inproceedings{alt-etal-2020-tacred, title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task", author = "Alt, Christoph and Gabryszak, Aleksandra and Hennig, Leonhard", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.142", doi = "10.18653/v1/2020.acl-main.142", pages = "1558--1569", }
For the Re-TACRED version, please also cite:
@inproceedings{DBLP:conf/aaai/StoicaPP21, author = {George Stoica and Emmanouil Antonios Platanios and Barnab{\'{a}}s P{\'{o}}czos}, title = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset}, booktitle = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI} 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9, 2021}, pages = {13843--13850}, publisher = {{AAAI} Press}, year = {2021}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/17631}, }
Thanks to @leonhardhennig for adding this dataset.