数据集:
DFKI-SLT/tacred
任务:
文本分类语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found源数据集:
extended|other预印本库:
arxiv:2104.08398许可:
otherThe TAC Relation Extraction Dataset (TACRED) is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing. Please see Stanford's EMNLP paper , or their EMNLP slides for full details.
Note:
This repository provides all three versions of the dataset as BuilderConfigs - 'original' , 'revisited' and 're-tacred' . Simply set the name parameter in the load_dataset method in order to choose a specific version. The original TACRED is loaded per default.
The language in the dataset is English.
An example of 'train' looks as follows:
{ "id": "61b3a5c8c9a882dcfcd2", "docid": "AFP_ENG_20070218.0019.LDC2009T13", "relation": "org:founded_by", "token": ["Tom", "Thabane", "resigned", "in", "October", "last", "year", "to", "form", "the", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", ",", "crossing", "the", "floor", "with", "17", "members", "of", "parliament", ",", "causing", "constitutional", "monarch", "King", "Letsie", "III", "to", "dissolve", "parliament", "and", "call", "the", "snap", "election", "."], "subj_start": 10, "subj_end": 13, "obj_start": 0, "obj_end": 2, "subj_type": "ORGANIZATION", "obj_type": "PERSON", "stanford_pos": ["NNP", "NNP", "VBD", "IN", "NNP", "JJ", "NN", "TO", "VB", "DT", "DT", "NNP", "NNP", "-LRB-", "NNP", "-RRB-", ",", "VBG", "DT", "NN", "IN", "CD", "NNS", "IN", "NN", ",", "VBG", "JJ", "NN", "NNP", "NNP", "NNP", "TO", "VB", "NN", "CC", "VB", "DT", "NN", "NN", "."], "stanford_ner": ["PERSON", "PERSON", "O", "O", "DATE", "DATE", "DATE", "O", "O", "O", "O", "O", "O", "O", "ORGANIZATION", "O", "O", "O", "O", "O", "O", "NUMBER", "O", "O", "O", "O", "O", "O", "O", "O", "PERSON", "PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "stanford_head": [2, 3, 0, 5, 3, 7, 3, 9, 3, 13, 13, 13, 9, 15, 13, 15, 3, 3, 20, 18, 23, 23, 18, 25, 23, 3, 3, 32, 32, 32, 32, 27, 34, 27, 34, 34, 34, 40, 40, 37, 3], "stanford_deprel": ["compound", "nsubj", "ROOT", "case", "nmod", "amod", "nmod:tmod", "mark", "xcomp", "det", "compound", "compound", "dobj", "punct", "appos", "punct", "punct", "xcomp", "det", "dobj", "case", "nummod", "nmod", "case", "nmod", "punct", "xcomp", "amod", "compound", "compound", "compound", "dobj", "mark", "xcomp", "dobj", "cc", "conj", "det", "compound", "dobj", "punct"] }
The data fields are the same among all splits.
To miminize dataset bias, TACRED is stratified across years in which the TAC KBP challenge was run:
Train | Dev | Test | |
---|---|---|---|
TACRED | 68,124 (TAC KBP 2009-2012) | 22,631 (TAC KBP 2013) | 15,509 (TAC KBP 2014) |
Re-TACRED | 58,465 (TAC KBP 2009-2012) | 19,584 (TAC KBP 2013) | 13,418 (TAC KBP 2014) |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
See the Stanford paper and the Tacred Revisited paper, plus their appendices.
To ensure that models trained on TACRED are not biased towards predicting false positives on real-world text, all sampled sentences where no relation was found between the mention pairs were fully annotated to be negative examples. As a result, 79.5% of the examples are labeled as no_relation.
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
To respect the copyright of the underlying TAC KBP corpus, TACRED is released via the Linguistic Data Consortium ( LDC License ). You can download TACRED from the LDC TACRED webpage . If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.
The original dataset:
@inproceedings{zhang2017tacred, author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.}, booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)}, title = {Position-aware Attention and Supervised Data Improve Slot Filling}, url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf}, pages = {35--45}, year = {2017} }
For the revised version ( "revisited" ), please also cite:
@inproceedings{alt-etal-2020-tacred, title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task", author = "Alt, Christoph and Gabryszak, Aleksandra and Hennig, Leonhard", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.142", doi = "10.18653/v1/2020.acl-main.142", pages = "1558--1569", }
For the relabeled version ( "re-tacred" ), please also cite:
@inproceedings{DBLP:conf/aaai/StoicaPP21, author = {George Stoica and Emmanouil Antonios Platanios and Barnab{\'{a}}s P{\'{o}}czos}, title = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset}, booktitle = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI} 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9, 2021}, pages = {13843--13850}, publisher = {{AAAI} Press}, year = {2021}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/17631}, }