数据集:

DFKI-SLT/tacred

任务:

文本分类

子任务:

multi-class-classification

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

crowdsourced expert-generated

源数据集:

extended|other

预印本库:

arxiv:2104.08398

其他:

relation extraction relation+extraction

许可:

other

数据集介绍文件清单

中文

Dataset Card for "tacred"

Dataset Summary

The TAC Relation Extraction Dataset (TACRED) is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing. Please see Stanford's EMNLP paper , or their EMNLP slides for full details.

Note:

There is currently a label-corrected version of the TACRED dataset, which you should consider using instead of the original version released in 2017. For more details on this new version, see the TACRED Revisited paper published at ACL 2020.
There is also a relabeled and pruned version of the TACRED dataset. For more details on this new version, see the Re-TACRED paper published at ACL 2020.

This repository provides all three versions of the dataset as BuilderConfigs - 'original' , 'revisited' and 're-tacred' . Simply set the name parameter in the load_dataset method in order to choose a specific version. The original TACRED is loaded per default.

Supported Tasks and Leaderboards

Tasks: Relation Classification
Leaderboards: https://paperswithcode.com/sota/relation-extraction-on-tacred

Languages

The language in the dataset is English.

Dataset Structure

Data Instances

Size of downloaded dataset files: 62.3 MB
Size of the generated dataset: 139.2 MB
Total amount of disk used: 201.5 MB

An example of 'train' looks as follows:

{
  "id": "61b3a5c8c9a882dcfcd2",
  "docid": "AFP_ENG_20070218.0019.LDC2009T13",
  "relation": "org:founded_by",
  "token": ["Tom", "Thabane", "resigned", "in", "October", "last", "year", "to", "form", "the", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", ",", "crossing", "the", "floor", "with", "17", "members", "of", "parliament", ",", "causing", "constitutional", "monarch", "King", "Letsie", "III", "to", "dissolve", "parliament", "and", "call", "the", "snap", "election", "."],
  "subj_start": 10,
  "subj_end": 13,
  "obj_start": 0,
  "obj_end": 2,
  "subj_type": "ORGANIZATION",
  "obj_type": "PERSON",
  "stanford_pos": ["NNP", "NNP", "VBD", "IN", "NNP", "JJ", "NN", "TO", "VB", "DT", "DT", "NNP", "NNP", "-LRB-", "NNP", "-RRB-", ",", "VBG", "DT", "NN", "IN", "CD", "NNS", "IN", "NN", ",", "VBG", "JJ", "NN", "NNP", "NNP", "NNP", "TO", "VB", "NN", "CC", "VB", "DT", "NN", "NN", "."],
  "stanford_ner": ["PERSON", "PERSON", "O", "O", "DATE", "DATE", "DATE", "O", "O", "O", "O", "O", "O", "O", "ORGANIZATION", "O", "O", "O", "O", "O", "O", "NUMBER", "O", "O", "O", "O", "O", "O", "O", "O", "PERSON", "PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
  "stanford_head": [2, 3, 0, 5, 3, 7, 3, 9, 3, 13, 13, 13, 9, 15, 13, 15, 3, 3, 20, 18, 23, 23, 18, 25, 23, 3, 3, 32, 32, 32, 32, 27, 34, 27, 34, 34, 34, 40, 40, 37, 3],
  "stanford_deprel": ["compound", "nsubj", "ROOT", "case", "nmod", "amod", "nmod:tmod", "mark", "xcomp", "det", "compound", "compound", "dobj", "punct", "appos", "punct", "punct", "xcomp", "det", "dobj", "case", "nummod", "nmod", "case", "nmod", "punct", "xcomp", "amod", "compound", "compound", "compound", "dobj", "mark", "xcomp", "dobj", "cc", "conj", "det", "compound", "dobj", "punct"]
}

Data Fields

The data fields are the same among all splits.

id : the instance id of this sentence, a string feature.
docid : the TAC KBP document id of this sentence, a string feature.
token : the list of tokens of this sentence, obtained with the StanfordNLP toolkit, a list of string features.
relation : the relation label of this instance, a string classification label.
subj_start : the 0-based index of the start token of the relation subject mention, an ìnt feature.
subj_end : the 0-based index of the end token of the relation subject mention, exclusive, an ìnt feature.
subj_type : the NER type of the subject mention, among 23 fine-grained types used in the Stanford NER system , a string feature.
obj_start : the 0-based index of the start token of the relation object mention, an ìnt feature.
obj_end : the 0-based index of the end token of the relation object mention, exclusive, an ìnt feature.
obj_type : the NER type of the object mention, among 23 fine-grained types used in the Stanford NER system , a string feature.
stanford_pos : the part-of-speech tag per token. the NER type of the subject mention, among 23 fine-grained types used in the Stanford NER system , a list of string features.
stanford_ner : the NER tags of tokens (IO-Scheme), among 23 fine-grained types used in the Stanford NER system , a list of string features.
stanford_deprel : the Stanford dependency relation tag per token, a list of string features.
stanford_head : the head (source) token index (0-based) for the dependency relation per token. The root token has a head index of -1, a list of int features.

Data Splits

To miminize dataset bias, TACRED is stratified across years in which the TAC KBP challenge was run:

Train	Dev	Test
TACRED	68,124 (TAC KBP 2009-2012)	22,631 (TAC KBP 2013)	15,509 (TAC KBP 2014)
Re-TACRED	58,465 (TAC KBP 2009-2012)	19,584 (TAC KBP 2013)	13,418 (TAC KBP 2014)

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

See the Stanford paper and the Tacred Revisited paper, plus their appendices.

To ensure that models trained on TACRED are not biased towards predicting false positives on real-world text, all sampled sentences where no relation was found between the mention pairs were fully annotated to be negative examples. As a result, 79.5% of the examples are labeled as no_relation.

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

To respect the copyright of the underlying TAC KBP corpus, TACRED is released via the Linguistic Data Consortium ( LDC License ). You can download TACRED from the LDC TACRED webpage . If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.

Citation Information

The original dataset:

@inproceedings{zhang2017tacred,
  author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)},
  title = {Position-aware Attention and Supervised Data Improve Slot Filling},
  url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf},
  pages = {35--45},
  year = {2017}
}

For the revised version ( "revisited" ), please also cite:

@inproceedings{alt-etal-2020-tacred,
    title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task",
    author = "Alt, Christoph  and
      Gabryszak, Aleksandra  and
      Hennig, Leonhard",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.142",
    doi = "10.18653/v1/2020.acl-main.142",
    pages = "1558--1569",
}

For the relabeled version ( "re-tacred" ), please also cite:

@inproceedings{DBLP:conf/aaai/StoicaPP21,
  author       = {George Stoica and
                  Emmanouil Antonios Platanios and
                  Barnab{\'{a}}s P{\'{o}}czos},
  title        = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset},
  booktitle    = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI}
                  2021, Thirty-Third Conference on Innovative Applications of Artificial
                  Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances
                  in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9,
                  2021},
  pages        = {13843--13850},
  publisher    = {{AAAI} Press},
  year         = {2021},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/17631},
}

Contributions

Thanks to @dfki-nlp and @phucdev for adding this dataset.

作者:

DFKI-SLT

数据集大小:

39.87 KB