数据集:
docred
任务:
文本检索语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1906.06127许可:
mitMultiple entities in a document generally exhibit complex inter-sentence relations, and cannot be well handled by existing relation extraction (RE) methods that typically focus on extracting intra-sentence relations for single entity pairs. In order to accelerate the research on document-level RE, we introduce DocRED, a new dataset constructed from Wikipedia and Wikidata with three features: - DocRED annotates both named entities and relations, and is the largest human-annotated dataset for document-level RE from plain text. - DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. - Along with the human-annotated data, we also offer large-scale distantly supervised data, which enables DocRED to be adopted for both supervised and weakly supervised scenarios.
An example of 'train_annotated' looks as follows.
{ "labels": { "evidence": [[0]], "head": [0], "relation_id": ["P1"], "relation_text": ["is_a"], "tail": [0] }, "sents": [["This", "is", "a", "sentence"], ["This", "is", "another", "sentence"]], "title": "Title of the document", "vertexSet": [[{ "name": "sentence", "pos": [3], "sent_id": 0, "type": "NN" }, { "name": "sentence", "pos": [3], "sent_id": 1, "type": "NN" }], [{ "name": "This", "pos": [0], "sent_id": 0, "type": "NN" }]] }
The data fields are the same among all splits.
defaultname | train_annotated | train_distant | validation | test |
---|---|---|---|---|
default | 3053 | 101873 | 998 | 1000 |
@inproceedings{yao-etal-2019-docred, title = "{D}oc{RED}: A Large-Scale Document-Level Relation Extraction Dataset", author = "Yao, Yuan and Ye, Deming and Li, Peng and Han, Xu and Lin, Yankai and Liu, Zhenghao and Liu, Zhiyuan and Huang, Lixin and Zhou, Jie and Sun, Maosong", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1074", doi = "10.18653/v1/P19-1074", pages = "764--777", }
Thanks to @ghomasHudson , @thomwolf , @lhoestq for adding this dataset.