数据集:
docred
任务:
语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1906.06127许可:
Multiple entities in a document generally exhibit complex inter-sentence relations, and cannot be well handled by existing relation extraction (RE) methods that typically focus on extracting intra-sentence relations for single entity pairs. In order to accelerate the research on document-level RE, we introduce DocRED, a new dataset constructed from Wikipedia and Wikidata with three features: - DocRED annotates both named entities and relations, and is the largest human-annotated dataset for document-level RE from plain text. - DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. - Along with the human-annotated data, we also offer large-scale distantly supervised data, which enables DocRED to be adopted for both supervised and weakly supervised scenarios.
An example of 'train_annotated' looks as follows.
{
"labels": {
"evidence": [[0]],
"head": [0],
"relation_id": ["P1"],
"relation_text": ["is_a"],
"tail": [0]
},
"sents": [["This", "is", "a", "sentence"], ["This", "is", "another", "sentence"]],
"title": "Title of the document",
"vertexSet": [[{
"name": "sentence",
"pos": [3],
"sent_id": 0,
"type": "NN"
}, {
"name": "sentence",
"pos": [3],
"sent_id": 1,
"type": "NN"
}], [{
"name": "This",
"pos": [0],
"sent_id": 0,
"type": "NN"
}]]
}
The data fields are the same among all splits.
default| name | train_annotated | train_distant | validation | test |
|---|---|---|---|---|
| default | 3053 | 101873 | 998 | 1000 |
@inproceedings{yao-etal-2019-docred,
title = "{D}oc{RED}: A Large-Scale Document-Level Relation Extraction Dataset",
author = "Yao, Yuan and
Ye, Deming and
Li, Peng and
Han, Xu and
Lin, Yankai and
Liu, Zhenghao and
Liu, Zhiyuan and
Huang, Lixin and
Zhou, Jie and
Sun, Maosong",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1074",
doi = "10.18653/v1/P19-1074",
pages = "764--777",
}
Thanks to @ghomasHudson , @thomwolf , @lhoestq for adding this dataset.