数据集:
hotpot_qa
任务:
问答语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1809.09600其他:
multi-hop许可:
cc-by-sa-4.0HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.
An example of 'validation' looks as follows.
{ "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 21", "Sent 22"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "medium", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "comparison" }fullwiki
An example of 'train' looks as follows.
{ "answer": "This is the answer", "context": { "sentences": [["Sent 1"], ["Sent 2"]], "title": ["Title1", "Title 2"] }, "id": "000001", "level": "hard", "question": "What is the answer?", "supporting_facts": { "sent_id": [0, 1, 3], "title": ["Title of para 1", "Title of para 2", "Title of para 3"] }, "type": "bridge" }
The data fields are the same among all splits.
distractortrain | validation | |
---|---|---|
distractor | 90447 | 7405 |
train | validation | test | |
---|---|---|---|
fullwiki | 90447 | 7405 | 7405 |
HotpotQA is distributed under a CC BY-SA 4.0 License .
@inproceedings{yang2018hotpotqa, title={{HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={Conference on Empirical Methods in Natural Language Processing ({EMNLP})}, year={2018} }
Thanks to @albertvillanova , @ghomasHudson for adding this dataset.