数据集:
cmrc2018
任务:
问答子任务:
extractive-qa语言:
zh计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original许可:
cc-by-sa-4.0A Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context.
An example of 'validation' looks as follows.
This example was too long and was cropped: { "answers": { "answer_start": [11, 11], "text": ["光荣和ω-force", "光荣和ω-force"] }, "context": "\"《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武...", "id": "DEV_0_QUERY_0", "question": "《战国无双3》是由哪两个公司合作开发的?" }
The data fields are the same among all splits.
defaultname | train | validation | test |
---|---|---|---|
default | 10142 | 3219 | 1002 |
@inproceedings{cui-emnlp2019-cmrc2018, title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension", author = "Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1600", doi = "10.18653/v1/D19-1600", pages = "5886--5891", }
Thanks to @patrickvonplaten , @mariamabarham , @lewtun , @thomwolf for adding this dataset.