数据集:

cmrc2018

任务:

问答

子任务:

extractive-qa

语言:

zh

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original
中文

Dataset Card for "cmrc2018"

Dataset Summary

A Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

default
  • Size of downloaded dataset files: 11.50 MB
  • Size of the generated dataset: 22.31 MB
  • Total amount of disk used: 33.83 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "answers": {
        "answer_start": [11, 11],
        "text": ["光荣和ω-force", "光荣和ω-force"]
    },
    "context": "\"《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武...",
    "id": "DEV_0_QUERY_0",
    "question": "《战国无双3》是由哪两个公司合作开发的?"
}

Data Fields

The data fields are the same among all splits.

default
  • id : a string feature.
  • context : a string feature.
  • question : a string feature.
  • answers : a dictionary feature containing:
    • text : a string feature.
    • answer_start : a int32 feature.

Data Splits

name train validation test
default 10142 3219 1002

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@inproceedings{cui-emnlp2019-cmrc2018,
    title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Liu, Ting  and
      Che, Wanxiang  and
      Xiao, Li  and
      Chen, Zhipeng  and
      Ma, Wentao  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1600",
    doi = "10.18653/v1/D19-1600",
    pages = "5886--5891",
}

Contributions

Thanks to @patrickvonplaten , @mariamabarham , @lewtun , @thomwolf for adding this dataset.