数据集:

race

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1704.04683

许可:

other
中文

Dataset Card for "race"

Dataset Summary

RACE is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

all
  • Size of downloaded dataset files: 25.44 MB
  • Size of the generated dataset: 174.73 MB
  • Total amount of disk used: 200.17 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "answer": "A",
    "article": "\"Schoolgirls have been wearing such short skirts at Paget High School in Branston that they've been ordered to wear trousers ins...",
    "example_id": "high132.txt",
    "options": ["short skirts give people the impression of sexualisation", "short skirts are too expensive for parents to afford", "the headmaster doesn't like girls wearing short skirts", "the girls wearing short skirts will be at the risk of being laughed at"],
    "question": "The girls at Paget High School are not allowed to wear skirts in that    _  ."
}
high
  • Size of downloaded dataset files: 25.44 MB
  • Size of the generated dataset: 140.12 MB
  • Total amount of disk used: 165.56 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "answer": "A",
    "article": "\"Schoolgirls have been wearing such short skirts at Paget High School in Branston that they've been ordered to wear trousers ins...",
    "example_id": "high132.txt",
    "options": ["short skirts give people the impression of sexualisation", "short skirts are too expensive for parents to afford", "the headmaster doesn't like girls wearing short skirts", "the girls wearing short skirts will be at the risk of being laughed at"],
    "question": "The girls at Paget High School are not allowed to wear skirts in that    _  ."
}
middle
  • Size of downloaded dataset files: 25.44 MB
  • Size of the generated dataset: 34.61 MB
  • Total amount of disk used: 60.05 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "answer": "B",
    "article": "\"There is not enough oil in the world now. As time goes by, it becomes less and less, so what are we going to do when it runs ou...",
    "example_id": "middle3.txt",
    "options": ["There is more petroleum than we can use now.", "Trees are needed for some other things besides making gas.", "We got electricity from ocean tides in the old days.", "Gas wasn't used to run cars in the Second World War."],
    "question": "According to the passage, which of the following statements is TRUE?"
}

Data Fields

The data fields are the same among all splits.

all
  • example_id : a string feature.
  • article : a string feature.
  • answer : a string feature.
  • question : a string feature.
  • options : a list of string features.
high
  • example_id : a string feature.
  • article : a string feature.
  • answer : a string feature.
  • question : a string feature.
  • options : a list of string features.
middle
  • example_id : a string feature.
  • article : a string feature.
  • answer : a string feature.
  • question : a string feature.
  • options : a list of string features.

Data Splits

name train validation test
all 87866 4887 4934
high 62445 3451 3498
middle 25421 1436 1436

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

http://www.cs.cmu.edu/~glai1/data/race/

  • RACE dataset is available for non-commercial research purpose only.

  • All passages are obtained from the Internet which is not property of Carnegie Mellon University. We are not responsible for the content nor the meaning of these passages.

  • You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purpose, any portion of the contexts and any portion of derived data.

  • We reserve the right to terminate your access to the RACE dataset at any time.

  • Citation Information

    @inproceedings{lai-etal-2017-race,
        title = "{RACE}: Large-scale {R}e{A}ding Comprehension Dataset From Examinations",
        author = "Lai, Guokun  and
          Xie, Qizhe  and
          Liu, Hanxiao  and
          Yang, Yiming  and
          Hovy, Eduard",
        booktitle = "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing",
        month = sep,
        year = "2017",
        address = "Copenhagen, Denmark",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/D17-1082",
        doi = "10.18653/v1/D17-1082",
        pages = "785--794",
    }
    

    Contributions

    Thanks to @abarbosa94 , @patrickvonplaten , @lewtun , @thomwolf , @mariamabarham for adding this dataset.