数据集:

cbt

语言:

en

计算机处理:

monolingual

语言创建人:

found

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:1511.02301

许可:

gfdl
中文

Dataset Card for CBT

Dataset Summary

The Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available.

This dataset contains four different configurations:

  • V : where the answers to the questions are verbs.
  • P : where the answers to the questions are pronouns.
  • NE : where the answers to the questions are named entities.
  • CN : where the answers to the questions are common nouns.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The data is present in English language as written by authors Lucy Maud Montgomery, Charles Dickens,Andrew Lang, etc. in story books for children.

Dataset Structure

Data Instances

An instance from the V config:

{'answer': 'said', 'options': ['christening', 'existed', 'hear', 'knows', 'read', 'remarked', 'said', 'sitting', 'talking', 'wearing'], 'question': "`` They are very kind old ladies in their way , '' XXXXX the king ; `` and were nice to me when I was a boy . ''", 'sentences': ['This vexed the king even more than the queen , who was very clever and learned , and who had hated dolls when she was a child .', 'However , she , too in spite of all the books she read and all the pictures she painted , would have been glad enough to be the mother of a little prince .', 'The king was anxious to consult the fairies , but the queen would not hear of such a thing .', 'She did not believe in fairies : she said that they had never existed ; and that she maintained , though The History of the Royal Family was full of chapters about nothing else .', 'Well , at long and at last they had a little boy , who was generally regarded as the finest baby that had ever been seen .', 'Even her majesty herself remarked that , though she could never believe all the courtiers told her , yet he certainly was a fine child -- a very fine child .', 'Now , the time drew near for the christening party , and the king and queen were sitting at breakfast in their summer parlour talking over it .', 'It was a splendid room , hung with portraits of the royal ancestors .', 'There was Cinderella , the grandmother of the reigning monarch , with her little foot in her glass slipper thrust out before her .', 'There was the Marquis de Carabas , who , as everyone knows , was raised to the throne as prince consort after his marriage with the daughter of the king of the period .', 'On the arm of the throne was seated his celebrated cat , wearing boots .', 'There , too , was a portrait of a beautiful lady , sound asleep : this was Madame La Belle au Bois-dormant , also an ancestress of the royal family .', 'Many other pictures of celebrated persons were hanging on the walls .', "`` You have asked all the right people , my dear ? ''", 'said the king .', "`` Everyone who should be asked , '' answered the queen .", "`` People are so touchy on these occasions , '' said his majesty .", "`` You have not forgotten any of our aunts ? ''", "`` No ; the old cats ! ''", "replied the queen ; for the king 's aunts were old-fashioned , and did not approve of her , and she knew it ."]}

Data Fields

For the raw config, the data fields are:

  • title : a string feature containing the title of the book present in the dataset.
  • content : a string feature containing the content of the book present in the dataset.

For all other configs, the data fields are:

  • sentences : a list of string features containing 20 sentences from a book.
  • question : a string feature containing a question with blank marked as XXXX which is to be filled with one of the options.
  • answer : a string feature containing the answer.
  • options : a list of string features containing the options for the question.

Data Splits

The splits and corresponding sizes are:

train test validation
raw 98 5 5
V 105825 2500 2000
P 334030 2500 2000
CN 120769 2500 2000
NE 108719 2500 2000

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

Children's Book Authors

Annotations

Annotation process

From the homepage :

After allocating books to either training, validation or test sets, we formed example ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. For finer-grained analyses, we evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs and Prepositions.

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

GNU Free Documentation License v1.3

Citation Information

@misc{hill2016goldilocks,
      title={The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations}, 
      author={Felix Hill and Antoine Bordes and Sumit Chopra and Jason Weston},
      year={2016},
      eprint={1511.02301},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @gchhablani for adding this dataset.