数据集:

wiki_movies

任务:

问答

语言:

en

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:1606.03126

许可:

cc-by-3.0
中文

Dataset Card for WikiMovies

Dataset Summary

The WikiMovies dataset consists of roughly 100k (templated) questions over 75k entitiesbased on questions with answers in the open movie database (OMDb). It is the QA part of the Movie Dialog dataset.

Supported Tasks and Leaderboards

  • Question Answering

Languages

The text in the dataset is written in English.

Dataset Structure

Data Instances

The raw data consists of question answer pairs separated by a tab. Here are 3 examples:

1 what does Grégoire Colin appear in?	Before the Rain
1 Joe Thomas appears in which movies?	The Inbetweeners Movie, The Inbetweeners 2
1 what films did Michelle Trachtenberg star in?	Inspector Gadget, Black Christmas, Ice Princess, Harriet the Spy, The Scribbler

It is unclear what the 1 is for at the beginning of each line, but it has been removed in the Dataset object.

Data Fields

Here is an example of the raw data ingested by Datasets :

{
'answer': 'Before the Rain', 
'question': 'what does Grégoire Colin appear in?'
}

answer : a string containing the answer to a corresponding question. question : a string containing the relevant question.

Data Splits

The data is split into train, test, and dev sets. The split sizes are as follows:

wiki-entities_qa_* n examples
train.txt 96185
dev.txt 10000
test.txt 9952

Dataset Creation

Curation Rationale

WikiMovies was built with the following goals in mind: (i) machine learning techniques should have ample training examples for learning; and (ii) one can analyze easily the performance of different representations of knowledge and break down the results by question type. The datasetcan be downloaded from http://fb.ai/babi

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@misc{miller2016keyvalue,
      title={Key-Value Memory Networks for Directly Reading Documents},
      author={Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston},
      year={2016},
      eprint={1606.03126},
      archivePrefix={arXiv},
      primaryClass={cs.CL}

Contributions

Thanks to @aclifton314 for adding this dataset.