数据集:
narrativeqa_manual
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1712.07040许可:
apache-2.0语言:
en任务:
文生文子任务:
abstractive-qaNarrativeQA Manual is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. THIS DATASET REQUIRES A MANUALLY DOWNLOADED FILE! Because of a script in the original repository which downloads the stories from original URLs everytime, the links are sometimes broken or invalid. Therefore, you need to manually download the stories for this dataset using the script provided by the authors ( https://github.com/deepmind/narrativeqa/blob/master/download_stories.sh ). Running the shell script creates a folder named "tmp" in the root directory and downloads the stories there. This folder containing the stories can be used to load the dataset via datasets.load_dataset("narrativeqa_manual", data_dir="<path/to/folder>") .
The dataset is used to test reading comprehension. There are 2 tasks proposed in the paper: "summaries only" and "stories only", depending on whether the human-generated summary or the full story text is used to answer the question.
English
A typical data point consists of a question and answer pair along with a summary/story which can be used to answer the question. Additional information such as the url, word count, wikipedia page, are also provided.
A typical example looks like this:
{ "document": { "id": "23jncj2n3534563110", "kind": "movie", "url": "https://www.imsdb.com/Movie%20Scripts/Name%20of%20Movie.html", "file_size": 80473, "word_count": 41000, "start": "MOVIE screenplay by", "end": ". THE END", "summary": { "text": "Joe Bloggs begins his journey exploring...", "tokens": ["Joe", "Bloggs", "begins", "his", "journey", "exploring",...], "url": "http://en.wikipedia.org/wiki/Name_of_Movie", "title": "Name of Movie (film)" }, "text": "MOVIE screenplay by John Doe\nSCENE 1..." }, "question": { "text": "Where does Joe Bloggs live?", "tokens": ["Where", "does", "Joe", "Bloggs", "live", "?"], }, "answers": [ {"text": "At home", "tokens": ["At", "home"]}, {"text": "His house", "tokens": ["His", "house"]} ] }
The data is split into training, valiudation, and test sets based on story (i.e. the same story cannot appear in more than one split):
Train | Valid | Test |
---|---|---|
32747 | 3461 | 10557 |
[More Information Needed]
Stories and movies scripts were downloaded from Project Gutenburg and a range of movie script repositories (mainly imsdb ).
Who are the source language producers?The language producers are authors of the stories and scripts as well as Amazon Turk workers for the questions.
Amazon Turk Workers were provided with human written summaries of the stories (To make the annotation tractable and to lead annotators towards asking non-localized questions). Stories were matched with plot summaries from Wikipedia using titles and verified the matching with help from human annotators. The annotators were asked to determine if both the story and the summary refer to a movie or a book (as some books are made into movies), or if they are the same part in a series produced in the same year. Annotators on Amazon Mechanical Turk were instructed to write 10 question–answer pairs each based solely on a given summary. Annotators were instructed to imagine that they are writing questions to test students who have read the full stories but not the summaries. We required questions that are specific enough, given the length and complexity of the narratives, and to provide adiverse set of questions about characters, events, why this happened, and so on. Annotators were encouraged to use their own words and we prevented them from copying. We asked for answers that are grammatical, complete sentences, and explicitly allowed short answers (one word, or a few-word phrase, or ashort sentence) as we think that answering with a full sentence is frequently perceived as artificial when asking about factual information. Annotators were asked to avoid extra, unnecessary information in the question or the answer, and to avoid yes/no questions or questions about the author or the actors.
Who are the annotators?Amazon Mechanical Turk workers.
None
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is released under a Apache-2.0 License .
@article{narrativeqa, author = {Tom\'a\v s Ko\v cisk\'y and Jonathan Schwarz and Phil Blunsom and Chris Dyer and Karl Moritz Hermann and G\'abor Melis and Edward Grefenstette}, title = {The {NarrativeQA} Reading Comprehension Challenge}, journal = {Transactions of the Association for Computational Linguistics}, url = {https://TBD}, volume = {TBD}, year = {2018}, pages = {TBD}, }
Thanks to @rsanjaykamath for adding this dataset.