数据集:
lambada
任务:
文生文语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
extended|bookcorpus许可:
cc-by-4.0The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse.
The LAMBADA dataset is extracted from BookCorpus and consists of 10'022 passages, divided into 4'869 development and 5'153 test passages. The training data for language models to be tested on LAMBADA include the full text of 2'662 novels (disjoint from those in dev+test), comprising 203 million words.
Long range dependency evaluated as (last) word prediction
The text in the dataset is in English. The associated BCP-47 code is en .
A data point is a text sequence (passage) including the context, the target sentence (the last one) and the target word. For each passage in the dev and the test splits, the word to be guessed is the last one.
The training data include the full text of 2'662 novels (disjoint from those in dev+test), comprising more than 200M words. It consists of text from the same domain as the dev+test passages, but not filtered in any way.
Each training instance has a category field indicating which sub-category the book was extracted from. This field is not given for the dev and test splits.
An example looks like this:
{"category": "Mystery", "text": "bob could have been called in at this point , but he was n't miffed at his exclusion at all . he was relieved at not being brought into this initial discussion with central command . `` let 's go make some grub , '' said bob as he turned to danny . danny did n't keep his stoic expression , but with a look of irritation got up and left the room with bob", }
The dataset aims at evaluating the ability of language models to hold long-term contextual memories. Instances are extracted from books because they display long-term dependencies. In particular, the data are curated such that the target words are easy to guess by human subjects when they can look at the whole passage they come from, but nearly impossible if only the last sentence is considered.
The corpus was duplicated and potentially offensive material were filtered out with a stop word list.
Who are the source language producers?The passages are extracted from novels from Book Corpus .
The authors required two consecutive subjects (paid crowdsourcers) to exactly match the missing word based on the whole passage (comprising the context and the target sentence), and made sure that no subject (out of ten) was able to provide it based on local context only, even when given 3 guesses.
Who are the annotators?The text is self-annotated but was curated by asking (paid) crowdsourcers to guess the last word.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is released under the [CC BY 4.0](Creative Commons Attribution 4.0 International) license.
@InProceedings{paperno-EtAl:2016:P16-1, author = {Paperno, Denis and Kruszewski, Germ\'{a}n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernandez, Raquel}, title = {The {LAMBADA} dataset: Word prediction requiring a broad discourse context}, booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = {August}, year = {2016}, address = {Berlin, Germany}, publisher = {Association for Computational Linguistics}, pages = {1525--1534}, url = {http://www.aclweb.org/anthology/P16-1144} }
Thanks to @VictorSanh for adding this dataset.