数据集:
eli5_category
许可:
license:unknown源数据集:
extended|eli5批注创建人:
found大小:
100K<n<1M语言创建人:
found计算机处理:
monolingual语言:
en任务:
文生文The ELI5-Category dataset is a smaller but newer and categorized version of the original ELI5 dataset. It's an English-language dataset of questions and answers gathered from the r/explainlikeimfive subreddit where users ask factual questions requiring paragraph-length or longer answers. After 2017, a tagging system was introduced to this subreddit so that the questions can be categorized into different topics according to their tags. Since the training and validation set is built by questions in different topics, the dataset is expected to alleviate the train/validation overlapping issue in the original ELI5 dataset .
The text in the dataset is in English, as spoken by Reddit users on the r/explainlikeimfive subreddit. The associated BCP-47 code is en .
The structure of this dataset is very similar to the original ELI5 dataset . A typical data point comprises a question, with a title containing the main question and a selftext which sometimes elaborates on it, and a list of answers from the forum sorted by scores they obtained. Additionally, the URLs in each of the text fields have been extracted to respective lists and replaced by generic tokens in the text. In addition to the original ELI5 dataset, the data point also has a category field. There are 11 common values of category in this dataset: Biology , Chemistry , Culture , Earth Science , Economics , Engineering , Mathematics , Other , Physics , Psychology , Technology , and a special category : Repost indicates the same question has been asked before.
An example from the ELI5-Category set looks as follows:
{'q_id': '5lcm18', 'title': 'Why do old games running on new hardware still have technical issues ?', 'selftext': 'I am playing some mega man games on my Xbox One and experience slowdown when there are a lot of enemies on screen . but the Xbox One is significantly more powerful than the NES , so why is there still slowdown on this hardware ?', 'category': 'Engineering', 'subreddit': 'explainlikeimfive', 'answers': {'a_id': ['dbuo48e', 'dbusfve'], 'text': ["The XBox is emulating NES hardware and running the emulation at a set speed . If it ran it at as fast as possible , then it would be several times faster than the original NES game and would be unplayable . I ca n't speak for Mega Man exactly , but older games tended to run on a cycle locked to the screen refresh which was a fixed 60Hz or 50Hz . There was only one piece of hardware they ran on , so there was no need to adjust for different hardware speeds .", "In that case , it 's probably on purpose - they want to emulate the experience as closely as possible , even including the slowdown and sprite flickering . Some emulators let you turn it off , but it 's usually turned on by default . In other cases , like if you 're trying to emulate PS2 games on your PC , the game might just run really slow in general . Even though your PC is way more powerful than a PS2 , it has to \" translate \" from PS2 language to PC language in realtime , which is much more difficult than running PS2 code on the PS2 itself ."], 'score': [13, 3], 'text_urls': [[],[]]}, 'title_urls': {'url': []}, 'selftext_urls': {'url': []}}
In order to avoid having duplicate questions across sets, three non-overlapping subsets of category are used in the training, validation and test set. Also, a special validation set contains all the questions in the Repost category. A valid retriever-generator model should have consistent performances on both validation sets. The final split sizes are as follows:
Train | Valid | Valid2 | Test | |
---|---|---|---|---|
Biology | 32769 | |||
Chemistry | 6633 | |||
Culture | 5446 | |||
Earth Science | 677 | |||
Economics | 5901 | |||
Engineering | 5411 | |||
Mathematics | 1912 | |||
Other | 19312 | |||
Physics | 10196 | |||
Psychology | 338 | |||
Technology | 14034 | |||
Repost | 2375 | |||
Total | 91772 | 5446 | 2375 | 5411 |
ELI5-Category was built to provide a testbed for machines to learn how to answer more complex questions, which requires them to find and combine the information in a coherent manner. The dataset was built by gathering questions that were asked by community members of three subreddits, including r/explainlikeimfive , along with the answers that were provided by other users. The rules of the subreddit make this data particularly well suited to training a model for abstractive question answering: the questions need to seek an objective explanation about well-established facts, and the answers provided need to be understandable to a layperson without any particular knowledge domain.
The data was obtained by filtering submissions and comments from the subreddits of interest from the XML dumps of the Reddit forum hosted on Pushshift.io .
In order to further improve the quality of the selected examples, only questions with a score of at least 2 and at least one answer with a score of at least 2 were selected for the dataset. The dataset questions and answers span a period from January 2017 to June 2021.
Who are the source language producers?The language producers are users of the r/explainlikeimfive subreddit between 2017 and 2021. No further demographic information was available from the data source.
The dataset contains the category as an additional annotation for the topics of questions.
Annotation processThe dataset is auto-annotated by the tags of posts in the Reddit forum .
Who are the annotators?The annotators are users/administrators of the r/explainlikeimfive subreddit between 2017 and 2021. No further demographic information was available from the data source.
The authors removed the speaker IDs from the Pushshift.io dumps but did not otherwise anonymize the data. Some questions and answers are about contemporary public figures or individuals who appeared in the news.
The dataset has a similar social impact to the original ELI5 dataset Social Impact of Dataset .
The dataset has similar considerations of biases to the original ELI5 dataset Discussion of Biases .
The dataset has similar limitations to the original ELI5 dataset Other Known Limitations .
The dataset was initially created by Jingsong Gao, Qinren Zhou, Rui Qiu, during a course project of ANLY 580 : NLP for Data Analytics at Georgetown University.
The licensing status of the dataset hinges on the legal status of the Pushshift.io data which is unclear.
@inproceedings{eli5-category, author = {Jingsong Gao and Qingren Zhou and Rui Qiu}, title = {{ELI5-Category:} A categorized open-domain QA dataset}, year = {2021} }
Thanks to @jingshenSN2 , @QinrenZhou , @rexarski for adding this dataset.