数据集:
eli5
任务:
文生文语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
license:unknownThe ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. The dataset was created to support the task of open-domain long form abstractive question answering, and covers questions about general topics in its r/explainlikeimfive subset, science in it r/askscience subset, and History in its r/AskHistorians subset.
The text in the dataset is in English, as spoken by Reddit users on the r/explainlikeimfive , r/askscience , and r/AskHistorians subreddits. The associated BCP-47 code is en .
A typical data point comprises a question, with a title containing the main question and a selftext which sometimes elaborates on it, and a list of answers from the forum sorted by the number of upvotes they obtained. Additionally, the URLs in each of the text fields have been extracted to respective lists and replaced by generic tokens in the text.
An example from the ELI5 test set looks as follows:
{'q_id': '8houtx', 'title': 'Why does water heated to room temperature feel colder than the air around it?', 'selftext': '', 'document': '', 'subreddit': 'explainlikeimfive', 'answers': {'a_id': ['dylcnfk', 'dylcj49'], 'text': ["Water transfers heat more efficiently than air. When something feels cold it's because heat is being transferred from your skin to whatever you're touching. Since water absorbs the heat more readily than air, it feels colder.", "Air isn't as good at transferring heat compared to something like water or steel (sit on a room temperature steel bench vs. a room temperature wooden bench, and the steel one will feel more cold).\n\nWhen you feel cold, what you're feeling is heat being transferred out of you. If there is no breeze, you feel a certain way. If there's a breeze, you will get colder faster (because the moving air is pulling the heat away from you), and if you get into water, its quite good at pulling heat from you. Get out of the water and have a breeze blow on you while you're wet, all of the water starts evaporating, pulling even more heat from you."], 'score': [5, 2]}, 'title_urls': {'url': []}, 'selftext_urls': {'url': []}, 'answers_urls': {'url': []}}
The data is split into a training, validation and test set for each of the three subreddits. In order to avoid having duplicate questions in across sets, the title field of each of the questions were ranked by their tf-idf match to their nearest neighbor and the ones with the smallest value were used in the test and validation sets. The final split sizes are as follow:
Train | Valid | Test | |
---|---|---|---|
r/explainlikeimfive examples | 272634 | 9812 | 24512 |
r/askscience examples | 131778 | 2281 | 4462 |
r/AskHistorians examples | 98525 | 4901 | 9764 |
ELI5 was built to provide a testbed for machines to learn how to answer more complex questions, which requires them to find and combine information in a coherent manner. The dataset was built by gathering questions that were asked by community members of three subreddits, including r/explainlikeimfive , along with the answers that were provided by other users. The rules of the subreddit make this data particularly well suited to training a model for abstractive question answering: the questions need to seek an objective explanation about well established facts, and the answers provided need to be understandable to a layperson without any particular knowledge domain.
The data was obtained by filtering submissions and comments from the subreddits of interest from the XML dumps of the Reddit forum hosted on Pushshift.io .
In order to further improve the quality of the selected examples, only questions with a score of at least 2 and at least one answer with a score of at least 2 were selected for the dataset. The dataset questions and answers span a period form August 2012 to August 2019.
Who are the source language producers?The language producers are users of the r/explainlikeimfive , r/askscience , and r/AskHistorians subreddits between 2012 and 2019. No further demographic information was available from the data source.
The dataset does not contain any additional annotations.
Annotation process[N/A]
Who are the annotators?[N/A]
The authors removed the speaker IDs from the Pushshift.io dumps but did not otherwise anonymize the data. Some of the questions and answers are about contemporary public figures or individuals who appeared in the news.
The purpose of this dataset is to help develop better question answering systems.
A system that succeeds at the supported task would be able to provide a coherent answer to even complex questions requiring a multi-step explanation, which is beyond the ability of even the larger existing models. The task is also thought as a test-bed for retrieval model which can show the users which source text was used in generating the answer and allow them to confirm the information provided to them.
It should be noted however that the provided answers were written by Reddit users, an information which may be lost if models trained on it are deployed in down-stream applications and presented to users without context. The specific biases this may introduce are discussed in the next section.
While Reddit hosts a number of thriving communities with high quality discussions, it is also widely known to have corners where sexism, hate, and harassment are significant issues. See for example the recent post from Reddit founder u/spez outlining some of the ways he thinks the website's historical policies have been responsible for this problem, Adrienne Massanari's 2015 article on GamerGate and follow-up works, or a 2019 Wired article on misogyny on Reddit .
While there has been some recent work in the NLP community on de-biasing models (e.g. Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings for word embeddings trained specifically on Reddit data), this problem is far from solved, and the likelihood that a trained model might learn the biases present in the data remains a significant concern.
We still note some encouraging signs for all of these communities: r/explainlikeimfive and r/askscience have similar structures and purposes, and r/askscience was found in 2015 to show medium supportiveness and very low toxicity when compared to other subreddits (see a hackerfall post , thecut.com write-up and supporting data ). Meanwhile, the r/AskHistorians rules mention that the admins will not tolerate " racism, sexism, or any other forms of bigotry ". However, further analysis of whether and to what extent these rules reduce toxicity is still needed.
We also note that given the audience of the Reddit website which is more broadly used in the US and Europe, the answers will likely present a Western perspectives, which is particularly important to note when dealing with historical topics.
The answers provided in the dataset are represent the opinion of Reddit users. While these communities strive to be helpful, they should not be considered to represent a ground truth.
The dataset was initially created by Angela Fan, Ethan Perez, Yacine Jernite, Jason Weston, Michael Auli, and David Grangier, during work done at Facebook AI Research (FAIR).
The licensing status of the dataset hinges on the legal status of the Pushshift.io data which is unclear.
@inproceedings{eli5_lfqa, author = {Angela Fan and Yacine Jernite and Ethan Perez and David Grangier and Jason Weston and Michael Auli}, editor = {Anna Korhonen and David R. Traum and Llu{\'{\i}}s M{\`{a}}rquez}, title = {{ELI5:} Long Form Question Answering}, booktitle = {Proceedings of the 57th Conference of the Association for Computational Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers}, pages = {3558--3567}, publisher = {Association for Computational Linguistics}, year = {2019}, url = {https://doi.org/10.18653/v1/p19-1346}, doi = {10.18653/v1/p19-1346} }
Thanks to @lewtun , @lhoestq , @mariamabarham , @thomwolf , @yjernite for adding this dataset.