数据集:
adversarial_qa
任务:
问答语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
crowdsourced源数据集:
original许可:
cc-by-sa-4.0We have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop.
We use three different models; BiDAF (Seo et al., 2016), BERTLarge (Devlin et al., 2018), and RoBERTaLarge (Liu et al., 2019) in the annotation loop and construct three datasets; D(BiDAF), D(BERT), and D(RoBERTa), each with 10,000 training examples, 1,000 validation, and 1,000 test examples.
The adversarial human annotation paradigm ensures that these datasets consist of questions that current state-of-the-art models (at least the ones used as adversaries in the annotation loop) find challenging. The three AdversarialQA round 1 datasets provide a training and evaluation resource for such methods.
extractive-qa : The dataset can be used to train a model for Extractive Question Answering, which consists in selecting the answer to a question from a passage. Success on this task is typically measured by achieving a high word-overlap F1 score . The RoBERTa-Large model trained on all the data combined with SQuAD currently achieves 64.35% F1. This task has an active leaderboard and is available as round 1 of the QA task on Dynabench and ranks models based on F1 score.
The text in the dataset is in English. The associated BCP-47 code is en .
Data is provided in the same format as SQuAD 1.1. An example is shown below:
{ "data": [ { "title": "Oxygen", "paragraphs": [ { "context": "Among the most important classes of organic compounds that contain oxygen are (where \"R\" is an organic group): alcohols (R-OH); ethers (R-O-R); ketones (R-CO-R); aldehydes (R-CO-H); carboxylic acids (R-COOH); esters (R-COO-R); acid anhydrides (R-CO-O-CO-R); and amides (R-C(O)-NR2). There are many important organic solvents that contain oxygen, including: acetone, methanol, ethanol, isopropanol, furan, THF, diethyl ether, dioxane, ethyl acetate, DMF, DMSO, acetic acid, and formic acid. Acetone ((CH3)2CO) and phenol (C6H5OH) are used as feeder materials in the synthesis of many different substances. Other important organic compounds that contain oxygen are: glycerol, formaldehyde, glutaraldehyde, citric acid, acetic anhydride, and acetamide. Epoxides are ethers in which the oxygen atom is part of a ring of three atoms.", "qas": [ { "id": "22bbe104aa72aa9b511dd53237deb11afa14d6e3", "question": "In addition to having oxygen, what do alcohols, ethers and esters have in common, according to the article?", "answers": [ { "answer_start": 36, "text": "organic compounds" } ] }, { "id": "4240a8e708c703796347a3702cf1463eed05584a", "question": "What letter does the abbreviation for acid anhydrides both begin and end in?", "answers": [ { "answer_start": 244, "text": "R" } ] }, { "id": "0681a0a5ec852ec6920d6a30f7ef65dced493366", "question": "Which of the organic compounds, in the article, contains nitrogen?", "answers": [ { "answer_start": 262, "text": "amides" } ] }, { "id": "2990efe1a56ccf81938fa5e18104f7d3803069fb", "question": "Which of the important classes of organic compounds, in the article, has a number in its abbreviation?", "answers": [ { "answer_start": 262, "text": "amides" } ] } ] } ] } ] }
Note that no answers are provided in the test set. Indeed, this dataset is part of the DynaBench benchmark, for which you can submit your predictions on the website .
The dataset is composed of three different datasets constructed using different models in the loop: BiDAF, BERT-Large, and RoBERTa-Large. Each of these has 10,000 training examples, 1,000 validation examples, and 1,000 test examples for a total of 30,000/3,000/3,000 train/validation/test examples.
This dataset was collected to provide a more challenging and diverse Reading Comprehension dataset to state-of-the-art models.
The source passages are from Wikipedia and are the same as those used in SQuAD v1.1 .
Who are the source language producers?The source language produces are Wikipedia editors for the passages, and human annotators on Mechanical Turk for the questions.
The dataset is collected through an adversarial human annotation process which pairs a human annotator and a reading comprehension model in an interactive setting. The human is presented with a passage for which they write a question and highlight the correct answer. The model then tries to answer the question, and, if it fails to answer correctly, the human wins. Otherwise, the human modifies or re-writes their question until the successfully fool the model.
Who are the annotators?The annotators are from Amazon Mechanical Turk, geographically restricted the the USA, UK and Canada, having previously successfully completed at least 1,000 HITs, and having a HIT approval rate greater than 98%. Crowdworkers undergo intensive training and qualification prior to annotation.
No annotator identifying details are provided.
The purpose of this dataset is to help develop better question answering systems.
A system that succeeds at the supported task would be able to provide an accurate extractive answer from a short passage. This dataset is to be seen as a test bed for questions which contemporary state-of-the-art models struggle to answer correctly, thus often requiring more complex comprehension abilities than say detecting phrases explicitly mentioned in the passage with high overlap to the question.
It should be noted, however, that the the source passages are both domain-restricted and linguistically specific, and that provided questions and answers do not constitute any particular social application.
The dataset may exhibit various biases in terms of the source passage selection, annotated questions and answers, as well as algorithmic biases resulting from the adversarial annotation protocol.
N/a
This dataset was initially created by Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp, during work carried out at University College London (UCL).
This dataset is distributed under CC BY-SA 3.0 .
@article{bartolo2020beat, author = {Bartolo, Max and Roberts, Alastair and Welbl, Johannes and Riedel, Sebastian and Stenetorp, Pontus}, title = {Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension}, journal = {Transactions of the Association for Computational Linguistics}, volume = {8}, number = {}, pages = {662-678}, year = {2020}, doi = {10.1162/tacl\_a\_00338}, URL = { https://doi.org/10.1162/tacl_a_00338 }, eprint = { https://doi.org/10.1162/tacl_a_00338 }, abstract = { Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: Humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalization to data collected without a model. We find that training on adversarially collected samples leads to strong generalization to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD—only marginally lower than when trained on data collected using RoBERTa itself (41.0F1). } }
Thanks to @maxbartolo for adding this dataset.