数据集:
piqa
任务:
问答子任务:
multiple-choice-qa语言:
en计算机处理:
monolingual大小:
10K<n<100K批注创建人:
crowdsourced源数据集:
original许可:
license:unknownTo apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems. The PIQA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA.
Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language.
PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials.
The underlying task is formualted as multiple choice question answering: given a question q and two possible solutions s1 , s2 , a model or a human must choose the most appropriate solution, of which exactly one is correct.
The text in the dataset is in English. The associated BCP-47 code is en .
An example looks like this:
{ "goal": "How do I ready a guinea pig cage for it's new occupants?", "sol1": "Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.", "sol2": "Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.", "label": 0, }
Note that the test set contains no labels. Predictions need to be submitted to the leaderboard.
List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.
The dataset contains 16,000 examples for training, 2,000 for development and 3,000 for testing.
The goal of the dataset is to construct a resource that requires concrete physical reasoning.
The authors provide a prompt to the annotators derived from instructables.com. The instructables website is a crowdsourced collection of instruc- tions for doing everything from cooking to car repair. In most cases, users provide images or videos detailing each step and a list of tools that will be required. Most goals are simultaneously rare and unsurprising. While an annotator is unlikely to have built a UV-Flourescent steampunk lamp or made a backpack out of duct tape, it is not surprising that someone interested in home crafting would create these, nor will the tools and materials be unfamiliar to the average person. Using these examples as the seed for their annotation, helps remind annotators about the less prototypical uses of everyday objects. Second, and equally important, is that instructions build on one another. This means that any QA pair inspired by an instructable is more likely to explicitly state assumptions about what preconditions need to be met to start the task and what postconditions define success.
Annotators were asked to glance at the instructions of an instructable and pull out or have it inspire them to construct two component tasks. They would then articulate the goal (often centered on atypical materials) and how to achieve it. In addition, annotaters were asked to provide a permutation to their own solution which makes it invalid (the negative solution), often subtly.
Initial Data Collection and NormalizationDuring validation, examples with low agreement were removed from the data.
The dataset is further cleaned to remove stylistic artifacts and trivial examples from the data, which have been shown to artificially inflate model performance on previous NLI benchmarks.using the AFLite algorithm introduced in ( Sakaguchi et al. 2020 ; Sap et al. 2019 ) which is an improvement on adversarial filtering ( Zellers et al, 2018 ).
Who are the source language producers?[More Information Needed]
Annotations are by construction obtained when crowdsourcers complete the prompt.
Who are the annotators?Paid crowdsourcers
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Unknown
@inproceedings{Bisk2020, author = {Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi}, title = {PIQA: Reasoning about Physical Commonsense in Natural Language}, booktitle = {Thirty-Fourth AAAI Conference on Artificial Intelligence}, year = {2020}, }
Thanks to @VictorSanh for adding this dataset.