数据集:

GEM/common_gen

语言:

en

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

其他:

reasoning

许可:

mit
中文

Dataset Card for GEM/common_gen

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

CommonGen is an English text generation task to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts, the task is to generate a coherent sentence describing an everyday scenario using these concepts. CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. The dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total. Note that the CommonGen test set is private and requires submission to the external leaderboard.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/common_gen')

The data loader can be found here .

website

link

paper

Link

authors

Bill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)

Dataset Overview

Where to find the Data and its Documentation

Webpage

link

Download

Link

Paper

Link

BibTex
@inproceedings{lin-etal-2020-commongen,
    title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Zhou, Wangchunshu  and
      Shen, Ming  and
      Zhou, Pei  and
      Bhagavatula, Chandra  and
      Choi, Yejin  and
      Ren, Xiang",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
    pages = "1823--1840",
}
Contact Name

Bill Yuchen Lin

Contact Email

yuchen.lin@usc.edu

Has a Leaderboard?

yes

Leaderboard Link

Link

Leaderboard Details

The model outputs are evaluated against the crowdsourced references, and ranked by SPICE score. The leaderboard also reports BLEU-4 and CIDEr scores.

Languages and Intended Use

Multilingual?

no

Covered Dialects

No information is provided on regional restrictions and we thus assume that the covered dialects are those spoken by raters on Mechanical Turk.

Covered Languages

English

Whose Language?

The concepts were extracted from multiple English image captioning datasets and the data was collected via Amazon Mechanical Turk. No information on regional restrictions is provided.

License

mit: MIT License

Intended Use

CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning.

Primary Task

Reasoning

Communicative Goal

The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

Credit

Curation Organization Type(s)

academic , independent

Curation Organization(s)

The dataset was curated by a joint team of researchers from the University of Southern California and Allen Institute for Artificial Intelligence.

Dataset Creators

Bill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)

Funding

The research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), the DARPA MCS program, and NSF SMA 18-29268.

Who added the Dataset to GEM?

Yacine Jernite created the initial data card. It was later extended by Simon Mille. Sebastian Gehrmann migrated it to the GEMv2 format.

Dataset Structure

Data Fields

A data instance has the following fields:

  • concepts : a list of string values denoting the concept the system should write about. Has 3 to 5 items, constitutes the input of the task.
  • target : a sentence string mentioning all of the above mentioned concepts . Constitutes the desired output of the task.
Example Instance
[
  {
    "concepts": ['ski', 'mountain', 'skier'],
    "target": 'Skier skis down the mountain',
  },
  {
    "concepts": ['ski', 'mountain', 'skier'],
    "target": 'Three skiers are skiing on a snowy mountain.',
  },
]
Data Splits

Each example in the dataset consists of a set of 3 to 5 concepts denoted by a single noun, verb, or adjective (the input), and a sentence using these concepts (the output). The dataset provides several such sentences for each such concept.

Train Dev Test
Total concept-sets 32,651 993 1,497
Total sentences 67,389 4,018 6,042
Average sentence length 10.54 11.55 13.34
Splitting Criteria

The dev and test set were created by sampling sets of concepts of size 4 or 5 (and as many of size 3 for the dev set) present in the source captioning datasets and having crowd-workers write reference sentences using these concepts.

Conversely, the training set has more concept sets of size 3 than of size 4 and 5, and uses the original captions from the source datasets as references.

The authors also ensured that the training, dev and test set have different combinations of unique concepts to ensure compositionality (details in Table 1 ).

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

CommonGen is a medium sized corpus with a unique reasoning challenge and interesting evaluation possibilities.

Similar Datasets

no

Ability that the Dataset measures

Commonsense reasoning

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

other

Modification Details

4 challenge sets for CommenGen were added to the GEM evaluation suite.

Additional Splits?

yes

Split Information
  • Data Shift
  • We created subsets of the training and development sets of ~500 randomly selected inputs each.

  • Transformations
  • We applied input scrambling on a subset of 500 randomly selected test instances; the order of the concepts was randomly reassigned.

  • Subpopulations
  • We created a subpopulation based on input length, taking into account the number of concepts the input test structures. By comparing inputs of different lengths, we can see to what extent systems are able to handle different input sizes

    Concept number Frequency English
    4 747
    5 750
    Split Motivation

    Generalization and Robustness

    Getting Started with the Task

    Pointers to Resources

    Previous Results

    Previous Results

    Measured Model Abilities

    Commonsense Reasoning

    Metrics

    Other: Other Metrics , BLEU , ROUGE , METEOR

    Other Metrics
    • SPICE: An evaluation metric for image captioning that is defined over scene graphs
    • CIDEr: An n-gram overlap metric based on cosine similarity between the TF-IDF weighted ngram counts
    Proposed Evaluation

    The main metrics are captioning metrics since the original concept lists were extracted from captioning datasets. A human subject study with five graduate students was conducted and they were asked to rank the "commonsense plausibility" of two models at a time.

    Previous results available?

    yes

    Other Evaluation Approaches

    The currently best performing model KFCNet ( https://aclanthology.org/2021.findings-emnlp.249/ ) uses the same automatic evaluation but does not conduct any human evaluation.

    Relevant Previous Results

    The most relevant results can be seen on the leaderboard

    Dataset Curation

    Original Curation

    Original Curation Rationale

    The dataset creators selected sets of concepts that appeared in image and video captions (as identified by a POS tagger) to ensure that a likely real-world scenario including the set could be imagined and constructed. Section 3.1 of the paper describes a sampling scheme which encourages diversity of sets while selecting common concepts.

    Communicative Goal

    The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

    Sourced from Different Sources

    yes

    Source Details

    Language Data

    How was Language Data Obtained?

    Crowdsourced

    Where was it crowdsourced?

    Amazon Mechanical Turk

    Language Producers

    The training data consists of concept sets and captions for the source datasets. The concept sets are the sets of labels of the images or videos, selected with a heuristic to maximize diversity while ensuring that they represent likely scenarios.

    The dev and test set sentences were created by Amazon Mechanical Turk crowd workers. The workers were shown an example generation and a set of 4 or 5 concept names along with their part-of-speech and asked to write:

  • One sentence mentioning all of the concepts
  • A rationale explaining how the sentence connects the concept
  • A screenshot of the interface is provided in Figure 7 of the Appendix .

    Topics Covered

    Information was not provided.

    Data Validation

    validated by data curator

    Was Data Filtered?

    algorithmically

    Filter Criteria

    During the data collection, workers who provided rationales that were too short, failed to have good coverage of the input in their sentences, or workers whose output had a high perplexity under a GPT-2 model were disqualified from the pool and replaced with newcomers.

    Structured Annotations

    Additional Annotations?

    none

    Annotation Service?

    no

    Consent

    Any Consent Policy?

    no

    Justification for Using the Data

    The data was sourced from Mechanical Turk which means that raters were aware that their annotations may be publicly released for research purposes.

    Private Identifying Information (PII)

    Contains PII?

    no PII

    Justification for no PII

    The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.

    Maintenance

    Any Maintenance Plan?

    no

    Broader Social Context

    Previous Work on the Social Impact of the Dataset

    Usage of Models based on the Data

    no

    Impact on Under-Served Communities

    Addresses needs of underserved Communities?

    no

    Discussion of Biases

    Any Documented Social Biases?

    no

    Are the Language Producers Representative of the Language?

    The dataset is created using data from image captioning systems and might inherit some of the social biases represented therein (see e.g. Tang et al. 2020 ).

    Another related concern is the exposure bias introduced by the initial selection of pictures and video, which are likely to over-represent situations that are common in the US at the expense of other parts of the world (Flickr, for example, is a US-based company founded in Canada). For more discussion of the potential impacts of exposure bias, see e.g. The Social Impact of Natural Language Processing .

    Considerations for Using the Data

    PII Risks and Liability

    Potential PII Risk

    The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.

    Licenses

    Copyright Restrictions on the Dataset

    open license - commercial use allowed

    Copyright Restrictions on the Language Data

    open license - commercial use allowed

    Known Technical Limitations

    Technical Limitations

    The dataset is in English, a language with an abundance of existing resources.

    The use of GPT-2 to validate development ant test sentences might be cause for similar concern , but we do note that the authors only use the model to discount very high perplexity sequences which is less likely to surface those biases.

    The language in the development and test set is crowdsourced, which means that it was written by workers whose main goal was speed. This is likely to impact the quality and variety of the targets. The population of crowdsource workers is also not identically distributed as the the base population of the locations the workers come from, which may lead to different representation of situations or underlying expectations of what these situations are.

    Unsuited Applications

    Due to the overrepresentation of US-situations, the system may not work for users across the world. Moreover, only limited information on the dataset quality are provided and the system may fail as a result of unknown issues.

    Discouraged Use Cases

    Any system needs to be evaluated on a broader set of unseen concepts then provided in the dataset. Since the references for the test set are private, it is not known how well findings generalize beyond the collection methodology.