数据集:

GEM/common_gen

任务:

task_categories:other

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

预印本库:

arxiv:1911.03705 arxiv:1910.13461 arxiv:2009.12677

其他:

reasoning

许可:

mit

数据集介绍文件清单

中文

Dataset Card for GEM/common_gen

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

CommonGen is an English text generation task to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts, the task is to generate a coherent sentence describing an everyday scenario using these concepts. CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. The dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total. Note that the CommonGen test set is private and requires submission to the external leaderboard.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/common_gen')

The data loader can be found here .

website

link

paper

Link

authors

Bill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)

Dataset Overview

Where to find the Data and its Documentation

Webpage

link

Download

Link

Paper

Link

BibTex

@inproceedings{lin-etal-2020-commongen,
    title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Zhou, Wangchunshu  and
      Shen, Ming  and
      Zhou, Pei  and
      Bhagavatula, Chandra  and
      Choi, Yejin  and
      Ren, Xiang",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
    pages = "1823--1840",
}

Contact Name

Bill Yuchen Lin

Contact Email

yuchen.lin@usc.edu

Has a Leaderboard?

yes

Leaderboard Link

Link

Leaderboard Details

The model outputs are evaluated against the crowdsourced references, and ranked by SPICE score. The leaderboard also reports BLEU-4 and CIDEr scores.

Languages and Intended Use

Multilingual?

Covered Dialects

No information is provided on regional restrictions and we thus assume that the covered dialects are those spoken by raters on Mechanical Turk.

Covered Languages

English

Whose Language?

The concepts were extracted from multiple English image captioning datasets and the data was collected via Amazon Mechanical Turk. No information on regional restrictions is provided.

License

mit: MIT License

Intended Use

CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning.

Primary Task

Reasoning

Communicative Goal

The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

Credit

Curation Organization Type(s)

academic , independent

Curation Organization(s)

The dataset was curated by a joint team of researchers from the University of Southern California and Allen Institute for Artificial Intelligence.

Dataset Creators

Bill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)

Funding

The research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), the DARPA MCS program, and NSF SMA 18-29268.

Who added the Dataset to GEM?

Yacine Jernite created the initial data card. It was later extended by Simon Mille. Sebastian Gehrmann migrated it to the GEMv2 format.

Dataset Structure

Data Fields

A data instance has the following fields:

concepts : a list of string values denoting the concept the system should write about. Has 3 to 5 items, constitutes the input of the task.
target : a sentence string mentioning all of the above mentioned concepts . Constitutes the desired output of the task.

Example Instance

[
  {
    "concepts": ['ski', 'mountain', 'skier'],
    "target": 'Skier skis down the mountain',
  },
  {
    "concepts": ['ski', 'mountain', 'skier'],
    "target": 'Three skiers are skiing on a snowy mountain.',
  },
]

Data Splits

Each example in the dataset consists of a set of 3 to 5 concepts denoted by a single noun, verb, or adjective (the input), and a sentence using these concepts (the output). The dataset provides several such sentences for each such concept.

Train	Dev	Test
Total concept-sets	32,651	993	1,497
Total sentences	67,389	4,018	6,042
Average sentence length	10.54	11.55	13.34

Splitting Criteria

The dev and test set were created by sampling sets of concepts of size 4 or 5 (and as many of size 3 for the dev set) present in the source captioning datasets and having crowd-workers write reference sentences using these concepts.

Conversely, the training set has more concept sets of size 3 than of size 4 and 5, and uses the original captions from the source datasets as references.

The authors also ensured that the training, dev and test set have different combinations of unique concepts to ensure compositionality (details in Table 1 ).

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

CommonGen is a medium sized corpus with a unique reasoning challenge and interesting evaluation possibilities.

Similar Datasets

Ability that the Dataset measures

Commonsense reasoning

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

other

Modification Details

4 challenge sets for CommenGen were added to the GEM evaluation suite.

Additional Splits?

yes

Split Information

Data Shift

We created subsets of the training and development sets of ~500 randomly selected inputs each.

Transformations

We applied input scrambling on a subset of 500 randomly selected test instances; the order of the concepts was randomly reassigned.

Subpopulations

We created a subpopulation based on input length, taking into account the number of concepts the input test structures. By comparing inputs of different lengths, we can see to what extent systems are able to handle different input sizes

Concept number	Frequency English
4	747
5	750

Split Motivation

Generalization and Robustness

Getting Started with the Task

Pointers to Resources

Two variants of BART , Knowledge Graph augemnted-BART and Enhanced Knowledge Injection Model for Commonsense Generation , hold the top two spots on the leaderboard, followed by a fine-tuned T5 model .
The following script shows how to download and load the data, fine-tune, and evaluate a model using the ROUGE, BLEU, and METEOR metrics: GEM sample script .

Previous Results

Measured Model Abilities

Commonsense Reasoning

Metrics

Other: Other Metrics , BLEU , ROUGE , METEOR

Other Metrics

SPICE: An evaluation metric for image captioning that is defined over scene graphs
CIDEr: An n-gram overlap metric based on cosine similarity between the TF-IDF weighted ngram counts

Proposed Evaluation

The main metrics are captioning metrics since the original concept lists were extracted from captioning datasets. A human subject study with five graduate students was conducted and they were asked to rank the "commonsense plausibility" of two models at a time.

Previous results available?

yes

Other Evaluation Approaches

The currently best performing model KFCNet ( https://aclanthology.org/2021.findings-emnlp.249/ ) uses the same automatic evaluation but does not conduct any human evaluation.

Relevant Previous Results

The most relevant results can be seen on the leaderboard

Dataset Curation

Original Curation

Original Curation Rationale

The dataset creators selected sets of concepts that appeared in image and video captions (as identified by a POS tagger) to ensure that a likely real-world scenario including the set could be imagined and constructed. Section 3.1 of the paper describes a sampling scheme which encourages diversity of sets while selecting common concepts.

Communicative Goal

The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

Sourced from Different Sources

yes

Source Details

Flickr30k
MSCOCO
Conceptual Captions
Video captioning datasets:
- LSMDC
- ActivityNet
- VaTeX

Language Data

How was Language Data Obtained?

Crowdsourced

Where was it crowdsourced?

Amazon Mechanical Turk

Language Producers

The training data consists of concept sets and captions for the source datasets. The concept sets are the sets of labels of the images or videos, selected with a heuristic to maximize diversity while ensuring that they represent likely scenarios.

The dev and test set sentences were created by Amazon Mechanical Turk crowd workers. The workers were shown an example generation and a set of 4 or 5 concept names along with their part-of-speech and asked to write:

One sentence mentioning all of the concepts

A rationale explaining how the sentence connects the concept

A screenshot of the interface is provided in Figure 7 of the Appendix .

Topics Covered

Information was not provided.

Data Validation

validated by data curator

Was Data Filtered?

algorithmically

Filter Criteria

During the data collection, workers who provided rationales that were too short, failed to have good coverage of the input in their sentences, or workers whose output had a high perplexity under a GPT-2 model were disqualified from the pool and replaced with newcomers.

Structured Annotations

Additional Annotations?

none

Annotation Service?

Consent

Any Consent Policy?

Justification for Using the Data

The data was sourced from Mechanical Turk which means that raters were aware that their annotations may be publicly released for research purposes.

Private Identifying Information (PII)

Contains PII?

no PII

Justification for no PII

The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Discussion of Biases

Any Documented Social Biases?

Are the Language Producers Representative of the Language?

The dataset is created using data from image captioning systems and might inherit some of the social biases represented therein (see e.g. Tang et al. 2020 ).

Another related concern is the exposure bias introduced by the initial selection of pictures and video, which are likely to over-represent situations that are common in the US at the expense of other parts of the world (Flickr, for example, is a US-based company founded in Canada). For more discussion of the potential impacts of exposure bias, see e.g. The Social Impact of Natural Language Processing .

Considerations for Using the Data

PII Risks and Liability

Potential PII Risk

The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.

Licenses

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

The dataset is in English, a language with an abundance of existing resources.

The use of GPT-2 to validate development ant test sentences might be cause for similar concern , but we do note that the authors only use the model to discount very high perplexity sequences which is less likely to surface those biases.

The language in the development and test set is crowdsourced, which means that it was written by workers whose main goal was speed. This is likely to impact the quality and variety of the targets. The population of crowdsource workers is also not identically distributed as the the base population of the locations the workers come from, which may lead to different representation of situations or underlying expectations of what these situations are.

Unsuited Applications

Due to the overrepresentation of US-situations, the system may not work for users across the world. Moreover, only limited information on the dataset quality are provided and the system may fail as a result of unknown issues.

Discouraged Use Cases

Any system needs to be evaluated on a broader set of unseen concepts then provided in the dataset. Since the references for the test set are private, it is not known how well findings generalize beyond the collection methodology.

作者:

GEM

数据集大小:

115.96 KB