You can find the main data card on the GEM Website .
CommonGen is an English text generation task to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts, the task is to generate a coherent sentence describing an everyday scenario using these concepts. CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. The dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total. Note that the CommonGen test set is private and requires submission to the external leaderboard.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/common_gen')
The data loader can be found here .
website paper authorsBill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)
@inproceedings{lin-etal-2020-commongen, title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning", author = "Lin, Bill Yuchen and Zhou, Wangchunshu and Shen, Ming and Zhou, Pei and Bhagavatula, Chandra and Choi, Yejin and Ren, Xiang", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165", pages = "1823--1840", }Contact Name
Bill Yuchen Lin
Contact Emailyuchen.lin@usc.edu
Has a Leaderboard?yes
Leaderboard Link Leaderboard DetailsThe model outputs are evaluated against the crowdsourced references, and ranked by SPICE score. The leaderboard also reports BLEU-4 and CIDEr scores.
no
Covered DialectsNo information is provided on regional restrictions and we thus assume that the covered dialects are those spoken by raters on Mechanical Turk.
Covered LanguagesEnglish
Whose Language?The concepts were extracted from multiple English image captioning datasets and the data was collected via Amazon Mechanical Turk. No information on regional restrictions is provided.
Licensemit: MIT License
Intended UseCommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning.
Primary TaskReasoning
Communicative GoalThe speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.
academic , independent
Curation Organization(s)The dataset was curated by a joint team of researchers from the University of Southern California and Allen Institute for Artificial Intelligence.
Dataset CreatorsBill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)
FundingThe research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), the DARPA MCS program, and NSF SMA 18-29268.
Who added the Dataset to GEM?Yacine Jernite created the initial data card. It was later extended by Simon Mille. Sebastian Gehrmann migrated it to the GEMv2 format.
A data instance has the following fields:
[ { "concepts": ['ski', 'mountain', 'skier'], "target": 'Skier skis down the mountain', }, { "concepts": ['ski', 'mountain', 'skier'], "target": 'Three skiers are skiing on a snowy mountain.', }, ]Data Splits
Each example in the dataset consists of a set of 3 to 5 concepts denoted by a single noun, verb, or adjective (the input), and a sentence using these concepts (the output). The dataset provides several such sentences for each such concept.
Train | Dev | Test | |
---|---|---|---|
Total concept-sets | 32,651 | 993 | 1,497 |
Total sentences | 67,389 | 4,018 | 6,042 |
Average sentence length | 10.54 | 11.55 | 13.34 |
The dev and test set were created by sampling sets of concepts of size 4 or 5 (and as many of size 3 for the dev set) present in the source captioning datasets and having crowd-workers write reference sentences using these concepts.
Conversely, the training set has more concept sets of size 3 than of size 4 and 5, and uses the original captions from the source datasets as references.
The authors also ensured that the training, dev and test set have different combinations of unique concepts to ensure compositionality (details in Table 1 ).
CommonGen is a medium sized corpus with a unique reasoning challenge and interesting evaluation possibilities.
Similar Datasetsno
Ability that the Dataset measuresCommonsense reasoning
yes
GEM Modificationsother
Modification Details4 challenge sets for CommenGen were added to the GEM evaluation suite.
Additional Splits?yes
Split InformationWe created subsets of the training and development sets of ~500 randomly selected inputs each.
We applied input scrambling on a subset of 500 randomly selected test instances; the order of the concepts was randomly reassigned.
We created a subpopulation based on input length, taking into account the number of concepts the input test structures. By comparing inputs of different lengths, we can see to what extent systems are able to handle different input sizes
Concept number | Frequency English |
---|---|
4 | 747 |
5 | 750 |
Generalization and Robustness
Commonsense Reasoning
MetricsOther: Other Metrics , BLEU , ROUGE , METEOR
Other MetricsThe main metrics are captioning metrics since the original concept lists were extracted from captioning datasets. A human subject study with five graduate students was conducted and they were asked to rank the "commonsense plausibility" of two models at a time.
Previous results available?yes
Other Evaluation ApproachesThe currently best performing model KFCNet ( https://aclanthology.org/2021.findings-emnlp.249/ ) uses the same automatic evaluation but does not conduct any human evaluation.
Relevant Previous ResultsThe most relevant results can be seen on the leaderboard
The dataset creators selected sets of concepts that appeared in image and video captions (as identified by a POS tagger) to ensure that a likely real-world scenario including the set could be imagined and constructed. Section 3.1 of the paper describes a sampling scheme which encourages diversity of sets while selecting common concepts.
Communicative GoalThe speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.
Sourced from Different Sourcesyes
Source DetailsCrowdsourced
Where was it crowdsourced?Amazon Mechanical Turk
Language ProducersThe training data consists of concept sets and captions for the source datasets. The concept sets are the sets of labels of the images or videos, selected with a heuristic to maximize diversity while ensuring that they represent likely scenarios.
The dev and test set sentences were created by Amazon Mechanical Turk crowd workers. The workers were shown an example generation and a set of 4 or 5 concept names along with their part-of-speech and asked to write:
A screenshot of the interface is provided in Figure 7 of the Appendix .
Topics CoveredInformation was not provided.
Data Validationvalidated by data curator
Was Data Filtered?algorithmically
Filter CriteriaDuring the data collection, workers who provided rationales that were too short, failed to have good coverage of the input in their sentences, or workers whose output had a high perplexity under a GPT-2 model were disqualified from the pool and replaced with newcomers.
none
Annotation Service?no
no
Justification for Using the DataThe data was sourced from Mechanical Turk which means that raters were aware that their annotations may be publicly released for research purposes.
no PII
Justification for no PIIThe concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.
no
no
no
no
Are the Language Producers Representative of the Language?The dataset is created using data from image captioning systems and might inherit some of the social biases represented therein (see e.g. Tang et al. 2020 ).
Another related concern is the exposure bias introduced by the initial selection of pictures and video, which are likely to over-represent situations that are common in the US at the expense of other parts of the world (Flickr, for example, is a US-based company founded in Canada). For more discussion of the potential impacts of exposure bias, see e.g. The Social Impact of Natural Language Processing .
The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.
open license - commercial use allowed
Copyright Restrictions on the Language Dataopen license - commercial use allowed
The dataset is in English, a language with an abundance of existing resources.
The use of GPT-2 to validate development ant test sentences might be cause for similar concern , but we do note that the authors only use the model to discount very high perplexity sequences which is less likely to surface those biases.
The language in the development and test set is crowdsourced, which means that it was written by workers whose main goal was speed. This is likely to impact the quality and variety of the targets. The population of crowdsource workers is also not identically distributed as the the base population of the locations the workers come from, which may lead to different representation of situations or underlying expectations of what these situations are.
Unsuited ApplicationsDue to the overrepresentation of US-situations, the system may not work for users across the world. Moreover, only limited information on the dataset quality are provided and the system may fail as a result of unknown issues.
Discouraged Use CasesAny system needs to be evaluated on a broader set of unseen concepts then provided in the dataset. Since the references for the test set are private, it is not known how well findings generalize beyond the collection methodology.