数据集:
common_gen
任务:
文生文语言:
en计算机处理:
monolingual大小:
10K<n<100K批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1911.03705其他:
concepts-to-text许可:
mitCommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday scenario using these concepts.
CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total.
An example of 'train' looks as follows.
{ "concept_set_idx": 0, "concepts": ["ski", "mountain", "skier"], "target": "Three skiers are skiing on a snowy mountain." }
The data fields are the same among all splits.
defaultname | train | validation | test |
---|---|---|---|
default | 67389 | 4018 | 1497 |
The dataset is licensed under MIT License .
@inproceedings{lin-etal-2020-commongen, title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning", author = "Lin, Bill Yuchen and Zhou, Wangchunshu and Shen, Ming and Zhou, Pei and Bhagavatula, Chandra and Choi, Yejin and Ren, Xiang", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165", doi = "10.18653/v1/2020.findings-emnlp.165", pages = "1823--1840" }
Thanks to @JetRunner , @yuchenlin , @thomwolf , @lhoestq for adding this dataset.