数据集:
numer_sense
子任务:
slot-filling语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
extended|other预印本库:
arxiv:2005.00683许可:
mitNumerSense is a new numerical commonsense reasoning probing task, with a diagnostic dataset consisting of 3,145 masked-word-prediction probes. The general idea is to mask numbers between 0-10 in sentences mined from a commonsense corpus and evaluate whether a language model can correctly predict the masked value.
The dataset supports the task of slot-filling, specifically as an evaluation of numerical common sense. A leaderboard is included on the dataset webpage with included benchmarks for GPT-2, RoBERTa, BERT, and human performance. Leaderboards are included for both the core set and the adversarial set discussed below.
This dataset is in English.
Each instance consists of a sentence with a masked numerical value between 0-10 and (in the train set) a target. Example from the training set:
sentence: Black bears are about <mask> metres tall. target: two
Each value of the training set consists of:
The dataset includes the following pre-defined data splits:
The purpose of this dataset is "to study whether PTLMs capture numerical commonsense knowledge, i.e., commonsense knowledge that provides an understanding of the numeric relation between entities." This work is motivated by the prior research exploring whether language models possess commonsense knowledge .
The dataset is an extension of the Open Mind Common Sense corpus. A query was performed to discover sentences containing numbers between 0-12, after which the resulting sentences were manually evaluated for inaccuracies, typos, and the expression of commonsense knowledge. The numerical values were then masked.
Who are the source language producers?The Open Mind Common Sense corpus, from which this dataset is sourced, is a crowdsourced dataset maintained by the MIT Media Lab.
No annotations are present in this dataset beyond the target values automatically sourced from the masked sentences, as discussed above.
Who are the annotators?The curation and inspection was done in two rounds by graduate students.
[More Information Needed]
The motivation of measuring a model's ability to associate numerical values with real-world concepts appears relatively innocuous. However, as discussed in the following section, the source dataset may well have biases encoded from crowdworkers, particularly in terms of factoid coverage. A model's ability to perform well on this benchmark should therefore not be considered evidence that it is more unbiased or objective than a human performing similar tasks.
[More Information Needed]
This dataset is sourced from a crowdsourced commonsense knowledge base. While the information contained in the graph is generally considered to be of high quality, the coverage is considered to very low as a representation of all possible commonsense knowledge. The representation of certain factoids may also be skewed by the demographics of the crowdworkers. As one possible example, the term "homophobia" is connected with "Islam" in the ConceptNet knowledge base, but not with any other religion or group, possibly due to the biases of crowdworkers contributing to the project.
[More Information Needed]
This dataset was collected by Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren, Computer Science researchers at the at the University of Southern California.
The data is hosted in a GitHub repositor with the MIT License .
@inproceedings{lin2020numersense, title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models}, author={Bill Yuchen Lin and Seyeon Lee and Rahul Khanna and Xiang Ren}, booktitle={Proceedings of EMNLP}, year={2020}, note={to appear} }
Thanks to @joeddav for adding this dataset.