数据集:

inverse-scaling/NeQA

中文

NeQA: Can Large Language Models Understand Negation in Multi-choice Questions? (Zhengping Zhou and Yuhui Zhang)

General description

This task takes an existing multiple-choice dataset and negates a part of each question to see if language models are sensitive to negation. The authors find that smaller language models display approximately random performance whereas the performance of larger models become significantly worse than random.

Language models failing to follow instructions in the prompt could be a serious issue that only becomes apparent on a task once models are sufficiently capable to perform non-randomly on the task.

Example

The following are multiple choice questions (with answers) about common sense.

Question: If a cat has a body temp that is below average, it isn't in

A. danger

B. safe ranges

Answer:

(where the model should choose B.)

Submission details

Task description

Negation is a common linguistic phenomenon that can completely alter the semantics of a sentence by changing just a few words. This task evaluates whether language models can understand negation, which is an important step towards true natural language understanding. Specifically, we focus on negation in open-book multi-choice questions, considering its wide range of applications and the simplicity of evaluation. We collect a multi-choice question answering dataset, NeQA, that includes questions with negations. When negation is presented in the question, the original correct answer becomes wrong, and the wrong answer becomes correct. We use the accuracy metric to examine whether the model can understand negation in the questions and select the correct answer given the presence of negation. We observe a clear inverse scaling trend on GPT-3, demonstrating that larger language models can answer more complex questions but fail at the last step to understanding negation.

Dataset generation procedure

The dataset is created by applying rules to transform questions in a publicly available multiple-choice question answering dataset named OpenBookQA. We use a simple rule by filtering questions containing "is" and adding "not" after it. For each question, we sample an incorrect answer as the correct answer and treat the correct answer as the incorrect answer. We randomly sample 300 questions and balance the label distributions (50% label as "A" and 50% label as "B" since there are two choices for each question)..

Why do you expect to see inverse scaling?

For open-book question answering, larger language models usually achieve better accuracy because more factual and commonsense knowledge is stored in the model parameters and can be used as a knowledge base to answer these questions without context. A higher accuracy rate means a lower chance of choosing the wrong answer. Can we change the wrong answer to the correct one? A simple solution is to negate the original question. If the model cannot understand negation, it will still predict the same answer and, therefore, will exhibit an inverse scaling trend. We expect that the model cannot understand negation because negation introduces only a small perturbation to the model input. It is difficult for the model to understand that this small perturbation leads to completely different semantics.

Why is the task important?

This task is important because it demonstrates that current language models cannot understand negation, a very common linguistic phenomenon and a real-world challenge to natural language understanding. Why is the task novel or surprising? (1+ sentences) To the best of our knowledge, no prior work shows that negation can cause inverse scaling. This finding should be surprising to the community, as large language models show an incredible variety of emergent capabilities, but still fail to understand negation, which is a fundamental concept in language.

Results

Inverse Scaling Prize: Round 1 Winners announcement