数据集:

allenai/qasper

任务:

问答

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

extended|s2orc

预印本库:

arxiv:2105.03011

许可:

cc-by-4.0
中文

Dataset Card for Qasper

Dataset Summary

QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers.

Supported Tasks and Leaderboards

  • question-answering : The dataset can be used to train a model for Question Answering. Success on this task is typically measured by achieving a high F1 score . The official baseline model currently achieves 33.63 Token F1 score & uses Longformer . This task has an active leaderboard which can be found here

  • evidence-selection : The dataset can be used to train a model for Evidence Selection. Success on this task is typically measured by achieving a high F1 score . The official baseline model currently achieves 39.37 F1 score & uses Longformer . This task has an active leaderboard which can be found here

Languages

English, as it is used in research papers.

Dataset Structure

Data Instances

A typical instance in the dataset:

{
  'id': "Paper ID (string)",
  'title': "Paper Title",
  'abstract': "paper abstract ...",
  'full_text': {
      'paragraphs':[["section1_paragraph1_text","section1_paragraph2_text",...],["section2_paragraph1_text","section2_paragraph2_text",...]],
      'section_name':["section1_title","section2_title"],...},
  'qas': {
  'answers':[{
      'annotation_id': ["q1_answer1_annotation_id","q1_answer2_annotation_id"]
      'answer': [{
          'unanswerable':False,
          'extractive_spans':["q1_answer1_extractive_span1","q1_answer1_extractive_span2"],
          'yes_no':False,
          'free_form_answer':"q1_answer1",
          'evidence':["q1_answer1_evidence1","q1_answer1_evidence2",..],
          'highlighted_evidence':["q1_answer1_highlighted_evidence1","q1_answer1_highlighted_evidence2",..]
          },
          {
          'unanswerable':False,
          'extractive_spans':["q1_answer2_extractive_span1","q1_answer2_extractive_span2"],
          'yes_no':False,
          'free_form_answer':"q1_answer2",
          'evidence':["q1_answer2_evidence1","q1_answer2_evidence2",..],
          'highlighted_evidence':["q1_answer2_highlighted_evidence1","q1_answer2_highlighted_evidence2",..]
          }],
      'worker_id':["q1_answer1_worker_id","q1_answer2_worker_id"]
      },{...["question2's answers"]..},{...["question3's answers"]..}],
  'question':["question1","question2","question3"...],
  'question_id':["question1_id","question2_id","question3_id"...],
  'question_writer':["question1_writer_id","question2_writer_id","question3_writer_id"...],
  'nlp_background':["question1_writer_nlp_background","question2_writer_nlp_background",...],
  'topic_background':["question1_writer_topic_background","question2_writer_topic_background",...],
  'paper_read': ["question1_writer_paper_read_status","question2_writer_paper_read_status",...],
  'search_query':["question1_search_query","question2_search_query","question3_search_query"...],
  }
}

Data Fields

The following is an excerpt from the dataset README:

Within "qas", some fields should be obvious. Here is some explanation about the others:

Fields specific to questions:
  • "nlp_background" shows the experience the question writer had. The values can be "zero" (no experience), "two" (0 - 2 years of experience), "five" (2 - 5 years of experience), and "infinity" (> 5 years of experience). The field may be empty as well, indicating the writer has chosen not to share this information.

  • "topic_background" shows how familiar the question writer was with the topic of the paper. The values are "unfamiliar", "familiar", "research" (meaning that the topic is the research area of the writer), or null.

  • "paper_read", when specified shows whether the questionwriter has read the paper.

  • "search_query", if not empty, is the query the question writer used to find the abstract of the paper from a large pool of abstracts we made available to them.

Fields specific to answers

Unanswerable answers have "unanswerable" set to true. The remaining answers have exactly one of the following fields being non-empty.

  • "extractive_spans" are spans in the paper which serve as the answer.
  • "free_form_answer" is a written out answer.
  • "yes_no" is true iff the answer is Yes, and false iff the answer is No.

"evidence" is the set of paragraphs, figures or tables used to arrive at the answer. Tables or figures start with the string "FLOAT SELECTED"

"highlighted_evidence" is the set of sentences the answer providers selected as evidence if they chose textual evidence. The text in the "evidence" field is a mapping from these sentences to the paragraph level. That is, if you see textual evidence in the "evidence" field, it is guaranteed to be entire paragraphs, while that is not the case with "highlighted_evidence".

Data Splits

Train Valid
Number of papers 888 281
Number of questions 2593 1005
Number of answers 2675 1764

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

NLP papers: The full text of the papers is extracted from S2ORC (Lo et al., 2020)

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

"The annotators are NLP practitioners, not expert researchers, and it is likely that an expert would score higher"

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Crowdsourced NLP practitioners

Licensing Information

CC BY 4.0

Citation Information

@inproceedings{Dasigi2021ADO,
  title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers},
  author={Pradeep Dasigi and Kyle Lo and Iz Beltagy and Arman Cohan and Noah A. Smith and Matt Gardner},
  year={2021}
}

Contributions

Thanks to @cceyda for adding this dataset.