数据集:

ruanchaves/faquad-nli

任务:

问答

子任务:

extractive-qa

语言:

pt

计算机处理:

monolingual

大小:

n<1K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

extended|wikipedia

许可:

cc-by-4.0
中文

Dataset Card for FaQuAD-NLI

Dataset Summary

FaQuAD is a Portuguese reading comprehension dataset that follows the format of the Stanford Question Answering Dataset (SQuAD). It is a pioneer Portuguese reading comprehension dataset using the challenging format of SQuAD. The dataset aims to address the problem of abundant questions sent by academics whose answers are found in available institutional documents in the Brazilian higher education system. It consists of 900 questions about 249 reading passages taken from 18 official documents of a computer science college from a Brazilian federal university and 21 Wikipedia articles related to the Brazilian higher education system.

FaQuAD-NLI is a modified version of the FaQuAD dataset that repurposes the question answering task as a textual entailment task between a question and its possible answers.

Supported Tasks and Leaderboards

  • question_answering : The dataset can be used to train a model for question-answering tasks in the domain of Brazilian higher education institutions.
  • textual_entailment : FaQuAD-NLI can be used to train a model for textual entailment tasks, where answers in Q&A pairs are classified as either suitable or unsuitable.

Languages

This dataset is in Brazilian Portuguese.

Dataset Structure

Data Fields

  • document_index : an integer representing the index of the document.
  • document_title : a string containing the title of the document.
  • paragraph_index : an integer representing the index of the paragraph within the document.
  • question : a string containing the question related to the paragraph.
  • answer : a string containing the answer related to the question.
  • label : an integer (0 or 1) representing if the answer is suitable (1) or unsuitable (0) for the question.

Data Splits

The dataset is split into three subsets: train, validation, and test. The splits were made carefully to avoid question and answer pairs belonging to the same document appearing in more than one split.

Train Validation Test
Instances 3128 731 650

Contributions

Thanks to @ruanchaves for adding this dataset.