数据集:
AmazonScience/mintaka
任务:
问答子任务:
open-domain-qa大小:
100K<n<1M语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0Mintaka is a complex, natural, and multilingual question answering (QA) dataset composed of 20,000 question-answer pairs elicited from MTurk workers and annotated with Wikidata question and answer entities. Full details on the Mintaka dataset can be found in our paper: https://aclanthology.org/2022.coling-1.138/
To build Mintaka, we explicitly collected questions in 8 complexity types, as well as generic questions:
Mintaka is one of the first large-scale complex, natural, and multilingual datasets that can be used for end-to-end question-answering models.
The dataset can be used to train a model for question answering. To ensure comparability, please refer to our evaluation script here: https://github.com/amazon-science/mintaka#evaluation
All questions were written in English and translated into 8 additional languages: Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish.
An example of 'train' looks as follows.
{ "id": "a9011ddf", "lang": "en", "question": "What is the seventh tallest mountain in North America?", "answerText": "Mount Lucania", "category": "geography", "complexityType": "ordinal", "questionEntity": [ { "name": "Q49", "entityType": "entity", "label": "North America", "mention": "North America", "span": [40, 53] }, { "name": 7, "entityType": "ordinal", "mention": "seventh", "span": [12, 19] } ], "answerEntity": [ { "name": "Q1153188", "label": "Mount Lucania", } ], }
The data fields are the same among all splits.
id : a unique ID for the given sample.
lang : the language of the question.
question : the original question elicited in the corresponding language.
answerText : the original answer text elicited in English.
category : the category of the question. Options are: geography, movies, history, books, politics, music, videogames, or sports
complexityType : the complexity type of the question. Options are: ordinal, intersection, count, superlative, yesno comparative, multihop, difference, or generic
questionEntity : a list of annotated question entities identified by crowd workers.
{ "name": The Wikidata Q-code or numerical value of the entity "entityType": The type of the entity. Options are: entity, cardinal, ordinal, date, time, percent, quantity, or money "label": The label of the Wikidata Q-code "mention": The entity as it appears in the English question text. Will be empty for non-English samples. "span": The start and end characters of the mention in the English question text. Will be empty for non-English samples. }
answerEntity : a list of annotated answer entities identified by crowd workers.
{ "name": The Wikidata Q-code or numerical value of the entity "label": The label of the Wikidata Q-code }
For each language, we split into train (14,000 samples), dev (2,000 samples), and test (4,000 samples) sets.
The corpora is free of personal or sensitive information.
Amazon Alexa AI.
This project is licensed under the CC-BY-4.0 License.
Please cite the following papers when using this dataset.
@inproceedings{sen-etal-2022-mintaka, title = "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering", author = "Sen, Priyanka and Aji, Alham Fikri and Saffari, Amir", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.138", pages = "1604--1619" }
Thanks to @afaji for adding this dataset.