数据集:
mkqa
任务:
问答子任务:
open-domain-qa大小:
10K<n<100K语言创建人:
found批注创建人:
crowdsourced预印本库:
arxiv:2007.15207许可:
cc-by-3.0MKQA contains 10,000 queries sampled from the Google Natural Questions dataset .
For each query we collect new passage-independent answers. These queries and answers are then human translated into 25 Non-English languages.
question-answering
Language code | Language name |
---|---|
ar | Arabic |
da | Danish |
de | German |
en | English |
es | Spanish |
fi | Finnish |
fr | French |
he | Hebrew |
hu | Hungarian |
it | Italian |
ja | Japanese |
ko | Korean |
km | Khmer |
ms | Malay |
nl | Dutch |
no | Norwegian |
pl | Polish |
pt | Portuguese |
ru | Russian |
sv | Swedish |
th | Thai |
tr | Turkish |
vi | Vietnamese |
zh_cn | Chinese (Simplified) |
zh_hk | Chinese (Hong kong) |
zh_tw | Chinese (Traditional) |
An example from the data set looks as follows:
{ 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': {'en': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': []}], 'ru': [{'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}], 'ja': [{'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}], 'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}], ... }, }
Each example in the dataset contains the unique Natural Questions example_id , the original English query , and then queries and answers in 26 languages. Each answer is labelled with an answer type. The breakdown is:
Answer Type | Occurrence |
---|---|
entity | 4221 |
long_answer | 1815 |
unanswerable | 1427 |
date | 1174 |
number | 485 |
number_with_unit | 394 |
short_phrase | 346 |
binary | 138 |
For each language, there can be more than one acceptable textual answer, in order to capture a variety of possible valid answers.
Detailed explanation of fields taken from here
when entity field is not available it is set to an empty string ''. when aliases field is not available it is set to an empty list [].
[More Information Needed]
Google Natural Questions dataset
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@misc{mkqa, title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering}, author = {Shayne Longpre and Yi Lu and Joachim Daiber}, year = {2020}, URL = {https://arxiv.org/pdf/2007.15207.pdf} }
Thanks to @cceyda for adding this dataset.