数据集:

mkqa

任务:

问答

子任务:

open-domain-qa

大小:

10K<n<100K

语言创建人:

found

批注创建人:

crowdsourced

预印本库:

arxiv:2007.15207

许可:

cc-by-3.0
中文

Dataset Card for MKQA: Multilingual Knowledge Questions & Answers

Dataset Summary

MKQA contains 10,000 queries sampled from the Google Natural Questions dataset .

For each query we collect new passage-independent answers. These queries and answers are then human translated into 25 Non-English languages.

Supported Tasks and Leaderboards

question-answering

Languages

Language code Language name
ar Arabic
da Danish
de German
en English
es Spanish
fi Finnish
fr French
he Hebrew
hu Hungarian
it Italian
ja Japanese
ko Korean
km Khmer
ms Malay
nl Dutch
no Norwegian
pl Polish
pt Portuguese
ru Russian
sv Swedish
th Thai
tr Turkish
vi Vietnamese
zh_cn Chinese (Simplified)
zh_hk Chinese (Hong kong)
zh_tw Chinese (Traditional)

Dataset Structure

Data Instances

An example from the data set looks as follows:

{
 'example_id': 563260143484355911,
 'queries': {
  'en': "who sings i hear you knocking but you can't come in",
  'ru': "кто поет i hear you knocking but you can't come in",
  'ja': '「 I hear you knocking」は誰が歌っていますか',
  'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的",
  ...
 },
 'query': "who sings i hear you knocking but you can't come in",
 'answers': {'en': [{'type': 'entity',
    'entity': 'Q545186',
    'text': 'Dave Edmunds',
    'aliases': []}],
  'ru': [{'type': 'entity',
    'entity': 'Q545186',
    'text': 'Эдмундс, Дэйв',
    'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds']}],
  'ja': [{'type': 'entity',
    'entity': 'Q545186',
    'text': 'デイヴ・エドモンズ',
    'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ']}],
  'zh_cn': [{'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186'}],
  ...
  },
}

Data Fields

Each example in the dataset contains the unique Natural Questions example_id , the original English query , and then queries and answers in 26 languages. Each answer is labelled with an answer type. The breakdown is:

Answer Type Occurrence
entity 4221
long_answer 1815
unanswerable 1427
date 1174
number 485
number_with_unit 394
short_phrase 346
binary 138

For each language, there can be more than one acceptable textual answer, in order to capture a variety of possible valid answers.

Detailed explanation of fields taken from here

when entity field is not available it is set to an empty string ''. when aliases field is not available it is set to an empty list [].

Data Splits

  • Train: 10000

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Google Natural Questions dataset

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

CC BY-SA 3.0

Citation Information

@misc{mkqa,
    title = {MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering},
    author = {Shayne Longpre and Yi Lu and Joachim Daiber},
    year = {2020},
    URL = {https://arxiv.org/pdf/2007.15207.pdf}
}

Contributions

Thanks to @cceyda for adding this dataset.