数据集:
xor_tydi_qa
任务:
问答子任务:
open-domain-qa计算机处理:
multilingual大小:
10K<n<100K批注创建人:
crowdsourced预印本库:
arxiv:2010.11856许可:
mitXOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections.
There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.
XOR-retrieve : XOR-Retrieve is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to retrieve English paragraphs that answer the question. The dataset can be used to train a model for cross-lingual retrieval. Success on this task is typically measured by R@5kt, R@2kt (the recall by computing the fraction of the questions for which the minimal answer is contained in the top 5,000 / 2,000 tokens selected). This task has an active leaderboard which can be found at leaderboard url
XOR-English Span : XOR-English Span is a cross-lingual retrieval task where a question is written in a target language (e.g., Japanese) and a system is required to output a short answer in English. The dataset can be used to train a model for cross-lingual retrieval. Success on this task is typically measured by F1, EM. This task has an active leaderboard which can be found at leaderboard url
XOR-Full : XOR-Full is a cross-lingual retrieval task where a question is written in the target language (e.g., Japanese) and a system is required to output a short answer in a target language. Success on this task is typically measured by F1, EM, BLEU This task has an active leaderboard which can be found at leaderboard url
The text in the dataset is available in 7 languages: Arabic ar , Bengali bn , Finnish fi , Japanese ja , Korean ko , Russian ru , Telugu te
A typical data point comprises a question , it's answer the language of the question text and the split to which it belongs.
{ "id": "-3979399588609321314", "question": "Сколько детей было у Наполео́на I Бонапа́рта?", "answers": ["сын"], "lang": "ru", "split": "train" }
The data is split into a training, validation and test set for each of the two configurations.
train | validation | test | |
---|---|---|---|
XOR Retrieve | 15250 | 2113 | 2501 |
XOR Full | 61360 | 3179 | 8177 |
This task framework reflects well real-world scenarios where a QA system uses multilingual document collections and answers questions asked by users with diverse linguistic and cultural backgrounds. Despite the common assumption that we can find answers in the target language, web re- sources in non-English languages are largely lim- ited compared to English (information scarcity), or the contents are biased towards their own cul- tures (information asymmetry). To solve these issues, XOR-TYDI QA (Asai et al., 2020) provides a benchmark for developing a multilingual QA system that finds answers in multiple languages.
annotation pipeline consists of four steps: 1) collection of realistic questions that require cross-lingual ref- erences by annotating questions from TYDI QA without a same-language answer; 2) question translation from a target language to the pivot language of English where the missing informa- tion may exist; 3) answer span selection in the pivot language given a set of candidate documents; 4) answer verification and translation from the pivot language back to the original language.
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?The Dataset is created by extending TyDiQA dataset and translating the questions into other languages. The answers are obtained by crowdsourcing the questions to Mechanical Turk workders
The English questions from TyDiQA are translated into other languages. The languages are chosen based on the availability of wikipedia data and the availability of tranlators.
Who are the annotators?The translations are carried out using the professionla tranlation service (Gengo)[ https://gengo.com] and the answers are annotated by MechanicalTurk workers
The dataset is created from wikipedia content and the QA task requires preserving the named entities, there by all the Wikipedia Named Entities are preserved in the data. Not much information has been provided about masking sensitive information.
[More Information Needed]
[More Information Needed]
[More Information Needed]
The people associated with the creation of the dataset are Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, Hannaneh Hajishirzi
XOR-TyDi QA is distributed under the CC BY-SA 4.0 license
@article{xorqa, title = {XOR QA: Cross-lingual Open-Retrieval Question Answering}, author = {Akari Asai and Jungo Kasai and Jonathan H. Clark and Kenton Lee and Eunsol Choi and Hannaneh Hajishirzi} year = {2020} }
Thanks to @sumanthd17 for adding this dataset.