Dataset Card for "mlqa"
Dataset Summary
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
Supported Tasks and Leaderboards
More Information Needed
Languages
MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese.
Dataset Structure
Data Instances
mlqa-translate-test.ar
-
Size of downloaded dataset files:
10.08 MB
-
Size of the generated dataset:
5.48 MB
-
Total amount of disk used:
15.56 MB
An example of 'test' looks as follows.
mlqa-translate-test.de
-
Size of downloaded dataset files:
10.08 MB
-
Size of the generated dataset:
3.88 MB
-
Total amount of disk used:
13.96 MB
An example of 'test' looks as follows.
mlqa-translate-test.es
-
Size of downloaded dataset files:
10.08 MB
-
Size of the generated dataset:
3.92 MB
-
Total amount of disk used:
13.99 MB
An example of 'test' looks as follows.
mlqa-translate-test.hi
-
Size of downloaded dataset files:
10.08 MB
-
Size of the generated dataset:
4.61 MB
-
Total amount of disk used:
14.68 MB
An example of 'test' looks as follows.
mlqa-translate-test.vi
-
Size of downloaded dataset files:
10.08 MB
-
Size of the generated dataset:
6.00 MB
-
Total amount of disk used:
16.07 MB
An example of 'test' looks as follows.
Data Fields
The data fields are the same among all splits.
mlqa-translate-test.ar
-
context
: a
string
feature.
-
question
: a
string
feature.
-
answers
: a dictionary feature containing:
-
answer_start
: a
int32
feature.
-
text
: a
string
feature.
-
id
: a
string
feature.
mlqa-translate-test.de
-
context
: a
string
feature.
-
question
: a
string
feature.
-
answers
: a dictionary feature containing:
-
answer_start
: a
int32
feature.
-
text
: a
string
feature.
-
id
: a
string
feature.
mlqa-translate-test.es
-
context
: a
string
feature.
-
question
: a
string
feature.
-
answers
: a dictionary feature containing:
-
answer_start
: a
int32
feature.
-
text
: a
string
feature.
-
id
: a
string
feature.
mlqa-translate-test.hi
-
context
: a
string
feature.
-
question
: a
string
feature.
-
answers
: a dictionary feature containing:
-
answer_start
: a
int32
feature.
-
text
: a
string
feature.
-
id
: a
string
feature.
mlqa-translate-test.vi
-
context
: a
string
feature.
-
question
: a
string
feature.
-
answers
: a dictionary feature containing:
-
answer_start
: a
int32
feature.
-
text
: a
string
feature.
-
id
: a
string
feature.
Data Splits
name
|
test
|
mlqa-translate-test.ar
|
5335
|
mlqa-translate-test.de
|
4517
|
mlqa-translate-test.es
|
5253
|
mlqa-translate-test.hi
|
4918
|
mlqa-translate-test.vi
|
5495
|
Dataset Creation
Curation Rationale
More Information Needed
Source Data
Initial Data Collection and Normalization
More Information Needed
Who are the source language producers?
More Information Needed
Annotations
Annotation process
More Information Needed
Who are the annotators?
More Information Needed
Personal and Sensitive Information
More Information Needed
Considerations for Using the Data
Social Impact of Dataset
More Information Needed
Discussion of Biases
More Information Needed
Other Known Limitations
More Information Needed
Additional Information
Dataset Curators
More Information Needed
Licensing Information
More Information Needed
Citation Information
@article{lewis2019mlqa,
title = {MLQA: Evaluating Cross-lingual Extractive Question Answering},
author = {Lewis, Patrick and Oguz, Barlas and Rinott, Ruty and Riedel, Sebastian and Schwenk, Holger},
journal = {arXiv preprint arXiv:1910.07475},
year = 2019,
eid = {arXiv: 1910.07475}
}
Contributions
Thanks to
@patrickvonplaten
,
@M-Salti
,
@lewtun
,
@thomwolf
,
@mariamabarham
,
@lhoestq
for adding this dataset.