数据集:
reasoning_bg
任务:
问答子任务:
multiple-choice-qa语言:
bg计算机处理:
monolingual大小:
n<1K语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1908.01519许可:
apache-2.0Recently, reading comprehension models achieved near-human performance on large-scale datasets such as SQuAD, CoQA, MS Macro, RACE, etc. This is largely due to the release of pre-trained contextualized representations such as BERT and ELMo, which can be fine-tuned for the target task. Despite those advances and the creation of more challenging datasets, most of the work is still done for English. Here, we study the effectiveness of multilingual BERT fine-tuned on large-scale English datasets for reading comprehension (e.g., for RACE), and we apply it to Bulgarian multiple-choice reading comprehension. We propose a new dataset containing 2,221 questions from matriculation exams for twelfth grade in various subjects -history, biology, geography and philosophy-, and 412 additional questions from online quizzes in history. While the quiz authors gave no relevant context, we incorporate knowledge from Wikipedia, retrieving documents matching the combination of question + each answer option.
[Needs More Information]
Bulgarian
A typical data point comprises of question sentence and 4 possible choice answers and the correct answer.
{ "id": "21181dda96414fd9b7a5e336ad84b45d", "qid": 1, "question": "!0<>AB>OB5;=> AJI5AB2C20I8 6828 A8AB5<8 A0:", "answers": [ "28@CA8B5", "BJ:0=8B5", "<8B>E>=4@88B5", "54=>:;5BJG=8B5 >@30=87<8" ], "correct": "54=>:;5BJG=8B5 >@30=87<8", "url": "http://zamatura.eu/files/dzi/biologiq/2010/matura-biologiq-2010.pdf" },
The dataset covers the following domains
Domain | #QA-paris | #Choices | Len Question | Len Options | Vocab Size |
---|---|---|---|---|---|
12th Grade Matriculation Exam | |||||
Biology | 437 | 4 | 10.44 | 2.64 | 2,414 (12,922) |
Philosophy | 630 | 4 | 8.91 | 2.94 | 3,636 (20,392) |
Geography | 612 | 4 | 12.83 | 2.47 | 3,239 (17,668) |
History | 542 | 4 | 23.74 | 3.64 | 5,466 (20,456) |
Online History Quizzes | |||||
Bulgarian History | 229 | 4 | 14.05 | 2.80 | 2,287 (10,620) |
PzHistory | 183 | 3 | 38.89 | 2.44 | 1,261 (7,518) |
Total | 2,633 | 3.93 | 15.67 | 2.89 | 13,329 (56,104) |
The dataset has been curated from matriculation exams and online quizzes. These questions cover a large variety of science topics in biology, philosophy, geography, and history.
Data has been sourced from the matriculation exams and online quizzes.
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
@article{hardalov2019beyond, title={Beyond english-only reading comprehension: Experiments in zero-shot multilingual transfer for bulgarian}, author={Hardalov, Momchil and Koychev, Ivan and Nakov, Preslav}, journal={arXiv preprint arXiv:1908.01519}, year={2019} }
Thanks to @saradhix for adding this dataset.