数据集:
juletxara/tydiqa_xtreme
许可:
apache-2.0预印本库:
arxiv:2003.11080源数据集:
extended|wikipedia批注创建人:
crowdsourced语言创建人:
crowdsourced计算机处理:
multilingual子任务:
extractive-qa任务:
问答TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD).
We also include "translate-train" and "translate-test" splits for each non-English languages from XTREME (Hu et al., 2020). These splits are the automatic translations from English to each target language used in the XTREME paper [ https://arxiv.org/abs/2003.11080] . The "translate-train" split purposefully ignores the non-English TyDiQA-GoldP training data to simulate the transfer learning scenario where original-language data is not available and system builders must rely on labeled English data plus existing machine translation systems.
An example of 'validation' looks as follows.
This example was too long and was cropped: { "annotations": { "minimal_answers_end_byte": [-1, -1, -1], "minimal_answers_start_byte": [-1, -1, -1], "passage_answer_candidate_index": [-1, -1, -1], "yes_no_answer": ["NONE", "NONE", "NONE"] }, "document_plaintext": "\"\\nรองศาสตราจารย์[1] หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร (22 กันยายน 2495 -) ผู้ว่าราชการกรุงเทพมหานครคนที่ 15 อดีตรองหัวหน้าพรรคปร...", "document_title": "หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร", "document_url": "\"https://th.wikipedia.org/wiki/%E0%B8%AB%E0%B8%A1%E0%B9%88%E0%B8%AD%E0%B8%A1%E0%B8%A3%E0%B8%B2%E0%B8%8A%E0%B8%A7%E0%B8%87%E0%B8%...", "language": "thai", "passage_answer_candidates": "{\"plaintext_end_byte\": [494, 1779, 2931, 3904, 4506, 5588, 6383, 7122, 8224, 9375, 10473, 12563, 15134, 17765, 19863, 21902, 229...", "question_text": "\"หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร เรียนจบจากที่ไหน ?\"..." }secondary_task
An example of 'validation' looks as follows.
This example was too long and was cropped: { "answers": { "answer_start": [394], "text": ["بطولتين"] }, "context": "\"أقيمت البطولة 21 مرة، شارك في النهائيات 78 دولة، وعدد الفرق التي فازت بالبطولة حتى الآن 8 فرق، ويعد المنتخب البرازيلي الأكثر تت...", "id": "arabic-2387335860751143628-1", "question": "\"كم عدد مرات فوز الأوروغواي ببطولة كاس العالم لكرو القدم؟\"...", "title": "قائمة نهائيات كأس العالم" }
The data fields are the same among all splits.
primary_taskname | train | validation |
---|---|---|
primary_task | 166916 | 18670 |
secondary_task | 49881 | 5077 |
@article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {Transactions of the Association for Computational Linguistics} }
@inproceedings{ruder-etal-2021-xtreme, title = "{XTREME}-{R}: Towards More Challenging and Nuanced Multilingual Evaluation", author = "Ruder, Sebastian and Constant, Noah and Botha, Jan and Siddhant, Aditya and Firat, Orhan and Fu, Jinlan and Liu, Pengfei and Hu, Junjie and Garrette, Dan and Neubig, Graham and Johnson, Melvin", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.802", doi = "10.18653/v1/2021.emnlp-main.802", pages = "10215--10245", } }