数据集:

khalidalt/tydiqa-goldp

任务:

问答

子任务:

extractive-qa

计算机处理:

multilingual

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

extended|wikipedia

许可:

apache-2.0
中文

Dataset Card for "tydiqa"

Dataset Summary

TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD).

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

primary_task
  • Size of downloaded dataset files: 1863.37 MB
  • Size of the generated dataset: 5757.59 MB
  • Total amount of disk used: 7620.96 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "annotations": {
        "minimal_answers_end_byte": [-1, -1, -1],
        "minimal_answers_start_byte": [-1, -1, -1],
        "passage_answer_candidate_index": [-1, -1, -1],
        "yes_no_answer": ["NONE", "NONE", "NONE"]
    },
    "document_plaintext": "\"\\nรองศาสตราจารย์[1] หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร  (22 กันยายน 2495 -) ผู้ว่าราชการกรุงเทพมหานครคนที่ 15 อดีตรองหัวหน้าพรรคปร...",
    "document_title": "หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร",
    "document_url": "\"https://th.wikipedia.org/wiki/%E0%B8%AB%E0%B8%A1%E0%B9%88%E0%B8%AD%E0%B8%A1%E0%B8%A3%E0%B8%B2%E0%B8%8A%E0%B8%A7%E0%B8%87%E0%B8%...",
    "language": "thai",
    "passage_answer_candidates": "{\"plaintext_end_byte\": [494, 1779, 2931, 3904, 4506, 5588, 6383, 7122, 8224, 9375, 10473, 12563, 15134, 17765, 19863, 21902, 229...",
    "question_text": "\"หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร เรียนจบจากที่ไหน ?\"..."
}
secondary_task
  • Size of downloaded dataset files: 1863.37 MB
  • Size of the generated dataset: 55.34 MB
  • Total amount of disk used: 1918.71 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "answers": {
        "answer_start": [394],
        "text": ["بطولتين"]
    },
    "context": "\"أقيمت البطولة 21 مرة، شارك في النهائيات 78 دولة، وعدد الفرق التي فازت بالبطولة حتى الآن 8 فرق، ويعد المنتخب البرازيلي الأكثر تت...",
    "id": "arabic-2387335860751143628-1",
    "question": "\"كم عدد مرات فوز الأوروغواي ببطولة كاس العالم لكرو القدم؟\"...",
    "title": "قائمة نهائيات كأس العالم"
}

Data Fields

The data fields are the same among all splits.

primary_task
  • passage_answer_candidates : a dictionary feature containing:
    • plaintext_start_byte : a int32 feature.
    • plaintext_end_byte : a int32 feature.
  • question_text : a string feature.
  • document_title : a string feature.
  • language : a string feature.
  • annotations : a dictionary feature containing:
    • passage_answer_candidate_index : a int32 feature.
    • minimal_answers_start_byte : a int32 feature.
    • minimal_answers_end_byte : a int32 feature.
    • yes_no_answer : a string feature.
  • document_plaintext : a string feature.
  • document_url : a string feature.
secondary_task
  • id : a string feature.
  • title : a string feature.
  • context : a string feature.
  • question : a string feature.
  • answers : a dictionary feature containing:
    • text : a string feature.
    • answer_start : a int32 feature.

Data Splits

name train validation
primary_task 166916 18670
secondary_task 49881 5077

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@article{tydiqa,
title   = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages},
author  = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki}
year    = {2020},
journal = {Transactions of the Association for Computational Linguistics}
}




@inproceedings{ruder-etal-2021-xtreme,
    title = "{XTREME}-{R}: Towards More Challenging and Nuanced Multilingual Evaluation",
    author = "Ruder, Sebastian  and
      Constant, Noah  and
      Botha, Jan  and
      Siddhant, Aditya  and
      Firat, Orhan  and
      Fu, Jinlan  and
      Liu, Pengfei  and
      Hu, Junjie  and
      Garrette, Dan  and
      Neubig, Graham  and
      Johnson, Melvin",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.802",
    doi = "10.18653/v1/2021.emnlp-main.802",
    pages = "10215--10245",

}

}