数据集:
SkelterLabsInc/JaQuAD
任务:
问答子任务:
extractive-qa语言:
ja计算机处理:
monolingual大小:
10K<n<100K批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:2202.01764许可:
cc-by-sa-3.0Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles. Fine-tuning BERT-Japanese on JaQuAD achieves 78.92% for an F1 score and 63.38% for an exact match.
Japanese ( ja )
An example of 'validation':
{ "id": "de-001-00-000", "title": "イタセンパラ", "context": "イタセンパラ(板鮮腹、Acheilognathuslongipinnis)は、コイ科のタナゴ亜科タナゴ属に分類される淡水>魚の一種。\n別名はビワタナゴ(琵琶鱮、琵琶鰱)。", "question": "ビワタナゴの正式名称は何?", "question_type": "Multiple sentence reasoning", "answers": { "text": "イタセンパラ", "answer_start": 0, "answer_type": "Object", }, },
JaQuAD consists of three sets, train , validation , and test . They were created from disjoint sets of Wikipedia articles. The test set is not publicly released yet. The following table shows statistics for each set.
Set | Number of Articles | Number of Contexts | Number of Questions |
---|---|---|---|
Train | 691 | 9713 | 31748 |
Validation | 101 | 1431 | 3939 |
Test | 109 | 1479 | 4009 |
The JaQuAD dataset was created by Skelter Labs to provide a SQuAD-like QA dataset in Japanese. Questions are original and based on Japanese Wikipedia articles.
The articles used for the contexts are from Japanese Wikipedia . 88.7% of articles are from the curated list of Japanese high-quality Wikipedia articles, e.g., featured articles and good articles .
Wikipedia articles were scrapped and divided into one more multiple paragraphs as contexts. Annotations (questions and answer spans) are written by fluent Japanese speakers, including natives and non-natives. Annotators were given a context and asked to generate non-trivial questions about information in the context.
No personal or sensitive information is included in this dataset. Dataset annotators has been manually verified it.
Users should consider that the articles are sampled from Wikipedia articles but not representative of all Wikipedia articles.
The social biases of this dataset have not yet been investigated.
The social biases of this dataset have not yet been investigated. Articles and questions have been selected for quality and diversity.
The JaQuAD dataset has limitations as follows:
This dataset is incomplete yet. If you find any errors in JaQuAD, please contact us.
Skelter Labs: https://skelterlabs.com/
The JaQuAD dataset is licensed under the CC BY-SA 3.0 license.
@misc{so2022jaquad, title={{JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension}}, author={ByungHoon So and Kyuhong Byun and Kyungwon Kang and Seongjin Cho}, year={2022}, eprint={2202.01764}, archivePrefix={arXiv}, primaryClass={cs.CL} }
This work was supported by TPU Research Cloud (TRC) program . For training models, we used cloud TPUs provided by TRC. We also thank annotators who generated JaQuAD.