数据集:

squad_kor_v2

任务:

问答

子任务:

extractive-qa

语言:

ko

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

crowdsourced
中文

Dataset Card for KorQuAD v2.1

Dataset Summary

KorQuAD 2.0 is a Korean question and answering dataset consisting of a total of 100,000+ pairs. There are three major differences from KorQuAD 1.0, which is the standard Korean Q & A data. The first is that a given document is a whole Wikipedia page, not just one or two paragraphs. Second, because the document also contains tables and lists, it is necessary to understand the document structured with HTML tags. Finally, the answer can be a long text covering not only word or phrase units, but paragraphs, tables, and lists.

Supported Tasks and Leaderboards

question-answering

Languages

Korean

Dataset Structure

Follows the standart SQuAD format. There is only 1 answer per question

Data Instances

An example from the data set looks as follows:

{'answer': {'answer_start': 3873,
  'html_answer_start': 16093,
  'text': '20,890 표'},
 'context': '<!DOCTYPE html>\n<html>\n<head>\n<meta>\n<title>심규언 - 위키백과, 우리 모두의 백과사전</title>\n\n\n<link>\n.....[omitted]',
 'id': '36615',
 'question': '심규언은 17대 지방 선거에서 몇 표를 득표하였는가?',
 'raw_html': '<!DOCTYPE html>\n<html c ...[omitted]',
 'title': '심규언',
 'url': 'https://ko.wikipedia.org/wiki/심규언'}

Data Fields

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answer': {'text': Value(dtype='string', id=None),
  'answer_start': Value(dtype='int32', id=None),
  'html_answer_start': Value(dtype='int32', id=None)},
 'url': Value(dtype='string', id=None),
 'raw_html': Value(dtype='string', id=None)}

Data Splits

  • Train : 83486
  • Validation: 10165

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Wikipedia

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

CC BY-ND 2.0 KR

Citation Information

@article{NODE09353166,
    author={Youngmin Kim,Seungyoung Lim;Hyunjeong Lee;Soyoon Park;Myungji Kim},
    title={{KorQuAD 2.0: Korean QA Dataset for Web Document Machine Comprehension}},
    booltitle={{Journal of KIISE 제47권 제6호}},
    journal={{Journal of KIISE}},
    volume={{47}},
    issue={{6}},
    publisher={The Korean Institute of Information Scientists and Engineers},
    year={2020},
    ISSN={{2383-630X}},
    pages={577-586},
    url={http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE09353166}}

Contributions

Thanks to @cceyda for adding this dataset.