数据集:
squad_kor_v2
任务:
问答子任务:
extractive-qa语言:
ko计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
crowdsourced许可:
cc-by-nd-4.0KorQuAD 2.0 is a Korean question and answering dataset consisting of a total of 100,000+ pairs. There are three major differences from KorQuAD 1.0, which is the standard Korean Q & A data. The first is that a given document is a whole Wikipedia page, not just one or two paragraphs. Second, because the document also contains tables and lists, it is necessary to understand the document structured with HTML tags. Finally, the answer can be a long text covering not only word or phrase units, but paragraphs, tables, and lists.
question-answering
Korean
Follows the standart SQuAD format. There is only 1 answer per question
An example from the data set looks as follows:
{'answer': {'answer_start': 3873, 'html_answer_start': 16093, 'text': '20,890 표'}, 'context': '<!DOCTYPE html>\n<html>\n<head>\n<meta>\n<title>심규언 - 위키백과, 우리 모두의 백과사전</title>\n\n\n<link>\n.....[omitted]', 'id': '36615', 'question': '심규언은 17대 지방 선거에서 몇 표를 득표하였는가?', 'raw_html': '<!DOCTYPE html>\n<html c ...[omitted]', 'title': '심규언', 'url': 'https://ko.wikipedia.org/wiki/심규언'}
{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answer': {'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None), 'html_answer_start': Value(dtype='int32', id=None)}, 'url': Value(dtype='string', id=None), 'raw_html': Value(dtype='string', id=None)}
[More Information Needed]
Wikipedia
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{NODE09353166, author={Youngmin Kim,Seungyoung Lim;Hyunjeong Lee;Soyoon Park;Myungji Kim}, title={{KorQuAD 2.0: Korean QA Dataset for Web Document Machine Comprehension}}, booltitle={{Journal of KIISE 제47권 제6호}}, journal={{Journal of KIISE}}, volume={{47}}, issue={{6}}, publisher={The Korean Institute of Information Scientists and Engineers}, year={2020}, ISSN={{2383-630X}}, pages={577-586}, url={http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE09353166}}
Thanks to @cceyda for adding this dataset.