数据集:
kor_sae
任务:
文本分类语言:
ko计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
cc-by-sa-4.0The Structured Argument Extraction for Korean dataset is a set of question-argument and command-argument pairs with their respective question type label and negativeness label. Often times, agents like Alexa or Siri, encounter conversations without a clear objective from the user. The goal of this dataset is to extract the intent argument of a given utterance pair without a clear directive. This may yield a more robust agent capable of parsing more non-canonical forms of speech.
The text in the dataset is in Korean and the associated is BCP-47 code is ko-KR .
An example data instance contains a question or command pair and its label:
{ "intent_pair1": "내일 오후 다섯시 조별과제 일정 추가해줘" "intent_pair2": "내일 오후 다섯시 조별과제 일정 추가하기" "label": 4 }
The corpus contains 30,837 examples.
The Structured Argument Extraction for Korean dataset was curated to help train models extract intent arguments from utterances without a clear objective or when the user uses non-canonical forms of speech. This is especially helpful in Korean because in English, the Who, what, where, when and why usually comes in the beginning, but this isn't necessarily the case in the Korean language. So for low-resource languages, this lack of data can be a bottleneck for comprehension performance.
The corpus was taken from the one constructed by Cho et al. , a Korean single utterance corpus for identifying directives/non-directives that contains a wide variety of non-canonical directives.
Who are the source language producers?Korean speakers are the source language producers.
Utterances were categorized as question or command arguments and then further classified according to their intent argument.
Who are the annotators?The annotation was done by three Korean natives with a background in computational linguistics.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is curated by Won Ik Cho, Young Ki Moon, Sangwhan Moon, Seok Min Kim and Nam Soo Kim.
The dataset is licensed under the CC BY-SA-4.0.
@article{cho2019machines, title={Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical Directives}, author={Cho, Won Ik and Moon, Young Ki and Moon, Sangwhan and Kim, Seok Min and Kim, Nam Soo}, journal={arXiv preprint arXiv:1912.00342}, year={2019} }
Thanks to @stevhliu for adding this dataset.