数据集:
kor_3i4k
预印本库:
arxiv:1811.04231许可:
cc-by-4.0源数据集:
original语言创建人:
expert-generated批注创建人:
expert-generated任务:
文本分类语言:
ko计算机处理:
monolingual大小:
10K<n<100KThe 3i4K dataset is a set of frequently used Korean words (corpus provided by the Seoul National University Speech Language Processing Lab) and manually created questions/commands containing short utterances. The goal is to identify the speaker intention of a spoken utterance based on its transcript, and whether in some cases, requires using auxiliary acoustic features. The classification system decides whether the utterance is a fragment, statement, question, command, rhetorical question, rhetorical command, or an intonation-dependent utterance. This is important because in head-final languages like Korean, the level of the intonation plays a significant role in identifying the speaker's intention.
The text in the dataset is in Korean and the associated is BCP-47 code is ko-KR .
An example data instance contains a short utterance and it's label:
{ "label": 3, "text": "선수잖아 이 케이스 저 케이스 많을 거 아냐 선배라고 뭐 하나 인생에 도움도 안주는데 내가 이렇게 진지하게 나올 때 제대로 한번 조언 좀 해줘보지" }
The data is split into a training set comrpised of 55134 examples and a test set of 6121 examples.
For head-final languages like Korean, intonation can be a determining factor in identifying the speaker's intention. The purpose of this dataset is to to determine whether an utterance is a fragment, statement, question, command, or a rhetorical question/command using the intonation-depedency from the head-finality. This is expected to improve language understanding of spoken Korean utterances and can be beneficial for speech-to-text applications.
The corpus was provided by Seoul National University Speech Language Processing Lab, a set of frequently used words from the National Institute of Korean Language and manually created commands and questions. The utterances cover topics like weather, transportation and stocks. 20k lines were randomly selected.
Who are the source language producers?Korean speakers produced the commands and questions.
Utterances were classified into seven categories. They were provided clear instructions on the annotation guidelines (see here for the guidelines) and the resulting inter-annotator agreement was 0.85 and the final decision was done by majority voting.
Who are the annotators?The annotation was completed by three Seoul Korean L1 speakers.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is curated by Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim and Nam Soo Kim.
The dataset is licensed under the CC BY-SA-4.0.
@article{cho2018speech, title={Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency}, author={Cho, Won Ik and Lee, Hyeon Seung and Yoon, Ji Won and Kim, Seok Min and Kim, Nam Soo}, journal={arXiv preprint arXiv:1811.04231}, year={2018} }
Thanks to @stevhliu for adding this dataset.