数据集:

kor_sarcasm

其他:

sarcasm-detection

许可:

mit

源数据集:

original

批注创建人:

expert-generated

语言创建人:

found

大小:

1K<n<10K

计算机处理:

monolingual

语言:

任务:

文本分类

数据集介绍文件清单

中文

Dataset Card for Korean Sarcasm Detection

Dataset Summary

The Korean Sarcasm Dataset was created to detect sarcasm in text, which can significantly alter the original meaning of a sentence. 9319 tweets were collected from Twitter and labeled for sarcasm or not_sarcasm . These tweets were gathered by querying for: 역설, 아무말, 운수좋은날, 笑, 뭐래 아닙니다, 그럴리없다, 어그로, irony sarcastic, and sarcasm . The dataset was pre-processed by removing the keyword hashtag, urls and mentions of the user to maintain anonymity.

Supported Tasks and Leaderboards

sarcasm_detection : The dataset can be used to train a model to detect sarcastic tweets. A BERT model can be presented with a tweet in Korean and be asked to determine whether it is sarcastic or not.

Languages

The text in the dataset is in Korean and the associated is BCP-47 code is ko-KR .

Dataset Structure

Data Instances

An example data instance contains a Korean tweet and a label whether it is sarcastic or not. 1 maps to sarcasm and 0 maps to no sarcasm.

{
  "tokens": "[ 수도권 노선 아이템 ] 17 . 신분당선의 #딸기 : 그의 이미지 컬러 혹은 머리 색에서 유래한 아이템이다 . #메트로라이프"
  "label": 	0
}

Data Fields

tokens : contains the text of the tweet
label : determines whether the text is sarcastic ( 1 : sarcasm, 0 : no sarcasm)

Data Splits

The data is split into a training set comrpised of 9018 tweets and a test set of 301 tweets.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

The dataset was created by gathering HTML data from Twitter. Queries for hashtags that include sarcasm and variants of it were used to return tweets. It was preprocessed by removing the keyword hashtag, urls and mentions of the user to preserve anonymity.

Who are the source language producers?

The source language producers are Korean Twitter users.

Annotations

Annotation process

Tweets were labeled 1 for sarcasm and 0 for no sarcasm.

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

Mentions of the user in a tweet were removed to keep them anonymous.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

This dataset was curated by Dionne Kim.

Licensing Information

This dataset is licensed under the MIT License.

Citation Information

@misc{kim2019kocasm,
  author = {Kim, Jiwon and Cho, Won Ik},
  title = {Kocasm: Korean Automatic Sarcasm Detection},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SpellOnYou/korean-sarcasm}}
}

Contributions

Thanks to @stevhliu for adding this dataset.

作者:

佚名

数据集大小:

10.4 KB