数据集:
jeanlee/kmhas_korean_hate_speech
任务:
文本分类语言:
ko计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:2208.10684许可:
cc-by-sa-4.0The Korean Multi-label Hate Speech Dataset, K-MHaS , consists of 109,692 utterances from Korean online news comments, labelled with 8 fine-grained hate speech classes (labels: Politics , Origin , Physical , Age , Gender , Religion , Race , Profanity ) or Not Hate Speech class. Each utterance provides from a single to four labels that can handles Korean language patterns effectively. For more details, please refer to our paper about K-MHaS , published at COLING 2022.
Hate Speech Detection
For the multi-label classification, a Hate Speech class from the binary classification, is broken down into eight classes, associated with the hate speech category. In order to reflect the social and historical context, we select the eight hate speech classes. For example, the Politics class is chosen, due to a significant influence on the style of Korean hate speech.
Korean
The dataset is provided with train/validation/test set in the txt format. Each instance is a news comment with a corresponding one or more hate speech classes (labels: Politics , Origin , Physical , Age , Gender , Religion , Race , Profanity ) or Not Hate Speech class. The label numbers matching in both English and Korean is in the data fields section.
{'text':'수꼴틀딱시키들이 다 디져야 나라가 똑바로 될것같다..답이 없는 종자들ㅠ' 'label': [2, 3, 4] }
In our repository, we provide splitted datasets that have 78,977(train) / 8,776 (validation) / 21,939 (test) samples, preserving the class proportion.
We propose K-MHaS, a large size Korean multi-label hate speech detection dataset that represents Korean language patterns effectively. Most datasets in hate speech research are annotated using a single label classification of particular aspects, even though the subjectivity of hate speech cannot be explained with a mutually exclusive annotation scheme. We propose a multi-label hate speech annotation scheme that allows overlapping labels associated with the subjectivity and the intersectionality of hate speech.
Our dataset is based on the Korean online news comments available on Kaggle and Github. The unlabeled raw data was collected between January 2018 and June 2020. Please see the details in our paper K-MHaS published at COLING2020.
Who are the source language producers?The language producers are users who left the comments on the Korean online news platform between 2018 and 2020.
We begin with the common categories of hate speech found in literature and match the keywords for each category. After the preliminary round, we investigate the results to merge or remove labels in order to provide the most representative subtype labels of hate speech contextual to the cultural background. Our annotation instructions explain a twolayered annotation to (a) distinguish hate and not hate speech, and (b) the categories of hate speech. Annotators are requested to consider given keywords or alternatives of each category within social, cultural, and historical circumstances. For more details, please refer to the paper K-MHaS .
Who are the annotators?Five native speakers were recruited for manual annotation in both the preliminary and main rounds.
This datasets contains examples of hateful language, however, has no personal information.
We propose K-MHaS, a new large-sized dataset for Korean hate speech detection with a multi-label annotation scheme. We provided extensive baseline experiment results, presenting the usability of a dataset to detect Korean language patterns in hate speech.
All annotators were recruited from a crowdsourcing platform. They were informed about hate speech before handling the data. Our instructions allowed them to feel free to leave if they were uncomfortable with the content. With respect to the potential risks, we note that the subjectivity of human annotation would impact on the quality of the dataset.
[More Information Needed]
This dataset is curated by Taejun Lim, Heejun Lee and Bogeun Jo.
Creative Commons Attribution-ShareAlike 4.0 International (cc-by-sa-4.0).
@inproceedings{lee-etal-2022-k, title = "K-{MH}a{S}: A Multi-label Hate Speech Detection Dataset in {K}orean Online News Comment", author = "Lee, Jean and Lim, Taejun and Lee, Heejun and Jo, Bogeun and Kim, Yangsok and Yoon, Heegeun and Han, Soyeon Caren", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.311", pages = "3530--3538", abstract = "Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.", }
The contributors of the work are: