数据集:
kor_hate
大小:
1K<n<10K语言创建人:
found源数据集:
original预印本库:
arxiv:2005.12503许可:
cc-by-sa-4.0计算机处理:
monolingual语言:
ko任务:
文本分类The Korean HateSpeech Dataset is a dataset of 8367 human-labeled entertainment news comments from a popular Korean news aggregation platform. Each comment was evaluated for either social bias (labels: gender , others none ), hate speech (labels: hate , offensive , none ) or gender bias (labels: True , False ). The dataset was created to support the identification of toxic comments on online platforms where users can remain anonymous.
The text in the dataset is in Korean and the associated is BCP-47 code is ko-KR .
An example data instance contains a comments containing the text of the news comment and then labels for each of the following fields: contain_gender_bias , bias and hate .
{'comments':'설마 ㅈ 현정 작가 아니지??' 'contain_gender_bias': 'True', 'bias': 'gender', 'hate': 'hate' }
The data is split into a training and development (test) set. It contains 8371 annotated comments that are split into 7896 comments in the training set and 471 comments in the test set.
The dataset was created to provide the first human-labeled Korean corpus for toxic speech detection from a Korean online entertainment news aggregator. Recently, two young Korean celebrities suffered from a series of tragic incidents that led to two major Korean web portals to close the comments section on their platform. However, this only serves as a temporary solution, and the fundamental issue has not been solved yet. This dataset hopes to improve Korean hate speech detection.
A total of 10.4 million comments were collected from an online Korean entertainment news aggregator between Jan. 1, 2018 and Feb. 29, 2020. 1,580 articles were drawn using stratified sampling and the top 20 comments were extracted ranked in order of their Wilson score on the downvote for each article. Duplicate comments, single token comments and comments with more than 100 characters were removed (because they could convey various opinions). From here, 10K comments were randomly chosen for annotation.
Who are the source language producers?The language producers are users of the Korean online news platform between 2018 and 2020.
Each comment was assigned to three random annotators to assign a majority decision. For more ambiguous comments, annotators were allowed to skip the comment. See Appendix A in the paper for more detailed guidelines.
Who are the annotators?Annotation was performed by 32 annotators, consisting of 29 annotators from the crowdsourcing platform DeepNatural AI and three NLP researchers.
[N/A]
The purpose of this dataset is to tackle the social issue of users creating toxic comments on online platforms. This dataset aims to improve detection of toxic comments online.
[More Information Needed]
[More Information Needed]
This dataset is curated by Jihyung Moon, Won Ik Cho and Junbum Lee.
[N/A]
@inproceedings {moon-et-al-2020-beep title = "{BEEP}! {K}orean Corpus of Online News Comments for Toxic Speech Detection", author = "Moon, Jihyung and Cho, Won Ik and Lee, Junbum", booktitle = "Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.socialnlp-1.4", pages = "25--31", abstract = "Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff{'}s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.", }
Thanks to @stevhliu for adding this dataset.