数据集:
hate_speech_filipino
任务:
文本分类子任务:
sentiment-analysis语言:
tl计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
machine-generated许可:
license:unknownContains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections.
[More Information Needed]
The dataset is primarily in Filipino, with the addition of some English words commonly used in Filipino vernacular
Sample data:
{ "text": "Taas ni Mar Roxas ah. KULTONG DILAW NGA NAMAN", "label": 1 }
[More Information Needed]
[More Information Needed]
This study seeks to contribute to the filling of this gap through the development of a model that can automate hate speech detection and classification in Philippine election-related tweets. The role of the microblogging site Twitter as a platform for the expression of support and hate during the 2016 Philippine presidential election has been supported in news reports and systematic studies. Thus, the particular question addressed in this paper is: Can existing techniques in language processing and machine learning be applied to detect hate speech in the Philippine election context?
The dataset used in this study was a subset of the corpus 1,696,613 tweets crawled by Andrade et al. and posted from November 2015 to May 2016 during the campaign period for the Philippine presidential election. They were culled based on the presence of candidate names (e.g., Binay, Duterte, Poe, Roxas, and Santiago) and election-related hashtags (e.g., #Halalan2016, #Eleksyon2016, and #PiliPinas2016).
Data preprocessing was performed to prepare the tweets for feature extraction and classification. It consisted of the following steps: data de-identification, uniform resource locator (URL) removal, special character processing, normalization, hashtag processing, and tokenization.
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Jan Christian Cruz
[More Information Needed]
@article{Cabasag-2019-hate-speech, title={Hate speech in Philippine election-related tweets: Automatic detection and classification using natural language processing.}, author={Neil Vicente Cabasag, Vicente Raphael Chan, Sean Christian Lim, Mark Edward Gonzales, and Charibeth Cheng}, journal={Philippine Computing Journal}, volume={XIV}, number={1}, month={August}, year={2019} }
Thanks to @anaerobeth for adding this dataset.