数据集:
hate_speech_pl
任务:
文本分类语言:
pl计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-sa-3.0The dataset was created to analyze the possibility of automating the recognition of hate speech in Polish. It was collected from the Polish forums and represents various types and degrees of offensive language, expressed towards minorities.
The original dataset is provided as an export of MySQL tables, what makes it hard to load. Due to that, it was converted to CSV and put to a Github repository.
Polish, collected from public forums, including the HTML formatting of the text.
The dataset consists of three collections, originally provided as separate MySQL tables. Here represented as three CSV files.
{ 'id': 1, 'text_id': 121713, 'annotator_id': 1, 'minority_id': 72, 'negative_emotions': false, 'call_to_action': false, 'source_of_knowledge': 2, 'irony_sarcasm': false, 'topic': 18, 'text': ' <font color=\"blue\"> Niemiec</font> mówi co innego', 'rating': 0 }
List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.
The dataset was not originally split at all.
[More Information Needed]
The dataset was collected from the public forums.
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
The dataset doesn't contain any personal or sensitive information.
The automated hate speech recognition is the main beneficial outcome of using the dataset.
The dataset contains negative posts only and due to that might underrepresent the whole language.
Dataset provided for research purposes only. Please check dataset license for additional information.
The dataset was created by Marek Troszyński and Aleksander Wawer, during work done at IPI PAN .
According to Metashare , the dataset is licensed under CC-BY-NC-SA, but the version is not mentioned.
@article{troszynski2017czy, title={Czy komputer rozpozna hejtera? Wykorzystanie uczenia maszynowego (ML) w jako{\'s}ciowej analizie danych}, author={Troszy{\'n}ski, Marek and Wawer, Aleksandra}, journal={Przegl{\k{a}}d Socjologii Jako{\'s}ciowej}, volume={13}, number={2}, pages={62--80}, year={2017}, publisher={Uniwersytet {\L}{\'o}dzki, Wydzia{\l} Ekonomiczno-Socjologiczny, Katedra Socjologii~…} }
Thanks to @kacperlukawski for adding this dataset.