数据集:
Overfit-GM/turkish-toxic-language
This text dataset is a collection of Turkish texts that have been merged from various existing offensive language datasets found online. The dataset contains a total of 77,800 instances, each labeled as either offensive or not offensive.
To ensure the dataset's completeness, we utilized multiple transformer models to augment the dataset with pseudo labels. The resulting dataset is designed to be a comprehensive resource for Turkish offensive language detection.
The dataset is provided in CSV format, for more details on the merged datasets used, please refer to the reference section.
To use dataset by using Huggingface dataset use snippet below:
from datasets import load_dataset # If the dataset is gated/private, make sure you have run huggingface-cli login dataset = load_dataset("Overfit-GM/turkish-toxic-language")
Dataset Information | |
---|---|
Number of instances | 77,800 |
Target label distribution | |
OTHER | 37,663 |
PROFANITY | 18,252 |
INSULT | 10,777 |
RACIST | 10,163 |
SEXIST | 945 |
Number of offensive instances | 40,137 |
Number of non-offensive instances | 37,663 |
Data source distribution | |
Jigsaw Multilingual Toxic Comments | 35,624 |
Turkish Offensive Language Detection Dataset | 39,551 |
Turkish Cyberbullying Dataset | 2,525 |