数据集:

Overfit-GM/turkish-toxic-language

语言:

tr

大小:

10K<n<100K

许可:

apache-2.0
中文

Turkish Texts for Toxic Language Detection

Dataset Description

Dataset Summary

This text dataset is a collection of Turkish texts that have been merged from various existing offensive language datasets found online. The dataset contains a total of 77,800 instances, each labeled as either offensive or not offensive.

To ensure the dataset's completeness, we utilized multiple transformer models to augment the dataset with pseudo labels. The resulting dataset is designed to be a comprehensive resource for Turkish offensive language detection.

The dataset is provided in CSV format, for more details on the merged datasets used, please refer to the reference section.

Loading Dataset

To use dataset by using Huggingface dataset use snippet below:

from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("Overfit-GM/turkish-toxic-language")

Dataset Structure

Dataset Information
Number of instances 77,800
Target label distribution
OTHER 37,663
PROFANITY 18,252
INSULT 10,777
RACIST 10,163
SEXIST 945
Number of offensive instances 40,137
Number of non-offensive instances 37,663
Data source distribution
Jigsaw Multilingual Toxic Comments 35,624
Turkish Offensive Language Detection Dataset 39,551
Turkish Cyberbullying Dataset 2,525

Source Data & References