数据集:
offenseval2020_tr
任务:
文本分类语言:
tr计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
cc-by-2.0The file offenseval-tr-training-v1.tsv contains 31,756 annotated tweets.
The file offenseval-annotation.txt contains a short summary of the annotation guidelines.
Twitter user mentions were substituted by @USER and URLs have been substitute by URL.
Each instance contains up to 1 labels corresponding to one of the following sub-task:
The dataset was published on this paper .
The dataset is based on Turkish.
A binary dataset with with (NOT) Not Offensive and (OFF) Offensive tweets.
Instances are included in TSV format as follows:
ID INSTANCE SUBA
The column names in the file are the following:
id tweet subtask_a
The labels used in the annotation are listed below.
Task and Labels(A) Sub-task A: Offensive language identification
In our annotation, we label a post as offensive (OFF) if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct.
train | test |
---|---|
31756 | 3528 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?From tweeter.
[More Information Needed]
Annotation processWe describe the labels above in a “flat” manner. However, the annotation process we follow is hierarchical. The following QA pairs give a more flowchart-like procedure to follow
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The annotations are distributed under the terms of Creative Commons Attribution License (CC-BY) . Please cite the following paper, if you use this resource.
@inproceedings{coltekin2020lrec, author = {\c{C}\"{o}ltekin, \c{C}a\u{g}r{\i}}, year = {2020}, title = {A Corpus of Turkish Offensive Language on Social Media}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference}, pages = {6174--6184}, address = {Marseille, France}, url = {https://www.aclweb.org/anthology/2020.lrec-1.758}, }
Thanks to @yavuzKomecoglu for adding this dataset.