数据集:
strombergnlp/offenseval_2020
OffensEval 2020 features a multilingual dataset with five languages. The languages included in OffensEval 2020 are:
The annotation follows the hierarchical tagset proposed in the Offensive Language Identification Dataset (OLID) and used in OffensEval 2019. In this taxonomy we break down offensive content into the following three sub-tasks taking the type and target of offensive content into account. The following sub-tasks were organized:
English training data is omitted so needs to be collected otherwise (see https://zenodo.org/record/3950379#.XxZ-aFVKipp )
The source datasets come from:
Five are covered: bcp47 ar;da;en;gr;tr
There are five named configs, one per language:
The training data for English is absent - this is 9M tweets that need to be rehydrated on their own. See https://zenodo.org/record/3950379#.XxZ-aFVKipp
An example of 'train' looks as follows.
{ 'id': '0', 'text': 'PLACEHOLDER TEXT', 'subtask_a': 1, }
name | train | test |
---|---|---|
ar | 7839 | 1827 |
da | 2961 | 329 |
en | 0 | 3887 |
gr | 8743 | 1544 |
tr | 31277 | 3515 |
Collecting data for abusive language classification. Different rational for each dataset.
Varies per language dataset
Who are the source language producers?Social media users
Varies per language dataset
Who are the annotators?Varies per language dataset; native speakers
The data was public at the time of collection. No PII removal has been performed.
The data definitely contains abusive language. The data could be used to develop and propagate offensive language against every target group involved, i.e. ableism, racism, sexism, ageism, and so on.
The datasets is curated by each sub-part's paper authors.
This data is available and distributed under Creative Commons attribution license, CC-BY 4.0.
@inproceedings{zampieri-etal-2020-semeval, title = "{S}em{E}val-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({O}ffens{E}val 2020)", author = {Zampieri, Marcos and Nakov, Preslav and Rosenthal, Sara and Atanasova, Pepa and Karadzhov, Georgi and Mubarak, Hamdy and Derczynski, Leon and Pitenis, Zeses and {\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}}, booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation", month = dec, year = "2020", address = "Barcelona (online)", publisher = "International Committee for Computational Linguistics", url = "https://aclanthology.org/2020.semeval-1.188", doi = "10.18653/v1/2020.semeval-1.188", pages = "1425--1447", abstract = "We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.", }
Author-added dataset @leondz