apache-2.0a novel Romanian language dataset for offensive language detection with manually annotated offensive labels from a local Romanian sports news website (gsp.ro):
Resulting in 12,445 annotated messages
An example of 'train' looks as follows.
{ 'id': 5, 'text':'PLACEHOLDER TEXT', 'label': 'OTHER' }
Train | Other | Profanity | Insult | Abuse |
9953 | 3656 | 1293 | 2236 | 2768 |
Test | Other | Profanity | Insult | Abuse |
2492 | 916 | 324 | 559 | 693 |
Collecting data for abusive language classification for Romanian Language.
For the labeling of texts we loosely base our definitions on the Germeval 2019 task for detecting offensive language in german tweets (Struß et al., 2019)
Data source: Comments on articles in Gazeta Sporturilor (gsp.ro) between 2011 and 2020
Selection for annotation: we select comments from a pool of secific articles based on the number of comments in the article. The number of comments per article has the following distribution:
mean 183.820923 std 334.707177 min 1.000000 25% 20.000000 50% 58.000000 75% 179.000000 max 2151.000000
Based on this we select only comments from articles having between 20 and 50 comments. Also, we remove comments containing urls or three consecutive *, since these were mostly censored by editors or automatic profanity detection algorythms.
Additional, in order to have some meaningful messages for annotation, we select only messages with length between 50 and 500 characters.
Sports News Articles comments
Initial Data Collection and Normalization Who are the source language producers?Sports News Article readers
Label used for non offensive texts.
PROFANITYThis is the "lighter" form of abusive language. When profane words are used without a direct intend on offending a target, or without ascribing some negative qualities to a target we use this label. Some messages in this class may even have a positive sentiment and uses swearwords as emphasis. Messages containing profane words that are not directed towards a specific group or person, we label as PROFANITY
Also, self censored messages with swear words having some letters hidden, or some deceitful misspellings of swearwords that have clear intend on circumventing profanity detectors will be treated as PROFANITY .
INSULTThe message clearly wants to offend someone, ascribing negatively evaluated qualities or deficiences, labeling a person or a group of persons as unworthy or unvalued. Insults do imply disrespect and contempt directed towards a target.
ABUSEThis label marks messages containing the stronger form of offensive and abusive language. This type of language ascribes the target a social identity that is judged negatively by the majority of society, or at least is percieved as a mostly negative judged identity. Shameful, unworthy or morally unaceptable identytities fall in this category. In contrast to insults, instances of abusive language require that the target of judgment is seen as a representative of a group and it is ascribed negative qualities that are taken to be universal, omnipresent and unchangeable characteristics of the group.
In contrast to insults, instances of abusive language require that the target of judgment tis seen as a representative of a group and it is ascribed negative qualities that are taken to be universal, omnipresent and unchangeable characteristics of the group.
Additional, dehumanizing language targeting a person or group is also classified as ABUSE.
Who are the annotators?Native speakers
The data was public at the time of collection. PII removal has been performed.
The data definitely contains abusive language. The data could be used to develop and propagate offensive language against every target group involved, i.e. ableism, racism, sexism, ageism, and so on.
This data is available and distributed under Apache-2.0 license