数据集:

readerbench/ro-offense

语言:

ro

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

许可:

apache-2.0
中文

Dataset Card for "RO-Offense-Sequences"

Dataset Description

Dataset Summary

a novel Romanian language dataset for offensive language detection with manually annotated offensive labels from a local Romanian sports news website (gsp.ro):

Resulting in 12,445 annotated messages

Languages

Romanian

Dataset Structure

Data Instances

An example of 'train' looks as follows.

{
  'id': 5,
  'text':'PLACEHOLDER TEXT',
  'label': 'OTHER'
}

Data Fields

  • id : The unique comment ID, corresponding to the ID in RO Offense
  • text : full comment text
  • label : the type of offensive message (OTHER, PROFANITY, INSULT, ABUSE)

Data Splits

Train Other Profanity Insult Abuse
9953 3656 1293 2236 2768
Test Other Profanity Insult Abuse
2492 916 324 559 693

Dataset Creation

Curation Rationale

Collecting data for abusive language classification for Romanian Language.

For the labeling of texts we loosely base our definitions on the Germeval 2019 task for detecting offensive language in german tweets (Struß et al., 2019)

Data source: Comments on articles in Gazeta Sporturilor (gsp.ro) between 2011 and 2020

Selection for annotation: we select comments from a pool of secific articles based on the number of comments in the article. The number of comments per article has the following distribution:

mean        183.820923
std         334.707177
min           1.000000
25%          20.000000
50%          58.000000
75%         179.000000
max        2151.000000

Based on this we select only comments from articles having between 20 and 50 comments. Also, we remove comments containing urls or three consecutive *, since these were mostly censored by editors or automatic profanity detection algorythms.

Additional, in order to have some meaningful messages for annotation, we select only messages with length between 50 and 500 characters.

Source Data

Sports News Articles comments

Initial Data Collection and Normalization Who are the source language producers?

Sports News Article readers

Annotations

  • Andrei Paraschiv
  • Irina Maria Sandu
Annotation process OTHER

Label used for non offensive texts.

PROFANITY

This is the "lighter" form of abusive language. When profane words are used without a direct intend on offending a target, or without ascribing some negative qualities to a target we use this label. Some messages in this class may even have a positive sentiment and uses swearwords as emphasis. Messages containing profane words that are not directed towards a specific group or person, we label as PROFANITY

Also, self censored messages with swear words having some letters hidden, or some deceitful misspellings of swearwords that have clear intend on circumventing profanity detectors will be treated as PROFANITY .

INSULT

The message clearly wants to offend someone, ascribing negatively evaluated qualities or deficiences, labeling a person or a group of persons as unworthy or unvalued. Insults do imply disrespect and contempt directed towards a target.

ABUSE

This label marks messages containing the stronger form of offensive and abusive language. This type of language ascribes the target a social identity that is judged negatively by the majority of society, or at least is percieved as a mostly negative judged identity. Shameful, unworthy or morally unaceptable identytities fall in this category. In contrast to insults, instances of abusive language require that the target of judgment is seen as a representative of a group and it is ascribed negative qualities that are taken to be universal, omnipresent and unchangeable characteristics of the group.

In contrast to insults, instances of abusive language require that the target of judgment tis seen as a representative of a group and it is ascribed negative qualities that are taken to be universal, omnipresent and unchangeable characteristics of the group.

Additional, dehumanizing language targeting a person or group is also classified as ABUSE.

Who are the annotators?

Native speakers

Personal and Sensitive Information

The data was public at the time of collection. PII removal has been performed.

Considerations for Using the Data

Social Impact of Dataset

The data definitely contains abusive language. The data could be used to develop and propagate offensive language against every target group involved, i.e. ableism, racism, sexism, ageism, and so on.

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

This data is available and distributed under Apache-2.0 license

Citation Information

tbd

Contributions