数据集:

hate_speech18

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

original
中文

Dataset Card for [Dataset Name]

Dataset Summary

These files contain text extracted from Stormfront, a white supremacist forum. A random set of forums posts have been sampled from several subforums and split into sentences. Those sentences have been manually labelled as containing hate speech or not, according to certain annotation guidelines.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

  • text: the provided sentence
  • user_id: information to make it possible to re-build the conversations these sentences belong to
  • subforum_id: information to make it possible to re-build the conversations these sentences belong to
  • num_contexts: number of previous posts the annotator had to read before making a decision over the category of the sentence
  • label: hate, noHate, relation (sentence in the post doesn't contain hate speech on their own, but combination of serveral sentences does) or idk/skip (sentences that are not written in English or that don't contain information as to be classified into hate or noHate)

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{gibert2018hate,
    title = "{Hate Speech Dataset from a White Supremacy Forum}",
    author = "de Gibert, Ona  and
      Perez, Naiara  and
      Garc{\'\i}a-Pablos, Aitor  and
      Cuadros, Montse",
    booktitle = "Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2)",
    month = oct,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W18-5102",
    doi = "10.18653/v1/W18-5102",
    pages = "11--20",
}

Contributions

Thanks to @czabo for adding this dataset.