数据集:
told-br
任务:
文本分类语言:
pt计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:2010.04543许可:
cc-by-sa-4.0ToLD-Br is the biggest dataset for toxic tweets in Brazilian Portuguese, crowdsourced by 42 annotators selected from a pool of 129 volunteers. Annotators were selected aiming to create a plural group in terms of demographics (ethnicity, sexual orientation, age, gender). Each tweet was labeled by three annotators in 6 possible categories: LGBTQ+phobia, Xenophobia, Obscene, Insult, Misogyny and Racism.
- text-classification-other-hate-speech-detection : The dataset can be used to train a model for Hate Speech Detection, either using it's multi-label classes or by grouping them into a binary Hate vs. Non-Hate class. A BERT model can be fine-tuned to perform this task and achieve 0.75 F1-Score for it's binary version.
The text in the dataset is in Brazilian Portuguese, as spoken by Tweet users. The associated BCP-47 code is pt-BR .
ToLD-Br has two versions: binary and multilabel.
Multilabel: A data point consists of the tweet text (string) followed by 6 categories that have values ranging from 0 to 3, meaning the amount of votes from annotators for that specific class on homophobia, obscene, insult, racism, misogyny and xenophobia.
An example from multilabel ToLD-Br looks as follows:
{'text': '@user bandido dissimulado. esse sérgio moro é uma espécie de mal carater com ditadura e pitadas de atraso' 'homophobia': 0 'obscene': 0 'insult': 2 'racism': 0 'misogyny': 0 'xenophobia': 0}
Binary: A data point consists of the tweet text (string) followed by a binary class "toxic" with values 0 or 1.
An example from binary ToLD-Br looks as follows:
{'text': '@user bandido dissimulado. esse sérgio moro é uma espécie de mal carater com ditadura e pitadas de atraso' 'toxic': 1}
Multilabel:
Binary:
Multilabel: The entire dataset consists of 21.000 examples.
Binary: The train set consists of 16.800 examples, validation set consists of 2.100 examples and test set consists of 2.100 examples.
Despite Portuguese being the 5th most spoken language in the world and Brazil being the 4th country with most unique users, Brazilian Portuguese was underrepresented in the hate-speech detection task. Only two other datasets were available, one of them being European Portuguese. ToLD-Br is 4x bigger than both these datasets combined. Also, none of them had multiple annotators per instance. Also, this work proposes a plural and diverse group of annotators carefully selected to avoid inserting bias into the annotation.
Data was collected in 15 days in August 2019 using Gate Cloud's Tweet Collector. Ten million tweets were collected using two methods: a keyword-based method and a user-mention method. The first method collected tweets mentioning the following keywords:
viado,veado,viadinho,veadinho,viadao,veadao,bicha,bixa,bichinha,bixinha,bichona,bixona,baitola,sapatão,sapatao,traveco,bambi,biba,boiola,marica,gayzão,gayzao,flor,florzinha,vagabundo,vagaba,desgraçada,desgraçado,desgracado,arrombado,arrombada,foder,fuder,fudido,fodido,cú,cu,pinto,pau,pal,caralho,caraio,carai,pica,cacete,rola,porra,escroto,buceta,fdp,pqp,vsf,tnc,vtnc,puto,putinho,acéfalo,acefalo,burro,idiota,trouxa,estúpido,estupido,estúpida,canalha,demente,retardado,retardada,verme,maldito,maldita,ridículo,ridiculo,ridícula,ridicula,morfético,morfetico,morfética,morfetica,lazarento,lazarenta,lixo,mongolóide,mongoloide,mongol,asqueroso,asquerosa,cretino,cretina,babaca,pilantra,neguinho,neguinha,pretinho,pretinha,escurinho,escurinha,pretinha,pretinho,crioulo,criolo,crioula,criola,macaco,macaca,gorila,puta,vagabunda,vagaba,mulherzinha,piranha,feminazi,putinha,piriguete,vaca,putinha,bahiano,baiano,baianagem,xingling,xing ling,xing-ling,carioca,paulista,sulista,mineiro,gringo
The list of most followed Brazilian Twitter accounts can be found here .
Who are the source language producers?The language producers are Twitter users from Brazil, speakers of Portuguese.
A form was published at the Federal University of São Carlos asking for volunteers to annotate our dataset. 129 people volunteered and 42 were selected according to their demographics in order to create a diverse and plural annotation group. Guidelines were produced and presented to the annotators. The entire process was done asynchronously because of the Covid-19 pandemic. The tool used was Google Sheets. Annotators were grouped into 14 teams of three annotators each. Each group annotated a respective file containing 1500 tweets. Annotators didn't have contact with each other, nor did they know that other annotators were labelling the same tweets as they were.
Who are the annotators?Annotators were people from the Federal University of São Carlos' Facebook group. Their demographics are described below:
Gender | |
---|---|
Male | 18 |
Female | 24 |
Sexual Orientation | |
---|---|
Heterosexual | 22 |
Bisexual | 12 |
Homosexual | 5 |
Pansexual | 3 |
Ethnicity | |
---|---|
White | 25 |
Brown | 9 |
Black | 5 |
Asian | 2 |
Non-Declared | 1 |
Ages range from 18 to 37 years old.
Annotators were paid R$50 ($10) to label 1500 examples each.
The dataset contains sensitive information for homophobia, obscene, insult, racism, misogyny and xenophobia.
Tweets were anonymized by replacing user mentions with a @user tag.
The purpose of this dataset is to help develop better hate speech detection systems.
A system that succeeds at this task would be able to identify hate speech tweets associated with the classes available in the dataset.
An effort was made to reduce annotation bias by selecting annotators with a diverse demographic background. In terms of data collection, by using keywords and user mentions, we are introducing some bias to the data, restricting our scope to the list of keywords and users we created.
Because of the massive data skew for the multilabel classes, it is extremely hard to train a robust model for this version of the dataset. We advise using it for analysis and experimentation only. The binary version of the dataset is robust enough to train a classifier with up to 76% F1-score.
The dataset was created by João Augusto Leite, Diego Furtado Silva, both from the Federal University of São Carlos (BR), Carolina Scarton and Kalina Bontcheva both from the University of Sheffield (UK)
ToLD-Br is licensed under a Creative Commons BY-SA 4.0
@article{DBLP:journals/corr/abs-2010-04543, author = {Joao Augusto Leite and Diego F. Silva and Kalina Bontcheva and Carolina Scarton}, title = {Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis}, journal = {CoRR}, volume = {abs/2010.04543}, year = {2020}, url = {https://arxiv.org/abs/2010.04543}, eprinttype = {arXiv}, eprint = {2010.04543}, timestamp = {Tue, 15 Dec 2020 16:10:16 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2010-04543.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Thanks to @JAugusto97 for adding this dataset.