数据集:
hope_edi
A Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube with 28,451, 20,198 and 10,705 comments in English, Tamil and Malayalam, respectively, manually labelled as containing hope speech or not. To our knowledge, this is the first research of its kind to annotate hope speech for equality, diversity and inclusion in a multilingual setting.
To identify hope speech in the comments/posts in social media.
English, Tamil and Malayalam
An example from the English dataset looks as follows:
text | label |
---|---|
all lives matter .without that we never have peace so to me forever all lives matter. | Hope_speech |
I think it's cool that you give people a voice to speak out with here on this channel. | Hope_speech |
An example from the Tamil dataset looks as follows:
text | label |
---|---|
Idha solla ivalo naala | Non_hope_speech |
இன்று தேசிய பெண் குழந்தைகள் தினம்.. பெண் குழந்தைகளை போற்றுவோம்..அவர்களை பாதுகாப்போம்... | Hope_speech |
An example from the Malayalam dataset looks as follows:
text | label |
---|---|
ഇത്രെയും കഷ്ടപ്പെട്ട് വളർത്തിയ ആ അമ്മയുടെ മുഖം കണ്ടപ്പോൾ കണ്ണ് നിറഞ്ഞു പോയി | Hope_speech |
snehikunavar aanayalum pennayalum onnichu jeevikatte..aareyum compel cheythitallalooo..parasparamulla ishtathodeyalle...avarum jeevikatte..?? | Hope_speech |
English
Tamil
Malayalam
train | validation | |
---|---|---|
English | 22762 | 2843 |
Tamil | 16160 | 2018 |
Malayalam | 8564 | 1070 |
Hope is considered significant for the well-being, recuperation and restoration of human life by health professionals. Hate speech or offensive language detection dataset is not available for code-mixed Tamil and code-mixed Malayalam, and it does not take into account LGBTIQ, women in STEM and other minorities. Thus, we cannot use existing hate speech or offensive language detection datasets to detect hope or non-hope for EDI of minorities.
For English, we collected data on recent topics of EDI, including women in STEM, LGBTIQ issues, COVID-19, Black Lives Matters, United Kingdom (UK) versus China, United States of America (USA) versus China and Australia versus China from YouTube video comments. The data was collected from videos of people from English-speaking countries, such as Australia, Canada, the Republic of Ireland, United Kingdom, the United States of America and New Zealand.
For Tamil and Malayalam, we collected data from India on the recent topics regarding LGBTIQ issues, COVID-19, women in STEM, the Indo-China war and Dravidian affairs.
Who are the source language producers?Youtube users
We created Google forms to collect annotations from annotators. Each form contained a maximum of 100 comments, and each page contained a maximum of 10 comments to maintain the quality of annotation. We collected information on the gender, educational background and the medium of schooling of the annotator to know the diversity of the annotator and avoid bias. We educated annotators by providing them with YouTube videos on EDI. A minimum of three annotators annotated each form.
Who are the annotators?For English language comments, annotators were from Australia, the Republic of Ireland, the United Kingdom and the United States of America. For Tamil, we were able to get annotations from both people from the state of Tamil Nadu of India and from Sri Lanka. Most of the annotators were graduate or post-graduate students.
Social media data is highly sensitive, and even more so when it is related to the minority population, such as the LGBTIQ community or women. We have taken full consideration to minimise the risk associated with individual identity in the data by removing personal information from dataset, such as names but not celebrity names. However, to study EDI, we needed to keep information relating to the following characteristics; racial, gender, sexual orientation, ethnic origin and philosophical beliefs. Annotators were only shown anonymised posts and agreed to make no attempts to contact the comment creator. The dataset will only be made available for research purpose to the researcher who agree to follow ethical guidelines
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This work is licensed under a Creative Commons Attribution 4.0 International Licence
@inproceedings{chakravarthi-2020-hopeedi, title = "{H}ope{EDI}: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion", author = "Chakravarthi, Bharathi Raja", booktitle = "Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media", month = dec, year = "2020", address = "Barcelona, Spain (Online)", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.peoples-1.5", pages = "41--53", abstract = "Over the past few years, systems have been developed to control online content and eliminate abusive, offensive or hate speech content. However, people in power sometimes misuse this form of censorship to obstruct the democratic right of freedom of speech. Therefore, it is imperative that research should take a positive reinforcement approach towards online content that is encouraging, positive and supportive contents. Until now, most studies have focused on solving this problem of negativity in the English language, though the problem is much more than just harmful content. Furthermore, it is multilingual as well. Thus, we have constructed a Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube with 28,451, 20,198 and 10,705 comments in English, Tamil and Malayalam, respectively, manually labelled as containing hope speech or not. To our knowledge, this is the first research of its kind to annotate hope speech for equality, diversity and inclusion in a multilingual setting. We determined that the inter-annotator agreement of our dataset using Krippendorff{'}s alpha. Further, we created several baselines to benchmark the resulting dataset and the results have been expressed using precision, recall and F1-score. The dataset is publicly available for the research community. We hope that this resource will spur further research on encouraging inclusive and responsive speech that reinforces positiveness.", }
Thanks to @jamespaultg for adding this dataset.