数据集:
Djacon/ru_goemotions
The RuGoEmotions dataset contains 34k Reddit comments labeled for 9 emotion categories (joy, interest, surprice, sadness, anger, disgust, fear, guilt and neutral). The dataset already with predefined train/val/test splits
This dataset is intended for multi-class, multi-label emotion classification.
The data is in Russian.
Each instance is a reddit comment with one or more emotion annotations (or neutral).
The configuration includes:
The simplified data includes a set of train/val/test splits with 26.9k, 3.29k, and 3.37k examples respectively.
From the paper abstract:
Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks.
Data was collected from Reddit comments via a variety of automated methods discussed in 3.1 of the paper.
Who are the source language producers?English-speaking Reddit users.
Annotations were produced by 3 English-speaking crowdworkers in India.
This dataset includes the original usernames of the Reddit users who posted each comment. Although Reddit usernames are typically disasociated from personal real-world identities, this is not always the case. It may therefore be possible to discover the identities of the individuals who created this content in some cases.
Emotion detection is a worthwhile problem which can potentially lead to improvements such as better human/computer interaction. However, emotion detection algorithms (particularly in computer vision) have been abused in some cases to make erroneous inferences in human monitoring and assessment applications such as hiring decisions, insurance pricing, and student attentiveness (see this article ).
From the authors' github page:
Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. Anyone using this dataset should be aware of these limitations of the dataset.
[More Information Needed]
Researchers at Amazon Alexa, Google Research, and Stanford. See the author list .
The GitHub repository which houses this dataset has an Apache License 2.0 .
@inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} }
Thanks to @joeddav for adding this dataset.