数据集:
go_emotions
任务:
文本分类语言:
en计算机处理:
monolingual语言创建人:
found批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:2005.00547其他:
emotion许可:
apache-2.0The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The raw data is included as well as the smaller, simplified version of the dataset with predefined train/val/test splits.
This dataset is intended for multi-class, multi-label emotion classification.
The data is in English.
Each instance is a reddit comment with a corresponding ID and one or more emotion annotations (or neutral).
The simplified configuration includes:
In addition to the above, the raw data includes:
In the raw data, labels are listed as their own columns with binary 0/1 entries rather than a list of ids as in the simplified data.
The simplified data includes a set of train/val/test splits with 43,410, 5426, and 5427 examples respectively.
From the paper abstract:
Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks.
Data was collected from Reddit comments via a variety of automated methods discussed in 3.1 of the paper.
Who are the source language producers?English-speaking Reddit users.
[More Information Needed]
Who are the annotators?Annotations were produced by 3 English-speaking crowdworkers in India.
This dataset includes the original usernames of the Reddit users who posted each comment. Although Reddit usernames are typically disasociated from personal real-world identities, this is not always the case. It may therefore be possible to discover the identities of the individuals who created this content in some cases.
Emotion detection is a worthwhile problem which can potentially lead to improvements such as better human/computer interaction. However, emotion detection algorithms (particularly in computer vision) have been abused in some cases to make erroneous inferences in human monitoring and assessment applications such as hiring decisions, insurance pricing, and student attentiveness (see this article ).
From the authors' github page:
Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. Anyone using this dataset should be aware of these limitations of the dataset.
[More Information Needed]
Researchers at Amazon Alexa, Google Research, and Stanford. See the author list .
The GitHub repository which houses this dataset has an Apache License 2.0 .
@inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} }
Thanks to @joeddav for adding this dataset.