数据集:

sem_eval_2018_task_1

任务:

文本分类

子任务:

multi-label-classification

语言:

计算机处理:

multilingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

crowdsourced

源数据集:

original

其他:

emotion-classification

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for SemEval-2018 Task 1: Affect in Tweets

Dataset Summary

Tasks: We present an array of tasks where systems have to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of the tweeters from their tweets. (The term tweeter refers to the person who has posted the tweet.) We also include a multi-label emotion classification task for tweets. For each task, we provide separate training and test datasets for English, Arabic, and Spanish tweets. The individual tasks are described below:

EI-reg (an emotion intensity regression task): Given a tweet and an emotion E, determine the intensity of E that best represents the mental state of the tweeter—a real-valued score between 0 (least E) and 1 (most E). Separate datasets are provided for anger, fear, joy, and sadness.

EI-oc (an emotion intensity ordinal classification task): Given a tweet and an emotion E, classify the tweet into one of four ordinal classes of intensity of E that best represents the mental state of the tweeter. Separate datasets are provided for anger, fear, joy, and sadness.

V-reg (a sentiment intensity regression task): Given a tweet, determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter—a real-valued score between 0 (most negative) and 1 (most positive).

V-oc (a sentiment analysis, ordinal classification, task): Given a tweet, classify it into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity, that best represents the mental state of the tweeter.

E-c (an emotion classification task): Given a tweet, classify it as 'neutral or no emotion' or as one, or more, of eleven given emotions that best represent the mental state of the tweeter. Here, E refers to emotion, EI refers to emotion intensity, V refers to valence or sentiment intensity, reg refers to regression, oc refers to ordinal classification, c refers to classification.

Together, these tasks encompass various emotion and sentiment analysis tasks. You are free to participate in any number of tasks and on any of the datasets.

Currently only the subtask 5 (E-c) is available on the Hugging Face Dataset Hub.

Supported Tasks and Leaderboards

Languages

English, Arabic and Spanish

Dataset Structure

Data Instances

An example from the subtask5.english config is:

{'ID': '2017-En-21441',
 'Tweet': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'anger': False,
 'anticipation': True,
 'disgust': False,
 'fear': False,
 'joy': False,
 'love': False,
 'optimism': True,
 'pessimism': False,
 'sadness': False,
 'surprise': False,
 'trust': True}

Data Fields

For any config of the subtask 5:

ID: string id of the tweet
Tweet: text content of the tweet as a string
anger: boolean, True if anger represents the mental state of the tweeter
anticipation: boolean, True if anticipation represents the mental state of the tweeter
disgust: boolean, True if disgust represents the mental state of the tweeter
fear: boolean, True if fear represents the mental state of the tweeter
joy: boolean, True if joy represents the mental state of the tweeter
love: boolean, True if love represents the mental state of the tweeter
optimism: boolean, True if optimism represents the mental state of the tweeter
pessimism: boolean, True if pessimism represents the mental state of the tweeter
sadness: boolean, True if sadness represents the mental state of the tweeter
surprise: boolean, True if surprise represents the mental state of the tweeter
trust: boolean, True if trust represents the mental state of the tweeter

Note that the test set has no labels, and therefore all labels are set to False.

Data Splits

train	validation	test
English	6,838	886	3,259
Arabic	2,278	585	1,518
Spanish	3,561	679	2,854

Dataset Creation

Curation Rationale

Source Data

Tweets

Initial Data Collection and Normalization Who are the source language producers?

Twitter users.

Annotations

Annotation process

We presented one tweet at a time to the annotators and asked which of the following options best de- scribed the emotional state of the tweeter: – anger (also includes annoyance, rage) – anticipation (also includes interest, vigilance) – disgust (also includes disinterest, dislike, loathing) – fear (also includes apprehension, anxiety, terror) – joy (also includes serenity, ecstasy) – love (also includes affection) – optimism (also includes hopefulness, confidence) – pessimism (also includes cynicism, no confidence) – sadness (also includes pensiveness, grief) – surprise (also includes distraction, amazement) – trust (also includes acceptance, liking, admiration) – neutral or no emotion Example tweets were provided in advance with ex- amples of suitable responses. On the Figure Eight task settings, we specified that we needed annotations from seven people for each tweet. However, because of the way the gold tweets were set up, they were annotated by more than seven people. The median number of anno- tations was still seven. In total, 303 people anno- tated between 10 and 4,670 tweets each. A total of 174,356 responses were obtained.

Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018). SemEval-2018 task 1: Affect in tweets. Proceedings of the 12th International Workshop on Semantic Evaluation, 1–17. https://doi.org/10.18653/v1/S18-1001

Who are the annotators?

Crowdworkers on Figure Eight.

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh and Svetlana Kiritchenko

Licensing Information

See the official Terms and Conditions

Citation Information

@InProceedings{SemEval2018Task1, author = {Mohammad, Saif M. and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana}, title = {SemEval-2018 {T}ask 1: {A}ffect in Tweets}, booktitle = {Proceedings of International Workshop on Semantic Evaluation (SemEval-2018)}, address = {New Orleans, LA, USA}, year = {2018}}

Contributions

Thanks to @maxpel for adding this dataset.

作者:

佚名

数据集大小:

25.19 KB