数据集:
cedr
任务:
文本分类语言:
ru计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
crowdsourced源数据集:
original许可:
apache-2.0The Corpus for Emotions Detecting in Russian-language text sentences of different social sources (CEDR) contains 9410 comments labeled for 5 emotion categories (joy, sadness, surprise, fear, and anger).
Here are 2 dataset configurations:
Dataset with predefined train/test splits.
This dataset is intended for multi-label emotion classification.
The data is in Russian.
Each instance is a text sentence in Russian from several sources with one or more emotion annotations (or no emotion at all).
An example for an instance from the dataset is shown below:
{ 'text': 'Забавно как люди в возрасте удивляются входящим звонкам на мобильник)', 'labels': [0], 'source': 'twitter', 'sentences': [ [ {'forma': 'Забавно', 'lemma': 'Забавно'}, {'forma': 'как', 'lemma': 'как'}, {'forma': 'люди', 'lemma': 'человек'}, {'forma': 'в', 'lemma': 'в'}, {'forma': 'возрасте', 'lemma': 'возраст'}, {'forma': 'удивляются', 'lemma': 'удивляться'}, {'forma': 'входящим', 'lemma': 'входить'}, {'forma': 'звонкам', 'lemma': 'звонок'}, {'forma': 'на', 'lemma': 'на'}, {'forma': 'мобильник', 'lemma': 'мобильник'}, {'forma': ')', 'lemma': ')'} ] ] }
Emotion label codes: {0: "joy", 1: "sadness", 2: "surprise", 3: "fear", 4: "anger"}
The main configuration includes:
In addition to the above, the raw data includes:
The dataset includes a set of train/test splits. with 7528, and 1882 examples respectively.
The formed dataset of examples consists of sentences in Russian from several sources (blogs, microblogs, news), which allows creating methods to analyse various types of texts. The created methodology for building the dataset based on applying a crowdsourcing service can be used to expand the number of examples to improve the accuracy of supervised classifiers.
Data was collected from several sources: posts of the Live Journal social network, texts of the online news agency Lenta.ru, and Twitter microblog posts.
Only those sentences were selected that contained marker words from the dictionary of the emotive vocabulary of the Russian language . The authors manually formed a list of marker words for each emotion by choosing words from different categories of the dictionary.
In total, 3069 sentences were selected from LiveJournal posts, 2851 sentences from Lenta.Ru, and 3490 sentencesfrom Twitter. After selection, sentences were offered to annotators for labeling.
Who are the source language producers?Russian-speaking LiveJournal and Tweeter users, and authors of news articles on the site lenta.ru.
Annotating sentences with labels of their emotions was performed with the help of a crowdsourcing platform .
The annotators’ task was: “What emotions did the author express in the sentence?”. The annotators were allowed to put an arbitrary number of the following emotion labels: "joy", "sadness", "anger", "fear", and "surprise".
If the accuracy of an annotator on the control sentences (including the trial run) became less than 70%, or if the accuracy was less than 66% over the last six control samples, the annotator was dismissed.
Sentences were split into tasks and assigned to annotators so that each sentence was annotated at least three times. A label of a specific emotion was assigned to a sentence if put by more than half of the annotators.
Who are the annotators?Only those of the 30% of the best-performing active users (by the platform’s internal rating) who spoke Russian and were over 18 years old were allowed into the annotation process. Moreover, before a platform user could be employed as an annotator, they underwent a training task, after which they were to mark 25 trial samples with more than 80% agreement compared to the annotation that the authors had performed themselves.
The text of the sentences may contain profanity.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Researchers at AI technology lab at NRC "Kurchatov Institute". See the author list .
The GitHub repository which houses this dataset has an Apache License 2.0.
If you have found our results helpful in your work, feel free to cite our publication. This is an updated version of the dataset, the collection and preparation of which is described here:
@article{sboev2021data, title={Data-Driven Model for Emotion Detection in Russian Texts}, author={Sboev, Alexander and Naumov, Aleksandr and Rybka, Roman}, journal={Procedia Computer Science}, volume={190}, pages={637--642}, year={2021}, publisher={Elsevier} }
Thanks to @naumov-al for adding this dataset.