数据集:

ruanchaves/reli-sa

中文

Dataset Card for ReLi-SA

Dataset Summary

ReLi is a dataset created by Cláudia Freitas within the framework of the project "Semantic Annotators based on Active Learning" at PUC-Rio. It consists of 1,600 book reviews manually annotated for the presence of opinions on the reviewed book and its polarity. The dataset contains reviews in Brazilian Portuguese on books written by seven authors: Stephenie Meyer, Thalita Rebouças, Sidney Sheldon, Jorge Amado, George Orwell, José Saramago, and J.D. Salinger. The language used in the reviews varies from highly informal, with slang, abbreviations, neologisms, and emoticons, to more formal reviews with a more elaborate vocabulary.

ReLi-SA is an adaptation of the original ReLi dataset for the sentiment analysis task. We attribute a sentiment polarity to each sentence according to the sentiment annotations of its individual tokens.

Supported Tasks and Leaderboards

  • sentiment-analysis : The dataset can be used to train a model for sentiment analysis, which consists of classifying the sentiment expressed in a sentence as positive, negative, neutral, or mixed. Success on this task is typically measured by achieving a high F1 score .

Languages

This dataset is in Brazilian Portuguese.

Dataset Structure

Data Instances

{
  'source': 'ReLi-Orwell.txt',
  'title': 'False',
  'book': '1984',
  'review_id': '0',
  'score': 5.0,
  'sentence_id': 102583,
  'unique_review_id': 'ReLi-Orwell_1984_0',
  'sentence': ' Um ótimo livro , além de ser um ótimo alerta para uma potencial distopia , em contraponto a utopia tão sonhada por os homens de o medievo e início de a modernidade .',
  'label': 'positive'
}

Data Fields

  • source : The source file of the review.
  • title : A boolean field indicating whether the sentence is a review title (True) or not (False).
  • book : The book that the review is about.
  • review_id : The review ID within the source file.
  • score : The score the review attributes to the book.
  • sentence_id : The sequential ID of the sentence (can be used to sort the sentences within a review).
  • unique_review_id : A unique ID for the review a sentence belongs to.
  • sentence : The sentence for which the label indicates the sentiment.
  • label : The sentiment label, either positive , neutral , negative , or mixed if both positive and negative sentiment polarity tokens are found in the sentence.

Data Splits

The dataset is divided into three splits:

train validation test
Instances 7,875 1,348 3,288

The splits are carefully made to avoid having reviews about a given author appear in more than one split.

Additional Information

Citation Information

If you use this dataset in your work, please cite the following publication:

@incollection{freitas2014sparkling,
  title={Sparkling Vampire... lol! Annotating Opinions in a Book Review Corpus},
  author={Freitas, Cl{\'a}udia and Motta, Eduardo and Milidi{\'u}, Ruy Luiz and C{\'e}sar, Juliana},
  booktitle={New Language Technologies and Linguistic Research: A Two-Way Road},
  editor={Alu{\'\i}sio, Sandra and Tagnin, Stella E. O.},
  year={2014},
  publisher={Cambridge Scholars Publishing},
  pages={128--146}
}

Contributions

Thanks to @ruanchaves for adding this dataset.