数据集:
laroseda
任务:
文本分类语言:
ro计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
cc-by-4.0LaRoSeDa - A La rge and Ro manian Se ntiment Da ta Set. LaRoSeDa contains 15,000 reviews written in Romanian, of which 7,500 are positive and 7,500 negative. The samples have one of four star ratings: 1 or 2 - for reviews that can be considered of negative polarity, and 4 or 5 for the positive ones. The 15,000 samples featured in the corpus and labelled with the star rating, are splitted in a train and test subsets, with 12,000 and 3,000 samples in each subset.
LiRo Benchmark and Leaderboard
The text dataset is in Romanian ( ro ).
Below we have an example of sample from LaRoSeDa:
{ "index": "9675", "title": "Nu recomand", "content": "probleme cu localizarea, mari...", "starRating": 1, }
where "9675" is the sample index, followed by the title of the review, review content and then the star rating given by the user.
The train/test split contains 12,000/3,000 samples tagged with the star rating assigned to each sample in the dataset.
The samples are preprocessed in order to eliminate named entities. This is required to prevent classifiers from taking the decision based on features that are not related to the topics. For example, named entities that refer to politicians or football players names can provide clues about the topic. For more details, please read the paper .
For the data collection, one of the largest Romanian e-commerce platform was targetted. Along with the textual content of each review, the associated star ratings was also collected in order to automatically assign labels to the collected text samples.
Who are the source language producers?The original text comes from one of the largest e-commerce platforms in Romania.
As mentioned above, LaRoSeDa is composed of product reviews from one of the largest e-commerce websites in Romania. The resulting samples are automatically tagged with the star rating assigned by the users.
Who are the annotators?N/A
The textual data collected for LaRoSeDa consists in product reviews freely available on the Internet. To the best of authors' knowledge, there is no personal or sensitive information that needed to be considered in the said textual inputs collected.
This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. In the past three years there was a growing interest for studying Romanian from a Computational Linguistics perspective. However, we are far from having enough datasets and resources in this particular language.
We note that most of the negative reviews (5,561) are rated with one star. Similarly, most of the positive reviews (6,238) are rated with five stars. Hence, the corpus is highly polarized.
The star rating might not always reflect the polarity of the text. We thus acknowledge that the automatic labeling process is not optimal, i.e. some labels might be noisy.
Published and managed by Anca Tache, Mihaela Gaman and Radu Tudor Ionescu.
CC BY-SA 4.0 License
@article{ tache2101clustering, title={Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa -- A Large Romanian Sentiment Data Set}, author={Anca Maria Tache and Mihaela Gaman and Radu Tudor Ionescu}, journal={ArXiv}, year = {2021} }
Thanks to @MihaelaGaman for adding this dataset.