数据集:
swedish_reviews
任务:
文本分类语言:
sv计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original许可:
license:unknownThe dataset is scraped from various Swedish websites where reviews are present. The dataset consists of 103 482 samples split between train , valid and test . It is a sample of the full dataset, where this sample is balanced to the minority class (negative). The original data dump was heavly skewved to positive samples with a 95/5 ratio.
This dataset can be used to evaluate sentiment classification on Swedish.
The text in the dataset is in Swedish.
What a sample looks like:
{ 'text': 'Jag tycker huggingface är ett grymt project!', 'label': 1, }
The data is split into a training, validation and test set. The final split sizes are as follow:
Train | Valid | Test |
---|---|---|
62089 | 20696 | 20697 |
[More Information Needed]
Various Swedish websites with product reviews.
Initial Data Collection and Normalization Who are the source language producers?Swedish
[More Information Needed]
Annotation processAutomatically annotated based on user reviews on a scale 1-5, where 1-2 is considered negative and 4-5 is positive , 3 is skipped as it tends to be more neutral.
Who are the annotators?The users who have been using the products.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The corpus was scraped by @timpal0l
Research only.
No paper exists currently.
Thanks to @timpal0l for adding this dataset.