数据集:

swedish_reviews

任务:

文本分类

子任务:

sentiment-classification

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for Swedish Reviews

Dataset Summary

The dataset is scraped from various Swedish websites where reviews are present. The dataset consists of 103 482 samples split between train , valid and test . It is a sample of the full dataset, where this sample is balanced to the minority class (negative). The original data dump was heavly skewved to positive samples with a 95/5 ratio.

Supported Tasks and Leaderboards

This dataset can be used to evaluate sentiment classification on Swedish.

Languages

The text in the dataset is in Swedish.

Dataset Structure

Data Instances

What a sample looks like:

{
 'text': 'Jag tycker huggingface är ett grymt project!',
 'label': 1,
}

Data Fields

text : A text where the sentiment expression is present.
label : a int representing the label 0 for negative and 1 for positive.

Data Splits

The data is split into a training, validation and test set. The final split sizes are as follow:

Train	Valid	Test
62089	20696	20697

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Various Swedish websites with product reviews.

Initial Data Collection and Normalization Who are the source language producers?

Swedish

Annotations

[More Information Needed]

Annotation process

Automatically annotated based on user reviews on a scale 1-5, where 1-2 is considered negative and 4-5 is positive , 3 is skipped as it tends to be more neutral.

Who are the annotators?

The users who have been using the products.

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

[More Information Needed]

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

[More Information Needed]

Dataset Curators

The corpus was scraped by @timpal0l

Licensing Information

Research only.

Citation Information

No paper exists currently.

Contributions

Thanks to @timpal0l for adding this dataset.

作者:

佚名

数据集大小:

9.51 KB