数据集:
leey4n/KR3
Korean sentiment classification dataset
0 stands for negative review, 1 stands for positive review, and 2 stands for ambiguous review. Note that rating 2 is not intended to be used directly for supervised learning(classification). This data is included for additional pre-training purpose or other usage. In other words, this dataset is basically a binary sentiment classification task where labels are 0 and 1.
See all the codes for crawling/preprocessing the dataset and experiments with KR3 in GitHub Repo . See Kaggle dataset in Kaggle Dataset .
from datasets import load_dataset kr3 = load_dataset("leey4n/KR3", name='kr3', split='train') kr3 = kr3.remove_columns(['__index_level_0__']) # Original file didn't include this column. Suspect it's a hugging face issue.
# drop reviews with ambiguous label kr3_binary = kr3.filter(lambda example: example['Rating'] != 2)
CC BY-NC-SA 4.0
We concluded that the non-commerical usage and release of KR3 fall into the range of fair use (공정 이용) stated in the Korean copyright act (저작권법). We further clarify that we did not agree to the terms of service from any websites which might prohibit web crawling. In other words, web crawling we've done was proceeded without logging in to the website. Despite all of these, feel free to contact to any of the contributors if you notice any legal issues.
(Alphabetical order)
This work was done as DIYA 4기. Compute resources needed for the work was supported by DIYA and surromind.ai.