数据集:
dbrd
语言:
nl计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1910.00896许可:
cc-by-nc-sa-4.0The DBRD (pronounced dee-bird ) dataset contains over 110k book reviews of which 22k have associated binary sentiment polarity labels. It is intended as a benchmark for sentiment classification in Dutch and was created due to a lack of annotated datasets in Dutch that are suitable for this task.
Non-Dutch reviews were filtered out using langdetect , and all reviews should therefore be in Dutch (nl). They are written by reviewers on Hebban , a Dutch website for book reviews.
The dataset contains three subsets: train, test, and unsupervised. The train and test sets contain labels, while the unsupervised set doesn't (the label value is -1 for each instance in unsupervised ). Here's an example of a positive review, indicated with a label value of 1 .
{ 'label': 1, 'text': 'Super om te lezen hoe haar leven is vergaan.\nBijzonder dat ze zo openhartig is geweest.' }
The train and test sets were constructed by extracting all non-neutral reviews because we want to assign either a positive or negative polarity label to each instance. Furthermore, the positive (pos) and negative (neg) labels were balanced in both train and test sets. The remainder was added to the unsupervised set.
Train | Test | Unsupervised | |
---|---|---|---|
# No. texts | 20028 | 2224 | 96264 |
% of total | 16.9% | 1.9% | 81.2% |
This dataset was created due to a lack of annotated Dutch text that is suitable for sentiment classification. Non-Dutch texts were therefore removed, but other than that, no curation was done.
The book reviews were taken from Hebban , a Dutch platform for book reviews.
Initial Data Collection and NormalizationThe source code of the scraper and preprocessing process can be found in the DBRD GitHub repository .
Who are the source language producers?The reviews are written by users of Hebban and are of varying quality. Some are short, others long, and many contain spelling mistakes and other errors.
Each book review was accompanied by a 1 to 5-star rating. The annotations are produced by mapping the user-provided ratings to either a positive or negative label. 1 and 2-star ratings are given the negative label 0 and 4 and 5-star ratings the positive label 1 . Reviews with a rating of 3 stars are considered neutral and left out of the train / test sets and added to the unsupervised set.
Annotation processUsers of Hebban were unaware that their reviews would be used in the creation of this dataset.
Who are the annotators?The annotators are the Hebban users who wrote the book reviews associated with the annotation. Anyone can register on Hebban and it's impossible to know the demographics of this group.
The book reviews and ratings are publicly available on Hebban and no personal or otherwise sensitive information is contained in this dataset.
While predicting sentiment of book reviews in itself is not that interesting, the value of this dataset lies in its usage for benchmarking models. The dataset contains some challenges that are common to outings on the internet, such as spelling mistakes and other errors. It is therefore very useful for validating models for their real-world performance. These datasets are abundant for English but are harder to find for Dutch, making them a valuable resource for ML tasks in this language.
[More Information Needed]
Reviews on Hebban are usually written in Dutch, but some have been written in English and possibly in other languages. While we've done our best to filter out non-Dutch texts, it's hard to do this without errors. For example, some reviews are in multiple languages, and these might slip through. Also be aware that some commercial outings can appear in the text, making them different from other reviews and influencing your models. While this doesn't pose a major issue in most cases, we just wanted to mention it briefly.
This dataset was created by Benjamin van der Burgh , who was working at Leiden Institute of Advanced Computer Science (LIACS) at the time.
The dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License .
Please use the following citation when making use of this dataset in your work.
@article{DBLP:journals/corr/abs-1910-00896, author = {Benjamin van der Burgh and Suzan Verberne}, title = {The merits of Universal Language Model Fine-tuning for Small Datasets - a case with Dutch book reviews}, journal = {CoRR}, volume = {abs/1910.00896}, year = {2019}, url = {http://arxiv.org/abs/1910.00896}, archivePrefix = {arXiv}, eprint = {1910.00896}, timestamp = {Fri, 04 Oct 2019 12:28:06 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1910-00896.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Thanks to @benjaminvdb for adding this dataset.