数据集:
gnad10
任务:
文本分类子任务:
topic-classification语言:
de计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
crowdsourced许可:
cc-by-nc-sa-4.0The 10k German News Article Dataset consists of 10273 German language news articles from the online Austrian newspaper website DER Standard. Each news article has been classified into one of 9 categories by professional forum moderators employed by the newspaper. This dataset is extended from the original One Million Posts Corpus . The dataset was created to support topic classification in German because a classifier effective on a English dataset may not be as effective on a German dataset due to higher inflections and longer compound words. Additionally, this dataset can be used as a benchmark dataset for German topic classification.
This dataset can be used to train a model, like BERT for topic classification on German news articles. There are 9 possible categories.
The text is in German and it comes from an online Austrian newspaper website. The BCP-47 code for German is de-DE .
An example data instance contains a German news article (title and article are concatenated) and it's corresponding topic category.
{'text': ''Die Gewerkschaft GPA-djp lanciert den "All-in-Rechner" und findet, dass die Vertragsform auf die Führungsebene beschränkt gehört. Wien – Die Gewerkschaft GPA-djp sieht Handlungsbedarf bei sogenannten All-in-Verträgen.' 'label': 'Wirtschaft' }
The data is split into a training set consisting of 9245 articles and a test set consisting of 1028 articles.
The dataset was created to support topic classification in the German language. English text classification datasets are common ( AG News and 20 Newsgroup ), but German datasets are less common. A classifier trained on an English dataset may not work as well on a set of German text due to grammatical differences. Thus there is a need for a German dataset for effectively assessing model performance.
The 10k German News Article Dataset is extended from the One Million Posts Corpus. 10273 German news articles were collected from this larger corpus. In the One Million Posts Corpus, each article has a topic path like Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise . The 10kGNAD uses the second part of the topic path as the topic label. Article title and texts are concatenated into one text and author names are removed to avoid keyword classification on authors who write frequently on a particular topic.
Who are the source language producers?The language producers are the authors of the Austrian newspaper website DER Standard.
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This dataset was curated by Timo Block.
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.
Please consider citing the authors of the "One Million Post Corpus" if you use the dataset.:
@InProceedings{Schabus2017, Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, Title = {One Million Posts: A Data Set of German Online Discussions}, Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, Pages = {1241--1244}, Year = {2017}, Address = {Tokyo, Japan}, Doi = {10.1145/3077136.3080711}, Month = aug }
Thanks to @stevhliu for adding this dataset.