数据集:
dbpedia_14
任务:
文本分类子任务:
topic-classification语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
machine-generated源数据集:
original许可:
cc-by-sa-3.0The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000. There are 3 columns in the dataset (same for train and test splits), corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). There are no new lines in title or content.
Although DBpedia is a multilingual knowledge base, the DBpedia14 extract contains English data mainly, other languages may appear (e.g. a film whose title is origanlly not English).
A typical data point, comprises of a title, a content and the corresponding label.
An example from the DBpedia test set looks as follows:
{ 'title':'', 'content':" TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states.", 'label':0 }
The data is split into a training and test set. For each of the 14 classes we have 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000.
The DBPedia ontology classification dataset is constructed by Xiang Zhang ( xiang.zhang@nyu.edu ), licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The DBPedia ontology classification dataset is constructed by Xiang Zhang ( xiang.zhang@nyu.edu ), licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The DBPedia ontology classification dataset is licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License.
Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
Lehmann, Jens, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann et al. "DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia." Semantic web 6, no. 2 (2015): 167-195.
Thanks to @hfawaz for adding this dataset.