数据集:
newsgroup
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
does not include cross-posts and includes only the "From" and "Subject" headers.
An example of 'train' looks as follows.
18828_comp.graphicsAn example of 'train' looks as follows.
18828_comp.os.ms-windows.miscAn example of 'train' looks as follows.
18828_comp.sys.ibm.pc.hardwareAn example of 'train' looks as follows.
18828_comp.sys.mac.hardwareAn example of 'train' looks as follows.
The data fields are the same among all splits.
18828_alt.atheism| name | train |
|---|---|
| 18828_alt.atheism | 799 |
| 18828_comp.graphics | 973 |
| 18828_comp.os.ms-windows.misc | 985 |
| 18828_comp.sys.ibm.pc.hardware | 982 |
| 18828_comp.sys.mac.hardware | 961 |
@incollection{LANG1995331,
title = {NewsWeeder: Learning to Filter Netnews},
editor = {Armand Prieditis and Stuart Russell},
booktitle = {Machine Learning Proceedings 1995},
publisher = {Morgan Kaufmann},
address = {San Francisco (CA)},
pages = {331-339},
year = {1995},
isbn = {978-1-55860-377-6},
doi = {https://doi.org/10.1016/B978-1-55860-377-6.50048-7},
url = {https://www.sciencedirect.com/science/article/pii/B9781558603776500487},
author = {Ken Lang},
}
Thanks to @mariamabarham , @thomwolf , @lhoestq for adding this dataset.