数据集:
newsgroup
任务:
文本分类语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
license:unknownThe 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
does not include cross-posts and includes only the "From" and "Subject" headers.
An example of 'train' looks as follows.
18828_comp.graphicsAn example of 'train' looks as follows.
18828_comp.os.ms-windows.miscAn example of 'train' looks as follows.
18828_comp.sys.ibm.pc.hardwareAn example of 'train' looks as follows.
18828_comp.sys.mac.hardwareAn example of 'train' looks as follows.
The data fields are the same among all splits.
18828_alt.atheismname | train |
---|---|
18828_alt.atheism | 799 |
18828_comp.graphics | 973 |
18828_comp.os.ms-windows.misc | 985 |
18828_comp.sys.ibm.pc.hardware | 982 |
18828_comp.sys.mac.hardware | 961 |
@incollection{LANG1995331, title = {NewsWeeder: Learning to Filter Netnews}, editor = {Armand Prieditis and Stuart Russell}, booktitle = {Machine Learning Proceedings 1995}, publisher = {Morgan Kaufmann}, address = {San Francisco (CA)}, pages = {331-339}, year = {1995}, isbn = {978-1-55860-377-6}, doi = {https://doi.org/10.1016/B978-1-55860-377-6.50048-7}, url = {https://www.sciencedirect.com/science/article/pii/B9781558603776500487}, author = {Ken Lang}, }
Thanks to @mariamabarham , @thomwolf , @lhoestq for adding this dataset.