数据集:
arxiv_dataset
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
expert-generated批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1905.00075许可:
cc0-1.0A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
[More Information Needed]
The language supported is English
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. An example is given below
{'id': '0704.0002', 'submitter': 'Louis Theran', 'authors': 'Ileana Streinu and Louis Theran', 'title': 'Sparsity-certifying Graph Decompositions', 'comments': 'To appear in Graphs and Combinatorics', 'journal-ref': None, 'doi': None, 'report-no': None, 'categories': 'math.CO cs.CG', 'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/', 'abstract': ' We describe a new algorithm, the $(k,\\ell)$-pebble game with colors, and use\nit obtain a characterization of the family of $(k,\\ell)$-sparse graphs and\nalgorithmic solutions to a family of problems concerning tree decompositions of\ngraphs. Special instances of sparse graphs appear in rigidity theory and have\nreceived increased attention in recent years. In particular, our colored\npebbles generalize and strengthen the previous results of Lee and Streinu and\ngive a new proof of the Tutte-Nash-Williams characterization of arboricity. We\nalso present a new decomposition that certifies sparsity based on the\n$(k,\\ell)$-pebble game with colors. Our work also exposes connections between\npebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and\nWestermann and Hendrickson.\n', 'update_date': '2008-12-13'}
The data was not splited.
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more is presented to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
This data is based on arXiv papers. [More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
This dataset contains no annotations.
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The original data is maintained by ArXiv
The data is under the Creative Commons CC0 1.0 Universal Public Domain Dedication
@misc{clement2019arxiv, title={On the Use of ArXiv as a Dataset}, author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi}, year={2019}, eprint={1905.00075}, archivePrefix={arXiv}, primaryClass={cs.IR} }
Thanks to @tanmoyio for adding this dataset.