数据集:

CShorten/ML-ArXiv-Papers

许可:

afl-3.0
中文

This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning.

The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv . The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering.

The dataset is maintained by with requests to the ArXiv API.

The current iteration of the dataset only contains the title and abstract of the paper.

The ArXiv dataset contains additional features that we may look to include in future releases. We have highlighted the top two features on the roadmap for integration:

  • authors
  • update_date
  • Submitter
  • Comments
  • Journal-ref
  • doi
  • report-no
  • categories
  • license
  • versions
  • authors_parsed