We introduce a dataset with a pure chemistry focus by compiling a list of chemistry academic journals with Open-Access articles. For each journal, we downloaded full-text article PDFs from the Open-Access portion of the journal using available APIs, or scraping this content using Selenium Chrome WebDriver . Each PDF was processed with Grobid via a locally installed client to extract free-text paragraphs with sections.
The table below shows the journals from which Open Access articles were sourced, as well as the number of papers processed.
For all journals, we filtered for papers with the provided topic of Chemistry when papers from other disciplines were also available (e.g. PubMed).
Source | # of Articles |
---|---|
Beilstein | 1,829 |
Chem Cell | 546 |
ChemRxiv | 12,231 |
Chemistry Open | 398 |
Nature Communications Chemistry | 572 |
PubMed Author Manuscript | 57,680 |
PubMed Open Access | 29,540 |
Royal Society of Chemistry (RSC) | 9,334 |
Scientific Reports - Nature | 6,826 |
English
Column | Description |
---|---|
uuid | Unique Identifier for the Example |
title | Title of the Article |
article_source | Open Source Journal (see above for list) |
abstract | Abstract (summary reference) |
sections | Full-text sections from the main body of paper (<!> indicates section boundaries) |
headers | Corresponding section headers for sections field (<!> delimited) |
source_toks | Aggregate number of tokens across sections |
target_toks | Number of tokens in the abstract |
compression | Ratio of source_toks to target_toks |
Please refer to load_chemistry() in https://github.com/griff4692/calibrating-summaries/blob/master/preprocess/preprocess.py for pre-processing as a summarization dataset. The inputs are sections and headers and the targets is the abstract .
Split | Count |
---|---|
train | 115,956 |
validation | 1,000 |
test | 2,000 |
@article{adams2023desired, title={What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization}, author={Adams, Griffin and Nguyen, Bichlien H and Smith, Jake and Xia, Yingce and Xie, Shufang and Ostropolets, Anna and Deb, Budhaditya and Chen, Yuan-Jyue and Naumann, Tristan and Elhadad, No{\'e}mie}, journal={arXiv preprint arXiv:2305.07615}, year={2023} }