This dataset contains more than 16,000 unique pairs of protein sequences and ligand SMILES, and the coordinates of their complexes.
SMILES are assumed to be tokenized by the regex from P. Schwaller
Every (x,y,z) ligand coordinate maps onto a SMILES token, and is nan if the token does not represent an atom
Every receptor coordinate maps onto the Calpha coordinate of that residue.
The dataset can be used to fine-tune a language model, all data comes from PDBind-cn.
Load a test/train split using
from datasets import load_dataset train = load_dataset("jglaser/pdbbind_complexes",split='train[:90%]') validation = load_dataset("jglaser/pdbbind_complexes",split='train[90%:]')
To manually perform the preprocessing, download the data sets from P.DBBind-cn
Register for an account at https://www.pdbbind.org.cn/ , confirm the validation email, then login and download
Extract those files in pdbbind/data
Run the script pdbbind.py in a compute job on an MPI-enabled cluster (e.g., mpirun -n 64 pdbbind.py ).