This dataset contains 1.9M unique pairs of protein sequences and ligand SMILES with experimentally determined binding affinities. It can be used for fine-tuning a language model.
The data comes from the following sources:
Load a test/train split using
from datasets import load_dataset train = load_dataset("jglaser/binding_affinity",split='train[:90%]') validation = load_dataset("jglaser/binding_affinity",split='train[90%:]')
Optionally, datasets with certain protein sequences removed are available. These can be used to test the predictive power for specific proteins even when these are not part of the training data.
Loading the data manually
The file data/all.parquet contains the preprocessed data. To extract it, you need download and install [git LFS support] https://git-lfs.github.com/] .
To manually perform the preprocessing, download the data sets from
In bindingdb , download the database as tab separated values https://bindingdb.org > Download > BindingDB_All_2021m4.tsv.zip and extract the zip archive into bindingdb/data
Run the steps in bindingdb.ipynb
Register for an account at https://www.pdbbind.org.cn/ , confirm the validation email, then login and download
Extract those files in pdbbind/data
Run the script pdbbind.py in a compute job on an MPI-enabled cluster (e.g., mpirun -n 64 pdbbind.py ).
Perform the steps in the notebook pdbbind.ipynb
Go to https://bindingmoad.org and download the files every.csv (All of Binding MOAD, Binding Data) and the non-redundant biounits ( nr_bind.zip ). Place and extract those files into binding_moad .
Run the script moad.py in a compute job on an MPI-enabled cluster (e.g., mpirun -n 64 moad.py ).
Perform the steps in the notebook moad.ipynb
Download from https://zhanglab.ccmb.med.umich.edu/BioLiP/ the files
The following steps are optional , they do not result in additional binding affinity data.
Download the script
Update the 2013 database to its current state
perl download_all-sets.pl
Run the script biolip.py in a compute job on an MPI-enabled cluster (e.g., mpirun -n 64 biolip.py ).
Perform the steps in the notebook biolip.ipynb
Run the steps in the notebook combine_dbs.ipynb