ZINC is a publicly available database that aggregates commercially available and annotated compounds. ZINC provides downloadable 2D and 3D versions as well as a website that enables rapid molecule lookup and analog search. ZINC has grown from fewer than 1 million compounds in 2005 to nearly 2 billion now. This dataset includes ~1B molecules in total. We have filtered out any compounds that were not avaible to be converted from smiles to seflies representations.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is split into an 80/10/10 train/valid/test random split across files (which roughly corresponds to the same percentages)
Initial data was released at https://zinc20.docking.org/ . We have downloaded and added a selfies field and filtered out all molecules that did not contain molecules that could be converted to selfies representations.
@article{Irwin2020, doi = {10.1021/acs.jcim.0c00675}, url = { https://doi.org/10.1021/acs.jcim.0c00675} , year = {2020}, month = oct, publisher = {American Chemical Society ({ACS})}, volume = {60}, number = {12}, pages = {6065--6073}, author = {John J. Irwin and Khanh G. Tang and Jennifer Young and Chinzorig Dandarchuluun and Benjamin R. Wong and Munkhzul Khurelbaatar and Yurii S. Moroz and John Mayfield and Roger A. Sayle}, title = {{ZINC}20{\textemdash}A Free Ultralarge-Scale Chemical Database for Ligand Discovery}, journal = {Journal of Chemical Information and Modeling} }
This dataset was curated and added by @zanussbaum .