Abstract
Bayesian optimization (BO) has been widely used to optimize drug potency towards a protein target by searching through fixed libraries of molecules and selecting compounds guided by machine learning (ML) predicted activities. We find that by pre-computing docking scores for the full library, and by using estimated binding energies and 3D descriptors from docking as features in the ML model, BO can be significantly accelerated relative to using traditional 2D features such as molecular fingerprints. Additionally, we have observed that a docking-based initialization scheme is often superior to the commonly-used diversity-based or random initialization. Using docking-based features and docking-based initialization required on average 24% (up to 77%) fewer data points to find the most active compound and gave on average 32% (up to 159%) improvement in enrichment factors, relative to a typical BO approach. We applied our method to 14 ChEMBL data sets, and 4 more highly challenging LIT-PCBA data sets with low hit rates and high molecular diversity. Our approach combines the generality of structure-based virtual screening (SBVS) with the inference power of ML ligand-based virtual screening (LBVS) to offer a more data-efficient hybrid approach.
Supplementary materials
Title
Supplementary Information: Optimizing Drug Activity Using Docking-Informed Machine Learning
Description
This supplementary information document contains information on methods for data collection and pre-processing, details of software used, methods for molecular docking and machine learning, statistical tests, and results of additional experiments (including ablation studies and studies of different batch sizes and different experimental budgets). It also includes figures depicting the correlation between docking scores and experimental activities, docked ligand poses, plots of projected descriptors for each data set, bar charts of molecular similarity scores, optimization trajectories and box-plots of results for different search/optimization algorithms.
Actions
Supplementary weblinks
Title
Repository associated with this work
Description
This repository contains code and data that were produced during this work. It includes a README file and YAML files that can be followed to reproduce the Python environments used in this work (on x86-64 Linux operating systems). Code is available as Python (.py) files and Jupyter notebooks (.ipynb) and includes scripts used to: collect and pre-process data, generate molecular descriptors, perform molecular docking, run machine learning experiments and process results. Data files (.csv) containing processed ligand activity values extracted from ChEMBL or LIT-PCBA, compounds represented as SMILES strings, docking scores and molecular descriptors are available. Results files (.csv) containing summary metrics and zero-based indices of compounds selected from the data sets by the different search algorithms are also available. Open-source software dependencies are included where needed in this repository.
Actions
View 


![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)