Optimizing Drug Activity Using Docking-Informed Machine Learning

02 September 2025, Version 2
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Bayesian optimization (BO) has been widely used to optimize drug potency towards a protein target by searching through fixed libraries of molecules and selecting compounds guided by machine learning (ML) predicted activities. We find that by pre-computing docking scores for the full library, and by using estimated binding energies and 3D descriptors from docking as features in the ML model, BO can be significantly accelerated relative to using traditional 2D features such as molecular fingerprints. Additionally, we have observed that a docking-based initialization scheme is often superior to the commonly-used diversity-based or random initialization. Using docking-based features and docking-based initialization required on average 24% (up to 77%) fewer data points to find the most active compound and gave on average 32% (up to 159%) improvement in enrichment factors, relative to a typical BO approach. We applied our method to 14 ChEMBL data sets, and 4 more highly challenging LIT-PCBA data sets with low hit rates and high molecular diversity. Our approach combines the generality of structure-based virtual screening (SBVS) with the inference power of ML ligand-based virtual screening (LBVS) to offer a more data-efficient hybrid approach.

Keywords

Machine Learning
Virtual Screening
Molecular Docking
Cheminformatics
Bayesian Optimization
Active Learning

Supplementary materials

Title
Description
Actions
Title
Supplementary Information: Optimizing Drug Activity Using Docking-Informed Machine Learning
Description
This supplementary information document contains information on methods for data collection and pre-processing, details of software used, methods for molecular docking and machine learning, statistical tests, and results of additional experiments (including ablation studies and studies of different batch sizes and different experimental budgets). It also includes figures depicting the correlation between docking scores and experimental activities, docked ligand poses, plots of projected descriptors for each data set, bar charts of molecular similarity scores, optimization trajectories and box-plots of results for different search/optimization algorithms.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.