Optimizing Drug Activity Using Docking-Informed Machine Learning

James Proudfoot; Toby Lewis-Atwell; Matthew Grayson

doi:10.26434/chemrxiv-2025-vg35z-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Optimizing Drug Activity Using Docking-Informed Machine Learning

02 September 2025, Version 2

Working Paper

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Bayesian optimization (BO) has been widely used to optimize drug potency towards a protein target by searching through fixed libraries of molecules and selecting compounds guided by machine learning (ML) predicted activities. We find that by pre-computing docking scores for the full library, and by using estimated binding energies and 3D descriptors from docking as features in the ML model, BO can be significantly accelerated relative to using traditional 2D features such as molecular fingerprints. Additionally, we have observed that a docking-based initialization scheme is often superior to the commonly-used diversity-based or random initialization. Using docking-based features and docking-based initialization required on average 24% (up to 77%) fewer data points to find the most active compound and gave on average 32% (up to 159%) improvement in enrichment factors, relative to a typical BO approach. We applied our method to 14 ChEMBL data sets, and 4 more highly challenging LIT-PCBA data sets with low hit rates and high molecular diversity. Our approach combines the generality of structure-based virtual screening (SBVS) with the inference power of ML ligand-based virtual screening (LBVS) to offer a more data-efficient hybrid approach.

Keywords

Bayesian Optimization

Active Learning

Supplementary materials

Title

Description

Actions

Title

Supplementary Information: Optimizing Drug Activity Using Docking-Informed Machine Learning

Description

This supplementary information document contains information on methods for data collection and pre-processing, details of software used, methods for molecular docking and machine learning, statistical tests, and results of additional experiments (including ablation studies and studies of different batch sizes and different experimental budgets). It also includes figures depicting the correlation between docking scores and experimental activities, docked ligand poses, plots of projected descriptors for each data set, bar charts of molecular similarity scores, optimization trajectories and box-plots of results for different search/optimization algorithms.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Repository associated with this work

Description

This repository contains code and data that were produced during this work. It includes a README file and YAML files that can be followed to reproduce the Python environments used in this work (on x86-64 Linux operating systems). Code is available as Python (.py) files and Jupyter notebooks (.ipynb) and includes scripts used to: collect and pre-process data, generate molecular descriptors, perform molecular docking, run machine learning experiments and process results. Data files (.csv) containing processed ligand activity values extracted from ChEMBL or LIT-PCBA, compounds represented as SMILES strings, docking scores and molecular descriptors are available. Results files (.csv) containing summary metrics and zero-based indices of compounds selected from the data sets by the different search algorithms are also available. Open-source software dependencies are included where needed in this repository.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Sep 02, 2025 Version 2

Aug 31, 2025 Version 1

Version Notes

Corrected funding body name to "UK Research and Innovation" from "Engineering Physical Sciences Research Council".

Metrics

1,129

507

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-vg35z-v2

Funding

UK Research and Innovation

EP/S023437/1

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Optimizing Drug Activity Using Docking-Informed Machine Learning

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share