Causal Descriptors in QSAR: Deconfounding High-Dimensional Molecular Features via Double Machine Learning and Hypothesis Testing

Yingkai Liu

doi:10.26434/chemrxiv-2025-nc74b

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

Causal Descriptors in QSAR: Deconfounding High-Dimensional Molecular Features via Double Machine Learning and Hypothesis Testing

30 October 2025, Version 1

Working Paper

Yingkai Liu

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Quantitative Structure-Activity Relationship (QSAR) modeling is a pillar of computational drug discovery. However, standard machine learning (ML) models are often confounded by the high-dimensional and intensely correlated nature of molecular descriptors. A model may identify a "bulk" property (e.g., molecular weight) as highly predictive, when in fact it is merely a proxy for a true, specific pharmacophore (e.g., a hydrogen bond donor). This correlational insight can misdirect costly synthesis efforts. We propose a statistical framework to move from correlational QSAR to causal QSAR. Our approach uses Double/Debiased Machine Learning (DML) to estimate the unconfounded causal effect of each molecular descriptor on biological activity, treating all other p-1 descriptors as potential confounders. We then apply the Benjamini-Hochberg procedure to these p estimates to perform high-dimensional hypothesis testing and control the False Discovery Rate (FDR). We validate this framework using a simulation study that explicitly models the high-correlation and confounding structures endemic to chemoinformatics. We show that baseline models (Lasso, Random Forest) are easily misled, consistently ranking non-causal but confounded "bulk" descriptors as highly important. In contrast, our DML + FDR framework successfully "sees through" the confounding, correctly identifies the true causal descriptors, and rejects the spurious ones, while maintaining the target FDR. This causal inference framework provides a robust method for "deconfounding" the molecular descriptor space. By identifying features with a statistically significant causal link to activity, it can provide medicinal chemists with more reliable, interpretable, and actionable hypotheses for rational drug design.

Keywords

Double Machine Learning

High-Dimensional Statistics

False Discovery Rate

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Oct 30, 2025 Version 1

Metrics

578

139

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-nc74b

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Causal Descriptors in QSAR: Deconfounding High-Dimensional Molecular Features via Double Machine Learning and Hypothesis Testing

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share