Causal Descriptors in QSAR: Deconfounding High-Dimensional Molecular Features via Double Machine Learning and Hypothesis Testing

30 October 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Quantitative Structure-Activity Relationship (QSAR) modeling is a pillar of computational drug discovery. However, standard machine learning (ML) models are often confounded by the high-dimensional and intensely correlated nature of molecular descriptors. A model may identify a "bulk" property (e.g., molecular weight) as highly predictive, when in fact it is merely a proxy for a true, specific pharmacophore (e.g., a hydrogen bond donor). This correlational insight can misdirect costly synthesis efforts. We propose a statistical framework to move from correlational QSAR to causal QSAR. Our approach uses Double/Debiased Machine Learning (DML) to estimate the unconfounded causal effect of each molecular descriptor on biological activity, treating all other p-1 descriptors as potential confounders. We then apply the Benjamini-Hochberg procedure to these p estimates to perform high-dimensional hypothesis testing and control the False Discovery Rate (FDR). We validate this framework using a simulation study that explicitly models the high-correlation and confounding structures endemic to chemoinformatics. We show that baseline models (Lasso, Random Forest) are easily misled, consistently ranking non-causal but confounded "bulk" descriptors as highly important. In contrast, our DML + FDR framework successfully "sees through" the confounding, correctly identifies the true causal descriptors, and rejects the spurious ones, while maintaining the target FDR. This causal inference framework provides a robust method for "deconfounding" the molecular descriptor space. By identifying features with a statistically significant causal link to activity, it can provide medicinal chemists with more reliable, interpretable, and actionable hypotheses for rational drug design.

Keywords

Causal Inference
QSAR
Chemoinformatics
Drug Discovery
Double Machine Learning
High-Dimensional Statistics
False Discovery Rate

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.