RETRACTED: Harnessing Graph Learning for Surfactant Chemistry: PharmHGT, GCN, and GAT in LogCMC Prediction

28 July 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Accurately predicting the critical micelle concentration (CMC) of surfactants is crucial for optimizing their use across various industries, including pharmaceuticals, detergents, and emulsions. In this study, we evaluate the effectiveness of graph-based machine learning models—specifically graph convolutional networks (GCNs), graph attention networks (GATs), and a graph-transformer model called PharmHGT—for predicting LogCMC values. Additionally, we complement these ML approaches with molecular dynamics (MD) simulations to calculate solvation free energies and provide fundamental thermodynamic insights into surfactant behavior. Our findings offer insights into the relative strengths of these approaches, emphasizing the potential advantages of transformer-based architectures like PharmHGT in representing molecular graphs more effectively than traditional graph neural networks. To better capture the unique molecular features of surfactants, we developed a dedicated surfactant detection module. This module identifies hydrophilic head groups, such as anionic (e.g., sulfates and carboxylates) and cationic (e.g., quaternary ammonium) functional groups, based on atomic properties like formal charge and chemical environment. It also detects hydrophobic tail groups by locating continuous carbon chains of at least four atoms using a depth-first search (DFS) algorithm to identify hydrocarbon fragments. Additionally, the module classifies surfactants into nonionic, anionic, cationic, or zwitterionic categories, depending on the presence of positive and negative charges or combinations of polar and nonpolar regions. The integration of ML predictions and MD-derived thermodynamic insights underscores the importance of combining data-driven and physics-based approaches for robust and interpretable prediction of surfactant properties. This framework advances the accuracy of CMC, or more precisely log CMC) prediction and supports the rational design of surfactants for specific applications.

Keywords

CMC
machine learning
molecular dynamics
surfactants

Supplementary materials

Title
Description
Actions
Title
Supporting Information for: Harnessing Graph Learning for Surfactant Chemistry: PharmHGT, GCN, and GAT in LogCMC Prediction
Description
Complete list of 25 physicochemical molecular descriptors used for clustering analysis with detailed descriptions organized by functional categories (Table S1); detailed SHAP feature importance analysis plots for Data1 showing cluster-specific contributions of molecular descriptors (Figure S1); comprehensive SHAP feature importance analysis for Data2 with surfactant type-specific patterns (Figure S2); FASTMAN molecular clustering analysis using PaCMAP dimensionality reduction combined with HDBSCAN clustering revealing five distinct chemical space regions (Figure S3); self-organizing map (SOM) analysis displaying overall molecular property topology and individual cluster distributions (Figure S4); SHAP contribution analysis comparing molecular descriptor patterns between clusters 1 and 3 (Figure S5); cluster-specific molecular descriptor component planes showing aromatic ring content and fractional CSP3 character variations with molecule positions on SOM grid (Figure S6); detailed methodology for surfactant-specific feature importance analysis including atom-level importance calculation and structural group analysis procedures (docx).
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.