A round-robin exercise for the precise prediction of aqueous solubility of organic chemicals using chemometric, machine learning, and stacking ensemble of deep learning models

02 January 2026, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Aqueous solubility is an important property for assessing the druggability and ecotoxicological effects of molecules. Successful drug candidates should have optimal aqueous solubility to improve bioavailability to target tissues. To effectively screen molecules in a short period of time, reliable predictive models are highly useful. In the present study, we conducted a round-robin exercise using a large, curated dataset of over 6000 compounds to predict aqueous solubility quantitatively. The six participating groups used an array of Machine Learning and Deep Learning algorithms to develop models with strong robustness and external predictive performance. All the models underwent rigorous Leave-One-Out and 10-fold cross-validation. The diversity of training sets and descriptor types used by different groups paved the way for exploring the mechanistic basis for the efficient identification of contributing features. The best-performing model was selected using the statistical Sum of Ranking Differences (SRD) approach, considering the performances on training, cross-validation, and test, as well as the performance difference between the training and test sets. Additionally, a curated, true external set was screened by the six different models. Here, the best-performing model was selected using a consensus ranking strategy based on Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R_Ext^2. In both approaches, i.e., the inherent model performance in terms of training, test, and cross-validation statistics, and the ability of the model to efficiently predict true external data, the Stacking Ensemble of Deep q-RASPR model emerged as the winner. This model showed comparable predictive performance to the previously reported model, which apparently lacked a proper data curation workflow and contained a significant number of duplicates and mixtures in its dataset, which can inflate model statistics. The insights from the different feature contributions from the different groups identified the useful structural and physicochemical aspects, which can help synthetic chemists to optimize molecules.

Keywords

Solubility
Deep Learning
Machine Learning
q-RASPR
QSPR
Round Robin

Supplementary materials

Title
Description
Actions
Title
Supplementary Files
Description
Supplementary Information SI-0 – contains the modeling dataset Supplementary Information SI-1 – contains all the necessary files for the model developed by Group A Supplementary Information SI-2 – contains all the necessary files for the model developed by Group B Supplementary Information SI-3 – contains all the necessary files for the model developed by Group C Supplementary Information SI-4 – contains all the necessary files for the model developed by Group D Supplementary Information SI-5 – contains all the necessary files for the model developed by Group E Supplementary Information SI-6 – contains all the necessary files for the model developed by Group F Supplementary Information SI-7 – contains the SRD output file Supplementary Information SI-8 – contains prediction values for the true external set
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.