Abstract
Aqueous solubility is an important property for assessing the druggability and ecotoxicological effects of molecules. Successful drug candidates should have optimal aqueous solubility to improve bioavailability to target tissues. To effectively screen molecules in a short period of time, reliable predictive models are highly useful. In the present study, we conducted a round-robin exercise using a large, curated dataset of over 6000 compounds to predict aqueous solubility quantitatively. The six participating groups used an array of Machine Learning and Deep Learning algorithms to develop models with strong robustness and external predictive performance. All the models underwent rigorous Leave-One-Out and 10-fold cross-validation. The diversity of training sets and descriptor types used by different groups paved the way for exploring the mechanistic basis for the efficient identification of contributing features. The best-performing model was selected using the statistical Sum of Ranking Differences (SRD) approach, considering the performances on training, cross-validation, and test, as well as the performance difference between the training and test sets. Additionally, a curated, true external set was screened by the six different models. Here, the best-performing model was selected using a consensus ranking strategy based on Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R_Ext^2. In both approaches, i.e., the inherent model performance in terms of training, test, and cross-validation statistics, and the ability of the model to efficiently predict true external data, the Stacking Ensemble of Deep q-RASPR model emerged as the winner. This model showed comparable predictive performance to the previously reported model, which apparently lacked a proper data curation workflow and contained a significant number of duplicates and mixtures in its dataset, which can inflate model statistics. The insights from the different feature contributions from the different groups identified the useful structural and physicochemical aspects, which can help synthetic chemists to optimize molecules.
Supplementary materials
Title
Supplementary Files
Description
Supplementary Information SI-0 – contains the modeling dataset
Supplementary Information SI-1 – contains all the necessary files for the model developed by Group A
Supplementary Information SI-2 – contains all the necessary files for the model developed by Group B
Supplementary Information SI-3 – contains all the necessary files for the model developed by Group C
Supplementary Information SI-4 – contains all the necessary files for the model developed by Group D
Supplementary Information SI-5 – contains all the necessary files for the model developed by Group E
Supplementary Information SI-6 – contains all the necessary files for the model developed by Group F
Supplementary Information SI-7 – contains the SRD output file
Supplementary Information SI-8 – contains prediction values for the true external set
Actions
Supplementary weblinks
Title
DTC Laboratory Supplementary website
Description
The Read-Across tool, RASAR descriptor calculator tool, and the chemical structure curation tool is freely available from the DTC Laboratory Supplementary website.
Actions
View 


![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)