Abstract
Here, we develop machine learning (ML) models to predict catalytic performance in rhodium-catalyzed propene hydroformylation and identify new ligands and reaction conditions for enhanced yield and selectivity. A curated dataset of 215 experimentally obtained data points involving [Rh(acac)(CO)₂] precatalyst and 49 phosphine ligands was used. Molecular, electronic, and reaction descriptors were constructed using cheminformatic representations and dimensionality reduction. Among several algorithms tested, XGBoost exhibited the best performance, achieving a root-mean-square error (RMSE) of 9.85% for iso-selectivity under leave-one-ligand-out cross-validation. Feature importance analysis revealed that ligand and solvent descriptors most strongly influence selectivity, whereas turnover number (TON) predictions were more challenging. A bootstrapped ensemble of XGBoost models integrated with a genetic algorithm enabled the exploration of vast ligand–condition space, yielding 58 candidate ligands predicted to achieve TON ≥ 330 and Iso(%) ≥ 70. This study demonstrates that interpretable ML models can complement mechanistic understanding and accelerate catalyst design for small-alkene hydroformylation.
Supplementary materials
Title
ESI
Description
ESI contains the details of structures of ligands from the dataset and proposed ligands.
Actions
Supplementary weblinks
Title
Github link
Description
The link contains all code and data used in the analyses.
Actions
View 


![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)