Abstract
AI-driven molecular generation encounters a "generation-synthesis gap": most computationally designed molecules cannot be synthesized in laboratories, limiting AI-assisted drug design (AIDD) applications. Current approaches to assess synthetic accessibility (SA) include computer-aided synthesis planning (CASP) tools that perform retrosynthetic searches, and machine learning-based SA prediction models that provide rapid scoring. CASP tools are computationally expensive for high-throughput screening, while existing SA prediction models may lack chemical synthesis logic or exhibit variable performance across different chemical spaces. We developed SynFrag, an SA prediction model using fragment assembly autoregressive generation to learn stepwise molecular construction patterns. Self-supervised pretraining on millions of unlabeled molecules enables learning of dynamic fragment assembly patterns beyond fragment occurrence statistics or reaction step annotations. This approach captures connectivity relationships relevant to synthesis difficulty cliffs, where minor structural changes substantially alter SA. Evaluation across public benchmarks, clinical drugs with intermediates, and AI-generated molecules shows consistent performance across diverse chemical spaces. The model produces sub-second predictions with attention mechanisms corresponding to key reactive sites. SynFrag provides computational efficiency suitable for large-scale screening while maintaining interpretability for detailed SA assessment in drug discovery workflows. Online platform: https://synfrag.simm.ac.cn. Code and data available: https://github.com/simmzx/SynFrag.
Supplementary materials
Title
Supporting Information of SynFrag
Description
Atom and Bond Features Used in SynFrag; hyperparameter search configuration for SynFrag fine-tuning; heatmap of the fingerprint similarities between the ES and HS compounds in 3 public test set; heatmap of AUROC on TS3 for different hyperparameter search configurations; and comparison of DFS assembled sequences and ground-truth DFS sequences.
Actions



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)