SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery

14 October 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

AI-driven molecular generation encounters a "generation-synthesis gap": most computationally designed molecules cannot be synthesized in laboratories, limiting AI-assisted drug design (AIDD) applications. Current approaches to assess synthetic accessibility (SA) include computer-aided synthesis planning (CASP) tools that perform retrosynthetic searches, and machine learning-based SA prediction models that provide rapid scoring. CASP tools are computationally expensive for high-throughput screening, while existing SA prediction models may lack chemical synthesis logic or exhibit variable performance across different chemical spaces. We developed SynFrag, an SA prediction model using fragment assembly autoregressive generation to learn stepwise molecular construction patterns. Self-supervised pretraining on millions of unlabeled molecules enables learning of dynamic fragment assembly patterns beyond fragment occurrence statistics or reaction step annotations. This approach captures connectivity relationships relevant to synthesis difficulty cliffs, where minor structural changes substantially alter SA. Evaluation across public benchmarks, clinical drugs with intermediates, and AI-generated molecules shows consistent performance across diverse chemical spaces. The model produces sub-second predictions with attention mechanisms corresponding to key reactive sites. SynFrag provides computational efficiency suitable for large-scale screening while maintaining interpretability for detailed SA assessment in drug discovery workflows. Online platform: https://synfrag.simm.ac.cn. Code and data available: https://github.com/simmzx/SynFrag.

Keywords

Synthetic Accessibility
Drug Discovery
Cheminformatics
Graph Neural Networks
Self-Supervised Learning
Fragment Assembly Generation
Pretrain-Finetune
Machine Learning
Attention Mechanism

Supplementary materials

Title
Description
Actions
Title
Supporting Information of SynFrag
Description
Atom and Bond Features Used in SynFrag; hyperparameter search configuration for SynFrag fine-tuning; heatmap of the fingerprint similarities between the ES and HS compounds in 3 public test set; heatmap of AUROC on TS3 for different hyperparameter search configurations; and comparison of DFS assembled sequences and ground-truth DFS sequences.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.