SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery

Xiang Zhang; Jia Liu; Bufan Xu; Zihan Zhang; Zifu Huang; Kaixian Chen; Dingyan Wang; Xutong Li

doi:10.26434/chemrxiv-2025-33251

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery

14 October 2025, Version 1

Working Paper

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

AI-driven molecular generation encounters a "generation-synthesis gap": most computationally designed molecules cannot be synthesized in laboratories, limiting AI-assisted drug design (AIDD) applications. Current approaches to assess synthetic accessibility (SA) include computer-aided synthesis planning (CASP) tools that perform retrosynthetic searches, and machine learning-based SA prediction models that provide rapid scoring. CASP tools are computationally expensive for high-throughput screening, while existing SA prediction models may lack chemical synthesis logic or exhibit variable performance across different chemical spaces. We developed SynFrag, an SA prediction model using fragment assembly autoregressive generation to learn stepwise molecular construction patterns. Self-supervised pretraining on millions of unlabeled molecules enables learning of dynamic fragment assembly patterns beyond fragment occurrence statistics or reaction step annotations. This approach captures connectivity relationships relevant to synthesis difficulty cliffs, where minor structural changes substantially alter SA. Evaluation across public benchmarks, clinical drugs with intermediates, and AI-generated molecules shows consistent performance across diverse chemical spaces. The model produces sub-second predictions with attention mechanisms corresponding to key reactive sites. SynFrag provides computational efficiency suitable for large-scale screening while maintaining interpretability for detailed SA assessment in drug discovery workflows. Online platform: https://synfrag.simm.ac.cn. Code and data available: https://github.com/simmzx/SynFrag.

Keywords

Synthetic Accessibility

Drug Discovery

Cheminformatics

Graph Neural Networks

Self-Supervised Learning

Fragment Assembly Generation

Pretrain-Finetune

Machine Learning

Attention Mechanism

Supplementary materials

Title

Description

Actions

Title

Supporting Information of SynFrag

Description

Atom and Bond Features Used in SynFrag; hyperparameter search configuration for SynFrag fine-tuning; heatmap of the fingerprint similarities between the ES and HS compounds in 3 public test set; heatmap of AUROC on TS3 for different hyperparameter search configurations; and comparison of DFS assembled sequences and ground-truth DFS sequences.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Oct 14, 2025 Version 1

Metrics

1,842

446

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-33251

Funding

Strategic Priority Research Program of the Chinese Academy of Sciences

XDB0830000, China

National Natural Science Foundation of China

82204278, China

National Key Research and Development Program of China

2023YFC2305904, China

National Key Research and Development Program of China

2022YFC3400504, China

Lingang Laboratory

LGL-8888

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share