Data augmentation in a triple transformer loop retrosynthesis model

Yves Grandjean; David Kreutter; Jean-Louis Reymond

doi:10.26434/chemrxiv-2024-r3x05-v3

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Data augmentation in a triple transformer loop retrosynthesis model

21 December 2025, Version 3

Working Paper

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Reactions in the US Patent Office (USPTO) are biased towards a few over-represented reaction types, which potentially limits their usefulness for computer-assisted synthesis planning (CASP). Herein we propose a data augmentation approach to generate a balanced dataset of fictive reactions. First, we apply retrosynthesis templates to template-matched USPTO molecules used as products (P) to obtain starting materials (SM). We then use transformer T2 from our recently reported triple transformer loop (TTL) retrosynthesis model to predict reagents (R). Finally, we validate the resulting fictive reaction by requesting high confidence, correct prediction by transformer T3*, trained to predict P from R and SM* with tagged reacting atoms. We generate up to 5,000 reactions per template, resulting in a template-equilibrated dataset of 27.5 million fictive reactions covering the chemical space of the original UPSTO dataset. We demonstrate that a TTL trained on these fictive reactions outperforms a TTL trained on USPTO reactions only.

Keywords

Supplementary weblinks

Title

Description

Actions

Title

enrichment framework

Description

code to run the fictive reaction generation

Actions

View

Title

fictive reaction dataset

Description

27.5 million fictive reaction generated in this work

Actions

View

Title

TMAP of reaction data

Description

html files of interactive TMAPs shown in Figure 3.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Dec 21, 2025 Version 3

Oct 17, 2025 Version 2

Aug 01, 2024 Version 1

Version Notes

we have improved the method section as well as the Github and zenodo repositories and modified Figures to be clearer

Metrics

1,731

745

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2024-r3x05-v3

Funding

Swiss National Science Foundation

200020_207976

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Data augmentation in a triple transformer loop retrosynthesis model

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share