Data augmentation in a triple transformer loop retrosynthesis model

21 December 2025, Version 3
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Reactions in the US Patent Office (USPTO) are biased towards a few over-represented reaction types, which potentially limits their usefulness for computer-assisted synthesis planning (CASP). Herein we propose a data augmentation approach to generate a balanced dataset of fictive reactions. First, we apply retrosynthesis templates to template-matched USPTO molecules used as products (P) to obtain starting materials (SM). We then use transformer T2 from our recently reported triple transformer loop (TTL) retrosynthesis model to predict reagents (R). Finally, we validate the resulting fictive reaction by requesting high confidence, correct prediction by transformer T3*, trained to predict P from R and SM* with tagged reacting atoms. We generate up to 5,000 reactions per template, resulting in a template-equilibrated dataset of 27.5 million fictive reactions covering the chemical space of the original UPSTO dataset. We demonstrate that a TTL trained on these fictive reactions outperforms a TTL trained on USPTO reactions only.

Keywords

retrosynthesis
transformer models
synthesis planning
data augmentation

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.