Abstract
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a
chemical from its standard International Chemical Identifier (InChI). The model uses two stacks
of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in
state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes
input and output into words or sub-words, our model processes the InChI and predicts the
2
IUPAC name character by character. The model was trained on a dataset of 10 million
InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online
PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set
accuracies of 95% (character-level) and 91% (whole name). The model performed particularly
well on organics, with the exception of macrocycles. The predictions were less accurate for
inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent
limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training
data.



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)