ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models

04 August 2025, Version 2
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

The rapid advancement of machine learning in computational chemistry has opened new doors for designing molecules, predicting molecular properties, and discovering novel materials. However, building scalable and robust models for molecular property prediction remains a significant challenge due to the vast size and complexity of chemical space. In this paper, we introduce ChemBERTa-3, an open-source training framework designed to train and fine-tune large-scale chemical foundation models. We explore the potential of multiple model architectures by evaluating their performance across various molecular datasets from the MoleculeNet suite. Our experiments demonstrated that pre-training on the expansive ZINC20 dataset yields models capable of performing well on both classification and regression tasks, providing valuable insights into drug discovery and materials science. For scalability, we leveraged both AWS-based Ray deployments and on-premise high-performance computing clusters to support the processing power required to train on billions of molecules. In support of reproducible and extensible science, we have open-sourced all ChemBERTa3 models.

Keywords

Molecular Property Prediction
Chemical Foundation Models
Open Source Models
Reproducible Computational Chemistry
DeepChem
ChemBERTa

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.