ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models

Riya Singh; Aryan Amit Barsainyan; Rida Irfan; Connor Joseph Amorin; Stewart He; Tony Davis; Arun Thiagarajan; Shiva Sankaran; Seyone Chithrananda; Walid Ahmad; Derek Jones; Kevin McLoughlin; Hyojin Kim; Anoushka Bhutani; Shreyas Vinaya Sathyanarayana; Venkat Viswanathan; Jonathan E. Allen; Bharath Ramsundar

doi:10.26434/chemrxiv-2025-4glrl-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models

04 August 2025, Version 2

Working Paper

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

The rapid advancement of machine learning in computational chemistry has opened new doors for designing molecules, predicting molecular properties, and discovering novel materials. However, building scalable and robust models for molecular property prediction remains a significant challenge due to the vast size and complexity of chemical space. In this paper, we introduce ChemBERTa-3, an open-source training framework designed to train and fine-tune large-scale chemical foundation models. We explore the potential of multiple model architectures by evaluating their performance across various molecular datasets from the MoleculeNet suite. Our experiments demonstrated that pre-training on the expansive ZINC20 dataset yields models capable of performing well on both classification and regression tasks, providing valuable insights into drug discovery and materials science. For scalability, we leveraged both AWS-based Ray deployments and on-premise high-performance computing clusters to support the processing power required to train on billions of molecules. In support of reproducible and extensible science, we have open-sourced all ChemBERTa3 models.

Keywords

Molecular Property Prediction

Chemical Foundation Models

Open Source Models

Reproducible Computational Chemistry

DeepChem

ChemBERTa

Supplementary weblinks

Title

Description

Actions

Title

ChemBERTa-3 Repo

Description

This repository contains the source code, configuration files, and training pipelines for ChemBERTa-3. It supports pretraining and fine-tuning of transformer-based models on large-scale chemical datasets. The framework is built for scalability using distributed computing (Ray, PyTorch DDP) and includes utilities for dataset preprocessing, model evaluation on standard benchmarks, and experiment tracking.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Aug 04, 2025 Version 2

Jul 15, 2025 Version 1

Version Notes

In this version (v2), we have revised the abstract and introduction.

Metrics

6,991

3,062

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-4glrl-v2

Funding

DTRA project

HDTRA13081-40035

U.S. Department of Energy

DE-AC52-07NA27344

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share