MEHC-Curation: A Python Framework for High-Quality Molecular Dataset Curation

19 November 2025, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

High-quality molecular datasets are vital for reliable Quantitative Structure-Activity Relationship (QSAR) modeling and drug discovery. However, many molecular databases contain inaccuracies, such as invalid structures and duplicates, that compromise model performance and reproducibility. Current curation tools require substantial domain expertise and involve complex procedures, creating challenges for newcomers and non-experts. To address this, we developed MEHC-curation, a user-friendly Python framework that simplifies molecular dataset curation for all researchers. This tool allows users to easily curate SMILES strings, transforming an intricate process into a straightforward operation. Built on established protocols, it employs a three-stage pipeline (validation, cleaning, normalization) with integrated duplicate removal and error tracking. The framework's effectiveness was validated through extensive testing on fifteen diverse benchmark datasets involving both classification and regression tasks. Results showed that proper curation significantly enhances dataset composition and model performance across various machine learning algorithms. Also, performance analysis revealed high computational efficiency supported by paralell processing. MEHC-curation is accessible to all researchers and easily integrates into drug discovery and QSAR workflows, delivering high-quality results without requiring specialized expertise.

Keywords

cheminformatics
QSAR modeling
molecular dataset curation
SMILES validation and normalization
duplicate detection

Supplementary materials

Title
Description
Actions
Title
Data distribution of all datasets before and after curation
Description
Data distribution of all datasets before and after curation with six representative molecular descriptors: (A) NumHAcceptors, (B) NumHDonors, (C) QED, (D) MolWt, (E) TPSA, (F) MolLogP.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.