Abstract
High-quality molecular datasets are vital for reliable Quantitative Structure-Activity Relationship (QSAR) modeling and drug discovery. However, many molecular databases contain inaccuracies, such as invalid structures and duplicates, that compromise model performance and reproducibility. Current curation tools require substantial domain expertise and involve complex procedures, creating challenges for newcomers and non-experts. To address this, we developed MEHC-curation, a user-friendly Python framework that simplifies molecular dataset curation for all researchers. This tool allows users to easily curate SMILES strings, transforming an intricate process into a straightforward operation. Built on established protocols, it employs a three-stage pipeline (validation, cleaning, normalization) with integrated duplicate removal and error tracking. The framework's effectiveness was validated through extensive testing on fifteen diverse benchmark datasets involving both classification and regression tasks. Results showed that proper curation significantly enhances dataset composition and model performance across various machine learning algorithms. Also, performance analysis revealed high computational efficiency supported by paralell processing. MEHC-curation is accessible to all researchers and easily integrates into drug discovery and QSAR workflows, delivering high-quality results without requiring specialized expertise.
Supplementary materials
Title
Data distribution of all datasets before and after curation
Description
Data distribution of all datasets before and after curation with six representative molecular descriptors: (A) NumHAcceptors, (B) NumHDonors, (C) QED, (D) MolWt, (E) TPSA, (F) MolLogP.
Actions



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)