Objective:To evaluate the ability of large language models (LLMs) with targeted feedback to classify medications as antimicrobial or non-antimicrobial and their implication in antimicrobial stewardship.
Design:Cross-sectional evaluation using a two-phase process: initial unguided classification and feedback-informed reclassification.
Setting:Medication-level analysis of health system prescribing data.
Participants:A data set of 7,239 unique medication entries from health systems in the Collaboration to Harmonize Antimicrobial Registry Measures (CHARM) project.
Methods:Four LLMs, including ChatGPT-3.5, Copilot GPT-4o, Claude Sonnet 4, and Gemini 2.5 Flash, classified all entries against a manual reference standard. Models then received feedback on 20% of misclassified cases for reclassification. Metrics included accuracy, macro F1-score (95% confidence intervals via bootstrap resampling), positive predictive value, negative predictive value, processing time, and error reduction rates (ERRs). McNemar’s test assessed accuracy between phases and model differences.
Results:Baseline accuracy varied among LLMs. Postfeedback, all accuracies improved significantly (P < .001) to Gemini (99.6%), Claude Sonnet 4 (99.4%), ChatGPT-3.5 (81.0%), and Copilot (79.7%). Gemini achieved the highest macro-F1 (98.9%, 95% CI: 98.4–99.3) and ERR (69.2%). Processing time were fastest for Copilot (42 s), followed by ChatGPT-3.5 (47 s), Gemini (1,080 s), and Claude (3,475 s). Manual classification of this task is estimated to take 18 hours without LLM. Misclassification was most common among antiseptics, antiparasitics, and drugs with antimicrobial components for non-infectious uses (eg, sulfasalazine).
Conclusions:Top-performing LLMs achieved accuracy levels suitable for automating initial antimicrobial classification in stewardship workflows. Performance variability underscores the need for careful selection and continued human oversight in clinical applications.
Summary:Four LLMs were evaluated for antimicrobial classification using 7,239 medications. Claude Sonnet 4 and Gemini achieved>99% accuracy, while ChatGPT-3.5 and Copilot showed substantial limitations. Top performers could automate stewardship workflows with appropriate oversight.