Hostname: page-component-5db58dd55d-l8wb7 Total loading time: 0 Render date: 2026-05-31T18:17:37.628Z Has data issue: false hasContentIssue false

Hidden failure modes of large language models in healthcare-associated infection surveillance: a structured evaluation using NHSN definitions

Published online by Cambridge University Press:  06 April 2026

Mamdooh Alzyood*
Affiliation:
Faculty of Health Science, and Technology, School of Psychology, Social Work and Public Health, Oxford Brookes University Faculty of Health and Life Sciences, Oxford, UK
Alfred Veldhuis
Affiliation:
Faculty of Health Science, and Technology, School of Psychology, Social Work and Public Health, Oxford Brookes University Faculty of Health and Life Sciences, Oxford, UK
Hayley Stevenson
Affiliation:
Warwickshire College Group/ WCG: Royal Leamington Spa College, UK
Samina Sheikh
Affiliation:
London Borough of Sutton, UK
*
Corresponding author: Mamdooh Alzyood; Email: malzyood@brookes.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Background:

Large language models (LLMs) are increasingly explored for healthcare-associated infection (HAI) surveillance, but their reliability in applying formal National Healthcare Safety Network (NHSN) definitions is not well characterized. This study evaluates GPT-5.1 Thinking’s accuracy and rationales in classifying NHSN-defined infections.

Methods:

Seventy synthesized case vignettes containing complete, organized clinical data representing five NHSN infection types, including complex edge cases, were assessed using 2025 NHSN surveillance definitions. GPT-5.1 Thinking classified cases under three prompting strategies: standard, structured, and constrained. Quantitative accuracy metrics and qualitative inductive content analysis of rationales and failure modes were performed.

Results:

Overall accuracy across 210 classifications improved from 78.6% (standard prompt) to 88.6% (structured) and 95.7% (constrained). Performance was highest for infections with clear anatomical or radiographic criteria (surgical site infections [SSI], ventilator-associated pneumonia [VAP]) and lowest for infections involving complex exclusion rules (central line-associated bloodstream infection [CLABSI], Clostridioides difficile infection [CDI]). Constrained prompting enhanced adherence to NHSN rules but did not eliminate errors in hierarchical exclusions. Content analysis identified three recurrent failure categories: prioritization of clinical plausibility over surveillance logic, failure to apply quantitative and temporal thresholds, and errors in hierarchical source attribution.

Conclusion:

GPT-5.1 Thinking shows potential to support infection surveillance under strict constraints but exhibits systematic limitations, including overreliance on clinical intuition and difficulty with complex exclusion pathways. Currently, LLMs are unsuitable for autonomous NHSN classification but may serve as supervised decision-support tools with robust human oversight. Further development is needed to enhance LLMs’ ability to synthesize surveillance definitions and complex situational characteristics critical for effective HAI surveillance, though fully autonomous deployment would require further validation. These findings are based on synthetic data that may differ from real-world clinical data in ways likely to overestimate the accuracy of these tools.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Healthcare Epidemiology of America
Figure 0

Figure 1. Figure 1 long description.Stepwise evaluation workflow showing case input, prompting strategies, and classification output. Each vignette was evaluated under standard, structured, and constrained prompts. Outputs were compared with gold-standard NHSN 2025 classifications to generate quantitative accuracy metrics. Misclassified vignettes were subject to inductive content analysis. HAI, healthcare-associated infection; NHSN, National Healthcare Safety Network; CLABSI, central line-associated bloodstream infection; CAUTI, catheter-associated urinary tract infection; CDI, Clostridioides difficile infection; SSI, surgical site infection; VAP, ventilator-associated pneumonia.

Figure 1

Table 1. Descriptive accuracy across HAI categoriesTable 1 long description.

Supplementary material: File

Alzyood et al. supplementary material 1

Alzyood et al. supplementary material
Download Alzyood et al. supplementary material 1(File)
File 21.7 KB
Supplementary material: File

Alzyood et al. supplementary material 2

Alzyood et al. supplementary material
Download Alzyood et al. supplementary material 2(File)
File 80.9 KB