Performance of an artificial intelligence model for evaluation of unnecessary central lines, Northern California 2025

Jenna M. Wick; Apoorva Bhaskara; Wajeeha Tariq; Sean Lau; Mindy M. Sampson; Jorge L. Salinas

doi:10.1017/ice.2026.10462

Performance of an artificial intelligence model for evaluation of unnecessary central lines, Northern California 2025

Published online by Cambridge University Press: 27 April 2026

Sean Lau ,

and

Jenna M. Wick*: Affiliation:
Stanford University School of Medicine , USA
Apoorva Bhaskara: Affiliation:
Stanford University School of Medicine , USA
Wajeeha Tariq: Affiliation:
Stanford University School of Medicine , USA
Sean Lau: Affiliation:
Stanford Health Care, USA
Mindy M. Sampson: Affiliation:
Stanford University School of Medicine , USA
Jorge L. Salinas: Affiliation:
Stanford University School of Medicine , USA
*: Corresponding author: Jenna M. Wick; Email: jwick@stanford.edu

Article contents

Abstract
Background
Methods
Results
Discussion
Supplementary material
References

Rights & Permissions

Abstract

We used a large language model integrated in the electronic health record to evaluate unnecessary central lines. It had a 16% sensitivity and 99% specificity for detecting unnecessary lines. Although it missed many unnecessary lines, the high specificity suggests potential as a tool where human review is not feasible.

Information

Type: Concise Communication
Information: Infection Control & Hospital Epidemiology , First View , pp. 1 - 3

DOI: https://doi.org/10.1017/ice.2026.10462 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Healthcare Epidemiology of America

Background

Central lines are frequently used in inpatient care, often for extended durations. However, they come with the risk of central line associated blood stream infections (CLABSI). CLABSIs extend hospital stay, increase mortality^{Reference Elangovan, Lo, Xie, Mitchell, Graves and Cai1} and can increase the cost of admission by 25,000–55,000 US dollars.^{Reference Yu, Jung and Ai2} Removal of unnecessary central lines, central lines that are not essential for current medical care, is a key component of CLABSI prevention;^{Reference Buetti, Marschall and Drees3} one method to achieve this is prospective audit and feedback, which is effective, but resource intensive and rarely systematically evaluated.^{Reference Beville, Heipel, Vanhoozer and Bailey4}

Large language models (LLMs) are systems capable of rapidly analyzing substantial amounts of data to perform high level tasks. LLMs have the potential for high-volume healthcare data extraction and report generation.^{Reference Fahim, Hasani, Kabba and Ragab5} These characteristics make LLMs a promising tool in the field of infection prevention. However, LLMs can present misinformation (e.g., “hallucinations”) and struggle with the nuances of clinical medicine.^{Reference Bedi, Liu and Orr-Ewing6} There has been little evaluation of LLMs with real patient scenarios to date in infection prevention. We used a secure LLM to evaluate unnecessary central lines in a large academic medical center.

Methods

Study setting and design: Stanford Health Care is a 700-bed academic medical center in Northern California. Recently, a secure LLM with gpt-4.1-mini (ChatEHR) was integrated within the EHR. We conducted a prospective study from October to November 2025.

Reference standard (expert review): The vascular access team conducts daily active central line necessity assessments within three general medicine units. It is performed by manual EHR chart review using criteria based on published literature (Supplement 1). The vascular access team performs audit and feedback by discussing removal of identified unnecessary lines with providers. For units without routine assessments, the same criteria were applied by expert reviewers performing manual EHR chart review, but audit and feedback was not performed in these cases.

Index test (ChatEHR): Using a standardized prompt (Supplement 2), ChatEHR was asked to determine whether each central line was unnecessary on that calendar day based on predefined criteria. For each case, the model received EHR data from the preceding seven days, including clinical notes, laboratory results, vital signs, medications, and orders. The model was run with a temperature of 0.1 and deployed within Stanford Health Care’s secure Microsoft Azure environment. ChatEHR did not perform dynamic retrieval of selected records at inference time. Rather, available patient data within the specified time window were passed directly to the LLM; when those data exceeded the allowable context length, they were partitioned into smaller chunks and analyzed separately. The same prompt, sources, and input data window were applied to all cases.

Analysis: We compared ChatEHR classification of unnecessary versus necessary to the expert review on a per-line per-day basis. We calculated sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) with 95% confidence intervals.

Results

We evaluated 271 central lines (111 PICCs, 82 implanted ports, 65 tunneled central lines, 12 non-tunneled central lines, and 1 pulmonary artery catheter). The reference standard classified 31 (11%) as unnecessary and 240 (89%) as necessary. ChatEHR identified 5 out of 31 unnecessary lines (sensitivity 16.1%) and correctly classified 237 out of 240 necessary lines (specificity 98.8%). The PPV was 62.5% and the NPV was 90.1% (Table 2). Discordant classifications occurred in 29 cases: ChatEHR classified 26 lines as necessary that were unnecessary (false negatives), and 3 lines as unnecessary that were necessary (false positives). Among discordant cases, 75% (n = 22) were PICCs, 10% (n = 3) were non-tunneled central lines, 7% (n = 2) were tunneled central lines, and 7% (n = 2) were implanted ports. Reasons for discordance (these were not mutually exclusive as ChatEHR frequently cited multiple reasons in same response) were clustered into four categories: (1) a discontinued infusate still cited as active occurred in 22 cases, (2) difficult IV access cited when criteria were not met occurred in 9 cases, (3) incorrect statements inconsistent with chart data occurred in 8 cases, and (4) clinical complexity requiring contextual judgment caused the discordance in 5 cases (Table 1; detailed examples in Supplement 3).

Table 1.

Diagnostic performance of ChatEHR for identifying unnecessary central lines

Table 2.

Discordant cases

Discussion

To our knowledge, this is the first study using an LLM integrated within the EHR to assess unnecessary central lines. Our evaluation used actual patient data in real-time. We found that the LLM quickly reviewed large volumes of clinical information and achieved a high specificity for unnecessary central lines. ChatEHR could be deployed as an surveillance method to evaluate for the presence of central lines that are unnecessary in settings where human review is not feasible.

ChatEHR had a specificity of 99% for detecting unnecessary central lines, thus rarely incorrectly classified an appropriate line as unnecessary. However, it had a sensitivity of 16%, and was unable to identify many unnecessary central lines. Similar to other evaluations of LLMs in infection prevention for detection of CAUTI, CLABSI and appropriate blood cultures, there is an imbalance between sensitivity and specificity.^{Reference Rodriguez-Nava, Egoryan, Goodman, Morgan and Salinas7–Reference Rodriguez-Nava, Keyes and Ambers9} This was a limitation of ChatEHR and may be one of other generic LLMs as well in the field of infection prevention. The most frequent reason for discordance was ChatEHR’s failure to utilize the current clinical scenario, and the second most frequent reason was failing to fulfill detailed criteria of the prompt. Increased sensitivity may be gained with additional training of the models to reflect the most recent data and to strictly adhere to requirements in the prompt. Incorrect information about how a medication was administered was also a flaw and having the LLM repeat the task to check itself might improve accuracy. Perhaps with enough training it will improve in nuanced clinical cases, although this is a daunting feat given the complexity of medical practice.

Our study has several limitations. It was conducted at a single center using ChatEHR, which may differ from LLMs available at other institutions. Different parameters of LLM prompts will yield variable results, and although we considered one week of data appropriate for the current clinical situation, results may differ if a different time frame is used. Additionally, there is a subjective component to the determination of unnecessary central lines, and our criteria may vary from what is used at other institutions. Despite limitations, our study adds to the evolving understanding of the rapidly progressing field of AI in medicine.

Ideally, clinical teams would review central line necessity daily; however, changing clinical scenarios and competing priorities often necessitate a secondary auditing process to support device-necessity review. Secondary audits are challenging for healthcare systems with options often being resource intensive; these manual reviews may not be realistic for units or institutions as a whole. Although LLMs cannot identify all unnecessary central lines, it could increase detection in areas where manual review is not possible. LLMs have the potential to review many more patients than feasible with the current practice of manual review, emphasizing that LLMs may become a valuable tool for review of devices that ultimately aides in healthcare-associated infection reduction.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/ice.2026.10462.

Acknowledgements

We thank the Stanford University Center for Digital Health for funding this study. The authors report no conflicts of interest related to this work.

References

Elangovan, S, Lo, JJ, Xie, Y, Mitchell, B, Graves, N, Cai, Y. Impact of central-line-associated bloodstream infections and catheter-related bloodstream infections: a systematic review and meta-analysis. J Hosp Infect 2024;152:126–137. doi: 10.1016/j.jhin.2024.08.002.CrossRef Google Scholar PubMed

Yu, KC, Jung, M, Ai, C. Characteristics, costs, and outcomes associated with central-line-associated bloodstream infection and hospital-onset bacteremia and fungemia in US hospitals. Infect Control Hosp Epidemiol 2023;44:1920–1926. doi: 10.1017/ice.2023.132.CrossRef Google Scholar PubMed

Buetti, N, Marschall, J, Drees, M, et al. Strategies to prevent central line-associated bloodstream infections in acute-care hospitals: 2022 update. Infect Control Hosp Epidemiol 2022;43:553–569. doi: 10.1017/ice.2022.87.CrossRef Google Scholar PubMed

Beville, ASM, Heipel, D, Vanhoozer, G, Bailey, P. Reducing central line associated bloodstream infections (CLABSIs) by reducing central line days. Curr Infect Dis Rep 2021;23:23. doi: 10.1007/s11908-021-00767-w.CrossRef Google Scholar PubMed

Fahim, YA, Hasani, IW, Kabba, S, Ragab, WM. Artificial intelligence in healthcare and medicine: clinical applications, therapeutic advances, and future perspectives. Eur J Med Res 2025 30:848. doi: 10.1186/s40001-025-03196-w.CrossRef Google Scholar PubMed

Bedi, S, Liu, Y, Orr-Ewing, L, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 2025;333:319–328. doi: 10.1001/jama.2024.21700.CrossRef Google Scholar PubMed

Rodriguez-Nava, G, Egoryan, G, Goodman, KE, Morgan, DJ, Salinas, JL. Performance of a large language model for identifying central line-associated bloodstream infections (CLABSI) using real clinical notes. Infect Control Hosp Epidemiol 2024;46:1–4. doi: 10.1017/ice.2024.164.Google Scholar PubMed

Alshanqeeti, S, Coffey, K, Mcdermott, K, et al. Comparing generative artificial intelligence vs. experts for detection of catheter-associated urinary tract infection (CAUTI). Clin Infect Dis 2025;9:ciaf486. doi: 10.1093/cid/ciaf486.Google Scholar

Rodriguez-Nava, G, Keyes, T, Ambers, N, et al. Using secure artificial intelligence agents integrated within the electronic medical record for the evaluation of blood culture appropriateness-Northern California, 2025. Infect Control Hosp Epidemiol 2025;11:1–4. doi: 10.1017/ice.2025.10349.Google Scholar

Table 1. Diagnostic performance of ChatEHR for identifying unnecessary central lines

Table 2. Discordant cases

Wick et al. supplementary material

DOI: https://doi.org/10.1017/ice.2026.10462.sm001

File 23.7 KB

Article contents

Performance of an artificial intelligence model for evaluation of unnecessary central lines, Northern California 2025

Abstract

Information

Background

Methods

Results

Discussion

Supplementary material

Acknowledgements

References

Wick et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests