Hostname: page-component-6766d58669-nqrmd Total loading time: 0 Render date: 2026-05-20T21:00:11.539Z Has data issue: false hasContentIssue false

A head-to-head comparison of the accuracy of commercially available large language models for infection prevention and control inquiries, 2024

Published online by Cambridge University Press:  12 December 2024

Oluchi J Abosi*
Affiliation:
University of Iowa Health Care, Iowa City, IA, USA
Takaaki Kobayashi
Affiliation:
University of Iowa Health Care, Iowa City, IA, USA
Natalie Ross
Affiliation:
University of Iowa Health Care, Iowa City, IA, USA
Alexandra Trannel
Affiliation:
University of Iowa Health Care, Iowa City, IA, USA
Guillermo Rodriguez Nava
Affiliation:
Stanford University, Stanford, CA, USA
Jorge L. Salinas
Affiliation:
Stanford University, Stanford, CA, USA
Karen Brust
Affiliation:
University of Iowa Health Care, Iowa City, IA, USA
*
Corresponding author: Oluchi J. Abosi; Email: oluchi-abosi@uiowa.edu
Rights & Permissions [Opens in a new window]

Abstract

We investigated the accuracy and completeness of four large language model (LLM) artificial intelligence tools. Most LLMs provided acceptable answers to commonly asked infection prevention questions (accuracy 98.9%, completeness 94.6%). The use of LLMs to supplement infection prevention consults should be further explored.

Information

Type
Concise Communication
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The Society for Healthcare Epidemiology of America
Figure 0

Table 1. Percentage of overall acceptable accuracy and completeness score across large language models response to infection prevention questions without and with CDC statementsa,b

Figure 1

Figure 1. Heatmap of acceptable accuracy and completeness score percentages by category across large language models in response to infection prevention questions without and with CDC statements a, b. LLM, Large Language Models. a The accuracy scale was a 5-point Likert scale (with 1 indicating completely incorrect; 2, more incorrect than correct; 3, More correct than incorrect but missing some major elements; 4, More correct than incorrect but missing some minor elements; and 5, completely correct). a The completeness scale was a 6-point Likert scale (1, addresses no aspect of the question, and the answer is not within the topic queried; 2, addresses no aspect of the question, and the answer is within the topic queried; 3, addresses some aspect of the question, but significant parts are missing or incomplete; 4, addresses most aspects of the questions but missing small details; 5, addresses all aspects of the question without additional information; and 6 addresses all aspects of the question and provides additional information beyond what was expected). b Responses with scores ≥3 were deemed accurate. Responses with scores ≥4 were deemed complete. cWithout limiting AI tool search to CDC-only references versus with prompt limiting AI tool search to CDC-only references.

Supplementary material: File

Abosi et al. supplementary material

Abosi et al. supplementary material
Download Abosi et al. supplementary material(File)
File 99.4 KB