Hostname: page-component-77f85d65b8-hzqq2 Total loading time: 0 Render date: 2026-04-18T20:03:25.720Z Has data issue: false hasContentIssue false

Identifying incarceration status in the electronic health record using large language models in emergency department settings

Published online by Cambridge University Press:  11 March 2024

Thomas Huang
Affiliation:
Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, USA
Vimig Socrates
Affiliation:
Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
Aidan Gilson
Affiliation:
Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, USA Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA
Conrad Safranek
Affiliation:
Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, USA
Ling Chi
Affiliation:
Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA
Emily A. Wang
Affiliation:
Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA SEICHE Center for Health and Justice, Yale School of Medicine, New Haven, CT, USA Department of Medicine, Yale School of Medicine, New Haven, CT, USA
Lisa B. Puglisi
Affiliation:
Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA SEICHE Center for Health and Justice, Yale School of Medicine, New Haven, CT, USA Department of Medicine, Yale School of Medicine, New Haven, CT, USA
Cynthia Brandt
Affiliation:
Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA
R. Andrew Taylor*
Affiliation:
Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, USA Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA
Karen Wang
Affiliation:
Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, USA SEICHE Center for Health and Justice, Yale School of Medicine, New Haven, CT, USA Department of Medicine, Yale School of Medicine, New Haven, CT, USA Equity Research and Innovation Center, Yale School of Medicine, Yale University, New Haven, CT, USA
*
Corresponding author: R. A. Taylor, MD, MHS; Email: richard.taylor@yale.edu
Rights & Permissions [Opens in a new window]

Abstract

Background:

Incarceration is a significant social determinant of health, contributing to high morbidity, mortality, and racialized health inequities. However, incarceration status is largely invisible to health services research due to inadequate clinical electronic health record (EHR) capture. This study aims to develop, train, and validate natural language processing (NLP) techniques to more effectively identify incarceration status in the EHR.

Methods:

The study population consisted of adult patients (≥ 18 y.o.) who presented to the emergency department between June 2013 and August 2021. The EHR database was filtered for notes for specific incarceration-related terms, and then a random selection of 1,000 notes was annotated for incarceration and further stratified into specific statuses of prior history, recent, and current incarceration. For NLP model development, 80% of the notes were used to train the Longformer-based and RoBERTa algorithms. The remaining 20% of the notes underwent analysis with GPT-4.

Results:

There were 849 unique patients across 989 visits in the 1000 annotated notes. Manual annotation revealed that 559 of 1000 notes (55.9%) contained evidence of incarceration history. ICD-10 code (sensitivity: 4.8%, specificity: 99.1%, F1-score: 0.09) demonstrated inferior performance to RoBERTa NLP (sensitivity: 78.6%, specificity: 73.3%, F1-score: 0.79), Longformer NLP (sensitivity: 94.6%, specificity: 87.5%, F1-score: 0.93), and GPT-4 (sensitivity: 100%, specificity: 61.1%, F1-score: 0.86).

Conclusions:

Our advanced NLP models demonstrate a high degree of accuracy in identifying incarceration status from clinical notes. Further research is needed to explore their scaled implementation in population health initiatives and assess their potential to mitigate health disparities through tailored system interventions.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science
Figure 0

Figure 1. Graphic overview of note selection and NLP training process.

Figure 1

Table 1. Incarceration history and annotation labels and their definitions

Figure 2

Figure 2. Identification of incarceration status prompt and pipeline into GPT-4 through the Azure openAI service.

Figure 3

Figure 3. ICD-10 code vs. manual annotation.

Figure 4

Figure 4. Longformer, roBERTa, and GPT-4 predicted label vs. true label by manual annotation for any history of incarceration.

Figure 5

Figure 5. RoBERTa, clinical-longformer, and GPT-4 performance on multilabel task (prior history of incarceration, recent incarceration, and current incarceration).

Figure 6

Figure 6. Shapley visualization of the Clinical-Longformer model correctly identifying and misidentifying any history of incarceration.

Supplementary material: File

Huang et al. supplementary material

Huang et al. supplementary material
Download Huang et al. supplementary material(File)
File 17.1 KB