Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-08T13:59:44.150Z Has data issue: false hasContentIssue false

Detection of suicidality from medical text using privacy-preserving large language models

Published online by Cambridge University Press:  05 November 2024

Isabella Catharina Wiest
Affiliation:
Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany; and Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
Falk Gerrik Verhees
Affiliation:
Department of Psychiatry and Psychotherapy, Carl Gustav Carus University Hospital, Technical University Dresden, Dresden, Germany
Dyke Ferber
Affiliation:
Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany; National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany; and Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany
Jiefu Zhu
Affiliation:
Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
Michael Bauer
Affiliation:
Department of Psychiatry and Psychotherapy, Carl Gustav Carus University Hospital, Technical University Dresden, Dresden, Germany
Ute Lewitzka
Affiliation:
Department of Psychiatry and Psychotherapy, Carl Gustav Carus University Hospital, Technical University Dresden, Dresden, Germany
Andrea Pfennig
Affiliation:
Department of Psychiatry and Psychotherapy, Carl Gustav Carus University Hospital, Technical University Dresden, Dresden, Germany
Pavol Mikolas
Affiliation:
Department of Psychiatry and Psychotherapy, Carl Gustav Carus University Hospital, Technical University Dresden, Dresden, Germany
Jakob Nikolas Kather*
Affiliation:
Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany; National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany; Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany; and Department of Medicine I, University Hospital Dresden, Dresden, Germany
*
Correspondence: Jakob Nikolas Kather. Email: jakob_nikolas.kather@tu-dresden.de
Rights & Permissions [Opens in a new window]

Abstract

Background

Attempts to use artificial intelligence (AI) in psychiatric disorders show moderate success, highlighting the potential of incorporating information from clinical assessments to improve the models. This study focuses on using large language models (LLMs) to detect suicide risk from medical text in psychiatric care.

Aims

To extract information about suicidality status from the admission notes in electronic health records (EHRs) using privacy-sensitive, locally hosted LLMs, specifically evaluating the efficacy of Llama-2 models.

Method

We compared the performance of several variants of the open source LLM Llama-2 in extracting suicidality status from 100 psychiatric reports against a ground truth defined by human experts, assessing accuracy, sensitivity, specificity and F1 score across different prompting strategies.

Results

A German fine-tuned Llama-2 model showed the highest accuracy (87.5%), sensitivity (83.0%) and specificity (91.8%) in identifying suicidality, with significant improvements in sensitivity and specificity across various prompt designs.

Conclusions

The study demonstrates the capability of LLMs, particularly Llama-2, in accurately extracting information on suicidality from psychiatric records while preserving data privacy. This suggests their application in surveillance systems for psychiatric emergencies and improving the clinical management of suicidality by improving systematic quality control and research.

Information

Type
Feature
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2024. Published by Cambridge University Press on behalf of Royal College of Psychiatrists
Figure 0

Table 1 Patient characteristics (n = 100)

Figure 1

Fig. 1 Experimental Setup. (a) The information extraction pipeline. The psychiatry reports (n = 100) were transferred to a csv table. Our pipeline then iterates over all reports with the predefined prompt and outputs a JavaScript Object Notation-File (JSON) file with all Large Language Model (LLM) outputs (PRED). The relevant classes (suicidality present: yes or no) were then extracted from the LLM output, which was more verbose in some cases. These outputs were then transferred to a pandas dataframe and automatically compared to the expert-based ground truth (GT). (b) The initial prompting strategy. One prompt and one report were given to the model at the same time. Every prompt contained a system prompt with general instructions and a specific question to the report (Instruction). (c) The chain-of-thought approach: the psychiatry report with our prompt was fed into the LLM, which generated a first output. With a second prompt and a predefined answering grammar, the model was fed its own output and again forced to generate a certain, json based output structure. This final output then underwent performance analysis. Icon Source: Midjourney.

Figure 2

Fig. 2 Performance of German-language fine-tuned Llama-2 model. (a) Sensitivity and Specificity for five different prompting strategies. With P0, the model was simply asked to provide the answer if suicidality was present from the report, P1, P2 and P3 provided one, two or three examples to the model. P4 applied a chain-of-thought approach, where the model was asked twice, with the first model output as input for the second run. (b) Confusion matrix representing the performance of the Large Language Model (LLM) indicating the presence of suicidality based on the examined admission notes (n = 100) with a sensitivity of 83% as well as specificity of 92% for P3, a prompt that included three examples. (c) Bar chart showing the balanced accuracies for all models and prompt engineering attempts. Error bars show the 95% confidence interval of the bootstrapped samples.

Figure 3

Table 2 Performance metrics of the three tested large language models (‘Emgerman’, ‘Sauerkraut’, ‘English’) with the five prompt variations (P0–P4)a

Supplementary material: File

Wiest et al. supplementary material

Wiest et al. supplementary material
Download Wiest et al. supplementary material(File)
File 15.8 KB

This journal is not currently accepting new eletters.

eLetters

No eLetters have been published for this article.