Hostname: page-component-77f85d65b8-jkvpf Total loading time: 0 Render date: 2026-03-28T18:49:13.019Z Has data issue: false hasContentIssue false

Promises and pitfalls of large language models in psychiatric diagnosis and knowledge tasks

Published online by Cambridge University Press:  29 April 2025

Chang-Bae Bang
Affiliation:
Department of Psychiatry, Yonsei University College of Medicine, Seoul, Republic of Korea; and Institute of Behavioral Science in Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
Young-Chul Jung
Affiliation:
Department of Psychiatry, Yonsei University College of Medicine, Seoul, Republic of Korea; and Institute of Behavioral Science in Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
Seng Chan You
Affiliation:
Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea; and Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
Kyungsang Kim
Affiliation:
Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, USA
Byung-Hoon Kim
Affiliation:
Department of Psychiatry, Yonsei University College of Medicine, Seoul, Republic of Korea; Institute of Behavioral Science in Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea; Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea; Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea; and Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, USA Email: egyptdj@yonsei.ac.kr
Rights & Permissions [Opens in a new window]

Abstract

Information

Type
Letter
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Royal College of Psychiatrists
Figure 0

Fig. 1 Bar plots of the task performances using mean scores and error bars with 95% confidence intervals. Dashed vertical lines separate the task performance of the residents (left) and the large language models (right). Asterisks indicate a significant (P < 0.05) difference compared with GPT-4 results from the Mann–Whitney U-test.

Figure 1

Table 1 Performance, similarity to GPT-4, and comorbidity error rate of residents before and after GPT-4 guidance

Supplementary material: File

Bang et al. supplementary material

Bang et al. supplementary material
Download Bang et al. supplementary material(File)
File 19.8 KB

This journal is not currently accepting new eletters.

eLetters

No eLetters have been published for this article.