Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-05T07:35:11.244Z Has data issue: false hasContentIssue false

Combining domain knowledge with large language models for predicting suicide risk in online counselling services

Published online by Cambridge University Press:  24 April 2026

Daniel Izmaylov
Affiliation:
Ben-Gurion University of the Negev , Israel
Avi Segal
Affiliation:
Ben-Gurion University of the Negev , Israel
Kobi Gal*
Affiliation:
Ben-Gurion University of the Negev , Israel University of Edinburgh , UK
Meytal Grimland
Affiliation:
Ruppin Academic Center, Israel
Yossi Levi-Belz
Affiliation:
University of Haifa, Israel
Joy Benatov
Affiliation:
University of Haifa, Israel
Yael Levy
Affiliation:
Sahar Online Mental Support
*
Corresponding author: Kobi Gal; Email: kgal@ed.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Online counselling services have seen increased use in recent years, providing critical emergency mental health support. These interactions are typically long, complex, and varied in the dialogue between help seekers and counsellors. The lack of domain-specific models, especially in low-resource languages, poses a significant challenge for the automatic detection of suicide risk in online chat services for mental health support. To address this challenge, our approach adapts a general-purpose large language model (LLM) to the suicide prediction task that employs a two-stage classification architecture to deal with sparse and imbalanced data. It extends the state of the art by: (1) incorporating psychological theory into model training and (2) capturing key aspects of conversation structure in counselling sessions. We evaluate the performance of the proposed LLM against the state-of-the-art LLMs for suicide detection on thousands of conversations in the Hebrew language from a leading national online counselling service in Israel. Results show that the proposed LLM outperformed existing state-of-the-art approaches in detecting suicide risk, as measured by relevant literature metrics. Moreover, the LLM outperforms other approaches even in the early stages of a conversation, which is crucial for real-time detection in practice. We also discuss the ethical implications of combining LLMs in counselling services. The contributions of this work are (1) extending existing LLM architectures to incorporate domain-specific information; (2) evaluating LLM technologies in the context of socially relevant problems; and (3) introducing novel LLM tools for resource-constrained languages.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. A (fictitious) example of a conversation between a help-seeker and counsellor.

Figure 1

Table 1. General statistics for Sahar corpus

Figure 2

Figure 2. Model architecture. (a) SR-BERT base architecture, encoding conversation and speaker roles. (b) Pre-training procedure on 4 self-supervised tasks including psychological knowledge learning using the SRF lexicon. (c) Fine-tuning procedure learning to predict suicide risk (SR).

Figure 3

Table 2. SR Prediction results of compared models. Bold highlights the highest value

Figure 4

Figure 3. Precision recall curve comparing the results of SR-BERT w. SSK and SR-BERT w.o. SSK.

Figure 5

Figure 4. False omission rate of different models.

Figure 6

Figure 5. Classification results for early detection of top-performing SR detection approaches.

Figure 7

Figure 6. Classification results for early detection on explicit-less-conversations benchmarks.

Figure 8

Table 3. SR-BERT performance for overlap with SRF lexicon (top) and number of tole changes (bottom)

Figure 9

Table 4. SR-BERT performance for gender subgroups

Figure 10

Table 5. Prediction results of compared models. Bold highlights the highest value

Figure 11

Figure 7. Two stage model for suicide severity classification.

Figure 12

Figure 8. Flat model baseline for suicide severity classification.

Figure 13

Figure 9. Comparing Models’ Confusion Matrices.

Figure 14

Table 6. Prediction results of compared models. Bold highlights the highest value

Figure 15

Figure 10. Confusion matrix for the data-augmented model.

Figure 16

Figure 11. Classification results on GSR class for early detection.

Figure 17

Table 7. Two-Stage model performance for gender subgroups

Figure 18

Figure 12. Classification results on IMSR class for early detection.

Figure 19

Table 8. Illustrative excerpts from conversations where SR-BERT and Ensemble SI-BERT disagree