Hostname: page-component-77c78cf97d-d2fvj Total loading time: 0 Render date: 2026-04-24T14:44:56.467Z Has data issue: false hasContentIssue false

A large language model based data generation framework to improve mild cognitive impairment detection sensitivity

Published online by Cambridge University Press:  26 March 2025

Yang Han
Affiliation:
Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong
Jacqueline C.K. Lam*
Affiliation:
Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong
Victor O.K. Li*
Affiliation:
Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong
Lawrence Y.L. Cheung
Affiliation:
Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Shatin, Hong Kong
*
Corresponding authors: Jacqueline C.K. Lam and Victor O.K. Li; Emails: jcklam@eee.hku.hk; vli@eee.hku.hk
Corresponding authors: Jacqueline C.K. Lam and Victor O.K. Li; Emails: jcklam@eee.hku.hk; vli@eee.hku.hk

Abstract

Recent studies utilizing AI-driven speech-based Alzheimer’s disease (AD) detection have achieved remarkable success in detecting AD dementia through the analysis of audio and text data. However, detecting AD at an early stage of mild cognitive impairment (MCI), remains a challenging task, due to the lack of sufficient training data and imbalanced diagnostic labels. Motivated by recent advanced developments in Generative AI (GAI) and Large Language Models (LLMs), we propose an LLM-based data generation framework, leveraging prior knowledge encoded in LLMs to generate new data samples. Our novel LLM generation framework introduces two novel data generation strategies, namely, the cross-lingual and the counterfactual data generation, facilitating out-of-distribution learning over new data samples to reduce biases in MCI label prediction due to the systematic underrepresentation of MCI subjects in the AD speech dataset. The results have demonstrated that our proposed framework significantly improves MCI Detection Sensitivity and F1-score on average by a maximum of 38% and 31%, respectively. Furthermore, key speech markers in predicting MCI before and after LLM-based data generation have been identified to enhance our understanding of how the novel data generation approach contributes to the reduction of MCI label prediction biases, shedding new light on speech-based MCI detection under low data resource constraint. Our proposed methodology offers a generalized data generation framework for improving downstream prediction tasks in cases where limited and/or imbalanced data have presented significant challenges to AI-driven health decision-making. Future study can focus on incorporating more datasets and exploiting more acoustic features for speech-based MCI detection.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Overall framework.

Figure 1

Table 1. Data descriptive statistics

Figure 2

Table 2. Data generation statistics across different strategies and modelsa

Figure 3

Table 3. Evaluation of different data generation strategies in improving MCI detection sensitivity and F1-score based on five-fold cross-validationa

Figure 4

Table 4. Evaluation of different data generation combinations in improving MCI detection sensitivity and F1-score based on five-fold cross-validationa

Figure 5

Figure 2. The top 10 speech markers predicting MCI compared to normal control based on the baseline model trained on the original data. Each circle in the plot represents one sample. The color of the circle indicates the speech marker’s TF-IDF value (see the color bar on the right). The higher the TF-IDF value, the darker the red color. The lower the TF-IDF value, the lighter the blue color. The x-axis represents the SHAP value (i.e., the feature importance score). A higher positive value indicates a higher contribution to the prediction of the positive label (i.e., MCI). A lower negative value indicates a higher contribution to the prediction of the negative label (i.e., normal control).

Figure 6

Figure 3. The top 10 speech markers predicting MCI compared to normal control based on the best model using the original data and the counterfactual data generation. Each circle in the plot represents one sample. The color of the circle indicates the speech marker’s TF-IDF value (see the color bar on the right). The higher the TF-IDF value, the darker the red color. The lower the TF-IDF value, the lighter the blue color. The x-axis represents the SHAP value (i.e., the feature importance score). A higher positive value indicates a higher contribution to the prediction of the positive label (i.e., MCI). A lower negative value indicates a higher contribution to the prediction of the negative label (i.e., normal control).

Figure 7

Figure 4. The top 10 speech markers predicting MCI compared to normal control based on the best model using the original data and all data generation. Each circle in the plot represents one sample. The color of the circle indicates the speech marker’s TF-IDF value (see the color bar on the right). The higher the TF-IDF value, the darker the red color. The lower the TF-IDF value, the lighter the blue color. The x-axis represents the SHAP value (i.e., the feature importance score). A higher positive value indicates a higher contribution to the prediction of the positive label (i.e., MCI). A lower negative value indicates a higher contribution to the prediction of the negative label (i.e., normal control).

Supplementary material: File

Han et al. supplementary material

Han et al. supplementary material
Download Han et al. supplementary material(File)
File 60.6 KB
Submit a response

Comments

No Comments have been published for this article.