Hostname: page-component-6766d58669-kl59c Total loading time: 0 Render date: 2026-05-14T07:59:48.806Z Has data issue: false hasContentIssue false

Heterogeneous pathways to depressive and anxiety disorders: A cluster-based predictive study in a nationwide longitudinal cohort

Published online by Cambridge University Press:  14 May 2026

Chong Chen*
Affiliation:
Division of Neuropsychiatry, Department of Neuroscience, Yamaguchi University Graduate School of Medicine , Ube, Yamaguchi, Japan
Yoshiyuki Asai
Affiliation:
Department of Systems Bioinformatics, Yamaguchi University Graduate School of Medicine , Ube, Japan
Yasuhiro Mochizuki
Affiliation:
Center for Data Science, Waseda University , Shinjuku-ku, Tokyo, Japan
Kosuke Hagiwara
Affiliation:
Division of Neuropsychiatry, Department of Neuroscience, Yamaguchi University Graduate School of Medicine , Ube, Yamaguchi, Japan
Ryo Okubo
Affiliation:
Department of Psychiatry, Hokkaido University Graduate School of Medicine , Sapporo, Hokkaido, Japan
Shin Nakagawa
Affiliation:
Division of Neuropsychiatry, Department of Neuroscience, Yamaguchi University Graduate School of Medicine , Ube, Yamaguchi, Japan
Takahiro Tabuchi
Affiliation:
Division of Epidemiology, School of Public Health, Graduate School of Medicine, Tohoku University , Sendai, Miyagi, Japan
*
Corresponding author: Chong Chen; Email: cchen@yamaguchi-u.ac.jp
Rights & Permissions [Opens in a new window]

Abstract

Background

Early prediction of depressive and anxiety disorders is challenging due to substantial heterogeneity in risk pathways. Conventional machine-learning models trained on aggregated populations may obscure subgroup-specific mechanisms and limit interpretability for prevention. We evaluated whether a hybrid unsupervised–supervised framework can identify meaningful subgroups and yield more interpretable risk prediction.

Methods

We analyzed cohort data of 15,897 Japanese adults who completed baseline (August–September 2020) and 6-month follow-up (February–March 2021) surveys and did not screen positive for depressive and anxiety disorders at baseline (K6 score < 13). Using 169 baseline demographic, psychosocial, lifestyle, and behavioral variables, we performed hierarchical clustering to derive data-driven subgroups. Within each cluster, we trained Random Forest models to predict incident screened depressive and anxiety disorders at follow-up (K6 ≥ 13) and interpreted predictors using SHapley Additive exPlanations (SHAP).

Results

The overall 6-month incidence was 6.23%. A five-cluster solution revealed two high-risk subgroups: an older-adult profile with poor quality of life (12.9%) and a working-parent profile characterized by work–family overload (29.8%). Compared with a global model trained on the full sample, the cluster-then-predict framework showed broadly similar overall performance but performed better in the highest-risk subgroup and revealed more differentiated predictor profiles. Loneliness, health-related quality of life, happiness, and personality traits predominated in clusters with moderate adversity, whereas lifestyle disruption (sleep, diet, and irregular routines) characterized the high-risk late-life subgroup and alcohol dependence and work–family burden characterized the high-risk working-parent subgroup.

Conclusions

Addressing risk-factor heterogeneity before prediction may enable more interpretable, context-tailored prevention strategies.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. Analytical workflow of the study. Participants who did not screen positive for depressive or anxiety disorders (K6 < 13) at baseline were analyzed using 169 predictors. After preprocessing and standardization, hierarchical clustering (Ward) was performed, and the optimal k was selected based on composite internal metrics. UMAP visualizations illustrate subgroup structure. Cluster characteristics were then examined using Random Forests and SHAP values, and separate predictive models were trained within each cluster to evaluate risk factors for incident disorders. Icons from Flaticon.

Figure 1

Figure 2. Overview of cluster validity, structure, and the incidence of depressive and anxiety disorders across subgroups. (a) Mean normalized cluster validity score (composite of within-cluster sum of squares, Silhouette, Calinski–Harabasz, and Davies–Bouldin indices) across candidate numbers of clusters (k = 2–20). The vertical dashed line indicates the optimal solution (k = 5). Consistent with the composite criterion, the Davies–Bouldin index also independently favored k = 5. (b) Hierarchical clustering dendrogram based on Ward’s method, with the horizontal dashed line indicating the cut level corresponding to the five-cluster solution. (c) Two-dimensional Uniform Manifold Approximation and Projection (UMAP) embedding colored by cluster ID, with kernel density contours outlining high-density regions within each cluster. (d) Three-dimensional UMAP embedding of the same clusters. (e) Cluster-wise incidence rates of screened depressive and anxiety disorders at follow-up (T2), with bars colored by cluster, sample sizes displayed at the base of each bar, and percentages shown above. The horizontal dashed line indicates the mean incidence across all participants.

Figure 2

Figure 3. Distribution of SHAP-identified differentiating features across clusters. The 40 most globally important features (ranked by mean absolute SHAP values) were grouped into five domains: demographic, work, health, family, and lifestyle. To improve readability, only key differentiating features are shown for the work and family domains. For continuous and ordinal variables, violin plots show the distribution of feature values by cluster, including median and interquartile ranges. For binary and one-hot-encoded variables, stacked bar charts show the proportion of participants with values 0 (translucent segment) and 1 (solid segment) in each cluster, with bar color indicating cluster ID. Distributions of all 40 features in the original SHAP rank order are provided in Supplementary Figure S3.

Figure 3

Figure 4. Cluster-specific SHAP feature importance for predicting incident depressive and anxiety disorders at follow-up. Beeswarm plots showing SHAP (SHapley Additive exPlanations) value distributions for the top 20 features contributing to the Random Forest models within each cluster. Each point represents an individual participant, and its horizontal position indicates the feature’s marginal contribution to higher (positive SHAP) or lower (negative SHAP) predicted risk within that cluster. Color indicates feature values (high in red, low in blue). Features are ranked vertically by their overall impact, with those at the top contributing most strongly. Panels (a)–(e) correspond to Clusters 1–5.

Figure 4

Figure 5. Cross-cluster comparison of mean absolute SHAP importance for predicting the incidence of depressive and anxiety disorders at follow-up. Heatmap displays the union of the top 20 features from each cluster-specific Random Forest model, with rows ordered by the maximum mean |SHAP| observed across clusters. Columns correspond to clusters, and cell color indicates the mean absolute SHAP value for that feature within that cluster (warmer colors = greater importance). White cells indicate near-zero importance (|mean SHAP| below 0.001). Numbers within colored cells denote the within-cluster rank of that feature’s mean |SHAP| (1 = most important) in the corresponding cluster.

Supplementary material: File

Chen et al. supplementary material

Chen et al. supplementary material
Download Chen et al. supplementary material(File)
File 3.7 MB