Predicting the naturalistic course in anxiety disorders using clinical and biological markers: a machine learning approach

Wicher A. Bokma; Paul Zhutovsky; Erik J. Giltay; Robert A. Schoevers; Brenda W.J.H. Penninx; Anton L.J.M. van Balkom; Neeltje M. Batelaan; Guido A. van Wingen

doi:10.1017/S0033291720001658

Predicting the naturalistic course in anxiety disorders using clinical and biological markers: a machine learning approach

Published online by Cambridge University Press: 11 June 2020

Wicher A. Bokma ,

Paul Zhutovsky

Erik J. Giltay ,

Robert A. Schoevers ,

Brenda W.J.H. Penninx ,

Anton L.J.M. van Balkom ,

Neeltje M. Batelaan and

Guido A. van Wingen

Show author details

Wicher A. Bokma*: Affiliation:
Department of Psychiatry, Amsterdam UMC, Vrije Universiteit, Amsterdam Public Health research institute, The Netherlands GGZ inGeest Specialized Mental Health Care, Amsterdam, The Netherlands
Paul Zhutovsky: Affiliation:
Department of Psychiatry, Amsterdam UMC, Location AMC, University of Amsterdam, Amsterdam Neuroscience, Amsterdam, The Netherlands
Erik J. Giltay: Affiliation:
Department of Psychiatry, Leiden University Medical Center (LUMC), Leiden, The Netherlands
Robert A. Schoevers: Affiliation:
Department of Psychiatry, University Medical Center Groningen, Groningen, The Netherlands
Brenda W.J.H. Penninx: Affiliation:
Department of Psychiatry, Amsterdam UMC, Vrije Universiteit, Amsterdam Public Health research institute, The Netherlands GGZ inGeest Specialized Mental Health Care, Amsterdam, The Netherlands
Anton L.J.M. van Balkom: Affiliation:
Department of Psychiatry, Amsterdam UMC, Vrije Universiteit, Amsterdam Public Health research institute, The Netherlands GGZ inGeest Specialized Mental Health Care, Amsterdam, The Netherlands
Neeltje M. Batelaan: Affiliation:
Department of Psychiatry, Amsterdam UMC, Vrije Universiteit, Amsterdam Public Health research institute, The Netherlands GGZ inGeest Specialized Mental Health Care, Amsterdam, The Netherlands
Guido A. van Wingen: Affiliation:
Department of Psychiatry, Amsterdam UMC, Location AMC, University of Amsterdam, Amsterdam Neuroscience, Amsterdam, The Netherlands
*: Author for correspondence: Wicher A. Bokma, E-mail: wicherbokma@gmail.com

Article contents

Abstract
Background
Methods
Results
Conclusions
Introduction
Methods
Results
Discussion
Footnotes
References

Rights & Permissions

Abstract

Background

Disease trajectories of patients with anxiety disorders are highly diverse and approximately 60% remain chronically ill. The ability to predict disease course in individual patients would enable personalized management of these patients. This study aimed to predict recovery from anxiety disorders within 2 years applying a machine learning approach.

Methods

In total, 887 patients with anxiety disorders (panic disorder, generalized anxiety disorder, agoraphobia, or social phobia) were selected from a naturalistic cohort study. A wide array of baseline predictors (N = 569) from five domains (clinical, psychological, sociodemographic, biological, lifestyle) were used to predict recovery from anxiety disorders and recovery from all common mental disorders (CMDs: anxiety disorders, major depressive disorder, dysthymia, or alcohol dependency) at 2-year follow-up using random forest classifiers (RFCs).

Results

At follow-up, 484 patients (54.6%) had recovered from anxiety disorders. RFCs achieved a cross-validated area-under-the-receiving-operator-characteristic-curve (AUC) of 0.67 when using the combination of all predictor domains (sensitivity: 62.0%, specificity 62.8%) for predicting recovery from anxiety disorders. Classification of recovery from CMDs yielded an AUC of 0.70 (sensitivity: 64.6%, specificity: 62.3%) when using all domains. In both cases, the clinical domain alone provided comparable performances. Feature analysis showed that prediction of recovery from anxiety disorders was primarily driven by anxiety features, whereas recovery from CMDs was primarily driven by depression features.

Conclusions

The current study showed moderate performance in predicting recovery from anxiety disorders over a 2-year follow-up for individual patients and indicates that anxiety features are most indicative for anxiety improvement and depression features for improvement in general.

Keywords

agoraphobia anxiety disorders classification generalized anxiety disorder machine learning panic disorder random forest classification social phobia

Information

Type: Original Article
Information: Psychological Medicine , Volume 52 , Issue 1 , January 2022 , pp. 57 - 67

DOI: https://doi.org/10.1017/S0033291720001658 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: Copyright © The Author(s), 2020. Published by Cambridge University Press

Introduction

Anxiety disorders are characterized by highly heterogeneous clinical course trajectories. After 2 years, the prognosis varies across disorders with remittance rates of 72.5% for panic disorder without agoraphobia, 69.7% for generalized anxiety disorder, 53.5% for social phobia and 52.7% for panic disorder with agoraphobia (Hendriks, Spijker, Licht, Beekman, & Penninx, Reference Hendriks, Spijker, Licht, Beekman and Penninx2013). Remitted patients experience a relatively benign course with moderate remaining symptom severity, disability and a low subjective need for care (Batelaan, Rhebergen, Spinhoven, van Balkom, & Penninx, Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016; van Beljouw, Verhaak, Cuijpers, van Marwijk, & Penninx, Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010). However, around 60% of patients have persistent symptoms, relapses, or chronic disease up to 6 years after the diagnosis (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016). Disease course in these patients is often characterized by substantial levels of disability. Predicting long-term disease course can be seen as an important step towards personalized medicine (Steyerberg, Reference Steyerberg2009). This would make targeted treatment efforts viable, in which treatments are tailored towards the individual risk for a poor disease outcome (McGorry, Ratheesh, & O'Donoghue, Reference McGorry, Ratheesh and O'Donoghue2018). However, in anxiety disorders, there is a lack of robust course predictors. For instance, different DSM anxiety disorder diagnoses were shown to be poorly predictive of subsequent course (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014). In current clinical practice, in the absence of valid risk prediction models, course prediction relies solely on clinician's opinions, which show poor accuracy (Randall, Sareen, Chateau, & Bolton, Reference Randall, Sareen, Chateau and Bolton2019).

A possible explanation for the lack of accuracy in course prediction in anxiety disorders is the complex, multicausal aetiology of anxiety disorders. Univariable and multivariable analyses of predictors of disease course showed low levels of explained variance (Bokma, Batelaan, Hoogendoorn, Penninx, & van Balkom, Reference Bokma, Batelaan, Hoogendoorn, Penninx and van Balkom2020). Furthermore, the inference is typically done on the group-level which does not allow for generalizable statements for the single individual. Multivariable machine learning (ML) methods provide a possible solution for this problem, as they are well-suited for solving problems with high numbers of predictors in complex, multicausal disorders (Iniesta, Stahl, & McGuffin, Reference Iniesta, Stahl and McGuffin2016). The use of ML in the field of psychiatry may have great potential for its application in the prediction of disease course trajectories (Hahn, Nierenberg, & Whitfield-Gabrieli, Reference Hahn, Nierenberg and Whitfield-Gabrieli2017). Prediction of the disease course can be regarded as a ‘classification’ problem, which can be solved using supervised algorithms (Deo, Reference Deo2015). In these, algorithms are trained on patients with known predictor and outcome variables to derive a function that can be applied to unseen patients to predict their outcome based on the values of their predictor variables. In anxiety disorders, supervised algorithms were applied a few times cross-sectionally, to relate predictors from various domains to current disease status (Woo, Chang, Lindquist, & Wager, Reference Woo, Chang, Lindquist and Wager2017) or to predict short-term treatment effects (Lueken & Hahn, Reference Lueken and Hahn2016). To our best knowledge, however, no studies applied supervised ML algorithms to predict the disease course in anxiety disorders.

The aim of this study was to predict long-term anxiety disorder course, using an ML approach applied to clinical, psychological, biological, sociodemographic and lifestyle baseline data. Specifically, we investigated the utility of a random forest classifier (RFC) (Breiman, Reference Breiman2001) to predict clinical course in patients with any baseline anxiety disorder. Our main outcome was recovery from anxiety disorders at 2-year follow-up. As secondary outcome recovery from all common mental disorders (CMDs) at 2-year follow-up was used. CMDs include anxiety disorders, but also depressive disorders and substance use disorders as these disorders often co-occur, show diagnostic instability over time (Hovenkamp-Hermelink et al., Reference Hovenkamp-Hermelink, Riese, Van Der Veen, Batelaan, Penninx and Schoevers2016; Lamers et al., Reference Lamers, van Oppen, Comijs, Smit, Spinhoven, van Balkom and Penninx2011; Scholten et al., Reference Scholten, Batelaan, Penninx, Balkom, Van Smit, Schoevers and Van Oppen2016; Verduijn et al., Reference Verduijn, Verhoeven, Milaneschi, Schoevers, van Hemert, Beekman and Penninx2017), and recovery from one but not the other does not index a major improvement in health. Finally, we assessed which predictor domains contributed most to disease course predictions. We hypothesized that RFCs using a wide array of baseline data from different domains would yield adequate 2-year recovery predictions for both outcomes. Furthermore, we hypothesized that the combination of the five domains would yield the best predictions.

Methods

Study sample

The participants in this study were selected from the multi-site Netherlands Study of Depression and Anxiety (NESDA), an ongoing naturalistic cohort study into the course of depression and anxiety. The baseline sample consists of 2981 participants who were recruited from the community, primary care and specialized mental health care centres. All participants had a lifetime or current depressive disorder or anxiety disorder diagnosis (n = 2329, 78.1%) or were healthy controls (n = 652, 21.9%). NESDA allowed for the presence of comorbid psychiatric disorders, with the exception of psychotic disorders, obsessive-compulsive disorder, post-traumatic stress disorder, bipolar disorders, or severe substance use disorders. Exclusion criterion consisted of insufficient proficiency of the Dutch language. Baseline data collection was performed in 2004–2007 and was followed by 1-year, 2-year, 4-year, 6-year, and 9-year follow-up measurements. Full descriptions of the design of NESDA were published previously (Penninx et al., Reference Penninx, Beekman, Smit, Zitman, Nolen, Spinhoven and Van Dyck2008). The study protocol was approved by the Ethical Review Board of all participating institutes and written informed consent was obtained from all participants.

For the purpose of this study, patients with current (6-month) panic disorder (PD, with or without agoraphobia), generalized anxiety disorder (GAD) or social anxiety disorder (SAD) diagnoses at baseline were selected (n = 1206). In our sample, psychiatric comorbidity was allowed. The diagnosis was established according to DSM-IV criteria with the Composite International Diagnostic Interview (CIDI, version 2.1) (American Psychiatric Association, 2000; Wittchen, Reference Wittchen1994; World Health Organization, 1998). From these patients, 212 were excluded due to missing diagnostic information at 2-years follow-up. A further 107 patients were removed due to having more than 20% missing variables across predictor variables at baseline. This yielded a final sample of 887 anxiety disorder patients with sufficient data available. Excluded patients showed comparable symptom severity at baseline – mean anxiety severity (Beck's Anxiety Inventory; BAI): 20.35 ± 11.74 v. 18.30 ± 10.48, t = 1.81, p = 0.07; mean depression severity (Inventory of Depressive Symptomatology-Self Report; IDS-SR): 30.71 ± 12.65 v. 29.39 ± 12.65, t = 0.97, p = 0.33. Excluded patients were younger (mean age: 38.25 ± 12.05 v. 41.92 ± 12.20 years, t = 4.62, p < 0.001), and had a lower mean number of education years: 11.03 ± 3.15 v. 11.88 ± 3.35, t = 3.97, p < 0.001, consistent with differences across the whole NESDA sample (Lamers et al., Reference Lamers, Hoogendoorn, Smit, van Dyck, Zitman, Nolen and Penninx2012). Gender did not differ between excluded and included patients (% female in excluded sample 68.2%, in included sample 66.8%, χ² = 0.22, p = 0.64).

Investigated classifications

Two distinct classification tasks predicting outcomes at 2-year follow-up were performed. Both were binary classification tasks predicting (1) recovery from anxiety disorders or (2) recovery from all CMDs. Anxiety disorders were defined as either PD, agoraphobia, GAD, or SAD. Recovery from anxiety disorders was deemed present if no anxiety disorder diagnoses persisted at follow-up. These diagnoses referred to all follow-up anxiety disorders, not only the index disorder(s). Anxiety disorders, dysthymia, major depressive disorder (MDD) and alcohol dependency are sometimes collectively referred to as CMDs (Ormel et al., Reference Ormel, Jeronimus, Kotov, Riese, Bos, Hankin and Oldehinkel2013; Vollebergh et al., Reference Vollebergh, Iedema, Bijl, de Graaf, Smit and Ormel2001). For the purpose of this study, we defined recovery from all CMDs if at follow-up no anxiety disorders, MDD, dysthymia or alcohol dependency diagnoses were present. Assessment of CMDs is relevant as it is evident from population-based studies that depressive disorders and alcohol dependency are the most commonly occurring comorbidities in anxiety disorders (Alonso & Lépine, Reference Alonso and Lépine2007; Judd et al., Reference Judd, Kessler, Paulus, Zeller, Wittchen and Kunovac1998; Wittchen, Kessler, Pfister, & Lieb, Reference Wittchen, Kessler, Pfister and Lieb2000), rates of diagnostic instability across anxiety disorders, depressive disorders and alcohol dependency are high (Gustavson et al., Reference Gustavson, Knudsen, Nesvåg, Knudsen, Vollset and Reichborn-Kjennerud2018; Hovenkamp-Hermelink et al., Reference Hovenkamp-Hermelink, Riese, Van Der Veen, Batelaan, Penninx and Schoevers2016; Scholten et al., Reference Scholten, Batelaan, Penninx, Balkom, Van Smit, Schoevers and Van Oppen2016) and recovery from one but not the other does not imply a major improvement in health. We assessed recovery from anxiety disorders as a primary outcome measure and recovery from all CMDs as a secondary outcome measure. These two outcome measures describe recovery in a narrow and a broad perspective (Verduijn et al., Reference Verduijn, Verhoeven, Milaneschi, Schoevers, van Hemert, Beekman and Penninx2017).

Baseline predictor variables

At baseline, a wide array of putative predictors from five domains (clinical, psychological, sociodemographic, biological and lifestyle) were selected, yielding a total of 651 variables. In our analyses, only information at the individual item level was used. Total summary scores for questionnaires were not calculated, as these would be correlated to the individual items. The exception was the NEO Five-Factor Inventory (NEO-FFI), as its domains (e.g. neuroticism) are of specific clinical relevance. Items were excluded if more than 20% of patients were missing the corresponding item. This resulted in the inclusion of 569 predictors at baseline (see Table 1). If a variable did not apply for a patient, it was re-coded as a new category for ordinal or nominal variables or as 0 for continuous variables (all continuous variables were positive). Such an encoding allowed to maintain the variable for classification and encoded it with a not naturally occurring value implying that this variable did not apply for this patient. All additional missing variables were imputed using median/mode imputation calculated on the training set (see below) to obtain a full data set. No variable had more than 10% missing values before imputation was applied. Additional information about measurement instruments, variable scoring and collection can be found in the Supplementary Methods. We investigated the predictive capability of all domains individually and the combination of all five domains.

Table 1. Included baseline predictor variables across the five predictor domains

Machine learning algorithm

RFCs (Breiman, Reference Breiman2001) were used in all analyses. RFCs have been shown to perform well on many different machine learning problems (Fernández-Delgado, Cernadas, Barro, & Amorim, Reference Fernández-Delgado, Cernadas, Barro and Amorim2014), specifically in biomedical sciences (Olson, Cava, Mustahsan, Varik, & Moore, Reference Olson, Cava, Mustahsan, Varik and Moore2018). An RFC is built as an ensemble of many decision trees (Breiman, Friedman, Olshen, & Stone, Reference Breiman, Friedman, Olshen and Stone1984) which themselves are trained by considering random subsamples of variables and patients for each tree. Such a procedure leads to improved and robust prediction performance in comparison to individual trees (Breiman, Reference Breiman2001). Details on hyperparameters used in the analysis can be found in the Supplementary Methods. All analyses were implemented using the scikit-learn (version 0.20.2) (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel and Duchesnay2011) and imbalanced-learn toolboxes (version 0.4.3) (Lemaître, Nogueira, & Aridas, Reference Lemaître, Nogueira and Aridas2017) in the Python programming language (version 3.7.2).

Evaluation

To evaluate the performance of our classifiers 10-times-repeated-10-fold-cross-validation was applied. In this procedure, the data set is repeatedly (n = 100) divided into disjoint training (90% of data) and test (10% of data) sets and the RFC is only fit on the training data and evaluated on the independent test data. The final performance is obtained as an average across all test set evaluations. We measured performance as area-under-the-receiver-operator-curve (AUC). In addition, we calculated sensitivity, specificity, balanced accuracy – average between sensitivity and specificity – and positive/negative predictive values. To further validate our classification performance label-permutation tests (n = 1000) of average AUC values were performed (Ojala & Garriga, Reference Ojala and Garriga2010). The obtained p values were Bonferroni-corrected across five individual and one combination of all domains and alpha was set to 0.05.

To systematically compare the performance of different predictor domains patients were distributed in exactly the same way for each of the classifications, i.e. the train and test set of any cross-validation iteration included the same patients for each predictor domain. This allowed the calculation of normalized average differences in AUC scores across cross-validation iterations for each pair of predictor domains (including the combination of all domains). Non-parametric sign-flipping tests (n = 10 000) were then employed to derive p values which were Bonferroni-corrected for 30 comparisons with alpha set to 0.05.

Variable importance

In addition to its strong classification performance RFCs allow to quantify the importance of each variable towards the classification task (Breiman, Reference Breiman2001). However, the standard calculation of variable importance has been shown to be biased (Strobl, Boulesteix, Zeileis, & Hothorn, Reference Strobl, Boulesteix, Zeileis and Hothorn2007) and a permutation-based variable importance scheme has been suggested instead (Altmann, Toloşi, Sander, & Lengauer, Reference Altmann, Toloşi, Sander and Lengauer2010; Hapfelmeier & Ulm, Reference Hapfelmeier and Ulm2013; Strobl et al., Reference Strobl, Boulesteix, Zeileis and Hothorn2007). Following this approach, we calculated p values for each variable by permuting (n = 1000) every variable separately. The computed p values were then corrected according to the false discovery rate (FDR) (Benjamini & Hochberg, Reference Benjamini and Hochberg2000) and significance was set to 0.05. Given that variable importance was calculated every cross-validation iteration, important variables were defined as variables which were consistently significant under FDR for at least 50% of all cross-validation iterations. This very stringent procedure for identifying important variables was employed to calculate valid variable importance information specific to the classification task. Variable importance were only investigated for the classifications using the data from the combination of all domains. In addition, we investigated differences in the average rankings of important variables between the two classification tasks. A detailed description of this approach can be found in the Supplementary Methods.

Results

At 2-year follow-up, 484 patients (54.6%) recovered from anxiety disorders, and 362 patients (40.8%) did not have any CMD. Baseline clinical, psychological, sociodemographic, biological and lifestyle variables are provided for patients with and without anxiety disorders at follow-up (Table 2) and for patients with and without CMD at follow-up (online Supplementary Table 1). Various clinical and psychological variables showed differences between the two groups. By contrast, biological and lifestyle status did not differ between the two groups.

Table 2. Baseline characteristics of anxiety disorder sample, group comparisons between patients who had no anxiety disorder (n = 484) at 2-year follow-up and patients who did (n = 403)

PD, panic disorder; SAD, social anxiety disorder; GAD, generalized anxiety disorder; MDD, major depressive disorder; FQ, Fear Questionnaire; PSWQ, Penn State Worry Questionnaire; SSI, Suicidal Ideation Scale; 4DSQ, Four-Dimensional Symptom Questionnaire; IDS-SR, Inventory of Depressive Symptomatology-SR; ISR, Insomnia Rating Scale; BAI, Beck's Anxiety Inventory; LCI, life chart interview; NEO-FFI, NEO Five-Factor Inventory; LEIDS, Leiden Index of Depression Sensitivity; ASI, Anxiety Sensitivity Index; BMI, Body Mass Index; CRP, c-reactive protein; IL-6, interleukin-6; TNF-α, tumour necrosis factor-α; BDNF, Brain-Derived Neurotrophic Factor. p values shown in bold are <0.05.

^a Childhood life events (<16 years of age) were parental divorce, being placed in a juvenile prison, raised in a foster family, placed in a child home, death of a parent.

^b Childhood trauma included emotional neglect, psychological abuse, physical abuse, and sexual abuse.

^c As measured with the AUDIT. Scores above 8 are reflective of hazardous drinking, scores at 13 or higher (females) and 15 or higher (males) are indicative of probable alcohol dependency.

Recovery from anxiety disorders

Classification performance

Results of our evaluation of the RFC when predicting recovery from anxiety disorders are reported in Table 3 and Fig. 1A. AUC values for the predictor domains ranged from 0.49 to 0.67 with significant (p_Bonferroni < 0.05) AUC values obtained for the clinical (0.67), and psychological (0.65) domains, as well as for the combination of all domains (0.67). Classification accuracies were small to moderate with the highest accuracy achieved by the combination of all domains (62.4%) with a sensitivity of 62.0% and specificity of 62.8%. In addition, we investigated the performance of the RFC for subgroups of patients who had any comorbidity (MDD, dysthymia, or alcohol dependency, n = 252 recovered, n = 248 persistent) at baseline and for patients who did not ( n = 232 recovered, n = 155 persistent). For that, the RFC trained on all data domains and all patients of the training set was evaluated within the two subgroups on the test set separately. The RFC obtained an average AUC of 0.64 within the no-comorbidity group and an AUC of 0.68 within the comorbidity group showing slightly increased performance for predictions within the comorbidity group.

Table 3. Evaluation of the 2-year recovery from anxiety disorders classification [mean (s.d.)]

AUC, area-under-receiver-operator-curve; PPV, positive predictive value; NPV, negative predictive value; *p_Bonferroni < 0.05.

p values shown in bold are <0.05.

Fig. 1. Classification performance of random forest classifiers. Performance is quantified by area-under-the-receiver-operator-curve (AUC) values calculated for each test set of all cross-validation iterations and is shown in box-and-whisker plots for all data domains. (a) Performance of the recovery from anxiety disorders prediction,(b) Performance of the recovery from all common mental disorders prediction. Asterisks mark a significant classification performance according to label-permutation tests (n = 1000) and Bonferroni-correction for six tests.The dashed line indicates chance-level performance.

Domain comparisons

When comparing different domains according to their AUC a clear ordering was observed: The clinical domain outperformed every other domain except for the combination of all domains (p_Bonferroni < 0.05), the psychological domain outperformed the sociodemographic, biological, and lifestyle domains (p_Bonferroni < 0.05), the sociodemographic domain outperformed the biological and lifestyle domains (p_Bonferroni < 0.05), and the biological domain outperformed the lifestyle domain (p_Bonferroni < 0.05). The combination of all domains was better than any domain except for the clinical domain (p_Bonferroni < 0.05).