Anxiety disorders are characterized by highly heterogeneous clinical course trajectories. After 2 years, the prognosis varies across disorders with remittance rates of 72.5% for panic disorder without agoraphobia, 69.7% for generalized anxiety disorder, 53.5% for social phobia and 52.7% for panic disorder with agoraphobia (Hendriks, Spijker, Licht, Beekman, & Penninx, Reference Hendriks, Spijker, Licht, Beekman and Penninx2013). Remitted patients experience a relatively benign course with moderate remaining symptom severity, disability and a low subjective need for care (Batelaan, Rhebergen, Spinhoven, van Balkom, & Penninx, Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016; van Beljouw, Verhaak, Cuijpers, van Marwijk, & Penninx, Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010). However, around 60% of patients have persistent symptoms, relapses, or chronic disease up to 6 years after the diagnosis (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016). Disease course in these patients is often characterized by substantial levels of disability. Predicting long-term disease course can be seen as an important step towards personalized medicine (Steyerberg, Reference Steyerberg2009). This would make targeted treatment efforts viable, in which treatments are tailored towards the individual risk for a poor disease outcome (McGorry, Ratheesh, & O'Donoghue, Reference McGorry, Ratheesh and O'Donoghue2018). However, in anxiety disorders, there is a lack of robust course predictors. For instance, different DSM anxiety disorder diagnoses were shown to be poorly predictive of subsequent course (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014). In current clinical practice, in the absence of valid risk prediction models, course prediction relies solely on clinician's opinions, which show poor accuracy (Randall, Sareen, Chateau, & Bolton, Reference Randall, Sareen, Chateau and Bolton2019).
Several clinical, psychological, biological, sociodemographic and lifestyle markers are related to the disease course. For instance, higher baseline severity of anxiety symptoms, presence of somatic or psychiatric comorbidity, and higher levels of disability are linked to worse outcomes at 1-year (van Beljouw et al., Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010), 2-year (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Hendriks et al., Reference Hendriks, Spijker, Licht, Beekman and Penninx2013; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013), 6-year (Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016), and 12-year follow-up (Bruce et al., Reference Bruce, Yonkers, Otto, Eisen, Weisberg, Pagano and Keller2005). Contrastingly, some authors suggest the same factors lead to better initial treatment results (Baldwin & Tiwari, Reference Baldwin and Tiwari2009; Rodriguez et al., Reference Rodriguez, Weisberg, Pagano, Bruce, Spencer, Culpepper and Keller2006). Also, a chronic duration of anxiety was linked to worse outcomes in most studies (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Hendriks et al., Reference Hendriks, Spijker, Licht, Beekman and Penninx2013; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013; Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016), while not showing any effect on disease course in another study (Nay, Brown, & Roberson-Nay, Reference Nay, Brown and Roberson-Nay2013). Most studies showed that a younger age at onset was associated with a chronic course (Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Beesdo-Baum et al., Reference Beesdo-Baum, Knappe, Fehm, Höfler, Lieb, Hofmann and Wittchen2012; Rodriguez et al., Reference Rodriguez, Weisberg, Pagano, Bruce, Spencer, Culpepper and Keller2006), while others showed no such age effect (Nay et al., Reference Nay, Brown and Roberson-Nay2013; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013). Inconsistent findings are likely due to methodological differences between studies. Other factors possibly related to worse disease course were duration of untreated illness (Baldwin & Tiwari, Reference Baldwin and Tiwari2009), the use of anti-anxiety medication (Bruce et al., Reference Bruce, Yonkers, Otto, Eisen, Weisberg, Pagano and Keller2005; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013), and presence of childhood trauma (Asselmann & Beesdo-Baum, Reference Asselmann and Beesdo-Baum2015; Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013). Psychological factors that negatively impact anxiety disorder disease course up till 6-year follow-up included high neuroticism (Asselmann & Beesdo-Baum, Reference Asselmann and Beesdo-Baum2015; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013; Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016), low extraversion (Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016), high anxiety sensitivity (Asselmann & Beesdo-Baum, Reference Asselmann and Beesdo-Baum2015; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013), high levels of worrying (Spinhoven et al., Reference Spinhoven, Batelaan, Rhebergen, van Balkom, Schoevers and Penninx2016), and low mastery (Asselmann & Beesdo-Baum, Reference Asselmann and Beesdo-Baum2015; Scholten et al., Reference Scholten, Batelaan, van Balkom, Penninx, Smit and Van Oppen2013). Only a few studies linked biological parameters to disease course in anxiety disorders: C-reactive protein (CRP) levels were longitudinally associated with anxiety symptoms (Copeland, Shanahan, Worthman, Angold, & Costello, Reference Copeland, Shanahan, Worthman, Angold and Costello2012), increasing cortisol levels were linked to higher 6-month anxiety severity in girls (Schiefelbein & Susman, Reference Schiefelbein and Susman2006), and lower Brain-Derived Neurotropic Factor (BDNF) levels were found in patients with a poor response to treatment (Kobayashi et al., Reference Kobayashi, Shimizu, Hashimoto, Mitsumori, Koike, Okamura and Iyo2005). However, most research into biological parameters for anxiety disorders was done cross-sectionally, showing that anxiety disorder status is linked to higher CRP-levels (Copeland et al., Reference Copeland, Shanahan, Worthman, Angold and Costello2012; Pitsavos et al., Reference Pitsavos, Panagiotakos, Papageorgiou, Tsetsekou, Soldatos and Stefanadis2006; Vogelzangs, Beekman, De Jonge, & Penninx, Reference Vogelzangs, Beekman, De Jonge and Penninx2013), higher metabolic syndrome markers (Carroll et al., Reference Carroll, Phillips, Thomas, Gale, Deary and Batty2009; Kahl et al., Reference Kahl, Schweiger, Correll, Müller, Busch, Bauer and Schwarz2015; Perez-Cornago, Ramírez, Zulet, & Martinez, Reference Perez-Cornago, Ramírez, Zulet and Martinez2014), higher tumour necrosis factor-α (TNF-α) levels (Hoge et al., Reference Hoge, Brandstetter, Moshier, Pollack, Wong and Simon2009; Pitsavos et al., Reference Pitsavos, Panagiotakos, Papageorgiou, Tsetsekou, Soldatos and Stefanadis2006), and lower BDNF levels (Molendijk et al., Reference Molendijk, Bus, Spinhoven, Penninx, Prickaerts, Oude Voshaar and Elzinga2012). Inconsistently, anxiety symptoms were linked to both higher (Zoccola, Dickerson, & Yim, Reference Zoccola, Dickerson and Yim2011) and lower (O ’Donovan et al., Reference O ’Donovan, Hughes, Slavich, Lynch, Cronin, O ’Farrelly and Malone2010) cortisol, as well as higher (Hoge et al. Reference Hoge, Brandstetter, Moshier, Pollack, Wong and Simon2009; O ’Donovan et al. Reference O ’Donovan, Hughes, Slavich, Lynch, Cronin, O ’Farrelly and Malone2010; Pitsavos et al. Reference Pitsavos, Panagiotakos, Papageorgiou, Tsetsekou, Soldatos and Stefanadis2006) and lower (Vogelzangs et al. Reference Vogelzangs, Beekman, De Jonge and Penninx2013) interleukin-6 (IL-6) measurements. Finally, sociodemographic and lifestyle factors such as education years (van Beljouw et al., Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010), age (Asselmann & Beesdo-Baum, Reference Asselmann and Beesdo-Baum2015; Catarino et al., Reference Catarino, Bateup, Tablan, Innes, Freer, Richards and Blackwell2018), partner status (Asselmann & Beesdo-Baum, Reference Asselmann and Beesdo-Baum2015; Batelaan et al., Reference Batelaan, Rhebergen, Spinhoven, van Balkom and Penninx2014), social support (van Beljouw et al., Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010), smoking status (Bruce et al., Reference Bruce, Yonkers, Otto, Eisen, Weisberg, Pagano and Keller2005), nicotine dependency (Nay et al., Reference Nay, Brown and Roberson-Nay2013), current financial problems (Nay et al., Reference Nay, Brown and Roberson-Nay2013), employment status (van Beljouw et al., Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010), and income (van Beljouw et al., Reference van Beljouw, Verhaak, Cuijpers, van Marwijk and Penninx2010) were associated with anxiety disorder disease course. In spite of these many variables that predict disease course at the group level, it is not known whether this translates to accurate predictions for individual patients. Currently, no encompassing model exists with sufficient sensitivity and specificity in disease course prediction to be feasible for use at the level of the individual patient.
A possible explanation for the lack of accuracy in course prediction in anxiety disorders is the complex, multicausal aetiology of anxiety disorders. Univariable and multivariable analyses of predictors of disease course showed low levels of explained variance (Bokma, Batelaan, Hoogendoorn, Penninx, & van Balkom, Reference Bokma, Batelaan, Hoogendoorn, Penninx and van Balkom2020). Furthermore, the inference is typically done on the group-level which does not allow for generalizable statements for the single individual. Multivariable machine learning (ML) methods provide a possible solution for this problem, as they are well-suited for solving problems with high numbers of predictors in complex, multicausal disorders (Iniesta, Stahl, & McGuffin, Reference Iniesta, Stahl and McGuffin2016). The use of ML in the field of psychiatry may have great potential for its application in the prediction of disease course trajectories (Hahn, Nierenberg, & Whitfield-Gabrieli, Reference Hahn, Nierenberg and Whitfield-Gabrieli2017). Prediction of the disease course can be regarded as a ‘classification’ problem, which can be solved using supervised algorithms (Deo, Reference Deo2015). In these, algorithms are trained on patients with known predictor and outcome variables to derive a function that can be applied to unseen patients to predict their outcome based on the values of their predictor variables. In anxiety disorders, supervised algorithms were applied a few times cross-sectionally, to relate predictors from various domains to current disease status (Woo, Chang, Lindquist, & Wager, Reference Woo, Chang, Lindquist and Wager2017) or to predict short-term treatment effects (Lueken & Hahn, Reference Lueken and Hahn2016). To our best knowledge, however, no studies applied supervised ML algorithms to predict the disease course in anxiety disorders.
The aim of this study was to predict long-term anxiety disorder course, using an ML approach applied to clinical, psychological, biological, sociodemographic and lifestyle baseline data. Specifically, we investigated the utility of a random forest classifier (RFC) (Breiman, Reference Breiman2001) to predict clinical course in patients with any baseline anxiety disorder. Our main outcome was recovery from anxiety disorders at 2-year follow-up. As secondary outcome recovery from all common mental disorders (CMDs) at 2-year follow-up was used. CMDs include anxiety disorders, but also depressive disorders and substance use disorders as these disorders often co-occur, show diagnostic instability over time (Hovenkamp-Hermelink et al., Reference Hovenkamp-Hermelink, Riese, Van Der Veen, Batelaan, Penninx and Schoevers2016; Lamers et al., Reference Lamers, van Oppen, Comijs, Smit, Spinhoven, van Balkom and Penninx2011; Scholten et al., Reference Scholten, Batelaan, Penninx, Balkom, Van Smit, Schoevers and Van Oppen2016; Verduijn et al., Reference Verduijn, Verhoeven, Milaneschi, Schoevers, van Hemert, Beekman and Penninx2017), and recovery from one but not the other does not index a major improvement in health. Finally, we assessed which predictor domains contributed most to disease course predictions. We hypothesized that RFCs using a wide array of baseline data from different domains would yield adequate 2-year recovery predictions for both outcomes. Furthermore, we hypothesized that the combination of the five domains would yield the best predictions.
The participants in this study were selected from the multi-site Netherlands Study of Depression and Anxiety (NESDA), an ongoing naturalistic cohort study into the course of depression and anxiety. The baseline sample consists of 2981 participants who were recruited from the community, primary care and specialized mental health care centres. All participants had a lifetime or current depressive disorder or anxiety disorder diagnosis (n = 2329, 78.1%) or were healthy controls (n = 652, 21.9%). NESDA allowed for the presence of comorbid psychiatric disorders, with the exception of psychotic disorders, obsessive-compulsive disorder, post-traumatic stress disorder, bipolar disorders, or severe substance use disorders. Exclusion criterion consisted of insufficient proficiency of the Dutch language. Baseline data collection was performed in 2004–2007 and was followed by 1-year, 2-year, 4-year, 6-year, and 9-year follow-up measurements. Full descriptions of the design of NESDA were published previously (Penninx et al., Reference Penninx, Beekman, Smit, Zitman, Nolen, Spinhoven and Van Dyck2008). The study protocol was approved by the Ethical Review Board of all participating institutes and written informed consent was obtained from all participants.
For the purpose of this study, patients with current (6-month) panic disorder (PD, with or without agoraphobia), generalized anxiety disorder (GAD) or social anxiety disorder (SAD) diagnoses at baseline were selected (n = 1206). In our sample, psychiatric comorbidity was allowed. The diagnosis was established according to DSM-IV criteria with the Composite International Diagnostic Interview (CIDI, version 2.1) (American Psychiatric Association, 2000; Wittchen, Reference Wittchen1994; World Health Organization, 1998). From these patients, 212 were excluded due to missing diagnostic information at 2-years follow-up. A further 107 patients were removed due to having more than 20% missing variables across predictor variables at baseline. This yielded a final sample of 887 anxiety disorder patients with sufficient data available. Excluded patients showed comparable symptom severity at baseline – mean anxiety severity (Beck's Anxiety Inventory; BAI): 20.35 ± 11.74 v. 18.30 ± 10.48, t = 1.81, p = 0.07; mean depression severity (Inventory of Depressive Symptomatology-Self Report; IDS-SR): 30.71 ± 12.65 v. 29.39 ± 12.65, t = 0.97, p = 0.33. Excluded patients were younger (mean age: 38.25 ± 12.05 v. 41.92 ± 12.20 years, t = 4.62, p < 0.001), and had a lower mean number of education years: 11.03 ± 3.15 v. 11.88 ± 3.35, t = 3.97, p < 0.001, consistent with differences across the whole NESDA sample (Lamers et al., Reference Lamers, Hoogendoorn, Smit, van Dyck, Zitman, Nolen and Penninx2012). Gender did not differ between excluded and included patients (% female in excluded sample 68.2%, in included sample 66.8%, χ2 = 0.22, p = 0.64).
Two distinct classification tasks predicting outcomes at 2-year follow-up were performed. Both were binary classification tasks predicting (1) recovery from anxiety disorders or (2) recovery from all CMDs. Anxiety disorders were defined as either PD, agoraphobia, GAD, or SAD. Recovery from anxiety disorders was deemed present if no anxiety disorder diagnoses persisted at follow-up. These diagnoses referred to all follow-up anxiety disorders, not only the index disorder(s). Anxiety disorders, dysthymia, major depressive disorder (MDD) and alcohol dependency are sometimes collectively referred to as CMDs (Ormel et al., Reference Ormel, Jeronimus, Kotov, Riese, Bos, Hankin and Oldehinkel2013; Vollebergh et al., Reference Vollebergh, Iedema, Bijl, de Graaf, Smit and Ormel2001). For the purpose of this study, we defined recovery from all CMDs if at follow-up no anxiety disorders, MDD, dysthymia or alcohol dependency diagnoses were present. Assessment of CMDs is relevant as it is evident from population-based studies that depressive disorders and alcohol dependency are the most commonly occurring comorbidities in anxiety disorders (Alonso & Lépine, Reference Alonso and Lépine2007; Judd et al., Reference Judd, Kessler, Paulus, Zeller, Wittchen and Kunovac1998; Wittchen, Kessler, Pfister, & Lieb, Reference Wittchen, Kessler, Pfister and Lieb2000), rates of diagnostic instability across anxiety disorders, depressive disorders and alcohol dependency are high (Gustavson et al., Reference Gustavson, Knudsen, Nesvåg, Knudsen, Vollset and Reichborn-Kjennerud2018; Hovenkamp-Hermelink et al., Reference Hovenkamp-Hermelink, Riese, Van Der Veen, Batelaan, Penninx and Schoevers2016; Scholten et al., Reference Scholten, Batelaan, Penninx, Balkom, Van Smit, Schoevers and Van Oppen2016) and recovery from one but not the other does not imply a major improvement in health. We assessed recovery from anxiety disorders as a primary outcome measure and recovery from all CMDs as a secondary outcome measure. These two outcome measures describe recovery in a narrow and a broad perspective (Verduijn et al., Reference Verduijn, Verhoeven, Milaneschi, Schoevers, van Hemert, Beekman and Penninx2017).
Baseline predictor variables
At baseline, a wide array of putative predictors from five domains (clinical, psychological, sociodemographic, biological and lifestyle) were selected, yielding a total of 651 variables. In our analyses, only information at the individual item level was used. Total summary scores for questionnaires were not calculated, as these would be correlated to the individual items. The exception was the NEO Five-Factor Inventory (NEO-FFI), as its domains (e.g. neuroticism) are of specific clinical relevance. Items were excluded if more than 20% of patients were missing the corresponding item. This resulted in the inclusion of 569 predictors at baseline (see Table 1). If a variable did not apply for a patient, it was re-coded as a new category for ordinal or nominal variables or as 0 for continuous variables (all continuous variables were positive). Such an encoding allowed to maintain the variable for classification and encoded it with a not naturally occurring value implying that this variable did not apply for this patient. All additional missing variables were imputed using median/mode imputation calculated on the training set (see below) to obtain a full data set. No variable had more than 10% missing values before imputation was applied. Additional information about measurement instruments, variable scoring and collection can be found in the Supplementary Methods. We investigated the predictive capability of all domains individually and the combination of all five domains.
Machine learning algorithm
RFCs (Breiman, Reference Breiman2001) were used in all analyses. RFCs have been shown to perform well on many different machine learning problems (Fernández-Delgado, Cernadas, Barro, & Amorim, Reference Fernández-Delgado, Cernadas, Barro and Amorim2014), specifically in biomedical sciences (Olson, Cava, Mustahsan, Varik, & Moore, Reference Olson, Cava, Mustahsan, Varik and Moore2018). An RFC is built as an ensemble of many decision trees (Breiman, Friedman, Olshen, & Stone, Reference Breiman, Friedman, Olshen and Stone1984) which themselves are trained by considering random subsamples of variables and patients for each tree. Such a procedure leads to improved and robust prediction performance in comparison to individual trees (Breiman, Reference Breiman2001). Details on hyperparameters used in the analysis can be found in the Supplementary Methods. All analyses were implemented using the scikit-learn (version 0.20.2) (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel and Duchesnay2011) and imbalanced-learn toolboxes (version 0.4.3) (Lemaître, Nogueira, & Aridas, Reference Lemaître, Nogueira and Aridas2017) in the Python programming language (version 3.7.2).
To evaluate the performance of our classifiers 10-times-repeated-10-fold-cross-validation was applied. In this procedure, the data set is repeatedly (n = 100) divided into disjoint training (90% of data) and test (10% of data) sets and the RFC is only fit on the training data and evaluated on the independent test data. The final performance is obtained as an average across all test set evaluations. We measured performance as area-under-the-receiver-operator-curve (AUC). In addition, we calculated sensitivity, specificity, balanced accuracy – average between sensitivity and specificity – and positive/negative predictive values. To further validate our classification performance label-permutation tests (n = 1000) of average AUC values were performed (Ojala & Garriga, Reference Ojala and Garriga2010). The obtained p values were Bonferroni-corrected across five individual and one combination of all domains and alpha was set to 0.05.
To systematically compare the performance of different predictor domains patients were distributed in exactly the same way for each of the classifications, i.e. the train and test set of any cross-validation iteration included the same patients for each predictor domain. This allowed the calculation of normalized average differences in AUC scores across cross-validation iterations for each pair of predictor domains (including the combination of all domains). Non-parametric sign-flipping tests (n = 10 000) were then employed to derive p values which were Bonferroni-corrected for 30 comparisons with alpha set to 0.05.
In addition to its strong classification performance RFCs allow to quantify the importance of each variable towards the classification task (Breiman, Reference Breiman2001). However, the standard calculation of variable importance has been shown to be biased (Strobl, Boulesteix, Zeileis, & Hothorn, Reference Strobl, Boulesteix, Zeileis and Hothorn2007) and a permutation-based variable importance scheme has been suggested instead (Altmann, Toloşi, Sander, & Lengauer, Reference Altmann, Toloşi, Sander and Lengauer2010; Hapfelmeier & Ulm, Reference Hapfelmeier and Ulm2013; Strobl et al., Reference Strobl, Boulesteix, Zeileis and Hothorn2007). Following this approach, we calculated p values for each variable by permuting (n = 1000) every variable separately. The computed p values were then corrected according to the false discovery rate (FDR) (Benjamini & Hochberg, Reference Benjamini and Hochberg2000) and significance was set to 0.05. Given that variable importance was calculated every cross-validation iteration, important variables were defined as variables which were consistently significant under FDR for at least 50% of all cross-validation iterations. This very stringent procedure for identifying important variables was employed to calculate valid variable importance information specific to the classification task. Variable importance were only investigated for the classifications using the data from the combination of all domains. In addition, we investigated differences in the average rankings of important variables between the two classification tasks. A detailed description of this approach can be found in the Supplementary Methods.
At 2-year follow-up, 484 patients (54.6%) recovered from anxiety disorders, and 362 patients (40.8%) did not have any CMD. Baseline clinical, psychological, sociodemographic, biological and lifestyle variables are provided for patients with and without anxiety disorders at follow-up (Table 2) and for patients with and without CMD at follow-up (online Supplementary Table 1). Various clinical and psychological variables showed differences between the two groups. By contrast, biological and lifestyle status did not differ between the two groups.
PD, panic disorder; SAD, social anxiety disorder; GAD, generalized anxiety disorder; MDD, major depressive disorder; FQ, Fear Questionnaire; PSWQ, Penn State Worry Questionnaire; SSI, Suicidal Ideation Scale; 4DSQ, Four-Dimensional Symptom Questionnaire; IDS-SR, Inventory of Depressive Symptomatology-SR; ISR, Insomnia Rating Scale; BAI, Beck's Anxiety Inventory; LCI, life chart interview; NEO-FFI, NEO Five-Factor Inventory; LEIDS, Leiden Index of Depression Sensitivity; ASI, Anxiety Sensitivity Index; BMI, Body Mass Index; CRP, c-reactive protein; IL-6, interleukin-6; TNF-α, tumour necrosis factor-α; BDNF, Brain-Derived Neurotrophic Factor. p values shown in bold are <0.05.
a Childhood life events (<16 years of age) were parental divorce, being placed in a juvenile prison, raised in a foster family, placed in a child home, death of a parent.
b Childhood trauma included emotional neglect, psychological abuse, physical abuse, and sexual abuse.
c As measured with the AUDIT. Scores above 8 are reflective of hazardous drinking, scores at 13 or higher (females) and 15 or higher (males) are indicative of probable alcohol dependency.
Recovery from anxiety disorders
Results of our evaluation of the RFC when predicting recovery from anxiety disorders are reported in Table 3 and Fig. 1A. AUC values for the predictor domains ranged from 0.49 to 0.67 with significant (pBonferroni < 0.05) AUC values obtained for the clinical (0.67), and psychological (0.65) domains, as well as for the combination of all domains (0.67). Classification accuracies were small to moderate with the highest accuracy achieved by the combination of all domains (62.4%) with a sensitivity of 62.0% and specificity of 62.8%. In addition, we investigated the performance of the RFC for subgroups of patients who had any comorbidity (MDD, dysthymia, or alcohol dependency, n = 252 recovered, n = 248 persistent) at baseline and for patients who did not ( n = 232 recovered, n = 155 persistent). For that, the RFC trained on all data domains and all patients of the training set was evaluated within the two subgroups on the test set separately. The RFC obtained an average AUC of 0.64 within the no-comorbidity group and an AUC of 0.68 within the comorbidity group showing slightly increased performance for predictions within the comorbidity group.
AUC, area-under-receiver-operator-curve; PPV, positive predictive value; NPV, negative predictive value; *pBonferroni < 0.05.
p values shown in bold are <0.05.
When comparing different domains according to their AUC a clear ordering was observed: The clinical domain outperformed every other domain except for the combination of all domains (pBonferroni < 0.05), the psychological domain outperformed the sociodemographic, biological, and lifestyle domains (pBonferroni < 0.05), the sociodemographic domain outperformed the biological and lifestyle domains (pBonferroni < 0.05), and the biological domain outperformed the lifestyle domain (pBonferroni < 0.05). The combination of all domains was better than any domain except for the clinical domain (pBonferroni < 0.05).
Consistently selected significant variables (N = 17) identified through a permutation-based variable importance calculation of the RFC are reported in online Supplementary Table 2. Only variables from the clinical and psychological domain were selected. These variables were derived from different measurement instruments (BAI, IDS-SR, Fear Questionnaire (FQ), NEO-FFI, WHO-Disability Assessment (WHO-DAS), Four-Dimensional Symptom Questionnaire (4DSQ), Mastery scale) but all referred to characteristic anxiety symptoms, with an emphasis on anxious arousal items.
Recovery from all common mental disorders
Results of the second classification procedure predicting recovery from CMDs are reported in Table 4 and Fig. 1B. AUC values ranged from 0.53 to 0.70 with significant (pBonferroni < 0.05) AUC values obtained for the clinical (0.70), psychological (0.67), and sociodemographic domain (0.65) as well as the combination of all domains (0.70). The highest accuracy was achieved by the combination of all domains (63.4%) with a sensitivity of 64.6% and a specificity of 62.3%. As in the case of the prediction of the recovery from anxiety disorders, we investigated the performance of the RFC for subgroups of patients who had (n = 164 recovered, n = 336 persistent) or did not (n = 198 recovered, n = 189 persistent) have any comorbidities at baseline. For that, the RFC trained on the combintation of all domains and all patients of the training set was evaluated within the two subgroups on the test set separately. The RFC obtained an AUC of 0.62 within the no-comorbidity group and an AUC of 0.73 within the comorbidity group. As in the case of the prediction of recovery from anxiety disorders the RFC was showing better performance for patients with comorbidities at baseline.
AUC, area-under-receiver-operator-curve; PPV, positive predictive value; NPV, negative predictive value; *pBonferroni < 0.05.
p values shown in bold are <0.05.
The best performing domains for this classification were the same as in the recovery from anxiety disorders classification. The clinical domain and the combination of all domains did not differ in their performance but outperformed any other domain during the classification. The order for the performance of the other domains was the same as with the recovery from anxiety disorders classification.
48 variables were identified as being consistently selected significant variables contributing to the classification (online Supplementary Table 3). In this classification, selected variables included a larger set of measures related to mood disorders and not only anxiety symptomatology. With one exception (sociodemographic) all variables were again selected from the clinical or psychological domain.
Difference in important variables between prediction analyses
Variables which were more (or less) important in the prediction of recovery from anxiety disorders than the prediction of all CMDs are reported in online Supplementary Table 4. These results confirmed the importance of anxiety-related variables for the prediction of recovery from anxiety, and the importance of depression-related variables for the prediction of recovery from all CMDs.
We replicated the classification of recovery from anxiety disorders at 2-year follow-up in a transfer learning setting: in such an approach we utilized the labels indicating recovery of CMDs during the training of the RFC classifier (training set) but subsequently evaluated its performance on the test set using the recovery from anxiety disorder labels. The result of this analysis can be seen in online Supplementary Table 5. Utilizing the transfer learning approach led to improved performance in predicting anxiety disorder recovery (AUC = 0.71 v. AUC = 0.67 for both training and testing on anxiety disorder recovery labels using either only the clinical or the combination of all domains). The increased performance was observed due to an increase in sensitivity of the classification for correctly identifying recovered anxiety patients. For all individual domains and the combination of them, sensitivity increased by 7.6 ± 1.9 when training on the CMDs labels first. Specificity only decreased slightly (mean decrease: 2.7 ± 0.8) which led to the improved overall performance.
One of the most important goals in personalized medicine is providing individual disease course predictions. Our results show that individual prediction of 2-year course in anxiety disorders is possible using various predictors but it is only moderately successful. The main outcome measure was recovery from anxiety disorders and our predictions reached a balanced accuracy of 62.4% with an AUC of 0.67. The current performance by itself does not warrant implementation of our models in routine psychiatric care as it would yield too many false positives/negatives. However, predictive properties of clinician opinion in predicting disease course in anxiety disorders are not available and therefore it remains unclear which predictive performance threshold is needed for a statistical model to surpass clinician opinion and become an improvement over current routine care.
Our study yielded two models with comparable accuracy for predicting 2-year anxiety disorder course: one consisting of predictors from all five domains and one consisting of predictors only from the clinical domain. Biological, lifestyle, and sociodemographic predictors did not contribute significantly to course prediction. This is surprising as these domains were previously shown to be related to anxiety disorder aetiology. Our results thereby suggest that the underlying aetiology is of less importance to course prediction after the development of threshold disorders and that after anxiety disorders have developed, phenotypical characteristics have more impact on subsequent disease course. This is evident from the individual features that contributed most to the classification. All of these features reflected symptoms, psychological states or traits associated with the emotions of fear and anxiety, such as the presence of ‘phobic symptoms’, difficulty ‘walking alone in a busy street’ or ‘dealing with people you don't know’, ‘feeling tense’, ‘not liking to be where the action is’, and ‘feeling faint or lightheaded’. A previous NESDA study that aimed to predict the naturalistic course in depression showed similar performance to the current study when 2-year follow-up MDD diagnosis was correctly classified with an AUC of 0.66 and balanced accuracy of 62% (Dinga et al., Reference Dinga, Marquand, Veltman, Beekman, Schoevers, van Hemert and Schmaal2018). In this study, clinical features were most important as well, though the nature of those items was related to depression.
As anxiety disorders and other psychiatric disorders frequently co-occur and show diagnostic instability over time, a secondary outcome was assessed. This broad perspective model was trained on recovery from all CMDs and showed marginally higher accuracy (63.4%) and AUC (0.70) in comparison with the main narrow perspective outcome. Like in the narrow perspective, omitting all domains except the clinical domain did not lead to a significant loss of predictive power (accuracy = 62.2% and AUC = 0.70). The individual features that were most consistently chosen during the classification again were almost exclusively from the clinical and psychological domains. Symptoms, psychological traits, and psychological states associated with depression and worrying contributed most to the classification. For instance: ‘feeling down’, ‘feeling sad’, having ‘a desire to die’, ‘suffering from worry’, ‘feeling tense’, and ‘having little control about the things that happen’. This suggests that predictions for recovery from all CMDs were largely driven by co-occurring depressive symptoms. Our decision to investigate the CMDs classification was also supported by the results of the additional transfer analysis which showed improved performance (accuracy = 63.3% and AUC = 0.71 for the combination of all domains data) when using the recovery from all CMDs labelling during training and the recovery from anxiety labels during model evaluation. This analysis showed that patients suffering from any mental disorder at 2-year follow-up – anxiety or not – constituted a more homogenous group while patients who fully recovered were more easily identified than patients only recovering from anxiety disorders (but having an additional CMD instead). This suggests that applying a broad perspective in future attempts in clinical prediction is more feasible for anxiety disorders.
Previous ML studies in anxiety disorders were invariably small in sample size and most focused on predicting immediate treatment response using neuroimaging data (Ball, Stein, Ramsawh, Campbell-Sills, & Paulus, Reference Ball, Stein, Ramsawh, Campbell-Sills and Paulus2014; Doehrmann et al., Reference Doehrmann, Ghosh, Polli, Reynolds, Horn, Keshavan and Gabrieli2013; Hahn et al., Reference Hahn, Kircher, Straube, Wittchen, Konrad, Ströhle and Lueken2014; Pantazatos, Talati, Schneier, & Hirsch, Reference Pantazatos, Talati, Schneier and Hirsch2014; Whitfield-Gabrieli et al., Reference Whitfield-Gabrieli, Ghosh, Nieto-Castanon, Saygin, Doehrmann, Chai and Gabrieli2016). Some studies used clinical, biological and/or neuroimaging data to distinguish between different types of anxiety disorders and healthy controls (Carpenter, Sprechmann, Calderbank, Sapiro, & Egger, Reference Carpenter, Sprechmann, Calderbank, Sapiro and Egger2016; Frick et al., Reference Frick, Gingnell, Marquand, Howner, Fischer, Kristiansson and Furmark2014; Hilbert, Lueken, Muehlhan, & Beesdo-Baum, Reference Hilbert, Lueken, Muehlhan and Beesdo-Baum2017; Pantazatos et al., Reference Pantazatos, Talati, Schneier and Hirsch2014). To the best of our knowledge, this is the first study into individual long-term course prediction in anxiety disorders. A strength of this study is the use of a large dataset with a high number of variables from a variety of predictor domains, most of which were previously related to disease course at the group level. In addition, using RFCs allowed for combining large numbers of predictors into an overall model and allowed the identification of the most contributing predictors, providing insight into the possible processes involved with recovery in anxiety disorders.
In spite of the wide array of predictors, the current study showed only moderate accuracy. This has a number of explanations. First, NESDA is a naturalistic cohort study in which the exposure to environmental stressors and treatment regimens varied across patients during the 2-year follow-up period. These different exposures will have impacted the 2-year outcomes. Furthermore, different data types might improve predictive accuracy. For instance, previous ML studies showed the strong potential of neuroimaging data to predict treatment response in anxiety disorders (Ball et al., Reference Ball, Stein, Ramsawh, Campbell-Sills and Paulus2014; Doehrmann et al., Reference Doehrmann, Ghosh, Polli, Reynolds, Horn, Keshavan and Gabrieli2013; Hahn et al., Reference Hahn, Kircher, Straube, Wittchen, Konrad, Ströhle and Lueken2014; Pantazatos et al., Reference Pantazatos, Talati, Schneier and Hirsch2014; Whitfield-Gabrieli et al., Reference Whitfield-Gabrieli, Ghosh, Nieto-Castanon, Saygin, Doehrmann, Chai and Gabrieli2016), sometimes exceeding predictions made using clinical data (Ball et al., Reference Ball, Stein, Ramsawh, Campbell-Sills and Paulus2014; Doehrmann et al., Reference Doehrmann, Ghosh, Polli, Reynolds, Horn, Keshavan and Gabrieli2013). Our study did not encompass neuroimaging data, as these were only available in a subset of NESDA participants (Janssen, Mourão-Miranda, & Schnack, Reference Janssen, Mourão-Miranda and Schnack2018). Other examples include gait analysis (Zhao et al., Reference Zhao, Zhang, Wang, Wang, Li, Zhu and Xiang2019), actigraphy (Merikangas et al., Reference Merikangas, Swendsen, Hickie, Cui, Shou, Merikangas and Zipunnikov2019), or social media data (Reece & Danforth, Reference Reece and Danforth2017). Additionally, more frequent data collection might improve predictive accuracy (Kubben, Dumontier, & Dekker, Reference Kubben, Dumontier and Dekker2019), which has now been implemented in the most recent wave of NESDA (Difrancesco et al., Reference Difrancesco, Lamers, Riese, Merikangas, Beekman, Hemert and Penninx2019). However, it is worth noting that our analyses showed that using a large set of variables from various domains (either combined or independently) did not outperform the clinical domain alone. Finally, future studies could explore differences in predictive performance across different patient subgroups, by analyzing separate patient groups consisting of different anxiety disorders, or groups with different comorbidity patterns separately.
Clinical care for anxiety disorders would benefit greatly from improved course prediction as it would pave the way for targeted treatments. The current study showed moderate accuracy in predicting recovery from anxiety disorders over a 2-year follow-up for individual patients. Items from the clinical and psychological domain were the most contributing predictors, while biological, lifestyle, and sociodemographic predictors were contributing less. The limited performance while using a wide array of predictors does not justify application in routine clinical care. The results from our study can, however, be used as a benchmark for future studies, with future studies likely resulting in further enhancements of the predictive properties. It has long been argued that statistical modelling will exceed clinician opinion in prediction problems (Ayres, Reference Ayres2007; Meehl, Reference Meehl1954), with clinician interpretation of statistical models likely yielding the best predictive power (Kuhn & Johnson, Reference Kuhn and Johnson2013). As a result, statistical models will increasingly become an addition to clinician opinion. Eventually, targeted treatment regimens and secondary prevention strategies will become more feasible if predictive models further evolve. This study provides an important first step towards valid long-term ML-based predictions in anxiety disorders.
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291720001658.
This study was supported by the Netherlands Organization for Scientific Research (NWO/ZonMW Vidi 016.156.318) and the AMC Research Council (150622). The infrastructure for the NESDA study (www.nesda.nl) is funded through the Geestkracht program of the Netherlands Organisation for Health Research and Development (ZonMw, grant number 10-000-1002) and financial contributions by participating universities and mental health care organizations (VU University Medical Center, GGZ inGeest, Leiden University Medical Center, Leiden University, GGZ Rivierduinen, University Medical Center Groningen, University of Groningen, Lentis, GGZ Friesland, GGZ Drenthe, Rob Giel Onderzoekscentrum).
Conflict of interest
Dr Penninx reports grants from Dutch Ministry of Health/NWO, research funds from Janssen pharmaceuticals, and research funds from Boehringer-Ingelheim, during the conduct of the study. The other authors do not have potential conflicts of interest.
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.