Introduction
Major depressive disorder (MDD) has high prevalence and high impairment (GBD 2019 Diseases and Injuries Collaborators, 2020). The two primary first-line MDD treatments are psychotherapy and antidepressant medication (ADM; Qaseem, Barry, & Kansagara, Reference Qaseem, Barry and Kansagara2016). ADM is the more common treatment despite most patients preferring psychotherapy (McHugh, Whitton, Peckham, Welge, & Otto, Reference McHugh, Whitton, Peckham, Welge and Otto2013) due to lower cost and wider availability (Hockenberry, Joski, Yarbrough, & Druss, Reference Hockenberry, Joski, Yarbrough and Druss2019). But some MDD patients do not respond to ADMs (Cipriani et al., Reference Cipriani, Furukawa, Salanti, Chaimani, Atkinson, Ogawa and Geddes2018; Kazdin et al., Reference Kazdin, Wu, Hwang, Puac-Polanco, Sampson, Al-Hamzawi and Kessler2021; Little, Reference Little2009) but do to psychotherapy or an ADM-psychotherapy combination. However, the latter treatments often are provided only after months of unsuccessful ADM treatment (Day et al., Reference Day, Shah, Taylor, Marwood, Nortey, Harvey and Strawbridge2021). A meaningful proportion of patients drop out before receiving other treatments (Larson et al., Reference Larson, Nemoianu, Lawrence, Troup, Gionfriddo, Pousti and Touya2021). A strategy to predict likelihood of responding before initiating ADM treatment could be of value.
Many multivariable models have been developed, typically using machine learning (ML) methods (Chekroud et al., Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil, Fokkema and Choi2021; Ermers, Hagoort, & Scheepers, Reference Ermers, Hagoort and Scheepers2020; Lee et al., Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat, Trevizol and McIntyre2018), to predict depression treatment response. Most such models can be faulted, though, for (i) low external validity because of restriction to clinical trial samples; (ii) focus on biomarkers infeasible to use in routine clinical practice; (iii) including many fewer predictors than documented in the literature; or (iv) suboptimal analytic methods.
The current report presents results of a study designed to address these problems by analyzing an observational sample of patients recruited near beginning ADM treatment and administered an extensive baseline battery of self-report questions to assess predictors of ADM treatment response found in previous studies. The patients were followed for 3 months to assess treatment response. The data were analyzed using a state-of-the-art stacked generalization ML method.
Materials and methods
Sample
As detailed elsewhere (Puac-Polanco et al., Reference Puac-Polanco, Leung, Bossarte, Bryant, Keusch, Liu and Kessler2021), a probability sample of patients beginning MDD outpatient treatment was selected from Veterans Health Administration (VHA) electronic health records (EHRs) December 2018–June 2020. Inclusion criteria were: (i) beginning first outpatient MDD treatment in the past year; and (ii) receiving ADM prescription and/or psychotherapy referral. Exclusions were: (i) 12-month suicide attempt; (ii) lifetime diagnoses of bipolar disorder, nonaffective psychosis, dementia, intellectual disability, autism, Tourette's disorder, stereotyped movement disorder, or borderline intellectual functioning; (iii) lifetime prescriptions of mood stabilizers or antipsychotic medications (online Supplementary Table S1). The exclusion of 12-month suicide attempts was made because such patients in VHA are placed on a high-risk list that leads to intensive case management, making the experiences of these patients unrepresentative of the more general patient population.
Recruitment letters were sent to 55 106 provisionally eligible patients the day after their first outpatient visit. The letter described the study purposes and the requirements of self-report web- or phone-based baseline assessment taking 45 min and at 3 months follow-up taking 20 min, with compensation of $50 and $25, respectively, for the two assessments. A team member then attempted to contact each patient over the next week (three call attempts). Of the 17 000 reached, 6298 agreed to participate and 4164 completed the baseline assessment (online Supplementary Fig. S1). At baseline, 809 respondents had received an ADM prescription without referral to psychotherapy and were otherwise eligible. The 660 of these 809 who completed the 3-month assessment are the focus of this report. The protocol was approved by the Institutional Review Board of Syracuse VA Medical Center, Syracuse, New York.
Measures
Treatment response
Two-week depressive symptoms were assessed with the 16-item Quick Inventory of Depression Symptomatology Self-Report (QIDS-SR; Rush et al., Reference Rush, Trivedi, Ibrahim, Carmody, Arnow, Klein and Keller2003). A modified version of the Sheehan Disability Scale (SDS; Leon, Olfson, Portera, Farber, & Sheehan, Reference Leon, Olfson, Portera, Farber and Sheehan1997) was used to assess role impairment by asking patients how much depression interfered with the ability to work, participate in family and home life, or participate in social activities in the past 2 weeks on a 0–10 visual analog scale with response options of not at all (0), mildly (1–3), moderately (4–6), markedly (7–9), and extremely (10) (Cronbach's α = 0.85).
Treatment response was defined as either (i) a 3-month QIDS-SR score no more than half its baseline value or (ii) a baseline SDS score of 4–10 in any role impairment domain along with a 3-month SDS score of 0–3 in all such domains. A similar composite definition of ADM treatment response was used in previous research (Huang et al., Reference Huang, Wang, Chen, Zhang, Yuan, Yue and Fang2018; Wang et al., Reference Wang, You, Wang, Xu, Bai, Xie and Hu2018; Zilcha-Mano et al., Reference Zilcha-Mano, Wang, Wajsbrot, Boucher, Fine and Rutherford2021).
Predictors
Numerous recent reports have carried out reviews or meta-analyses of research on baseline predictors of response to individual types of depression treatment (e.g. Furukawa et al., Reference Furukawa, Suganuma, Ostinelli, Andersson, Beevers, Shumake and Cuijpers2021; Noma et al., Reference Noma, Furukawa, Maruo, Imai, Shinohara, Tanaka and Cipriani2019) or treatment in general pooled across multiple treatment types (e.g. Buckman et al., Reference Buckman, Cohen, O'Driscoll, Fried, Saunders, Ambler and Pilling2021a, Reference Buckman, Saunders, Cohen, Barnett, Clarke, Ambler and Pilling2021b, Reference Buckman, Saunders, O'Driscoll, Cohen, Stott, Ambler and Pilling2021c, Reference Buckman, Saunders, Arundell, Oshinowo, Cohen, O'Driscoll and Pilling2022), which are referred to collectively as prognostic predictors. Other reviews have examined baseline variables that interact significantly with treatment type to predict outcomes (Maj et al., Reference Maj, Stein, Parker, Zimmerman, Fava, De Hert and Wittchen2020; Perlman et al., Reference Perlman, Benrimoh, Israel, Rollins, Brown, Tunteng and Berlim2019; Perna, Alciati, Daccò, Grassi, & Caldirola, Reference Perna, Alciati, Daccò, Grassi and Caldirola2020), which are referred to as prescriptive predictors. Predictors from all important domains of either prescriptive or prognostic predictors were included in our baseline questionnaire or abstracted from EHRs or government small-area geospatial databases linked to patient residential addresses. Included here were six domains involving the episodes (symptom frequency, severity, subtypes, clinical staging, psychiatric comorbidities, functioning, and quality of life), two others involving stressors (early environmental exposures, recent environmental stressors), and three involving personality/cognition (personality scales, neurocognition, dysfunctional cognitive schemas). A separate domain of ‘protective/resilience factors’ assessed patient psychological characteristics (e.g. coping styles, self-reported psychological resilience) and environmental resources (e.g. access to supportive social relationships; access to material resources). Two other domains included information about comorbid physical disorders and family history of psychopathology.
We also included information about socio-demographics and treatment characteristics associated in previous research with differential depression treatment response (Constantino, Vîslă, Coyne, & Boswell, Reference Constantino, Vîslă, Coyne and Boswell2018; Kraus, Kadriu, Lanzenberger, Zarate, & Kasper, Reference Kraus, Kadriu, Lanzenberger, Zarate and Kasper2019). Treatment characteristics included self-reports about current expectations, which were assessed as of the time of baseline rather than asking patients to recall their expectations prior to making their initial treatment contact. This time frame is relevant because, as noted above, the baseline assessment was carried out only after treatment started. Treatment characteristics also included patient self-reports about past treatment experiences and EHR data on treatment histories and ADM types prescribed in the current treatment. The latter were classified as norepinephrine-dopamine reuptake inhibitors (NDRI), serotonin antagonist reuptake inhibitors (SARI), serotonin modulator and stimulators, serotonin-norepinephrine reuptake inhibitors (SNRI), selective serotonin reuptake inhibitors (SSRI), tricyclic antidepressants, and tetracyclic antidepressants. We also included a dummy variable for ADMs suggested as most effective in controlled trials (i.e. escitalopram, mirtazapine, paroxetine, sertraline, venlafaxine) (Cipriani et al., Reference Cipriani, Furukawa, Salanti, Chaimani, Atkinson, Ogawa and Geddes2018; Kazdin et al., Reference Kazdin, Wu, Hwang, Puac-Polanco, Sampson, Al-Hamzawi and Kessler2021; Little, Reference Little2009). Other dummy variables were included for typical combinations of ADMs with baseline symptoms (e.g. trazodone with sleep disturbance, duloxetine with severe physical pain). We also recorded whether the treatment provider was the patient's regular primary care physician, someone else in the same primary care office, or someone at a mental health specialty clinic.
Categorical variables were coded as dummy indicators. Quantitative variables were standardized to a mean of 0 and variance of 1 and discretized into quintiles to create stabilized predictors and nested dichotomies. These transformations resulted in 2768 potential predictors (online Supplementary Tables S2–S4). Item-level missingness was handled by single imputation carried out in the total sample before defining separate training and test samples, with missing values imputed to the mode for dichotomous and categorical variables and to the mean for ordinal and interval variables.
Analysis methods
The R program sbw (Zubizarreta, Li, Allouah, & Greifer, Reference Zubizarreta, Li, Allouah and Greifer2021) was used to make weighting adjustments for: (i) discrepancies in baseline EHR variables between eligible VHA patients and the 809 baseline respondents and (ii) discrepancies in baseline survey variables between the 660 3-month follow-up respondents and nonrespondents (Zubizarreta, Reference Zubizarreta2015).
The Super Learner (SL) stacked generalization ML method (Polley, LeDell, Kennedy, Lendle, & van der Laan, Reference Polley, LeDell, Kennedy, Lendle and van der Laan2021) was used to develop a prediction model in the weighted sample of 3-month respondents. SL generates predictions from a weighted combination of conventional and flexible ML algorithms in an ensemble. Our SL specification used 10-fold cross-validation (10F-CV) to generate a weighted composite that performs at least as well in expectation as the best algorithm in the ensemble (Polley, Rose, & van der Laan, Reference Polley, Rose, van der Laan, van der Laan and Rose2011). The appeal of stacked generalization over single algorithms is improved predictive accuracy by virtue of combining results across algorithms that include a wide range of functional forms (Polley et al., Reference Polley, Rose, van der Laan, van der Laan and Rose2011). Consistent with recommendations (LeDell, van der Laan, & Petersen, Reference LeDell, van der Laan and Petersen2016), a diverse set of algorithms was included in the SL ensemble (online Supplementary Table S5). Some prior computational psychiatric studies have used similar stacked generalization procedures (Karrer et al., Reference Karrer, Bassett, Derntl, Gruber, Aleman, Jardri and Bzdok2019; Ziobrowski et al., Reference Ziobrowski, Kennedy, Ustun, House, Beaudoin, An and van Rooij2021a).
We estimated the SL model in a stratified (by the outcome variable) random 70% training sample (n = 462) and validated it in the remaining 30% test sample (n = 198). Prediction strength, defined as area under the receiver operating characteristic curve [AUC (ROC)], was compared across a wide range of hyperparameter settings for each algorithm in the 10F-CV sample (online Supplementary Table S5). Predictors were selected independently in each 10F-CV fold with a range of constraints on predictor number using lasso regression (Park & Casella, Reference Park and Casella2008) for linear models and Bayesian additive regression trees (Chipman, George, & McCulloch, Reference Chipman, George and McCulloch2010) for nonadditive models. Comparisons of AUC (ROC) estimated in the full training sample and 10F-CV allowed determination of how much each learner (i.e. combination of number of allowed predictors and hyperparameter values for a given algorithm) was overfitting and CV prediction strength. A subset of learners with balance between these two criteria was selected for the final SL ensemble. Once the final SL model was estimated, 10F-CV was used for model calibration in the 10F-CV sample based on isotonic regression (Lindhiem, Petersen, Mentch, & Youngstrom, Reference Lindhiem, Petersen, Mentch and Youngstrom2020).
Models were assessed in the test sample by how well predicted probability of treatment response ranked patients on observed response (i.e. discrimination). The AUC (ROC) and the AUC of the precision recall curve [AUC (PRC)] were compared for the SL and a simpler benchmark lasso regression model whose penalty parameter was selected via internal cross-validation, both estimated in the training sample and applied to the test sample. Operating characteristics in the test sample were then inspected across quantiles of predicted probability of response. Operating characteristics included conditional and cumulative sensitivity (SN; the proportion of all patients responding to treatment who were in the quantile) and positive predictive value (PPV; the prevalence of treatment response in the decile). A locally estimated scatterplot smoothed calibration curve (Austin & Steyerberg, Reference Austin and Steyerberg2014) with 0.75 bandwidth was used to quantify model calibration in the test sample using the integrated calibration index (ICI) and expected calibration error (ECE) (Austin & Steyerberg, Reference Austin and Steyerberg2019). Model fairness (Yuan, Kumar, Ahmad, & Teredesai, Reference Yuan, Kumar, Ahmad and Teredesai2021) was evaluated by examining variation in the association of predicted probability of response with observed response across socio-demographic subgroups related to health disparities (age, sex, race/ethnicity, education) using robust Poisson regression models (Zou, Reference Zou2004).
Predictor importance was examined using the model-agnostic kernel Shapley Additive Explanations (SHAP) method (Lundberg & Lee, Reference Lundberg and Lee2017), which generates a predicted difference in outcome score for each patient based on changing one and only one predictor at a time from its observed score to the mean across all logically possible permutations of other predictors. The mean of this ‘SHAP value’ for a given predictor across all patients is 0. However, the mean absolute SHAP value provides useful information about the average importance of the predictor. A bee swarm plot of the association between the individual-level SHAP value and the observed score for a given predictor was used to describe dominant direction of association. Mean absolute SHAP values can also be aggregated across subsets of predictors by summing SHAP values across the predictors at the individual level and then calculating the mean of the absolute value of this sum. Such aggregate scores estimate the expected change in prevalence of treatment response if all predictors for all patients changed from their observed values to the mean values.
SAS statistical software, version 9.4 (SAS Institute Inc, 2013) was used for data management, estimating prevalence of treatment response, and calculating SN, PPV, and AUC. R, version 4.0.5 (R Core Team, 2021) was used to estimate the SL model and SHAP values.
Reporting
We followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines (Collins, Reitsma, Altman, & Moons, Reference Collins, Reitsma, Altman and Moons2015) in presenting results.
Results
Sample characteristics and treatment response
Baseline QIDS-SR scores were transformed to approximate Hamilton Rating Scale of Depression (HRSD) categories using published transformation rules (Table 3 in Rush et al., Reference Rush, Trivedi, Ibrahim, Carmody, Arnow, Klein and Keller2003) to give a sense of the baseline symptom severity distribution. In total, 30.1% of patients were classified as having baseline mild depression, 35.6% moderate, 21.4% severe, and 12.9% very severe (online Supplementary Table S6). Given that the baseline assessment was not administered until after the initiation of treatment, there is a possibility that these severities were lower than if assessments had been carried out prior to beginning treatment. However, the Pearson correlation between the baseline QIDS-SR score and number of days between beginning treatment and taking the baseline assessment (median = 21 days; inter-quartile range = 14–30 days) was nonsignificant (r = 0.013, p = 0.74).
The great majority (80.7%) of patients were prescribed a single ADM, most commonly SSRIs (57.0%), SNRIs (16.8%), NDRIs (15.7%), and SARIs (15.0%). The modal socio-demographic categories were between 35 and 49 years of age, male, non-Hispanic White, married, and living in a major metropolitan area. Except for age (p = 0.009), no statistically significant differences were found in baseline socio-demographics, clinical severity, or ADM classes between the weighted baseline sample and the doubly weighted analytic sample. In total, 35.7% of doubly weighted test sample patients responded to treatment as of the 3-month assessment.
Model performance
The SL test sample AUC (ROC) was 0.66 compared to 0.62 for the benchmark lasso model (Fig. 1). SL had better SN than lasso for all values of specificity (SP) above 0.25. SL had much higher PPV than the lasso model for SN below 0.10 and somewhat higher PPV than lasso across most of the remaining SN range (Fig. 2). SL test sample AUC (ROC) remained 0.66 when the analysis was limited to patients classified as having at least moderate baseline symptom severity compared to AUC (ROC) of 0.60 among patients classified as having mild baseline symptom severity.
Calibration based on the isotonic regression transformation was (ICI = 0.28 and ECE = 0.34). The SL model also had comparable fairness across subgroups defined by age, sex, race/ethnicity, and education (online Supplementary Table S7).
A monotonic gradient was found in the proportion of test sample patients that responded to treatment [i.e., PPV (s.e.)] across SL model quantiles defined in the training sample. These quantiles could be collapsed without meaningful loss of information into three groups of patients (Table 1). Among patients in the first group, those with high predicted probabilities of response, 45.7% (5.5) responded to treatment. In the intermediate predicted probably group, 34.5% (7.6) responded. In the low predicted probability group, 11.1% (4.9) responded. These predicted probabilities did not vary significantly between patients whose baseline symptom severity was mild v. more severe (Table 1). Although the thresholds to define these groups were for quintiles in the training sample, 50.4% of patients in the test sample fell into the high group, 30.1% the intermediate group, and 19.4% the low group.
ADM, antidepressant medication; PPV, positive predictive value (i.e. predicted proportion with treatment response); s.e., standard error; SN, sensitivity (i.e. proportion of all treatment responders).
a 80.9% among patients with mild baseline symptom severity v. 35.3% among other patients.
b 47.5% among patients with mild baseline symptom severity v. 45.7% among other patients.
c 13.7% in the subsample of patients with mild baseline symptom severity v. 38.3% among other patients.
d 30.3% among patients with mild baseline symptom severity v. 38.4% among other patients.
Predictor importance
|The 2768 predictors were highly redundant, as indicated by 750 (27.1%) of them having significant univariable associations with the outcome in the training sample at the 0.05 level but only 53 (1.9%) being selected by SL (Fig. 3). Forty-six of these 53 were patient self-reports, four EHR variables, and three geospatial variables. The aggregate mean absolute SHAP value across all these predictors was 4.3%. This means the probability of treatment response would have changed by an estimated average of 4.3% if each patient's scores on all selected predictors changed from observed to sample-wide mean values.
The most important predictors were features of the episode (10 of 53 predictors), with an aggregate mean absolute SHAP value of 3.5% (81% of the total). This included the most important predictor, overall depressive symptom frequency in the 2 weeks before treatment, in addition to two other important symptom measures, frequency of being happy or at peace (third most important) and anhedonia (reverse coded, 14th most important), along with five indicators of current or recent comorbidity (4th, 15th, 22nd, 34th, 44th). These predictors were for the most part associated with reduced probability of treatment response. However, SHAP value distributions (Fig. 4) show that some associations were nonmonotonic. For example, lower-than-average but not lowest overall depressive symptom frequency was associated with highest probability of treatment response.
The second most important predictor domain involved treatment characteristics (22 of 53 selected predictors), with an aggregate mean absolute SHAP value of 1.2% (25% of the total). Included were 10 indicators of positive treatment expectation/preference (e.g. 8th most important, expectation of having a good relationship with treatment provider), all positively associated with treatment response. Another seven treatment-related predictors involved current treatment (e.g. sixth most important, referral to a mental health specialist or psychologist carried out intake). None of the ADM types was among the important predictors. The remaining treatment-related predictors involved treatment history (e.g. 12th most important, past psychotherapy was not helpful), all positively associated with treatment response.
There were only two other important predictor domains: recent stressors and protective/resilience factors, with aggregate mean absolute SHAP values of 0.9% (26% of the total) and 0.7% (17% of the total), respectively. The most important stressors were financial (second) and high mortality rate due to drug overdose in the patient's county of residence (fifth), both negatively associated with treatment response. The protective/resilience factors included three indicators of psychological resilience and three of social support. As shown in the bee swarm plot, most of these predictors had nonmonotonic associations with the outcome due to patients with higher-than-average but not highest reported protective/resilience scores having highest probabilities of treatment response.
Discussion
The 35.7% 3-month ADM treatment response rate is comparable to previous VHA studies (Katz, Liebmann, Resnick, & Hoff, Reference Katz, Liebmann, Resnick and Hoff2021) but lower than most civilian studies (Cuijpers et al., Reference Cuijpers, Noma, Karyotaki, Vinkers, Cipriani and Furukawa2020), presumably reflecting the greater severity/complexity of depressed Veterans than civilians (Ziobrowski et al., Reference Ziobrowski, Leung, Bossarte, Bryant, Keusch, Liu and Kessler2021b). This highlights the potential importance of patients in the group with highest predicted probability of ADM response being more than four times as likely to respond as patients in the lowest group (45.7% v. 11.1%). Accurate discrimination of this sort is valuable as a first step in determining optimal treatments. However, multiple treatment-specific models need to be developed and combined to create a precision treatment rule for optimal assignment of patients across interventions (Kessler & Luedtke, Reference Kessler and Luedtke2021). For instance, psychotherapy or ADM plus psychotherapy might be prioritized for patients with low predicted probabilities of ADM treatment response, but only in the subset of patients with higher predicted probabilities of response in treatment-specific models for psychotherapy or combined therapy.
In this respect, our model might be compared to pharmacogenomic models used to determine pre-emptively whether ADMs are likely to be effective for individual patients (Greden et al., Reference Greden, Parikh, Rothschild, Thase, Dunlop, DeBattista and Dechairo2019). Our model performed at least as well as these pharmacogenomic models and could be implemented at a fraction of the cost of pharmacogenetic testing. The largest pharmacogenomic testing trial to date for ADM selection found that patients receiving test-congruent medications had a 12% higher probability of treatment response than patients receiving test-incongruent medications (29% v. 17%; Greden et al., Reference Greden, Parikh, Rothschild, Thase, Dunlop, DeBattista and Dechairo2019), whereas the differences we found were 34.6% (45.7% v. 11.1%) between our high and low groups, 11.2% (45.7% v. 34.5%) between our top and intermediate groups, and 23.4% (34.5% v. 11.1%) between our intermediate and low groups.
Caution is needed in interpreting results regarding predictor importance because predictor importance rankings can be very unstable when, as in our dataset, many predictors are highly correlated (Leeuwenberg et al., Reference Leeuwenberg, van Smeden, Langendijk, van der Schaaf, Mauer, Moons and Schuit2022). Several broad results about predictor importance are nonetheless noteworthy. The most striking is that baseline clinical characteristics of the episode were by far the most important predictors. This is consistent with a recent individual-level meta-analysis of over 6000 patients in primary care depression treatment across 12 trials, where baseline depression symptom severity was by far the single most important predictor of treatment response independent of treatment type (Buckman et al., Reference Buckman, Saunders, Cohen, Barnett, Clarke, Ambler and Pilling2021b). Other significant clinical predictors in that recent meta-analysis included duration of the depressive episode before beginning treatment, comorbid panic, and duration of comorbid anxiety. We found a different set of important clinical predictors, including two secondary depressive symptom factors and several measures of psychiatric comorbidity, but, as in the meta-analysis, these were all much less important than overall baseline depression symptom frequency.
The secondary symptom factors (absence of positive emotions, anhedonia) are both central aspects of melancholic depression. Evidence in previous studies has been mixed for melancholic depression being less responsive to treatment than others (Maj et al., Reference Maj, Stein, Parker, Zimmerman, Fava, De Hert and Wittchen2020). It is noteworthy that the baseline assessment included the other indicators of melancholia (i.e. deep feelings of despair, mood worse in the morning, early morning awakening, psychomotor changes, weight loss, excessive guilt), but we did not attempt to define this or any other theoretical (Benazzi, Reference Benazzi2006) or data-driven (Buckman et al., Reference Buckman, Cohen, O'Driscoll, Fried, Saunders, Ambler and Pilling2021a) MDD subtype beyond those that emerged in an exploratory factor analysis of symptoms in the baseline assessment. The nonadditive models in the SL ensemble would have been expected to detect interactions across these factors if a strong data-driven episode subtype existed. Nonetheless, it might be useful in future investigations to use unsupervised ML methods to explore the possibility of detecting such clusters.
The importance of treatment characteristics, the next most important predictor domain in our sample, was striking in two ways. First, ADM type was unrelated to treatment response. Second, multiple aspects of treatment history and current treatment expectations were important. Although the literature on treatment expectations is inconsistent in its measures and controls for prior experiences, our finding that both process and outcome expectations were important predictors is broadly consistent with previous studies (Laferton, Kube, Salzmann, Auer, & Shedden-Mora, Reference Laferton, Kube, Salzmann, Auer and Shedden-Mora2017). This is striking given that we controlled for and found significant associations of several measures of past treatment experiences that presumably underlie expectations. Taken together, these results argue for the potential value of shared decision-making and patient-centered care for depression (Rush & Thase, Reference Rush and Thase2018), for the potential value of expanding interventions to influence treatment expectations (Gruszka, Burger, & Jensen, Reference Gruszka, Burger and Jensen2019) and for the importance of including psychometrically sound and conceptually cohesive questions about treatment expectations and past treatment experiences in baseline patient assessments (e.g. Barth, Kern, Lüthi, & Witt, Reference Barth, Kern, Lüthi and Witt2019).
The finding that recent stressors were important is broadly consistent with evidence documenting effects of stressful life experiences on depression treatment response (Buckman et al., Reference Buckman, Saunders, Arundell, Oshinowo, Cohen, O'Driscoll and Pilling2022). The fact that financial stress was the second most important predictor was especially striking and is consistent with prior studies showing that unemployment and low household income are top predictors of low ADM treatment response (Lee et al., Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat, Trevizol and McIntyre2018). The findings that baseline protective/resilience factors were important are also in line with much previous research (Buckman et al., Reference Buckman, Saunders, O'Driscoll, Cohen, Stott, Ambler and Pilling2021c; Laird, Lavretsky, St Cyr, & Siddarth, Reference Laird, Lavretsky, St Cyr and Siddarth2018). The fact that some of these associations were nonmonotonic is consistent with naturalistic evidence that moderate, compared to extremely low or high, levels of emotional reactivity to stress predict low future depression severity (Santee & Starr, Reference Santee and Starr2021) and that baseline self-reported resilience is sometimes significant in predicting depression treatment response only in interaction with other predictors (Choi et al., Reference Choi, Kim, Kang, Kim, Kang, Lee and Kim2021; Min, Lee, Lee, Lee, & Chae, Reference Min, Lee, Lee, Lee and Chae2012). These specifications might reflect the greater importance of protective/resilience factors in the subset of patients whose depressive episodes are triggered by stressful life experiences, which could be the subject of future investigation (Chromik, Reference Chromik, Ardito, Lanzilotti, Malizia, Petrie, Piccinno, Desolda and Inkpen2021).
Limitations
The study had several noteworthy limitations. Three of these involve external validity. First, the baseline response rate was low, although comparable to response rates in other VHA studies examining mental health outcomes (King, Beehler, Buchholz, Johnson, & Wray, Reference King, Beehler, Buchholz, Johnson and Wray2019; Stolzmann et al., Reference Stolzmann, Meterko, Miller, Belanger, Seibert and Bauer2019). However, as shown in a previous report (Puac-Polanco et al., Reference Puac-Polanco, Leung, Bossarte, Bryant, Keusch, Liu and Kessler2021), there are minimal differences between our responders and nonresponders on baseline administrative variables and equally modest differences in baseline self-reports between patients followed v. lost to follow-up, although response bias might nonetheless exist with respect to unmeasured variables. Second, we did not account for possible disruptions in care due to the COVID-19 pandemic, which involved 7.6% of study patients who completed assessments after February 2020. Third, although the model was validated in a separate test sample, it was not tested in an external validation sample. Nor is it clear whether findings would generalize to non-VHA patients.
A separate set of limitations involve design decisions that could have biased results. One of these is that study recruitment and assessment occurred only after the initial visit, during which time symptoms might have decreased, leading to distortion in our estimates of associations between baseline symptoms and treatment response. As reported above, the association between time between initiating treatment and completing the baseline assessment was unrelated to baseline QIDS-SR scores, somewhat reducing this concern, but it is nonetheless important that future replications and extensions of our work are carried out with baseline assessment administered before treatment selection is made. Another limitation that might have biased results was the use of a very large set of predictors, which could have resulted in over-fitting even though we used procedures to minimize this possibility.
A final set of noteworthy limitations involves the measures. The predictors excluded information about military experiences that might have led to the depression, and the outcomes were based on self-reports rather than clinical interviews.
Strengths
The study also had several strengths, including an observational sample with greater external validity than clinical trial samples, a rich baseline predictor set that included a wide range of variables found in previous research to be prognostic predictors of depression treatment response, and use of a rigorous ML method to develop the model.
Conclusions
Within the context of these limitations, we found that a model to predict ADM treatment response could be developed based largely on a battery of self-report questions along with some administrative variables from EHRs and geospatial databases. The model had modest overall prediction strength but nonetheless provided enough discrimination across three broad groups of patients to have potential value in informing depressed patients pre-emptively about their likelihood of responding to ADM as part of a patient-centered shared decision-making process. The model had good calibration and fairness with respect to key indicators of health disparities. Our findings would need to be replicated in a sample where the baseline assessment occurred before the beginning of treatment, the model streamlined, and parallel models built for predicted response to other types of treatment before results could be useful. In addition, parallel models combined across different treatments would be needed to determine best treatment options for particular patients (Kessler & Luedtke, Reference Kessler and Luedtke2021).
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291722001982
Financial support
This research was supported by the Office of Mental Health Services and Suicide Prevention and Center of Excellence for Suicide Prevention (Bossarte), the National Institute of Mental Health of the National Institutes of Health (R01MH121478, Kessler), the United States Department of Veterans Affairs Health Services Research & Development Service Career Development Award (IK2 HX002867, Leung), the PCORI Project Program Award (ME-2019C1-16172, Zubizarreta), and the Advanced Fellowship from the VISN 4 Mental Illness Research, Education, & Clinical Center (MIRECC, Cui, Oslin). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, the US Department of Veterans Affairs, or the United States Government.
Conflict of interest
In the past 3 years, Dr Kessler was a consultant for Datastat, Inc., Holmusk, RallyPoint Networks, Inc., and Sage Therapeutics. He has stock options in Mirah, PYM, and Roga Sciences. Dr Pigeon consulted for CurAegis Technologies and received clinical trial support from Pfizer, Inc. and Abbvie, Inc. Dr Zubizarreta consulted for Johnson & Johnson Real World Data Analytics. Dr Cipriani is supported by the National Institute for Health Research (NIHR) Oxford Cognitive Health Clinical Research Facility, by an NIHR Research Professorship (grant RP-2017-08-ST2-006), by the NIHR Oxford and Thames Valley Applied Research Collaboration and by the NIHR Oxford Health Biomedical Research Centre (grant BRC-1215-20005); he has also received research, educational, and consultancy fees from INCiPiT (Italian Network for Paediatric Trials), CARIPLO Foundation and Angelini Pharma. The views expressed are those of the authors and not necessarily those of the UK National Health Service, the NIHR, or the UK Department of Health. The remaining authors report no conflict of interest.
Ethical standards
All procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.