Trajectories of remitted psychotic depression: identification of predictors of worsening by machine learning

Background. Remitted psychotic depression (MDDPsy) has heterogeneity of outcome. The study ’ s aims were to identify subgroups of persons with remitted MDDPsy with distinct trajectories of depression severity during continuation treatment and to detect predictors of membership to the worsening trajectory. Method. One hundred and twenty-six persons aged 18 – 85 years participated in a 36-week randomized placebo-controlled trial (RCT) that examined the clinical effects of continuing olanzapine once an episode of MDDPsy had remitted with sertraline plus olanzapine. Latent class mixed modeling was used to identify subgroups of participants with distinct trajectories of depression severity during the RCT. Machine learning was used to predict membership to the trajectories based on participant pre-trajectory characteristics. Results. Seventy-one (56.3%) participants belonged to a subgroup with a stable trajectory of depression scores and 55 (43.7%) belonged to a subgroup with a worsening trajectory. A random forest model with high prediction accuracy (AUC of 0.812) found that the strongest predictors of membership to the worsening subgroup were residual depression symptoms at onset of remission, followed by anxiety score at RCT baseline and age of onset of the first life-time depressive episode. In a logistic regression model that examined depression score at onset of remission as the only predictor variable, the AUC (0.778) was close to that of the machine learning model. Conclusions. Residual depression at onset of remission has high accuracy in predicting membership to worsening outcome of remitted MDDPsy. Research is needed to determine how best to optimize the outcome of psychotic MDDPsy with residual symptoms.


Introduction
Major depressive disorder with psychotic features (MDDPsy) has a worse long-term outcome than non-psychotic depression, with a higher rate of relapse and recurrence, more frequent psychiatric hospitalization, and poorer long-term function (Coryell et al. 1996;Jääskeläinen et al. 2018;Nietola et al. 2018).There is, however, heterogeneity of outcome, with some individuals progressing to full recovery, while others have a brittle or relapsing course despite adequate treatment.
To our knowledge, no study has examined trajectories of outcome of remitted MDDPsy.With respect to non-psychotic depression, Gueorguieva, Chekroud, and Krystal (2017) performed a post hoc analysis of data from double-blind discontinuation trials of fluoxetine or duloxetine v. placebo among individuals who had responded to acute treatment of MDD and identified a 'relapse' trajectory and two trajectories of stable depression scores.Female sex, shorter length of time with clinical response, and higher residual depression severity at discontinuation baseline increased the odds of belonging to the relapse trajectory.
STOP-PD II was a randomized clinical trial that examined the clinical outcomes of persons aged 18-85 years who had experienced 8 weeks of sustained remission or near-remission of MDDPsy when treated with sertraline plus olanzapine and were then randomized to 36 weeks of sertraline plus olanzapine or sertraline plus placebo (NCT01427608) (Flint et al. 2019).As hypothesized, participants randomized to sertraline plus placebo had a higher risk of relapse than those randomized to sertraline plus olanzapine.Nevertheless, 20% of individuals in the olanzapine group experienced a relapse, while 45% in the placebo group did not, indicating significant heterogeneity of outcome in relation to treatment assignment.
In order to better understand the heterogeneity of outcome of remitted psychotic depression, and factors that contribute, we analyzed data from STOP-PD II and had two aims.First, we sought to identify subgroups of participants with distinct trajectories of depressive symptoms during the randomized phase of STOP-PD II.Second, we used machine learning to detect characteristics of participants that predicted membership to the worsening trajectory of depressive symptoms during the randomized phase, regardless of assigned treatment.The identification of predictors of individuals with the worse outcome has the potential to inform personalized care.We used a machine learning approach because of its sensitivity and replicability, ability to detect complex non-linear patterns among predictors, and ability to examine many variables, even correlated ones, simultaneously (Chekroud et al. 2021).

Participants and study design
The design and methods of STOP-PD II have been previously described (Flint et al. 2013).The study was conducted at four medical centers (University Health Network, Toronto; University of Massachusetts Chan Medical School; University of Pittsburgh School of Medicine; and Weill Cornell Medicine) following approval by their Institutional Review Boards.Written informed consent was obtained from all participants or their substitute decision maker prior to the initiation of any research procedures.
The study had 3 phases: an acute phase lasting up to 12 weeks, an 8-week stabilization phase, and a 36-week randomized controlled trial (RCT).At the time of enrollment in the acute phase of the study, participants were aged between 18 and 85 years, met Structured Clinical Interview for DSM-IV-TR (SCID) (First, Spitzer, Gibbon, & Williams, 2001) criteria for a current major depressive episode with at least one associated delusion (with or without hallucinations), and had a 17-item Hamilton Depression Rating Scale (Ham-D 17 ) (Hamilton, 1960) total score ⩾21.The study's exclusion criteria included DSM-IV-TR criteria for: lifetime bipolar disorder, any other psychotic disorder, or intellectual disability; current body dysmorphic disorder or obsessive-compulsive disorder; substance abuse or dependence within the preceding 3 months; and dementia preceding the index episode of depression or a 26-item IQCODE (Jorm, 2004) mean score ⩾4 at acute phase baseline.Additional exclusions were type 1 diabetes mellitus; neurologic disease that might affect neuromuscular function; and unstable physical illness, although many of the study participants had stable chronic physical problems.
In the acute phase, participants received a combination of open-label sertraline (target dosage of 150-200 mg/day) plus open-label olanzapine (target dosage of 15-20 mg/day).The only other psychotropic medications allowed were 'as needed' lorazepam to a maximum dosage of 3 mg/day or 'as needed' benztropine to a maximum dosage of 2 mg/day.Participants entered the stabilization phase as soon as they met the study's criteria for remission or, failing that, met criteria for near-remission at Week 12 of the acute phase.Remission was defined as the absence of delusions and hallucinations and a Ham-D 17 total score ⩽10 for two consecutive weeks.Near-remission was defined as the absence of delusions and hallucinations, a Ham-D 17 score of 11-15 with ⩾50% reduction in baseline Ham-D 17 score, and rated as 'very much improved' or 'much improved' on the Clinical Global Impression (CGI) Scale (Guy, 1976).At the end of the 8-week stabilization phase, participants who still met remission or near-remission criteria following open-label treatment with sertraline plus olanzapine, and had a Mini-Mental State Examination (MMSE) (Folstein, Folstein, & McHugh, 1975) score ⩾24, were eligible for the RCT.
All participants continued to take open-label sertraline for the duration of the 36-week RCT.They were randomized under double-blind conditions to either continue olanzapine or switch from olanzapine to identically appearing placebo pills during a protocolized 4-week taper of olanzapine.Participants in the RCT were assessed weekly for the first 8 weeks and once every 4 weeks after that until they reached one of the three study end points: relapse (see Flint et al. 2019 for relapse criteria), study completion at RCT Week 36, or early termination.

Outcome
The outcome for the current analysis was depression severity, measured with the GRID version of the Ham-D 17 (Williams et al. 2008) at each assessment point of the RCT.There were up to 15 post-baseline assessments.(Psychosis was not considered for outcome, since sustained absence of delusions and hallucinations was an eligibility requirement for the RCT and re-emergence of psychosis during the RCT was, by virtue of being one of the criteria of relapse (Flint et al. 2019), a study end point).

Predictor variables and associated measures
Potential predictors of depression severity trajectory membership were selected based on their previously reported association with outcome of MDD following acute treatment (Alexopoulos et al. 2000;Buckman et al. 2018;Burcusa & Iacono, 2007;Hardeveld, Spijker, De Graaf, Nolen, & Beekman, 2010;Klein, Holtman, Bockting, Heymans, & Burger, 2018;Wojnarowski, Firth, Finegan, & Delgadillo, 2019).In addition, the following variables were also selected as possible predictors: sociodemographic variables; study site; dosages of sertraline and olanzapine at RCT baseline; and acute and stabilization phase measures of medication-associated parkinsonism and akathisia, given their overlap with psychomotor disturbance which has been associated with risk of relapse of MDDPsy (Flint et al. 2021) Of the clinical variables, depression severity was measured with the GRID version of the Ham-D 17 (Williams et al. 2008); delusion severity was measured with the delusion severity item of the Schedule for Affective Disorders and Schizophrenia (SADS) (Spitzer & Endicott, 1979); anxiety severity was measured with the anxiety subscale of the Hospital Anxiety and Depression Scale (HADS-A) (Zigmond & Snaith, 1983); severity of psychomotor disturbance was measured with the CORE instrument (Parker et al. 1993); clinical global impression was measured Psychological Medicine with the Clinical Global Impression Scale severity item (Guy, 1976); overall severity of illness at acute baseline was measured with the Brief Psychiatric Rating Scale (Overall & Gorham, 1962); lifetime medical burden was quantified by the Cumulative Illness Rating Scale for Geriatrics (CIRS-G) (Miller et al. 1992); treatment resistance during the index episode of psychotic depression was defined on the Antidepressant Treatment History Form (ATHF) (Oquendo et al. 2003) as an antidepressant plus antipsychotic combination rating score of 3 or higher or seven or more treatments of electroconvulsive therapy (Blumberger et (Simpson & Angus, 1970); and akathisia was measured with the Barnes Akathisia Rating Scale (Barnes, 1989).Age of onset of the first lifetime episode of MDD, lifetime number of episodes of MDD, duration of the index episode of depression, and presence or absence of a suicide attempt during the index episode of depression were assessed at acute baseline using the SCID.

Data analyses
Subgroups based on trajectories of depression scores during the RCT Latent class mixed modeling (LCMM) was used to identify subgroups of participants with distinct longitudinal trajectories of depression severity during the 36-week RCT.A range of latent subgroups (K = 1 to 4) were considered.For each K-subgroup model, various shapes of the depression trajectory were considered i.e. constant, linear, or non-linear (quadratic or cubic), and the final shape was chosen based on model fit statistics (Bayesian information criterion or BIC).Average posterior probabilities of group membership were used as a measure of internal reliability for each trajectory.The posterior probability of membership is computed using the Bayes rule and denotes the probability of an individual belonging to a certain trajectory class conditional on the individuals' repeated measures of the outcome.
Posterior probability values greater than 0.70-0.80 for each trajectory subgroup suggest greater homogeneity within a trajectory group than between trajectory groups.

Predictors of membership of subgroups based on trajectories of depression score
To identify membership to the trajectories of depression severity during the RCT, a number of sociodemographic and clinical characteristics of participants were considered (Table 1).Specifically, three sets of predictors were considered: (1) predictors measured at baseline of the acute phase; (2) predictors measured at the RCT baseline (end of the stabilization phase); and (3) longitudinal features of predictor variables during the acute and stabilization phases.Severity of depression at RCT baseline was not included as a predictor because it was included in deriving the trajectories of depression severity; instead, we used Ham-D 17 total score at the end of the acute phase, when participants first met criteria for remission/near-remission (hereafter referred to as 'remission') of psychotic depression.The third group of predictors extracted longitudinal features from variables that had repeated measures during the acute and stabilization phases of the study, specifically delusion severity, clinical global impression scale severity, and medication-associated parkinsonism and akathisia.The longitudinal features considered were median, standard deviation (S.D.), change from acute baseline to RCT baseline, a binary indicator of monotonic increasing or decreasing trend during the acute and stabilization phase, and a Spearman's correlation with time.Any predictors at acute baseline or RCT baseline that had more than 30% missing values were not used in this analysis.Missing values of remaining predictors were imputed using the proximity measures of a random forest (Stekhoven & Buehlmann, 2012).
Initially, acute baseline and RCT baseline characteristics of the LCMM-identified latent trajectories were compared using chi-square or independent two-sample t tests as appropriate.Then, a machine learning algorithm, random forest, was used to predict membership to the groups of trajectories using the predictors mentioned above.Briefly, the random forest algorithm considers complex interactions between predictors using decision trees and averages predictions (by majority voting) over multiple decision trees in bootstrapped samples.Predictors were ranked on their relative 'importance' by quantifying the improvement in prediction error by each predictor using the Gini Impurity Index.We report the set of predictors that explains 70% of cumulative reduction of Gini Index.
Prediction accuracy was operationalized using the area under the receiver operating characteristics curve (AU-ROC) and was estimated by five-fold cross-validation (CV).The variability of the estimated AU-ROC is reported by computing the 95% bootstrapped confidence intervals of the AUC.Tuning parameters for random forest (number of trees and the number of randomly chosen predictors for a candidate split in each tree) were identified by embedding another 5-fold CV within the outer CV.

Results
Of the 269 participants enrolled in the acute phase of STOP-PD II, 126 participated in the RCT.The CONSORT figure has been reported elsewhere (Flint et al. 2019).

Trajectories of depression scores during the RCT
A two-group trajectory model was chosen based on BIC, average posterior probability of group membership, sample size in each group, and clinical interpretability.This model was characterized by a stable trajectory (n = 71; 56.3% of participants) and a worsening trajectory (n = 55; 43.7% of participants) (Fig. 1).The average posterior probabilities for membership in the stable and worsening trajectories were 0.99 and 0.98, respectively, showing strong internal reliability of each trajectory group; that is, participants within a trajectory group were more homogeneous than between the two groups.The depression severity of the worsening trajectory had a linear increasing trend over the RCT (estimate of slope = 0.7765, 95% Bootstrap CI 0.5357-1.0175),while that of the stable trajectory did not change over time (estimate of slope = 0.1461, 95% Bootstrap CI −0.0291 to 0.3212).The group with the worsening trajectory had significantly higher mean (S.D.) depression severity at RCT baseline compared to the stable trajectory group (7.82 [3.31] v. 3.68 [2.60] respectively; t = 7.6, df = 100, p < 0.001).Rates of relapse were 70.9% (39/55) and 11.3% (8/71) in the worsening and stable trajectory groups, respectively.

Predictors of worsening depression trajectory: machine learning model
At the acute phase baseline (when participants were depressed), individuals with a worsening trajectory had earlier age of onset of the first lifetime episode of MDD, greater number of lifetime depressive episodes, greater medical burden, and were more likely to speak English as their first language (Table 1).With respect to variables assessed at RCT baseline (when participants were in remission), the worsening trajectory group had higher scores on measures of anxiety, psychomotor disturbance, clinical global impression, medical burden, and akathisia (Table 1).Participants in the worsening trajectory group had a higher mean (S.D.) Ham-D 17 total score at onset of remission (i.e. at acute phase termination) than those with a stable trajectory (Table 1).
The random forest model had an AUC of 0.812 (95% CI 0.658-0.938) in predicting the worsening depression trajectory.Based on the mean decrease in Gini Impurity Index, the strongest predictors of worsening trajectory were depression score at onset of remission, followed by anxiety score at RCT baseline, and age of onset of first lifetime depressive episode (Fig. 2). Figure 2 shows 6 Samprit Banerjee et al.

Discussion
We identified two groups of persons with remitted MDDPsy based on trajectories of depression scores during continuation treatment with either sertraline plus olanzapine or sertraline plus placebo: those with a stable trajectory and those with a worsening trajectory.In the random forest model, the strongest predictor of membership in the worsening group was the severity of depression at onset of remission.The prediction accuracy of the machine learning model that included multiple predictor variables, including depression severity, was only marginally better than that of a logistic regression model that included depression severity alone (AUCs of 0.812 and 0.778, respectively).In contrast, a secondary random forest model that did not include depression severity was slightly less accurate (AUC of 0.772) than depression severity alone.These findings have clinical implications.First, they indicate that assessing the severity of MDDPsy in remission is important in predicting outcome.The mean (S.D.) Ham-D 17 score at the onset of remission was 4.6 (2.8) in the stable group and 7.9 (3.2) in the worsening group.Although both mean scores fall within the conventional range of remission of MDD in mid-life and older adults (Frank et al. 1991;Reynolds et al. 2006), our findings indicate that there is value in knowing where an individual lies within this range, since lower scores predict a better outcome.
Second, measuring severity of depression in remission may be sufficient to predict clinical course during continuation treatment, without the need to assess other prognostic variables.Although the random forest model that contained multiple predictors was slightly more accurate than depression score alone, the 3.4% difference in their AUC is of doubtful clinical significance.This may indicate that depression severity at remission is correlated with other predictors considered in the random forest model.Interestingly, Song et al. (2023) found that subsyndromal depression severity predicted future case-level depression with similar accuracy to that of a machine learning model that included a number of psychological and non-psychological variables (AUCs of 0.764 and 0.791, respectively).From the perspective of patient and clinician, the administration of one measure (depression severity) is less burdensome, and more likely to lead to uptake in clinical practice, than the need to administer several measures.Third, our findings raise an important question about state v. trait effects in the outcome of remitted MDDPsy.In some individuals, it possible that a more vigorous and/or a different treatment could further lessen depression severity, or eliminate symptoms entirely, thus resulting in a stable outcome (state effect).However, other individuals may be unable to achieve an asymptomatic state regardless of what treatment they receive and remain at risk for an unfavorable course of depression.In this case, the inability to achieve full remission serves as a trait marker of a phenotype that may have a distinct neurobiological signature.Further research is needed to disentangle these effects, and identify individuals with residual depression who can benefit from more vigorous or different treatment and determine how they differ biologically from persons with a trait-determined course.
Anxiety score at randomization baseline, when participants were in remission, also emerged as a strong predictor of worse outcome trajectory.The association of anxiety, either as a residual symptom of depression or as a comorbid disorder, with worse long-term outcome of depression is well documented (Buckman et al. 2018;Hardeveld et al. 2010).In addition, psychomotor score at randomization baseline was a leading predictor of worse outcome.Although several factors have been implicated in risk of relapse and recurrence of MDD (Buckman et al. 2018;Hardeveld et al. 2010;Wojnarowski et al. 2019), our collective findings suggest that a focus on residual symptoms may be a fruitful area of inquiry to understand why some people with remitted MDDPsy have a poorer outcome than others.
There are limitations to the study.First, analyses were post hoc.Second, history of childhood trauma, personality traits, quality of interpersonal relationships, and life events and difficulties, which have been found to be relevant to relapse and recurrence of MDD (Buckman et al. 2018;Burcusa & Iacono, 2007;Perlman et al. 2019), were not assessed in STOP-PD II and therefore could not be included in the machine learning model.Third, the study design precluded a nuanced analysis of psychosis in the machine learning model.In order to be eligible for the RCT, participants had to experience full remission of delusions and hallucinations during acute treatment and, then, remain free of psychosis during the 8-week stabilization phase.Thus, in contrast to depression severity, there was minimal variability in SADS delusion and hallucination scores at the point of remission and during stabilization.Fourth, the trajectory of Ham-D scores in the RCT was influenced by randomized treatment, in that participants treated with sertraline plus placebo had greater risk of relapse (which would have influenced the trajectory of Ham-D scores) than participants treated with sertraline plus olanzapine.It is possible that different variables would have predicted membership of worsening trajectory if the sample had been restricted to persons treated with sertraline plus olanzapine.However, the sample size was not large enough to allow separate analyses of each treatment group.Fifth, our findings are limited to the drugs that were used in STOP-PD II.Trajectories and/or predictors may have been different had other drug combinations been used.
Strengths of the study include the well characterized sample, the rigorous approach to measurement of variables, and the large and diverse number of variables available for the machine learning model (notwithstanding the absence of the specific variables noted above).Other strengths are the wide age range of participants in keeping with the broad age of onset of MDDPsy, the 36-week duration of the RCT, and the multiple points of measurement of the outcome variable.
To conclude, the strongest predictor of membership in the worsening subgroup was the severity of depression at onset of remission.The predictive accuracy of remission depression severity alone was comparable to that of a machine learning model that included multiple variables.This finding suggests that assessing the severity of depression at onset of remission may be sufficient to inform the outcome of depression during continuation treatment, without the need to measure other prognostic variables.Finally, there is a need for research to determine how to optimize the outcome of treated psychotic depression when residual symptoms persist.
Supplementary material.The supplementary material for this article can be found at https://doi.org/10.1017/S0033291723002945.

Figure 1 .
Figure 1.Latent Growth Mixture Model of estimated trajectories of depression severity (along with 95% bootstrapped confidence intervals) among participants in the randomized phase of STOP-PD II (N = 126).
. Table 1 lists all variables.Online Supplementary Table 1 reports the time schedule of measurement of predictor variables.

Table 1 .
Characteristics of subgroups based on latent growth mixture model trajectories of Hamilton Depression Rating Scale total scores during the STOP-PD II randomized controlled trial (N = 126) (Randolph, 1998)ychiatric Rating Scale; CGI, Clinical Global Impression; CIRS-G, Cumulative Illness Rating Scale for Geriatrics; CORE, the CORE measure of psychomotor disturbance; DKEFS, Delis-Kaplan Executive Function System; HADS, Hospital Anxiety and Depression Scale; HAM-D 17 , 17-item Hamilton Depression Rating Scale; MMSE, Mini Mental State Examination; RBANS, Repeatable Battery for the Assessment of Neuropsychological Status; SADS, Schedule for Affective Disorders and Schizophrenia. a 'Other Race' includes American Indian or Alaska Native, Asian, Native Hawaiian or Other Pacific Islander, and unknown or not reported.bTreatmentresistancedefinedasan antidepressant plus antipsychotic combination rating score of 3 or higher on the Antidepressant Treatment History Form or seven or more treatments of electroconvulsive therapy during the current episode of psychotic depression.cDKEFScolor-wordinterference test condition 3 final weighted score.Interference Condition 3 final weighted scaled score (a continuous measure of inhibition); information processing speed was measured with the Coding task from the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS)(Randolph, 1998); delayed verbal recall was measured with the RBANS List Recall task; medication-associated parkinsonism was measured with the Simpson Angus Scale al. 2011); global cognitive function was measured with the MMSE (Folstein et al. 1975); executive function was measured with the Delis-Kaplan Executive Function System (DKEFS) (Delis, Kaplan, & Kramer, 2001) Color WordAbbreviations