Predicting patients who will drop out of out-patient psychotherapy using machine learning algorithms

Background About 30% of patients drop out of cognitive–behavioural therapy (CBT), which has implications for psychiatric and psychological treatment. Findings concerning drop out remain heterogeneous. Aims This paper aims to compare different machine-learning algorithms using nested cross-validation, evaluate their benefit in naturalistic settings, and identify the best model as well as the most important variables. Method The data-set consisted of 2543 out-patients treated with CBT. Assessment took place before session one. Twenty-one algorithms and ensembles were compared. Two parameters (Brier score, area under the curve (AUC)) were used for evaluation. Results The best model was an ensemble that used Random Forest and nearest-neighbour modelling. During the training process, it was significantly better than generalised linear modelling (GLM) (Brier score: d = –2.93, 95% CI (−3.95, −1.90)); AUC: d = 0.59, 95% CI (0.11 to 1.06)). In the holdout sample, the ensemble was able to correctly identify 63.4% of cases of patients, whereas the GLM only identified 46.2% correctly. The most important predictors were lower education, lower scores on the Personality Style and Disorder Inventory (PSSI) compulsive scale, younger age, higher scores on the PSSI negativistic and PSSI antisocial scale as well as on the Brief Symptom Inventory (BSI) additional scale (mean of the four additional items) and BSI overall scale. Conclusions Machine learning improves drop-out predictions. However, not all algorithms are suited to naturalistic data-sets and binary events. Tree-based and boosted algorithms including a variable selection process seem well-suited, whereas more advanced algorithms such as neural networks do not.

Cognitive-behavioural therapy (CBT) is an effective treatment for mental health problems. 1 However, approximately one in five patients drop out of treatment, 2 leading to many problems including a lack of adequate treatment. 3,4 Because of these negative consequences, identifying patients at a high risk of dropping out could lead to the development of clinical support tools that minimise the risk of drop out in individual patients. 5,6 However, findings from past studies examining CBT treatments have been heterogeneous with only younger age and lower education level being consistently associated with drop out. [7][8][9] Most studies used small samples and heterogeneous methods. An increase in statistical precision and large data-sets are therefore necessary to reliably identify patients at risk of dropping out of therapy.

Methodological developments
Over recent years, machine-learning approaches in particular have had a large impact on prediction modelling and on the most recent debate about the implementation of personalised or precision medicine concepts in mental health. 10,11 Machine learning has been applied in various prediction contexts, [12][13][14][15] taking advantage of the ability to capture non-linear relationships. 16 Nevertheless, machine learning does not always have an advantage over more traditional methods, 17 indicating that personalised medical care faces serious challenges that cannot be addressed through algorithmic complexity alone. 18 It remains unclear which machine-learning methods are most suited to data from an outpatient CBT setting and whether previous findings can be generalised to this context. 15,18 Further, to our knowledge, there is no study that has investigated the use of machine-learning algorithms for the prediction of a binary event in a naturalistic setting. For this reason, we pursued two aims in this study.
(a) Various machine-learning algorithms will be systematically compared with regard to their personalised drop-out predictions and under routine care out-patient CBT conditions. (b) Findings from these comparisons will be used to generate a clinically useful drop-out prediction model that can be used in clinical practice before the first session has occurred.

Method Patients and treatment
The analyses were based on a sample comprising 2543 patients treated at the University of Trier out-patient CBT clinic in Southwest Germany between 2007 and 2021. Patients were included when they had completed a battery of questionnaires at intake, had begun therapy after the diagnostic phase (i.e. completed at least three sessions) and completed (i.e. consensual termination) or dropped out of treatment (see Supplementary Materials 1 available at https://doi. org/10.1192/bjp.2022.17 for a flow chart of selected patients). Written informed consent was obtained from all patients. The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. All procedures involving human patients were approved by the ethics committee of the University of Trier.
All patient data collected from 2007 to 2017 were used for the model-generating process (training sample) and the remaining data were used for testing purposes (holdout sample). Therapy took place once a week (range 3-113 sessions). When patients dropped out, the number of sessions was significantly lower than when they completed therapy (mean for those who dropped out 17.2 sessions; mean for those with completion 43.4 sessions; t(2541) = 33.46; P < 0.001; Cohen's d = 1.49). This held true for the training sample (mean for those who dropped out 17.5 sessions; mean for those with completion 44.3 sessions; t(2041) = 31.36; P < 0.001; Cohen's d = 1.51) and for the holdout sample (mean for those who dropped out 6.4 sessions; mean for those with completion 39.0 sessions; t(498) = 12.51; P < 0.001; Cohen's d = 1.21).
Diagnoses were based on the German version of the Structured Clinical Interview for Axis I DSM-IV Disorders-Patient Edition 19 and the International Diagnostic Checklist for Personality Disorders. 20 Interviews were conducted by intensively trained independent clinicians before actual therapy began. All sessions were videotaped for establishing diagnoses and supervision; interviews and diagnoses were discussed in expert consensus teams that included four senior clinicians. Final diagnoses were determined by consensual agreement of at least 75% of the team members. For an overview of patient characteristics and differences between the groups see Supplementary materials 2.
The mean scores on the short-form of the Outcome Questionnaire 21 and the Brief Symptom Inventory (BSI; 22 German translation of Derogatis 23 ) were 1.90 (s.d. = 0.56) and 1.30 (s.d. = 0.71), respectively, indicating a moderate-to-severe general level of distress.

Therapists
Patients were treated by 220 therapists (173 female, 40 male, 7 unknown) who participated in a 3-year (full-time) or 5-year (parttime) postgraduate training programme with a CBT focus. Each therapist had at least 1 year of clinical training before beginning to treat patients. On average, therapists treated 11.6 patients each (s.d. = 6.2, range 1-26). Each therapist received 1 h of group or individual supervision on a monthly basis. The session videos were used for supervision and research. Supervisors were senior clinicians with at least 5 years of clinical experience after completing training. In treatment, therapists scored a mean of 3.81 (s.d. = 0.89) on the 'overall adherence item' of the Inventory of Therapeutic Interventions and Skills that has a range from 0 to 6. 24 For this reason, adherence can be described as adequate.

Drop out
Drop out was assessed via clinical judgement at the end of treatment. When the patient and therapist agreed on a consensual end of therapy, the treatment was considered regularly completed. In contrast, when the patient stopped coming to therapy, despite the therapist's appraisal that more sessions were necessary the form of termination was considered as a drop out. Examples of this operationalisation of drop out include the patient stopped coming to sessions and was unable to be reached by phone or email or the patient told the therapist that they would no longer be coming to therapy anymore, despite the therapist's advice to continue therapy.

Intake variables
A total of 77 variables measured at intake (i.e. before the first session) were included in the analyses. Table 1 shows all 77 variables as well as the mean differences. All variables were assessed via questionnaires.

Selection of machine-learning algorithms
In order to get an accurate picture of common algorithms used in sociological/ scientific/ medical contexts, we decided to use and compare only those algorithms that have already found application in the relevant literature . For this purpose, we particularly focused  on the Stratified Medicine Approaches for Treatment Selection  Mental Health Prediction Tournament at the 2019 Treatment  Selection Idea Lab conference, in which 13 different research  groups developed different prediction models using the same  data-set (for a further review, see Cohen et al 32 ). Using the information from this tournament, as well as an examination of the literature provided by the tournament organiser, we selected a total of 21 algorithms for closer examination (see Table 2). As we aimed to compare different algorithms regardless of them being linear or non-linear, we decided to include linear algorithms alongside the machine-learning algorithms, as suggested by Brownlee. 16 Data analytic strategy Data preparation All analyses were conducted using the free software environment R version 4.1.1. 33 No variables that had more than 10% missing values were included in the analyses. Therefore, we had to exclude a total of five variables (total scores of the Patient Health Questionnaire (PHQ-9), Affective Style Questionnaire (ASQ), Generalised Anxiety Disorder Assessment (GAD-7), Ten Item Personality Measure (TIPI) and Work and Social Adjustment Scale (WSAS)). No patient was excluded from the analyses because of too many missing values. Variables with less than 10% missing values were imputed using a trained Random Forest in the R package missForest v1.4. 34 The imputations for the training and holdout samples were conducted separate from the cross-validation framework before the actual analyses.
For model training, we used the R-packages caret v6.0-90 35 and caretEnsemble v2.0.1. 36 These packages tune the hyperparameters to their optimal settings depending on which one is being used. To ensure a fair comparison of the algorithms, we did not change the packages' default settings. Table 2 shows the algorithms used and the different tuning parameters that were tested (for a further review see Kuhn 37 ). Identification of the best model was always based on the receiver operating characteristic curve. The model with the largest area under the curve (AUC) was considered the best model. All models predicted drop out as a binary event (drop out versus non-drop out).

Ranking and correlation of algorithms
First, we ranked all individual algorithms based on the two parameters (i.e. Brier (for a description see below), AUC) and compared the correlations of the predictions of all algorithms during the model-building process using the corresponding function in the Caret package. For this purpose, we conducted a nested cross validation with 20 outer and 10 inner loops according to Brownlee's 38 recommendations. All continuous variables were centered separately for each outer cross-validation loop of the training and test sets. Subtrahend was always the mean value from the training data of the respective variable to avoid data leakage and to ensure appropriate data preparation for the algorithms. 16 Drop out was dichotomised with 1 (drop out) and 0 (regular termination). Subsequently, each of the 21 machine-learning algorithms generated a drop-out prediction model based on each outer and inner cross-validation training set that was then evaluated in the respective outer or inner cross-validation test set to minimise overfitting 39 and the influence of sample characteristics. For the inner cross validation, we also applied a sampling method (synthetic minority oversampling technique; SMOTE) 40 to address the problem of class imbalance. 41,42 SMOTE is a hybrid method combining up and down sampling. It artificially generates new examples of the minority class using the nearest neighbours of these cases. Furthermore, the majority class examples are also undersampled, leading to a more balanced data-set. Then, a performance ranking based on the Brier score and the AUC was generated as well as the correlation matrix for all algorithms.

Brier score
The Brier score ranges from 0 (best prediction) to 1 (worst prediction) by measuring probabilistic predictions. 43 Thus, it takes the certainty of the prediction into account. In effect, it is the mean squared error of the forecast: Hereby, N is the total number of observations, f is the probability of the event (i.e. drop out) and o is the actual outcome (i.e. 0 or 1) of the event at instance t.

AUC
The AUC uses the sensitivity and specificity of a prediction and ranges from 0 (worst prediction) to 1 (best prediction). Based on signal detection theory, 44 the AUC takes the base rate of the dependent variable into account.

Ensembles
We used the ranking of single algorithms and the correlation matrix to generate ensembles. Ensembles show better performance and greater robustness in certain contexts by reweighting the results of different algorithms, which can produce better overall results. 45 We decided to use five types of ensembles. The two and three best algorithms, the two and three least correlating algorithms and the best algorithm with the respective least correlating algorithm. The idea of merging algorithms with low correlations is that they probably assess different aspects of the data-set. 46 Therefore, it is possible that an ensemble of such algorithms improves the prediction significantly, even though one algorithm makes poor predictions on its own. These ensembles were merged either via a generalised linear modelling (GLM) algorithm or via the best algorithm across both parameters (i.e. Brier score and AUC) according to our ranking using the stacking method. Again we used Caret with its default settings to create an ensemble with the best parameters. In total, we generated ten ensembles (five types of ensemble × two ways of merging).

Comparing ensembles and single algorithms
Next, we compared all ten ensembles and the five best single algorithms. Again, we used a nested cross validation as described above. However, this time we used a ten-fold inner cross validation with three repetitions. Repeating the cross validation leads to a more precise result, 47 so we conducted this procedure for a more adequate comparison.
Extending the procedure In order to gain a more comprehensive picture, we repeated the entire procedure twice. For the first repetition, we only used the significant predictors (i.e. initial impairment, male gender, lower education status, more histrionic and less compulsive personality style and negative treatment expectations) from Zimmermann et al. 8 Thus, we evaluated the changes in the prediction when using these relevant predictors only. For the second repetition, we performed variable selection using an elastic net regularisation with the Caret package for the training set after each split. As we examined a large set of variables, we evaluated whether some models improve, when preceded by variable selection. This was done 20 times for each training set of the outer loops inside the cross-validation framework, preventing data leakage from the test set. For this elastic net selection after each split, we did not use Caret to choose the optimal setting, but set alpha to 0.1 for the first analysis and then altered alpha in increments of 0.1 until 1 was reached. An alpha of 0 is equal to a ridge regression, whereas an alpha of 1 equals a least absolute shrinkage and selection operator regression. We also defined lambda's range analogue to the alpha parameter. Lambda defines the magnitude of the regression penalty. This resulted in 100 different possible combinations of these two parameters (ten values for alpha × ten values for lambda) to identify the best fitting model. Identification of the best model was always based on the AUC. The model with the highest value was considered the best model. At the end of the second repetition, we only included the predictors that had predictive power in the best model.
Conducting this entire procedure three times (with all variables, with only seven predictors, and with variables that had predictive power in the preceding elastic net analysis) led to a total of 30 ensembles (10 ensembles × 3 procedures) and 15 single algorithms (5 single algorithms × 3 procedures). Each ensemble and single algorithm generated a model via a nested cross validation with 20 outer loops and 10 inner loops with three repetitions. The model with the best mean prediction scores across all cross validations and across both parameters was chosen. Generating 20 models via the outer cross validations resulted in one distribution consisting of 20 Brier scores and one distribution consisting of 20 AUC scores for each algorithm/ensemble. In order to quantitatively compare the differences and distributions as well as the robustness against sampling artefacts, t-tests between the best and worst model as well as between the best model and a single GLM were conducted for each parameter. For a final test we used the best ensemble/algorithm and let it generate a model with the whole training sample via a ten-fold cross validation with three repetitions. This model was then tested in the still unused and independent holdout sample to assess the generalisability of the model and to prevent overfitting.
Last, the holdout sample's confusion matrix was examined in order to assess the improvement of the prediction. Therefore, each individual that had a higher risk than the mean of the training sample to drop out of therapy (i.e. 30.6%) was considered a predicted 'dropout case'. Finally, the Caret package was used to determine the most important variables.

Results
After the first step, the algorithm with the best predictions when using all variables was Stochastic Gradient Boosting. When only using predictors that showed predictive power in a preceding elastic net analysis, Random Forest was the best algorithm.
Boosted Classification Trees (ADA) made the best predictions when only the seven significant predictors were used (see Supplementary Materials 3 for an overview of all algorithms). Especially boosting and tree-based approaches seemed to make the best predictions. Further, algorithms from different classes seemed to correlate the least with each other (see Supplementary Materials 4 for the low correlating algorithms).
Next, by using the rankings (Supplementary Materials 3) and correlations (Supplementary Materials 4), we generated the ensembles as described above for the final analyses. Comparing the different algorithms and ensembles, the best model across both parameters was generated by an ensemble with the best machine-learning algorithm and its least correlating algorithm (i.e. Random Forest and K-Fold-Nearest-Neighbors (kNN)) that was merged via a GLM and had a preceding elastic net variable selection (Brier score 0.1983, AUC = 0.6581). Table 3 provides an overview of all algorithms and ensembles.
The distributions of each algorithm/ensemble revealed that the best ones hardly differed from each other (see Fig. 1). Nevertheless, some models seemed to make significantly worse predictions. For the Brier score, the pattern was very similar.
As a result of these distributions, we were able to compare model accuracy/robustness via t-tests. A paired one-sided t-test revealed a highly significant effect between the overall best and overall worst models concerning the AUC score (AUC best = 0.6581; AUC worst = 0.5465; t(19) = 8.30, P < 0.001, Cohen's d = 1.86, 95% CI (0.11, 2.58)). Comparing the overall best model Table 2 Classification of all machine learning algorithms 16  The main predictors of drop out that made a substantial contribution (i.e. relative importance >90%) to the model were lower education level, younger age, lower scores on the compulsive scale of the Personality Style and Disorder Inventory (PSSI), higher scores on the negativistic and antisocial scale of the PSSI and higher scores on the additional scale of the BSI as well as a higher total score (see Supplementary materials 7 for an overview of all variables; see Liaw and Wiener 48 for a description of how variable importance is calculated in a Random Forest model). The BSI additional scale is the mean of the four additional items not included in any of the dimension scores ('poor appetite', 'trouble falling asleep', 'thoughts of death and dying', 'feeling of guilt').

Main findings
The aim was to evaluate the use of different machine-learning algorithms in a naturalistic routine care setting by generating a predictive model to identify patients who are at risk of dropping out. Two different indices were used to gain a more comprehensive picture of the results. We selected 21 algorithms for our study and used nested cross validation to compare them. We used the best algorithms and least correlating algorithms to generate ensembles that were also compared. The best model was an ensemble of the best algorithm with its least correlating algorithm (i.e. Random Forest, kNN) that used only predictive variables and was merged via a GLM. Differences between the best ensemble and a single GLM as well as the worst algorithm were highly significant, independent of the examined parameters. When comparing the distributions of the best model and the GLM, a large effect size of up to d = -2.93 was found, indicating the superiority of the best model independent of the training sample used. The best model was able to correctly identify 63.4% of all cases of patients dropping out in an independent holdout sample. Although this does not seem very precise at first, it must be acknowledged that this prediction was made before the first session of routine CBT and that a single GLM correctly identified only 46.2%. Therefore, this model is of high clinical value and is able to identify before the first session has occurred patients who tend to drop out of therapy. The mostly identical values of the AUC and the Brier score in the holdout sample compared with the test set in the modelling process indicate good stability and generalisability of the model.

Interpretation of our findings relating to identification of patients at risk for treatment discontinuation
The most important variables used in the final prediction model also appear to differ significantly between individuals who dropped out and those where there was consensual termination. Nevertheless, this is not true for all variables, suggesting that the model uses more than just the different mean values for prediction. Based on the relevant variables in the model, therapists should take time to build a complementary relationship with the patient and invest time in explaining how therapy can concretely help them. Particularly high levels of interpersonal variables that make it difficult to establish a functional therapeutic relationship (for example negativistic or antisocial personality style) appear to increase the risk of drop out. It is important for clinicians to pay attention to the complementarity of the relationship in order to establish a good alliance. This is especially crucial in the first session, as this is where the first impression is made. Here model predictions can be used to better prepare for potential interpersonal difficulties.
With regard to the BSI additional scale, it seems reasonable to first treat symptoms such as sleep problems, poor appetite, suicidal thoughts and feelings of guilt. Although general symptom burden or functionality do not play a particularly important role, symptoms that are very obvious to the patient (such as sleep problems, distressing suicidal thoughts) appear to be important indicators. It is obvious that patients hope for a quick improvement in these symptoms resulting from therapy, which has an important signal effect for therapists to focus on the treatment of these symptoms, especially at the beginning of therapy.
Interestingly, lower education and younger age also seem to increase the probability of individuals dropping out. Other studies have also identified these variables, 8 so these should be considered in therapy, even if they are invariant. Future studies should explore the underlying mechanisms of these on drop-out probability to better understand the effects and to improve future models. Nevertheless, using the information from our model, clinicians could generate a more precise case concept for the individual patient before the first session to help patients gain confidence in therapy, facilitate the establishment of a functional therapeutic relationship, and thus reduce the risk of patients dropping out. Therefore, the best model from our analyses could improve and further support measurement-based care with regard to drop-out prediction and prevention.  Fig. 1 Distribution of the 20 outer cross-validation models generated by each algorithm and ensemble ranked from best to worst using the area under the curve.
Each value was grand-mean centred; the horizontal line represents the total average of all models. The numbers on the graphs are the standard deviations.
Interpretation of our findings relating to the use of machine learning Further, the results indicate that machine-learning algorithms/ ensembles can have a true predictive advantage in naturalistic settings. However, this does not apply to all algorithms. Some produced significantly worse predictions, indicating that not all machine-learning algorithms/ensembles are suited to naturalistic settings. Our results revealed that ensembles consisting of low-correlating algorithms did not perform well except when a powerful algorithm that delivers good predictions on its own is included. The idea that low-correlating algorithms assess different aspects of the data-set and thus should perform better than ensembles with more similar algorithms did not hold true. Mayer 46 states that an optimal ensemble of low correlating algorithms consists of those that perform similarly on their own. This could explain the worse prediction quality in our data, as this was not the case in our analyses. Furthermore, the assumption 'the more the better' also did not hold true. Ensembles that used more algorithms did not automatically perform better. This finding is in line with previous argumentation that selecting algorithms to create an ensemble does not follow easy rules like 'the more the better', but is a research topic of its own. 49 In addition, when using a large set of variables, a variable selection procedure should be part of model generation, either by using an algorithm that includes a selection procedure or a preceding variable selection. Our findings indicate that algorithms that had to handle many variables and did not include a variable selection procedure performed worse (for example linear discriminant analysis). This finding is in line with the existing literature, stating that, in clinical settings, not every variable has predictive power for a certain outcome 50 and can thus weaken the power of the model.
Interestingly, tree-based and boosted algorithms seem to perform better compared with more advanced algorithms such as neural networks. This finding appeared consistently, independent of the examined parameters. Therefore, for this kind of naturalistic binary data, boosted linear algorithms and tree-based approaches such as random forest seem very well suited.

Limitations
Although this study has many strengths, several limitations must be mentioned. One reason for the poor performance of neural networks could be the data quality. Albeit naturalistic assessments include crucial predictive information, they are nowhere near perfect and always have measurement errors. Although this topic is not new, 51 these errors prevent the algorithms from assessing the relevant relationships. These suggestions are in line with the existing literature. 50,52 A solution to this problem could be the usage of ecological momentary assessment data, which provides more accurate descriptions of within-person processes at a higher resolution. For future studies, it is of great interest whether complex algorithms such as neural networks are more suitable for such data and are thus able to improve predictions of drop out. Electrophysiological variables and neural imaging variables could also improve predictions, 53 but such assessments are expensive and time-consuming and therefore unlikely to be used in routine care. In addition, the amount of data could have played an important role. Complex algorithms that are able to assess high-order interactions need a lot of data. 16 Thus, the size of our data-set limits the evaluation of these algorithms. Future studies should try to generate even larger data-sets in order to evaluate the possible benefit of advanced algorithms.
Furthermore, it is possible that certain predictive variables were not collected. For example, we only collected whether patients were taking medication or not, regardless of what they were taking or for how long. Although this variable did not play a role in our model, it cannot be ruled out that more precise information could improve the model. The same applies to the variables that we had to exclude because of too many missing values (i.e. PHQ-9, ASQ, GAD-7, TIPI, WSAS). These variables contain important clinical information that could be important for prediction.
Moreover, we used only a small number of possible machinelearning algorithms. Although we used many models that have already been applied in psychological studies to create a representative picture, it cannot be ruled out that an even more suitable approach for this kind of data exists. Also, as mentioned above, the use of ensembles requires a profound understanding of this topic. For our own ensembles, we used the stacking method. However, there are other options to create ensembles such as bagging or boosting.
Although the model is well protected against overfitting by the use of repeated and nested cross validation as well as a separate holdout sample, the possibility of overfitting cannot be completely ruled out. Furthermore, our holdout sample is quite similar to the training sample, which limits the generalisability of the results. Nevertheless, it must be noted that there are differences between the samples, especially with regard to the diagnosis, which is why a certain degree of generalisability can be assumed. Nevertheless, holdout samples from other institutes should be used in future studies to more robustly test generalisability.
In addition, although this model helps identifying patients who are at risk of dropping out of therapy, it does not reveal the reasons for this increased risk. No causal conclusions can be drawn from this model, which is a limitation of our model and of machine learning in general. Nevertheless, the identified predictors provide first clues as to which risk factors may be relevant to drop out. Moreover, the identification of patients at risk for treatment discontinuation is the first step to reducing the number of patients who drop out.

Implications
To our knowledge, this study is the first to use such a large naturalistic data-set to evaluate different machine-learning algorithms and ensembles to identify a useful drop-out prediction model. The current study compared several machine-learning methods in order to evaluate the benefit of machine learning in naturalistic contexts and to generate a model that has high clinical value for identifying drop-out risk at an individual level. The model identified over 60% of patients' type of therapy termination correctly. This study's findings highlight that it is possible to identify, before the first session has occurred, patients at risk of dropping out and that machine learning algorithms provide an important contribution to model generation. Tree-based and boosted algorithms that include a variable selection procedure (for example elastic net) seem especially suited to building prediction models for psychotherapy drop out.
Future research should further explore treatment data to improve prediction models and use them to develop strategies to reduce the risk of drop out. By implementing these models into clinical support systems, the number of individuals who drop out could be reduced, resulting in more effective therapy outcomes and less burden on patients and society.

Data availability
The data that support the findings of this study are available from the corresponding author, B.B., upon reasonable request.