Prediction models in first-episode psychosis: systematic review and critical appraisal

Background People presenting with first-episode psychosis (FEP) have heterogenous outcomes. More than 40% fail to achieve symptomatic remission. Accurate prediction of individual outcome in FEP could facilitate early intervention to change the clinical trajectory and improve prognosis. Aims We aim to systematically review evidence for prediction models developed for predicting poor outcome in FEP. Method A protocol for this study was published on the International Prospective Register of Systematic Reviews, registration number CRD42019156897. Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidance, we systematically searched six databases from inception to 28 January 2021. We used the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies and the Prediction Model Risk of Bias Assessment Tool to extract and appraise the outcome prediction models. We considered study characteristics, methodology and model performance. Results Thirteen studies reporting 31 prediction models across a range of clinical outcomes met criteria for inclusion. Eleven studies used logistic regression with clinical and sociodemographic predictor variables. Just two studies were found to be at low risk of bias. Methodological limitations identified included a lack of appropriate validation, small sample sizes, poor handling of missing data and inadequate reporting of calibration and discrimination measures. To date, no model has been applied to clinical practice. Conclusions Future prediction studies in psychosis should prioritise methodological rigour and external validation in larger samples. The potential for prediction modelling in FEP is yet to be realised.


Psychosis
Psychosis is a mental illness characterised by hallucinations, delusions and thought disorder. The median lifetime prevalence of psychosis is around 8 per 1000 of the global population. 1 Psychotic disorders, including schizophrenia, are in the top 20 leading causes of disability worldwide. 2 People with psychosis have heterogeneous outcomes. More than 40% fail to achieve symptomatic remission. 3 At present, clinicians struggle to predict longterm outcome in individuals with first-episode psychosis (FEP).

Prediction modelling
Prediction modelling has the potential to revolutionise medicine by predicting individual patient outcome. 4 Early identification of those with good and poor outcomes would allow for a more personalised approach to care, matching interventions and resources to those most at need. This is the basis of precision medicine. Risk prediction models have been successfully employed clinically in many areas of medicine; for example, the QRISK tool predicts cardiovascular risk in individual patients. 5 However, within psychiatry, precision medicine is not yet established within clinical practice. In FEP, precision medicine could enable rapid stratification and targeted intervention, thereby decreasing patient suffering and limiting treatment associated risks such as medication side-effects and intrusive monitoring.
Salazar de Pablo et al recently undertook a broad systematic review of individualised prediction models in psychiatry. 6 They found clear evidence that precision psychiatry has developed into an important area of research, with the greatest number of prediction models focusing on outcomes in psychosis. However, the field is hindered by methodological flaws such as lack of validation. Further, there is a translation gap, with only one study considering implementation into clinical practice. Systematic guidance for the development, validation and presentation of prediction models is available. 7 Further, the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement sets standards for reporting. 8 Models that do not adhere to these guidelines result in unreliable predictions, which may cause more harm than good in guiding clinical decisions. 9 Salazar de Pablo et al 's review was impressive in scope, but necessarily limited in detailed analysis of the specific models included. 6 Systematic reviews focusing on predicting the transition to psychosis 10,11 and relapse in psychosis have also been published. 12 In our present review, we focus on FEP with the aim to systematically review and critically appraise the prediction models for the prediction of poor outcomes.

Method
'participants', 'predictors', 'outcome' and 'statistical analysis'. Signalling questions are answered 'yes', 'probably yes', 'no', 'probably no' or 'no information'. Answering 'yes' indicates a low risk of bias, whereas answering 'no' indicates high risk of bias. A domain where all signalling questions are answered as 'yes' or 'probably yes' indicates low risk of bias. Answering 'no' or 'probably no' flags the potential for the presence of bias, and reviewers should use their personal judgement to determine whether issues identified have introduced bias. Applicability of included studies to the review question is also considered in PROBAST.
We reported our results according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 Statement (see Supplementary Material). 17

Results
Systematic review of the literature yielded 2353 records from database searches and 67 from additional sources. After removal of duplicates, 1543 records were screened. Of these, 82 full texts were reviewed, which resulted in 13 studies meeting criteria for inclusion in our qualitative synthesis (Fig. 1). [18][19][20][21][22][23][24][25][26][27][28][29][30] Study characteristics are summarised in Table 1. The 13 included studies, comprising a total of 19 different patient cohorts, reported 31 different prediction models. Dates of publication ranged from 2006 to 2021. Twelve studies (92%) recruited participants from Europe, with two studies (15%) also recruiting participants from Israel and one study (8%) from Singapore. Over two-thirds (n = 9) of studies were multicentre. Ten studies (77%) included participants from cohort studies, three studies (23%) included participants from randomised controlled trials and two studies (15%) included participants from case registries. Two studies (15%) included only out-patients, four (31%) included in-patients and out-patients, and the rest did not specify their setting. Cohort sample size ranged from 47 to 1663 patients. The average age of patients ranged from 21 to 28 years, and 49-77% of the cohorts were male. Where specified, the average duration of untreated psychosis ranged from 34 to 106 weeks. Ethnicity was reported in eight studies (62%), with the percentage of Black and minority ethnic patients in the cohorts ranging from 4 to >75%. The definition of FEP was primarily non-affective psychosis in the majority of patient cohorts, with the minority also including affective psychosis, and two cohorts also including drug-induced psychosis. All but one study (92%) considered solely sociodemographic and clinical predictors. A wide range of outcomes were assessed across the 13 included studies, including symptom remission in five studies (38%), global functioning in five studies (38%), vocational functioning in three studies (23%), treatment resistance in two studies (15%), hospital readmission in two studies (15%) and quality of life in one study (8%). All of the outcomes were binary. The follow-up period of included studies ranged from 1 to 10 years.
Study prediction-modelling methodologies are outlined in Table 2. Nine (69%) studies pertained solely to model development, with the highest level of validation reported being apparent validity in four of the studies, internal validity in three of the studies and internal-external validity (via leave-one-site-out cross-validation) in two of the studies. The remaining four (31%) studies also included a validation cohort and reported external validity. High dimensionality was common across the study cohorts, with the majority having a very low EPV ratio and up to 258 candidate predictors considered. Some form of variable selection was used in the majority (62%) of studies. The number of events in the external validation cohort ranged from 23 to 173. All of the studies had missing data. Six studies (46%) used complete-case analysis, five (38%) studies used single imputation and the remaining two (15%) studies applied multiple imputation.
The most common modelling methodology was logistic regression fitted by maximum likelihood estimation, followed by logistic regression with regularisation. Only two studies used machine learning methods, both via support vector machines. Just over half of the studies (54%) did not use any variable shrinkage, and only three (23%) studies recalibrated their models based on validation to improve performance. The full model was presented in seven (54%) studies. Only two (15%) studies assessed clinical utility.
The performance of the best model per study outcome grouped by method of validation to allow for appropriate comparisons is reported in Table 3. For the five studies (38%) reporting only apparent validity, two reported a measure of discrimination and only one considered calibration. For the seven (54%) studies reporting internal validation performance, four reported discrimination with a C-statistic ranging from 0.66 to 0.77, and four reported calibration. For the three (23%) studies reporting internal-external validation, only one study considered discrimination with a C-statistic, which ranged from 0.703 to 0.736 across each of its four models. None of the studies reporting internal-external validation considered any measure of calibration. All four (31%) studies reporting external validation considered model discrimination, with C-statistics ranging from 0.556 to 0.876. However, only two of these studies considered calibration. Table 3 also records any global performance metrics, including the Brier score and McFadden's pseudo-R 2 , both of which incorporate aspects of discrimination and calibration. Various classification metrics were reported across the study models, but it is difficult to make any meaningful comparisons between these alone, without considering the models' corresponding discrimination and calibration metrics, which were not universally reported.
We applied the PROBAST tool to the 31 different prediction models across the 13 studies in our systematic review, and determined an overall risk-of-bias rating for each study, as summarised in Supplementary Table 1. The majority (85%) of studies had an overall 'high' risk of bias. In each of these studies, the risk of bias was rated 'high' in the analysis domain, with one study also having a 'high' risk of bias in the predictors domain. The main reasons for the 'high' risk of bias in the analysis domain were insufficient participant numbers and consequently low EPV, inappropriate methods of variable selection including via univariable analysis, a lack of appropriate validation with only apparent validation, an absence of reported measures of discrimination and calibration, and inappropriate handling of missing data by either complete-case analysis or single imputation. Two studies, Leighton et al 29 and Puntis et al, 30 were rated overall 'low' risk of bias. These studies considered symptom remission and psychiatric hospital readmission outcomes, respectively. Both studies externally validated their prediction model and considered its clinical utility. However, neither study considered the implementation of the prediction model into actual clinical practice. When we assessed the 13 included studies according to PROBAST applicability concerns, all of the studies were considered overall 'low' concern. This is indicative of the broad scope of our systematic review.

Discussion
Our systematic review identified 13 studies reporting 31 prognostic prediction models for the prediction of a wide range of clinical outcomes. The majority of models were developed via logistic regression. There were several methodological limitations identified, including a lack of appropriate validation, issues with handling missing data and a lack of reporting of calibration and discrimination measures. We identified two studies with models at low risk  Prediction modelling in first-episode psychosis  Prediction modelling in first-episode psychosis    Prediction modelling in first-episode psychosis of bias as assessed with PROBAST, both of which externally validated their models.

Principal findings in context
Our systematic review found no consistent definition of FEP across the different cohorts used for developing and validating prediction models. A lack of an operational definition for FEP within clinical and research settings has previously been identified as major a barrier to progress. 31 The majority of cohorts in our systematic review included only individuals with non-affective psychosis, with a minority also including affective psychosis. In contrast, early intervention services typically do not make a distinction between affective and non-affective psychosis in those that they accept onto their service. 32 As such, there may be issues with generalisability of prediction models developed in cohorts with solely non-affective psychosis to real-world clinical practice. A wide range of different outcomes were predicted by the FEP models, including symptom remission, global functioning, vocational functioning, treatment resistance, hospital readmission and quality-of-life outcomes. This is reflective of the fact that recovery from FEP is not readily distilled down to a single factor such as symptom remission. Meaningful recovery is represented by a constellation of multidimensional outcomes unique to each individual. 33 We should engage people with lived experience, to ensure that prediction models are welcomed and are predicting outcomes most relevant to the people they are for.
All of the prediction models were developed in populations from high-income countries, and only three studies included participants from countries outside of Europe, an issue not unique to FEP research. Consequently, it is currently unknown how prediction models for FEP would generalise to low-income countries. Prediction models may have considerable benefit in low-income countries, where almost 80% of patients with FEP live, but where mental health support is often scarce. 34 Prediction models could help prioritise the appropriate utilisation of limited healthcare resources.
Only one study considered predictor variables other than clinical or sociodemographic factors. In this study, the additional predictors did not add significant value. 22 In recent years, substantial progress has been made in elucidating the pathophysiological mechanisms underpinning the development of psychosis. We now recognise important roles for genetic factors, neurodevelopmental factors, dopamine and glutamate. 35 Prediction model performance may be improved by the incorporation of these biologically relevant disease markers as predictor variables. However, the cost-benefit aspect of adding more expensive and less accessible disease markers must be carefully considered, especially if models are to be utilised in settings where resources are more limited.
Machine learning can be operationally defined as 'models that directly and automatically learn from data'. This is in contrast to regression models, which 'are based on theory and assumptions, and benefit from human intervention and subject knowledge for model specification'. 36 Just two studies used machine learning techniques for their modelling. 22,26 The rest of the studies used logistic regression. We were unable to make any comparison between the discrimination and calibration ability of the two studies that used machine learning and the other studies, because these metrics were not provided. However, a recent systematic review found no evidence of superior performance of clinical prediction models that use machine learning methods over logistic regression. 36 In any case, the distinction between regression models and machine learning has been viewed to be artificial. Instead, algorithms may exist 'along a continuum between fully human-guided to fully machine-guided data analysis'. 37 An alternative comparison may be between linear and non-linear classifiers. Only one study used a non-linear classifier, 26 but again we were unable to gain meaningful insights into its relative performance because appropriate metrics were not provided.
A principal finding from our systematic review is the presence of methodological limitations across the majority of studies. Steyerberg et al outline four key measures of predictive performance that should be assessed in any prediction-modelling study: two measures of calibration (the model intercept (A) and the calibration slope (B)), discrimination via a concordance statistic (C) and clinical usefulness with decision-curve analysis (D). 7 Model calibration is the level of agreement between the observed outcomes and the predictions. For example, if a model predicts a 5% risk of cancer, then, according to such a prediction, the observed proportion should be five cancers per 100 people. Discrimination is the ability of a model to distinguish between a patient with the outcome and one without. 7 Our review found that only seven studies (54%) reported discrimination and just five (38%) reported any measure of calibration. The remaining studies reported only classification metrics, such as accuracy or balanced accuracy. The problem with solely reporting classification metrics is that they vary both across models and across different probability thresholds for the same model. This renders the comparison between models less meaningful. It is further argued that setting a classification threshold for a probability-generating model is premature. Rather, a clinician may choose to set different probability thresholds for the same prediction model, depending on the situation at hand, to optimise the balance between false positives and false negatives. For example, in the case of a model predicting cancer, a clinician may choose a lower probability threshold to offer a non-invasive screening test and a higher probability threshold to suggest an invasive and potentially harmful biopsy. Further, without any measure of model calibration, we are unable to assess if the model can make unbiased estimates of outcome. 38 The final key step in assessing the performance of a prediction model is to determine its clinical usefulnessthat is, can better decisions be made with the model than without? Decision-curve analysis considers the net benefit (the treatment threshold weighted sum of true-minus false-positive classifications) for a prediction model compared with the default strategy of treating all or no patients, across an entire range of treatment thresholds. 39 Only two studies (15%) included in our review considered whether the model was clinically useful. Without proper validation of the prediction models, the reported performances are likely to be overly optimistic. Four studies (31%) reported only apparent validity. Just four studies (31%) reported external validation, which is considered essential before applying a prediction model to clinical practice. 14 Altogether, just two studies (15%) had an overall 'low' risk of bias according to PROBAST, reflecting these methodological limitations. Neither study considered real-world implementation. To progress with implementation, impact studies are required. These would involve a cluster randomised trial comparing patient outcomes between a group with treatment informed by a clinical prediction model and a control group. 40 We are not aware of any such study having been carried out within the field of psychiatry. However, Salazar de Pablo et al suggest that PROBAST thresholds for considering a study to be a 'low' risk of bias may be too strict. 6 Indeed, in the field of machine learning, multiple imputation is frequently computationally infeasible, and single imputation may be viewed as sufficient. This is especially true in larger data-sets or in the presence of relatively few missing values. 41

Strengths and limitations
Our review had a number of strengths. We provide the first systematic overview of prediction-modelling studies for use in patients Prediction modelling in first-episode psychosis with FEP. We offer a detailed critique of the study characteristics, their methodologies and model performance metrics. Further, our review adheres to gold-standard guidance for extracting data from prediction models and for assessing bias, namely the CHARMS checklist and PROBAST.
There were several limitations. Our initial aim was to perform a meta-analysis of any prediction model that was validated across different settings and populations. However, no meta-analysis was possible because no single prediction model was validated more than once. In addition, as a consequence of poor reporting of discrimination and calibration performance across the studies, it was often difficult to make meaningful comparison between the prediction models. Also, the lack of consensus as to the most important outcome measure in FEP, with six different outcomes considered across only 13 included studies, further hindered efforts at drawing meaningful comparisons between the included studies and their respective prediction models. Likewise, if more studies had considered the same outcome measures, this may have afforded the opportunity to validate existing prediction models rather than necessitating the creation of additional new models. All published prediction-modelling studies in FEP reported significant positive findings. It is possible that studies that had negative findings were held back from publication, reflecting the possibility of publication bias. We originally intended to evaluate the overall certainty in the body of evidence by using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) framework. 42 GRADE was originally designed for reviews of intervention studies, but has not yet been adapted for use in systematic reviews of prediction models. Consequently, in its current form, we did not find GRADE to be a suitable tool for our review and decided not to use it. Future research should consider how to adapt GRADE for use in systematic reviews of prediction models.

Implications for future research
It is clear that there is a growing trend for the development of prediction models in FEP. 6 FEP is an illness that responds best to an early intervention paradigm. 43 Prediction models have the potential to optimise the allocation of time-critical interventions, like clozapine for treatment resistance. 44 However, several steps are necessary before meaningful implementation into real-world clinical practice. The field must prioritise external validation and replication of existing prediction models in larger sample sizes, to increase the EPV. This is best accomplished by an emphasis on data-sharing and open collaboration. Prediction studies should include FEP cohorts from low-income countries, where there is considerable potential for benefit by helping to prioritise limited resources to those most in need. Harmonisation of data collection across the field, both in terms of predictors and outcomes measured, would facilitate validation efforts. There should be a greater consideration of biologically relevant and cognitive predictors based on our growing understanding of disease mechanisms, which could optimise prediction model performance. Finally, our review highlights considerable methodological pitfalls in much of the current literature. Future predictionmodelling studies should focus on methodological rigour with adherence to accepted best-practice guidance. 9,14,38 Our goal in psychiatry should be to develop an innovative approach to care by using prediction models. Application of these approaches into clinical practice would enable rapid and targeted intervention, thereby limiting treatment-associated risks and reducing patient suffering.