There is substantial variability in the efficacy of antipsychotic drugs in the treatment of psychotic disorders. On average, 20–30% of patients show inadequate response to treatment and fewer than 40% achieve symptom remission, with treatment discontinuation rates reaching approximately 70%. Reference Wold, Ottesen, Camilla, Johnsen, Lagerberg and Romm1 These factors contribute to relapses that often lead to rehospitalisation, further worsening the prognosis, negatively impacting patients’ quality of life and reducing life expectancy. Reference Lisoway, Chen, Zai, Tiwari and Kennedy2 Therefore, recovery, remission and relapse prevention in psychotic disorders have become key areas of interest in psychiatric research. Reference Fusar-Poli, Rutigliano, Stahl, Davies, Bonoldi and Reilly3
Nevertheless, the long-term recovery rates for individuals experiencing first-episode psychosis (FEP) are more favourable, varying between 10 and 35% depending on the definition of recovery and the length of the follow-up period. Reference Catalan, Richter, Salazar de Pablo, Vaquerizo-Serrano, Mancebo and Pedruzo4,Reference Peralta, García de Jalón, Moreno-Izco, Peralta, Janda and Sánchez-Torres5 Although the outcome trajectories of individual patients are difficult to predict, clinical development over the first years of treatment is critical for longer-term outcomes. Early identification of individuals at high risk of non-response at the initial clinical encounter may enable timely, personalised interventions, reduce the delay in treatment initiation and optimise resource utilisation, redirecting intensive resources towards those with more complex or prolonged recovery trajectories, ultimately improving patient outcomes and increasing the likelihood of long-term recovery. Reference Albert and Weibell6
Previous studies have identified several predictors of poor clinical, functional and cognitive outcomes, such as poor premorbid adjustment, history of developmental disorders, greater baseline symptom severity and longer duration of untreated psychosis. Reference Lally, Ajnakina, Stubbs, Cullinane, Murphy and Gaughran7–Reference Santesteban-Echarri, Paino, Rice, González-Blanch, McGorry and Gleeson9 However, these findings are based on population-level associations and do not necessarily allow accurate prediction at the individual level.
In recent years, multivariable prediction models such as machine learning algorithms have been increasingly used to provide individualised risk estimates. These tools aim to support personalised psychiatry by stratifying the probability of poor outcomes at the time of first clinical contact. Several systematic reviews have catalogued such models for psychosis outcomes, many of which rely on baseline clinical variables. Reference Salazar de Pablo, Radua, Pereira, Bonoldi, Arienti and Besana10,Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl11
Among these clinical prediction models in FEP, the most frequently predicted outcome is clinical remission. Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark12–Reference Soldatos, Cearns, Nielsen, Kollias, Xenaki and Stefanatou14 However, there is increasing support for incorporating measures of psychosocial functioning and personal recovery to better capture the heterogeneity of FEP outcomes. Reference Peralta, García de Jalón, Moreno-Izco, Peralta, Janda and Sánchez-Torres5 In this context, clinical recovery – defined as sustained symptom remission alongside regained functional capacity – has been proposed as a more comprehensive and meaningful outcome. Reference Cuesta, Sánchez-Torres, Moreno-Izco, García de Jalón, Gil-Berrozpe and Zarzuela15,Reference Treen Calvo, Giménez-Donoso, Setién-Suero, Toll Privat, Crespo-Facorro and Ayesa Arriola16 Unlike traditional outcome measures that focus narrowly on symptom reduction, recovery reflects a holistic and person-centered approach that better captures the lived experience and long-term adaptation of individuals with schizophrenia. Recently, the Remission in Schizophrenia Working Group criteria for remission, in combination with functional measurements, have been used to define early clinical recovery (ECR) following a FEP as meeting the criteria for remission and regained functioning at 1-year follow-up. Reference Wold, Ottesen, Camilla, Johnsen, Lagerberg and Romm1
To date, most prediction models for FEP outcomes have focused primarily on sociodemographic and clinical variables, often overlooking cognitive functioning and biomarkers. Neurocognitive impairment is strongly associated with poor functional outcomes and reduced likelihood of recovery. Reference Treen Calvo, Giménez-Donoso, Setién-Suero, Toll Privat, Crespo-Facorro and Ayesa Arriola16 Polygenic risk scores (PRSs), which aggregate the effects of many genetic variants associated with specific traits or disorders, are among the most promising emerging biomarkers. The integration of PRS into clinical practice is one of the most exciting developments in the advancement of precision psychiatry. Reference Wray, Lin, Austin, McGrath, Hickie and Murray17
In the present study, we aimed to develop and externally validate a clinically applicable prediction model for ECR and evaluate its performance across different modelling approaches, including both regression-based and machine learning techniques. We used data from the two largest longitudinal FEP cohorts in Spain: the primeros episodios psicóticos (PEPs) cohort (from the study ‘Genotype–Phenotype Interaction and Environment. Application to a Predictive Model in First Psychotic Episodes’, N = 335) for model development and internal validation, and the Programa Asistencial a las Fases Iniciales de Psicosis (PAFIP) cohort (from the ‘Clinical Program on Early Phases of Psychosis’, N = 668) for external validation. We developed prediction models using different sets of predictors: clinical variables (including cognitive measures), genetic data (PRS) and a combination of both.
Method
The study was conducted in accordance with the protocol of the FarmaPRED study (approved by the Clinical Research Ethic Committee of Hospital Clinic Barcelona, approval no. HCB/2022/0079), Reference Mas, Julià, Cuesta, Crespo-Facorro, Vázquez-Bourgon and Spuch18 and is reported following the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD + AI) guidelines. Reference Collins, Moons, Dhiman, Riley, Beam and Calster19 A detailed description of protocol adherence is provided in Supplementary Table 1 available at https://doi.org/10.1192/bjp.2026.10581. The completed TRIPOD + AI reporting checklist can be found in Supplementary Table 2.
Participants
The development data-set comprised individuals from the PEPs cohort (N = 335), collected as part of the multicentre, naturalistic, longitudinal study ‘Genotype–Phenotype Interaction and Environment: Application to a Predictive Model in First Psychotic Episodes’. Reference Bernardo, Bioque, Parellada, Saiz Ruiz, Cuesta and Llerena20 Participants were recruited between April 2009 and April 2011 across 16 clinical centres throughout Spain. Inclusion criteria were as follows: age 7–35 years, psychotic symptoms <12 months, fluency in Spanish and signed informed consent. Exclusion criteria included intellectual disability (DSM-IV), history of head trauma with loss of consciousness and medical/neurological conditions affecting cognition. The study was approved by the Clinical Research Ethic Committee of Hospital Clinic Barcelona (approval no. HCB/2008/4232). All participants provided written informed consent.
The validation data-set consisted of individuals from the PAFIP cohort (N = 668), part of the programme ‘Clinical Programme on Early Phases of Psychosis’ Reference Crespo-Facorro, de la Foz, Ayesa-Arriola, Pérez-Iglesias, Mata and Suarez-Pinilla21 conducted at University Hospital Marqués de Valdecilla (Santander, Spain). Recruitment occurred between February 2001 and September 2018. Inclusion criteria included: residence in Cantabria, first episode of non-affective psychosis, age 15–60 years and a DSM-IV diagnosis within the schizophrenia spectrum (schizophrenia, schizophreniform, schizoaffective, brief psychotic disorder or psychosis not otherwise specified). Exclusion criteria were substance- or medication-induced psychosis, brain injury or neurological illness, intellectual disability and substance dependence (except nicotine). Diagnoses were confirmed using the Structured Clinical Interview for DSM-IV Axis I Disorders, conducted by a trained research psychiatrist. Informed consent was obtained from participants and their families. The study was approved by the Institutional Review Board of Cantabria (approval no. 2017/247), and complied with international research ethics guidelines.
A complete description of these cohorts can be found elsewhere. Reference Bernardo, Bioque, Parellada, Saiz Ruiz, Cuesta and Llerena20,Reference Crespo-Facorro, de la Foz, Ayesa-Arriola, Pérez-Iglesias, Mata and Suarez-Pinilla21
Early clinical recovery definition
The outcome for this study was ECR, defined as achieving both symptomatic and functional remission within 1 year of study inclusion according to Wold et al. Reference Wold, Ottesen, Camilla, Johnsen, Lagerberg and Romm1 Symptomatic remission was defined as a score of ≤3 (mild or less) on the following 8 Positive and Negative Syndrome Scale (PANSS) items: P1 (delusions), P2 (conceptual disorganisation), P3 (hallucinatory behaviour), N1 (blunted affect), N4 (social withdrawal), N6 (lack of spontaneity), G5 (mannerisms/posturing) and G9 (unusual thought content). Functional remission was defined as achieving a score of ≥61 on the Global Assessment of Functioning (GAF) scale. Participants who met both symptomatic and functional remission criteria at either the 3- or 6-month assessment were classified as ECR, provided that this recovery was confirmed as stable at 1 year. Participants not meeting these criteria were classified as non-ECR. No specific class imbalance correction methods were applied, because the outcome prevalence in the development cohort was 73.73% ECR v. 26.27% non-ECR.
The outcome in PAFIP was assessed at 3 months only, because the cohort schedule lacks a 6-month follow-up. Because PANSS scores were not available for all participants, symptomatic remission was defined according to the following items of the Scale for the Assessment of Positive Symptoms (SAPS) (item 20, Global Rating of Delusions); the Scale for the Assessment of Negative Symptoms (SANS) (item 9, Poverty of Speech; item 13, Global Rating of Alogia; item 18, Recreational Interests and Activities; item 22, Global Rating of Anhedonia/Asociality); and the Brief Psychiatric Rating Scale (item 4, Conceptual Disorganisation; item 7, Mannerisms and Posturing; item 12, Hallucinatory Behaviour; item 15, Unusual Thought Content). Functional remission was defined based on Disability Assessment Scale (DAS).
Candidate predictor variables
A total of 47 baseline variables (Table 1) and 87 genetic variables (Supplementary Table 3) were initially considered as potential predictors (i.e. prognostic factors) of ECR. All candidate predictors were selected based on their availability in the development data-set, and on previous research and consensus among 11 psychiatrists involved in the FarmaPRED study using the Delphi method. Reference Mas, Julià, Cuesta, Crespo-Facorro, Vázquez-Bourgon and Spuch18
Table 1 Summary of clinical characteristics of the development cohort, stratified by early clinical recovery

ECR, early clinical recovery; PAS, Premorbid Adjustment Scale; DUP, duration of untreated psychosis; DTP, duration of treated psychosis; PANSS, Positive and Negative Syndrome Scale; YOUNG, Young Mania Rating Scale; MADRS, Montgomery–Åsberg Depression Rating Scale; GEOPTE, GEOPTE Scale of Social Cognition for Psychosis; GAF, Global Assessment of Functioning; TMT, trail-making test; CPT, continuous performance test; WCST, Wisconsin card-sorting test. Values are mean (standard deviation) unless stated otherwise. Categorical variables are compared using Fisher’s exact test, with continuous variables compared using the Wilcoxon rank-sum test.
Candidate baseline predictors included sociodemographics (age, gender, ethnicity, socioeconomic status, immigration, urbanicity, toxic habits, Premorbid Adjustment Scale); environmental (family psychiatric history, traumatic experiences, winter birth); clinical (days of untreated psychosis (DUP) and days of treated psychosis (DTP); insight (PANSS item G12); PANSS, Young Mania Rating Scale; Montgomery–Åsberg Depression Rating Scale, GAF); and cognitive variables (cognitive reserve, animal fluency test, trail-making test (TMT); Conners continuous performance test; digit span subtest and the letter–number sequencing subtest of the Wechsler Adult Intelligence Scale (WAIS-III); Verbal learning test Spain Complutense for Adults; and Wisconsin card-sorting test (WCST-perseverative response)).
Genetic variables included PRSs related to psychopathology, cognitive function, toxic habits, personality traits and immunological traits. A complete description of the protocol can be found elsewhere. Reference Segura, Serna, Sugranyes, Baeza, Valli and Martínez-Serrano22 All PRSs were standardised.
To characterise and compare the baseline clinical profile of individuals by ECR status, we computed descriptive statistics for all candidate predictors. For continuous variables we reported the mean and standard deviation, and class comparisons between ECR and non-ECR were performed using the Wilcoxon rank-sum test. For categorical variables, absolute and relative frequencies were calculated, with class differences assessed using Fisher’s exact test. Statistical significance was set at P < 0.05. This analysis was conducted separately for the development and validation cohorts.
To ensure comparability between the development and validation cohorts, we applied systematic data harmonisation to address differences in measurement instruments and variable ranges. DUP was defined similarly in both cohorts as the time between symptom onset and antipsychotic initiation but, because extreme values were present in PAFIP, individuals with DUP >912 days were excluded. Insight was assessed using different instruments across cohorts, each measuring the same construct (PANSS item G12 in PEPs and Scale Unawareness of Mental Disorders in PAFIP). Scores were rescaled to a common range of 1–7 using linear interpolation. Functioning was measured with GAF (range 0–100) in PEPs, and with DAS in PAFIP. DAS scores were inverted and linearly interpolated to approximate GAF scores. Cognitive reserve was summarised in PEPs via principal component analysis (PCA) on estimated IQ (evaluated with the vocabulary subtest of WAIS), education (years) and premorbid adjustment (childhood and early adolescence). For PAFIP, variables were standardised using PEPs parameters and projected onto PCA loadings from the validation cohort. Executive function was assessed using different instruments: WCST in PEPs and TMT-B in PAFIP. Scalar values were normalised to z-scores using values from control populations from the PEPs and PAFIP studies, respectively.
For reduction of dimensionality, avoidance of overfitting and selection of the most prognostically informative variables, we applied penalised logistic regression using the least absolute shrinkage and selection operator (LASSO). The model was trained on the development set using tenfold cross-validation, and the regularisation parameter (λ) was chosen according to the one-standard-error rule, selecting the most parsimonious model whose performance was within one standard error of the model with the highest cross-validated area under the receiver operating characteristic (ROC) curve (AUC). Distributions of the selected predictor variables across cohorts are visualised in Supplementary Fig. 1.
Missing data
Missing data in the development data-set were handled using multiple imputation by chained equations (MICE). Variables with more than 30% missingness were excluded from the analysis. The outcome variable (ECR) was defined prior to imputation, and none of the variables required to determine ECR were imputed. The imputation model assumed a missing-at-random mechanism and contained all other candidate predictors. A total of five imputed data-sets were generated, based on the proportion of incomplete cases among variables with missing values. Initial model fitting was performed on the first imputed data-set, followed by sensitivity analysis to evaluate the robustness and consistency of model estimates across the other imputed data-sets. No imputation procedures were applied to the external validation data-set. To assess potential selection bias arising from the complete-case approach in this cohort, we compared baseline characteristics between included and excluded participants.
Model development
Three predictive models were developed based on the type of prognostic factors (model type): (a) a clinical model, using predictors selected by LASSO from the 47 baseline clinical variables; (b) a genetic model, using LASSO-selected PRSs among the 87 available; and (c) a combined model, integrating selected clinical and genetic predictors. No additional transformations were applied beyond those described in the Candidate predictor variables section. For each model type, we applied four model-building approaches (algorithms): one classical statistical method, logistic regression and three machine learning approaches: the naive Bayes classifier (NBC), gradient-boosting machine (GBM) and support vector machine (SVM). To ensure terminological clarity, each trained model is referred to using the format (model type)-(algorithm) throughout the manuscript (e.g, clinical-logistic regression, genetic-GBM, combined-SVM).
Models were fitted using the entire development data-set (PEPs cohort). Internal validation was performed using bootstrap resampling to estimate optimism-adjusted performance metrics. Hyperparameters for the machine learning models were optimised using grid search and tenfold cross-validation, selecting the configuration that maximised prediction accuracy.
The models output the predicted probability of ECR for each patient. In all models, the predicted probabilities were converted into binary classifications (ECR versus non-ECR) using the optimal probability threshold determined via the Youden index, which maximises the sum of sensitivity and specificity.
Performance evaluation
Model performance was evaluated on the development data-set across three complementary dimensions: discrimination, calibration and clinical utility. External validation was performed on the validation data-set following the same evaluation procedures.
Discrimination was assessed using AUC. Confidence intervals (95%) for AUC were computed using 2000 stratified bootstrap resamples. In addition, standard classification metrics derived from the confusion matrix were calculated at the optimal threshold; these included accuracy, sensitivity and specificity. Calibration plots were generated to assess the alignment of observed proportion of ECR and predicted probabilities. The calibration curves were smoothed with locally estimated scatterplot smoothing.
To compare model performance across model types, we performed pairwise comparisons of AUC values using the bootstrap test for the difference in AUCs. Selection of the final model was based on a combination of discrimination, calibration and parsimony criteria.
To assess the clinical utility of the best-performing model, we applied decision curve analysis (DCA). Net benefit curves were plotted across a range of threshold probabilities, comparing the model strategy against the default strategies of ‘treat all’ and ‘treat none’. This method quantifies the trade-off between true positive identifications and potential harms due to false positives. In this context, predicted probabilities were inverted so that the event corresponded to non-ECR, which is the clinically undesirable outcome.
Finally, to enhance model interpretability, we computed Shapley additive explanations (SHAP) values for the best-performing model using the development cohort. These values quantify the contribution of each predictor to the model’s prediction at the individual level, and help identify which features most strongly influenced model decisions.
Software and code
All analyses were conducted using R version 4.5.0 (R Foundation for Statistical Computing, Vienna, Austria; https://www.R-project.org/). Key packages included the following: mice Reference Buuren and Groothuis-Oudshoorn23 for MICE, glmnet Reference Friedman, Hastie and Tibshirani24 for LASSO, rms Reference Harrell25 for logistic regression modelling and caret Reference Kuhn26 for training and tuning of machine learning algorithms. The full analysis code is publicly available in the following GitHub repository: https://github.com/laurajuliamelis/ECR-prediction-models. Additionally, a user-friendly online tool has been developed to facilitate individual prediction of ECR using the final model; this tool is accessible at https://farmapred-pep.shinyapps.io/predictECR/. Due to privacy and ethical restrictions, individual-level patient data are not publicly available.
Results
Study cohorts
In PEPs, 236 out of 335 (70.4%) had complete data; of these, 174 (73.7%) achieved ECR during the first year. In PAFIP, 451 out of 668 (67.5%) had complete data and 272 (60.3%) met ECR criteria at 3 months. Supplementary Table 4 shows that no clinically meaningful differences were detected between included and excluded patients in the PEPs cohort. However, excluded patients from the PAFIP cohort showed longer DUP, worse functioning and lower insight, suggesting that they may represent more severe cases who did not complete follow-up. The flow of participants and final sample sizes included in each analysis are detailed in Fig. 1.

Fig. 1 Flowchart for model development and external validation study populations. PEPs, primeros episodios psicóticos; PAFIP, Programa Asistencial a las Fases Iniciales de Psicosis; ECR, early clinical recovery; PAS, Premorbid Adjustment Scale.
Table 1 summarises baseline clinical characteristics of the development cohort, overall and stratified by ECR status. Significant differences between ECR and non-ECR groups were observed in several variables. ECR patients had a shorter DUP (mean 92 v. 141 days, P = 0.059), longer DTP (mean 62 v. 32 days, P = 0.034) and better insight (mean 2.7 v. 3.3, P = 0.020). They also presented better premorbid adjustment scores in early adolescence (mean 7.8 v. 10.0, P = 0.009), late adolescence (mean 8.9 v. 11.0, P = 0.049) and adulthood (mean 18.4 v. 22.0, P = 0.029), alongside higher estimated premorbid IQ (mean 88.9 v. 85.7, P = 0.022). Cognitive markers differed markedly, with ECR individuals showing higher cognitive reserve composite scores (mean 0.2 v. −0.4, P = 0.002). In terms of functioning, ECR individuals exhibited significantly higher GAF scores (mean 53.4 v. 41.6, P < 0.001).
In the validation cohort (Table 2), patients who achieved ECR had significantly shorter DUP, better functioning scores, higher cognitive reserve and better insight compared with non-ECR patients (all P < 0.05). No significant differences were observed for DTP, executive function or any of the PRSs.
Table 2 Clinical characteristics and polygenic risk scores of the validation cohort (PAFIP), stratified by early clinical recovery

PAFIP, Programa Asistencial a las Fases Iniciales de Psicosis; ECR, early clinical recovery; DUP, duration of untreated Psychosis; DTP, duration of treated psychosis; ASD, autism spectrum disorder; CP, cognitive performance; CPD, cigarettes per day; EA, educational attainment; IL-16, interleukin 16; MIF, macrophage migration inhibitory factor. Values are mean (standard deviation). Variables are compared using Student’s t-test or the Wilcoxon rank-sum test, as appropriate.
Model development
We generated five imputed data-sets using MICE. As shown in Supplementary Fig. 2, missingness was generally low for key variables and visual inspection of imputed versus observed distributions indicated plausible imputations with preserved data structure.
LASSO logistic regression was used to select the most predictive variables. For the clinical model (includes sociodemographic, clinical, functional and cognitive variables), the selected penalisation parameter (λ 1s.e. = 0.0717) retained 6 variables: DUP, DTP, functioning (GAF score), cognitive reserve, insight and executive function. For the genetic model, 7 PRSs were retained at λ 1s.e. = 0.0716: autism spectrum disorder, cognitive performance, cigarettes per day, educational attainment, IQ, macrophage migration inhibitory factor and interleukin 16. Final hyperparameter settings for each model type and algorithm are detailed in Supplementary Table 5.
Model evaluation
Figure 2 presents ROC curves and key performance metrics (accuracy, sensitivity, specificity) for each of the 12 models, combining 3 model types (clinical, genetic, combined) and 4 algorithms (logistic regression, NBC, GBM, SVM). All metrics were computed at the optimal classification threshold derived using the Youden index. When comparing optimism-corrected AUC (AUCc), we observed considerable overfitting in GBM models, with AUC reductions ranging from 0.10 to 0.14 (e.g. genetic-GBM, AUC = 0.92, AUCc = 0.79). NBC models also showed performance degradation following correction, with AUC–AUCc differences between 0.08 and 0.12. Logistic regression and SVM algorithms showed better robustness to overfitting. The best-performing model in terms of discrimination was combined-SVM (AUCc = 0.80), followed closely by combined-logistic regression (AUCc = 0.79) and clinical-logistic regression (AUCc = 0.73). Genetic-only models consistently yielded the lowest AUCs across all algorithms. Supplementary Fig. 3 displays AUC distributions across imputations for each model, confirming robustness to imputation variability.

Fig. 2 Discriminative performance of the 12 models in the development cohort. Each panel displays the receiver operating characteristic (ROC) curve for one algorithm (logistic regression, naive Bayes classifier, gradient-boosting machine and support vector machines), with separate curves for each model type (clinical, genetic, combined). Summary metrics (accuracy, sensitivity, specificity) are shown within each panel and were computed at the optimal probability threshold based on the Youden Index. AUC, area under the ROC curve; AUCc, optimism-corrected AUC, estimated via bootstrap resampling.
Figure 3 shows the smoothed calibration plots for the three logistic regression models. Both clinical-logistic regression (slope 1.00, 95% CI: 0.65–1.35) and combined-logistic regression (slope 1.00, 95% CI: 0.67–1.33) demonstrated generally good alignment of observed and predicted probabilities, with some over-prediction in those with predicted probabilities <0.3 in the clinical-logistic regression model. By contrast, the genetic-logistic regression model showed markedly poorer calibration, particularly at lower predicted probabilities, where sparse data contributed to unstable estimates and deviation from the ideal calibration line. Miscalibration was also prominent across models using other algorithms, as detailed in Supplementary Fig. 4.

Fig. 3 Calibration plots for logistic regression models in the development data-set, stratified by model type (clinical, genetic, combined). The diagonal line represents perfect calibration, with the locally estimated scatterplot smoothing curve showing the model’s calibration and 95% confidence interval in grey. The marginal bar plot indicates the distribution of patients with (1) and without (0) the observed outcome across deciles of predicted probability of early clinical recovery.
Among the logistic regression models, Combined-logistic regression outperformed both clinical and genetic types in AUC (combined versus clinical: 0.856 v. 0.761, P = 0.06; combined v. genetic: 0.856 v. 0.729, P = 0.02), although the former comparison did not reach statistical significance.
External validation was assessed in the PAFIP cohort, and the results are shown in Supplementary Fig. 5. All logistic regression models showed reduced discrimination performance in the PAFIP cohort compared with PEPs, with clinical-logistic regression achieving the highest AUC values (0.63). Calibration for these models is depicted in Supplementary Fig. 6.
Based on internal discrimination, calibration and parsimony criteria, we selected the clinical-logistic regression model for further interpretation and clinical translation.
Clinical-logistic regression model
To improve model interpretability, Fig. 4(a) displays SHAP values for the clinical-logistic regression model; functioning, executive function and DTP were the most influential predictors. Colour gradients indicate the direction of the effect: higher values of functioning (mean 0.065), executive performance (mean 0.058), DTP (mean 0.048) and cognitive reserve (mean 0.036) increased the likelihood of ECR, whereas higher DUP values (mean 0.042) and insight (mean 0.041) were associated with a decreased probability. Exponentiated coefficients and associated statistics for the clinical-logistic regression model are provided in Supplementary Table 6.

Fig. 4 (a) Shapley additive explanations (SHAP) summary plot for the clinical-logistic regression model, showing the contribution of each predictor to model output across all individuals. Each dot represents a patient, coloured by the actual value of the predictor variable (purple (dark blue in print version), high; yellow (light blue in print version), low). Predictors are ordered by their impact on model predictions. The horizontal bars on the right indicate each variable’s mean SHAP value, expressed as a percentage of the total contribution across all predictors. (b) Decision curve analysis for the clinical-logistic regression (LR) model, showing the net benefit of using the model to identify patients unlikely to experience early clinical recovery, across a range of threshold probabilities. The model’s net benefit curve (light blue) is compared against default strategies (‘treat all’ and ‘treat none’). DTP, duration of treated psychosis; DUP, duration of untreated psychosis.
Finally, Fig. 4(b) shows the DCA for the clinical-logistic regression model. This model yielded superior net benefit compared with ‘treat all’ or ‘treat none’ strategies across a wide range of threshold probabilities (particularly 0.1–0.8), supporting its potential clinical utility.
Discussion
In this study, we developed and validated predictive models for ECR in patients with FEP. Models were derived using both regression (logistic regression) and machine learning techniques in a large, naturalistic multicentre cohort (PEPs), and externally validated in an independent cohort (PAFIP). We tested various sets of predictors, including clinical and genetic variables. Overall, regression models using clinical and cognitive variables offered the best performance and generalisability.
The predictive performance of our best model (AUC = 0.76, AUCc = 0.73) is consistent with previous studies predicting remission outcomes (AUC range 0.70–0.74). Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark12–Reference Soldatos, Cearns, Nielsen, Kollias, Xenaki and Stefanatou14 As observed in prior research involving multi-site prediction models, a drop in performance occurred when transitioning from the development to the validation data-set: the external validation performance in the PAFIP data-set decreased to an AUC of 0.63. This reduction may be attributable to heterogeneity between cohorts, including differences in inclusion/exclusion criteria, diagnostic distributions and geographic settings.
Notably, PEPs is a multicentre study across multiple regions in Spain, included both affective and non-affective psychoses, younger participants and it used more inclusive recruitment criteria than PAFIP, which focused on non-affective psychosis and recent-onset cases from Cantabria. Differences in follow-up timing (6 versus 3 months), symptom/functioning measures (PANSS/GAF versus SAPS–SANS/DAS) and the tools used to assess insight and cognition introduced additional variability. Despite this, our models remained robust during external validation, performing significantly above chance levels. This supports their broader relevance across heterogeneous clinical populations.
Our findings allow characterisation of the FEP population, with better prognosis for an ECR on the basis of good premorbid functioning, clinical (shorter DUP, longer DTP and better insight) and cognitive domains (stronger executive function and higher cognitive reserve). This profile is in agreement with substantial evidence in the literature Reference Catalan, Richter, Salazar de Pablo, Vaquerizo-Serrano, Mancebo and Pedruzo4,Reference Peralta, García de Jalón, Moreno-Izco, Peralta, Janda and Sánchez-Torres5,Reference Lally, Ajnakina, Stubbs, Cullinane, Murphy and Gaughran7 and is consistent with previous multivariate predictors of remission, Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark12–Reference Soldatos, Cearns, Nielsen, Kollias, Xenaki and Stefanatou14 and highlights the central role of cognitive functioning in long-term outcomes. Reference Treen Calvo, Giménez-Donoso, Setién-Suero, Toll Privat, Crespo-Facorro and Ayesa Arriola16 Developing scalable methods to incorporate such measures into routine minimisation of clinical burden will be critical for the deployment of precision psychiatry in clinical practice. Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl11
Considering the limitations of the present study when interpreting our findings, we suggest that individuals predicted to be at risk of non-ECR may benefit from more intensive and sustained early-intervention care. This could include optimised pharmacological management, increased frequency or duration of psychosocial interventions and targeted cognitive interventions when cognitive difficulties are prominent. Reference Fusar-Poli, Rutigliano, Stahl, Davies, Bonoldi and Reilly3 The intensity and content of care should be tailored to the domains in which the individual shows the greatest need, rather than following a uniform, one-size-fits-all approach. Further research will be needed to explore whether individuals predicted to have poor early recovery could be considered candidates for earlier clozapine intervention. Reference Butler, Stratford and Jauhar27 In addition, an ECR profile may support shared decision-making between psychiatrists and patients regarding dose reduction or treatment discontinuation, as desired by many patients. Reference Sommer, de Beer, Gangadin, de Haan, Veling and van Beveren28
Our findings have significant implications for the organisation of early-intervention services. The current practice in many settings is to offer all individuals with FEP the same fixed period of early-intervention services care, regardless of their clinical trajectory. Introducing flexibility in discharge timing could enable services to redirect intensive resources toward those with more complex or prolonged recovery trajectories – specifically patients who require extended or more intensive care – thereby enhancing overall service efficiency, a crucial consideration given the limited resources highlighted in economic evaluations of early-phase psychosis interventions. Reference Shields, Buck, Varese, Yung, Thompson and Husain29 Analogous to other areas of medicine, where prognostic models reliably stratify patients to guide the intensity and duration of treatment or monitoring, predictive approaches in FEP could inform personalised care pathways. By identifying those with low probability of early recovery or at higher risk of prolonged courses, services could allocate more intensive interventions and longer follow-up.
An important issue raised by the advancement of precision psychiatry concerns the caution required when communicating unfavourable prognostic information or low chances of ECR to individuals with FEP, in order to avoid fixed and pessimistic labels and to prevent unnecessary emotional distress, hopelessness and misunderstanding. Reference Huurman, Pijnenborg, Sportel, van Rijsbergen, Hasson-Ohayon and Boonstra30 Information about illness chronicity and relapse risk should be provided clearly, without implying that recovery is unlikely. This cautious approach is warranted given the substantial heterogeneity of outcomes in FEP, as well as the current limitations of predictive models, which are not yet sufficiently robust or precise to guide individual clinical decisions in isolation. Reference Lee, Leighton, Thomas, Gkoutos, Wood and Fenton31 Importantly, predictive information should be embedded within early-intervention services that emphasise modifiable environmental and clinical factors, including early engagement with treatment, reinforcement of substance use prevention strategies and sustained family support.
In terms of modelling approaches, whereas machine learning is often touted for its capacity to model complex relationships, our results showed no significant performance advantage over traditional regression techniques. The fact that machine learning did not outperform logistic regression here probably stems from the combination of our sample size which, although respectable for this clinical domain, may be below the threshold required for many complex algorithms to reliably extract signal from noise without overfitting, and the inherent heterogeneity of the phenotype. This echoes concerns raised in recent literature about overestimating the utility of machine learning in clinical prediction tasks, particularly with moderate-sized data-sets and low signal-to-noise ratios. Reference Vabalas, Gowen, Poliakoff and Casson32 In such settings, simpler, well-specified regression models can be equally or more performant, a finding consistent with recent comparative studies. Reference Christodoulou, Ma, Collins, Steyerberg, Verbakel and Van Calster33 We ultimately favoured regression models for their greater interpretability, facilitating transparency and clinical usability in contrast to the ‘black-box’ nature of many machine learning approaches. Reference Kundu34 Such transparency is crucial to enhancing clinician and patient engagement with clinical predictive models.
We also evaluated the added value of genetic information using PRS. Several recent studies have explored the integration of PRS with clinical data to improve outcome prediction, under the premise that genetic information could improve clinical models. However, the current discriminatory power of PRS in general populations remains limited. Reference Lewis and Vassos35 The PRS for cognitive performance was the most informative genetic predictor of recovery, consistent with prior findings linking cognitive PRS to relapse and functional outcomes. Reference Segura, Mezquida, Martínez-Pinteño, Gassó, Rodriguez and Moreno-Izco36,Reference Segura, Prohens, Gassó, Rodríguez, Garcia-Rizo and Moreno-Izco37 Nonetheless, although genetic-based models showed strong predictive performance during internal validation, they failed to generalise to the external cohort despite consistent methodologies.
This lack of generalisability may reflect either context-specific gene–environment interactions or phenotypic variability. Increasing evidence suggests that PRS portability is limited not only by ancestry but also by environmental, social and methodological factors. Reference Hui, Dudek, Kiryluk, Walunas, Kullo and Wei38 Advances in PRS technology – combining both common and rare variants, epigenetic factors and context-aware modelling – will be essential to realising its full clinical potential. Reference Zeng, Guo, Liu and Wong39
In addition to its limited generalisability, the genetic-logistic regression model exhibited poor calibration, particularly at the lower end of the predicted probability range. This is likely to have resulted from three interlinked factors: (a) a high baseline probability of recovery, driven by class imbalance in the derivation cohort; (b) weak genetic effects; and (c) the scarcity of individuals with an extreme high-risk genetic profile. These characteristics reflect the current limitations of PRS-only models for individual-level clinical prediction.
Among the strengths of this study is the use of large, well-characterised, real-world FEP cohorts with a rich array of clinical, cognitive and genetic data. We followed TRIPOD-AI guidelines, selected predictors based on a priori clinical knowledge, applied coefficient shrinkage and used bootstrapping techniques to reduce overfitting. We reported multiple calibration and performance metrics. Importantly, our models were externally validated using data from a second, independent cohort. Finally, to increase transparency and facilitate future validations, the full analysis code is publicly available in GitHub and a user-friendly online tool has been developed to facilitate individual prediction of ECR using the best-performing model.
Nonetheless, our study has some limitations. We included only self-reported Caucasian participants and did not account for ethnicity as a predictor, limiting the generalisability of our findings. Additionally, the clinical nature of first-episode psychosis cohorts inherently constrains the inclusion of extremely high-risk clinical profiles, such as markedly impaired functioning, very poor insight or psychosis of exceptionally long duration. Patients with more severe presentations are often more difficult to retain in longitudinal studies, as illustrated by the exclusion of participants in the PAFIP cohort. Consequently, the model had limited exposure to very low-recovery phenotypes, which may partly explain the slight deviation in calibration observed in this range. Treatment effects were also not controlled for, due to the naturalistic design and wide variability in treatments used across both cohorts. We assumed that participants had received standard clinical care. Our model focused on the prediction of early clinical recovery (3–6 months to 1 year), which does not necessarily capture medium- or long-term trajectories. A further limitation is the suboptimal calibration of some models (particularly the genetic model), especially in the lower range of predicted probabilities, largely driven by class imbalance and the limited presence of patients with extreme profiles. Finally, unlike clinical symptomatic remission, recovery is a broad and multidimensional construct, making it difficult to establish a uniform definition or standardised measurement criteria. Moreover, recovery status may fluctuate over time, reflecting the dynamic interaction among symptoms, environment and personal factors, which poses challenges for both its longitudinal assessment and comparison of outcomes across studies. Future models should consider standardised outcome definitions, harmonised tools across cohorts and treatment trajectories.
The early identification of individuals at risk of poor recovery may support proactive interventions, tailored treatment plans and resource optimisation. To be translated into practice, prediction tools require prospective validation, integration in clinical decision processes and co-design with end-users, and appropriate clinical and regulatory approval. Ongoing efforts, including the FarmaPRED cohort, Reference Mas, Julià, Cuesta, Crespo-Facorro, Vázquez-Bourgon and Spuch18 will be valuable for refining and validating these models. Future work should evaluate dynamic prediction over time, incorporate treatment response and examine extended biomarker integration to improve predictive performance. Additionally, we recommend testing whether machine learning approaches provide greater benefit in predicting ECR in larger, multicentre consortia, where the signal-to-noise ratio may be sufficiently increased to reveal potential advantages over traditional regression methods.
In this study, we developed and validated a prediction model for early clinical recovery in patients with first-episode psychosis using a combination of clinical, cognitive and genetic predictors. The final logistic regression model, based on clinical variables, showed acceptable discrimination and calibration in the development data-set and retained moderate performance during external validation in an independent cohort. Despite methodological differences between cohorts and variation in outcome definitions, the model demonstrated a degree of generalisability, highlighting the potential of early data to inform prognostic stratification. Although the results are promising, further research is needed before implementation in clinical settings. The current model should be viewed as a proof of concept for the use of simple clinical indicators to anticipate early treatment response.
Supplementary material
The supplementary material is available online at https://doi.org/10.1192/bjp.2026.10581
Data availability
The data that support the findings of this study are available from the corresponding author, S.M., upon reasonable request.
Author contributions
The results presented here are part of a broader project, the FarmaPRED study. S.M. is the designer and coordinator of the FarmaPRED study, supervised the statistical analysis, performed interpretation of the results and wrote the first draft of the manuscript. L.J. and V.O.-G.F. performed the statistical analysis and wrote the first draft of the manuscript, and both authors contributed equally to this work. A.G.-S. and M.A. performed bioinformatic analysis of the genetic data. J.L.C. supervised the statistical analysis. C.M.D.-C., I.Z., A.L., A.R., R.A.-A., P.S.P., M.J.-R., M.G.R., C.C., J.U., A.I., C.R.-L.-A., C.S., A.M., A.G.-P., B.C.-F., A.C., M.J.C., S.A. and J.V.-B. participated in the recruitment and assessment of the sample. B.C.-F. and J.V.-B. are the coordinators of the PAFIP cohort. S.A. is the coordinator of the PEPs study. All authors contributed to the final draft of the manuscript.
Funding
This study (no. PMP21/00085) was funded by Instituto de Salud Carlos III and European Union Next Generation EU/Mecanismo para la Recuperación y la Resiliencia/Plan de Recuperación, Transformación y Resiliencia; CERCA Program; Catalan Government, the Secretariat of Universities and Research of the Department of Enterprise and Knowledge (no. 2021 SGR 00672); and Centro de Investigación Biomédica en Red de Salud Mental. The PEPs study was funded by Ministerio de Economía y Competitividad (nos PI08/0208, PI11/00325 and PI14/00612), Instituto de Salud Carlos III – Fondo Europeo de Desarrollo Regional. Union Europea. The PAFIP cohort was supported by Instituto de Investigación Sanitaria Valdecilla (grant nos INT/A20/04 and INT/A21/10) and Instituto de Salud Carlos III (grant nos PI020499, PI050427 and PI060507).
Declaration of interest
C.M.D.-C. has received honoraria and/or travel support from Angelini, Johnson & Johnson and Viatris. A.M. has served as a speaker for, and received financial support for attending meetings and travel from Otsuka, Angelini and Janssen Cilag. A.I. thanks the support of Consorcio Centro de Investigación Biomédica en Red (no. CB/07/09/0025), Instituto de Salud Carlos III, Ministerio de Ciencia e Innovación; the Madrid Regional Government (no. S2022/BMD-7216 (AGES 3-CM)) and European Union Structural Funds; and grant nos PI22/01183 and ICI21/00089, integrated into Plan Nacional de I+D+I and co-financed by ISCIII-Subdirección General de Evaluación and the Fondo Europeo de Desarrollo Regional (FEDER). She has received research support from, or served as speaker or advisor for, Alter, Casen Recordati, Janssen-Cilag, Lundbeck, Otsuka Pharmaceutical SA, Rovi and Viatris, with no financial or other relationship relevant to the subject of this article. C.d.l.C. has received financial support to attend scientific meetings from Esteve, Lundbeck, Rovi, Novartis, Casen Recordati and Janssen, outside the submitted work. She is supported by a grant from the Aragón Government (no. B15_R23). M.J.C. thanks the support of the Government of Navarra (grant nos 17/31 and 18/41) and Carlos III Health Institute (FEDER Funds) from the Spanish Ministry of Economy and Competitiveness (nos PI16/02148, PI19/01698, RD21/0009/0025 and RD24/0003/0015). He has also received financial support as a speaker on his research findings from Alianza Otsuka-Lundbeck and Rovi. All other authors report no biomedical financial interests or potential conflicts of interest.
Transparency declaration
The manuscript is an honest, accurate and transparent account of the study being reported; no important aspect of the study has been omitted, and any discrepancies from the study as planned have been explained.





eLetters
No eLetters have been published for this article.