Predicting early clinical recovery in first-episode psychosis: development and external validation of a clinically interpretable multivariable model

Laura Julià; Victor Ortiz-García de la Foz; Àlex González-Segura; Maria Alemany; Josep Lluis Carrasco; Covadonga Martinez Díaz-Caneja; Iñaki Zorrilla; Antonio Lobo; Alexandra Roldán; Rosa Ayesa-Arriola; Paula Suárez-Pinilla; María Juncal-Ruiz; Marcos Gómez-Revuelta; Concepción de la Cámara; Judith Usall; Angela Ibañez; Cristina Romero-Lopez-Alberca; Carlos Spuch; Anna Manè; Ana González-Pinto; Benedicto Crespo-Facorro; Ana Catalan; Manuel J. Cuesta; Silvia Amoretti; Javier Vázquez-Bourgon; Sergi Mas

doi:10.1192/bjp.2026.10581

Predicting early clinical recovery in first-episode psychosis: development and external validation of a clinically interpretable multivariable model

Published online by Cambridge University Press: 23 March 2026

Laura Julià ,

Victor Ortiz-García de la Foz ,

Àlex González-Segura ,

Maria Alemany ,

Josep Lluis Carrasco ,

Covadonga Martinez Díaz-Caneja

Alexandra Roldán and

Laura Julià: Affiliation:
Clínic Foundation – August Pi i Sunyer Biomedical Research Institute (IDIBAPS), Barcelona, Spain Department of Basic Clinical Practice, University of Barcelona, Barcelona, Spain
Victor Ortiz-García de la Foz: Affiliation:
Valdecilla Biomedical Research (IDIVAL), Santander, Spain
Àlex González-Segura: Affiliation:
Bipolar and Depressive Disorders Unit, Hospital Clínic of Barcelona, Barcelona, Spain
Maria Alemany: Affiliation:
Translational Psychiatry Group, Institute of Biomedicine of Seville (IBiS) – CSIC, Spain Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain
Josep Lluis Carrasco: Affiliation:
Department of Basic Clinical Practice, University of Barcelona, Barcelona, Spain
Covadonga Martinez Díaz-Caneja: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Department of Child and Adolescent Psychiatry, Institute of Psychiatry and Mental Health, Gregorio Marañón General University Hospital, IiSGM, Madrid, Spain School of Medicine, Complutense University, Madrid, Spain
Iñaki Zorrilla: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain BIOARABA, Department Psychiatry, Álava University Hospital, Vitoria, Spain University of the Basque Country (UPV/EHU), Vitoria, Spain
Antonio Lobo: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Aragon Health Research Institute (IIS-A), Zaragoza, Spain
Alexandra Roldán: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Psychiatry Department, Hospital de la Santa Creu i Sant Pau, Sant Pau Biomedical Research Institute (IIB Sant Pau), Barcelona, Spain
Rosa Ayesa-Arriola: Affiliation:
Valdecilla Biomedical Research (IDIVAL), Santander, Spain Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain
Paula Suárez-Pinilla: Affiliation:
Valdecilla Biomedical Research (IDIVAL), Santander, Spain Marqués de Valdecilla University Hospital, Santander, Spain University of Cantabria, Santander, Spain
María Juncal-Ruiz: Affiliation:
Valdecilla Biomedical Research (IDIVAL), Santander, Spain Hospital Sierrallana, Torrelavega, Spain
Marcos Gómez-Revuelta: Affiliation:
Valdecilla Biomedical Research (IDIVAL), Santander, Spain Marqués de Valdecilla University Hospital, Santander, Spain University of Cantabria, Santander, Spain
Concepción de la Cámara: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Department of Psychiatry, Zaragoza Clinical University Hospital (HCU-Z), Zaragoza, Spain
Judith Usall: Affiliation:
Sant Joan de Déu Research Institute, Sant Joan de Déu Healthcare Park, Sant Boi de Llobregat, Spain
Angela Ibañez: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Department of Psychiatry, Ramón y Cajal University Hospital, University of Alcalá, Ramón y Cajal Health Research Institute (IRYCIS), Madrid, Spain
Cristina Romero-Lopez-Alberca: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Department of Psychology, University of Cádiz, Cádiz, Spain
Carlos Spuch: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Translational Neuroscience Research Group, Galicia Sur Health Research Institute (IIS-Galicia Sur), SERGAS-UVIGO, Vigo, Spain Addictions Primary Care Research Network (RIAPAD), ISCIII, Spain
Anna Manè: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Sant Joan de Déu Research Institute, Sant Joan de Déu Healthcare Park, Sant Boi de Llobregat, Spain
Ana González-Pinto: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain BIOARABA, Department Psychiatry, Álava University Hospital, Vitoria, Spain University of the Basque Country (UPV/EHU), Vitoria, Spain
Benedicto Crespo-Facorro: Affiliation:
Translational Psychiatry Group, Institute of Biomedicine of Seville (IBiS) – CSIC, Spain Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Hospital Universitario Virgen del Rocío, Sevilla, Spain
Ana Catalan: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Department of Psychiatry, Biobizkaia Health Research Institute, Basurto University Hospital, Osakidetza (Basque Health Service), Integrated Healthcare Organisation Bilbao-Basurto, Bilbao, Spain University of the Basque Country (UPV/EHU), Barakaldo, Spain
Manuel J. Cuesta: Affiliation:
Department of Psychiatry, Navarra University Hospital, Pamplona, Spain Navarra Health Research Institute (IdiSNA), Pamplona, Spain
Silvia Amoretti: Affiliation:
Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Psychiatry, Mental Health and Addictions Group, Vall d’Hebron Research Institute (VHIR), Accredited Health Research Institute – Vall d’Hebron University Hospital Research Institute (IR-HUVH), Barcelona, Spain
Javier Vázquez-Bourgon: Affiliation:
Valdecilla Biomedical Research (IDIVAL), Santander, Spain Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain Marqués de Valdecilla University Hospital, Santander, Spain University of Cantabria, Santander, Spain
Sergi Mas*: Affiliation:
Clínic Foundation – August Pi i Sunyer Biomedical Research Institute (IDIBAPS), Barcelona, Spain Department of Basic Clinical Practice, University of Barcelona, Barcelona, Spain Network Centre for Biomedical Research in Mental Health (CIBERSAM), Carlos III Health Institute, Madrid, Spain
*: Correspondence: Sergi Mas. Email: sergimash@ub.edu

Article contents

Abstract
Background
Aims
Method
Results
Conclusions
Method
Results
Discussion
Supplementary material
Data availability
Author contributions
Funding
Declaration of interest
Transparency declaration
Footnotes
References

Rights & Permissions

Abstract

Background

Identifying patients with first-episode psychosis (FEP) who are unlikely to achieve early clinical recovery (ECR) is critical for personalised intervention and resource allocation. ECR – defined as the concurrent achievement of symptomatic and functional remission – represents a clinically meaningful outcome that captures both illness control and functional reintegration.

Aims

To develop and externally validate prediction models for ECR using clinical, cognitive and genetic data.

Method

We analysed two large, independent Spanish cohorts: the primeros episodios psicóticos cohort (N = 335), for model development and internal validation, and the Programa Asistencial a las Fases Iniciales de Psicosis cohort (N = 668), for external validation. Forty-seven baseline clinical and cognitive variables and 87 polygenic risk scores (PRSs) were examined. Predictors were selected using penalised logistic regression. Logistic regression and three machine learning algorithms were compared for discrimination, calibration and clinical utility.

Results

The best-performing model was a logistic regression using six routinely collected clinical and cognitive predictors (duration of untreated psychosis, days of treated psychosis, baseline functioning, insight, executive function and cognitive reserve), with an optimism-corrected area under the receiver operating characteristic curve of 0.73 in development and 0.63 in external validation. PRS models showed limited external generalisability and did not improve prediction. Machine learning algorithms offered no advantage over regression models.

Conclusions

A simple, interpretable logistic regression model based on routine clinical and cognitive variables can predict early recovery in FEP with acceptable generalisability. These findings support the use of transparent, clinically grounded models in early psychosis care and highlight the current limitations of genetic predictors for individualised treatment.

Keywords

First-episode psychosis early clinical recovery predictive modelling machine learning precision psychiatry

Information

Type: Original Article
Information: The British Journal of Psychiatry , First View , pp. 1 - 12

DOI: https://doi.org/10.1192/bjp.2026.10581 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of Royal College of Psychiatrists

There is substantial variability in the efficacy of antipsychotic drugs in the treatment of psychotic disorders. On average, 20–30% of patients show inadequate response to treatment and fewer than 40% achieve symptom remission, with treatment discontinuation rates reaching approximately 70%.^{Reference Wold, Ottesen, Camilla, Johnsen, Lagerberg and Romm1} These factors contribute to relapses that often lead to rehospitalisation, further worsening the prognosis, negatively impacting patients’ quality of life and reducing life expectancy.^{Reference Lisoway, Chen, Zai, Tiwari and Kennedy2} Therefore, recovery, remission and relapse prevention in psychotic disorders have become key areas of interest in psychiatric research.^{Reference Fusar-Poli, Rutigliano, Stahl, Davies, Bonoldi and Reilly3}

Nevertheless, the long-term recovery rates for individuals experiencing first-episode psychosis (FEP) are more favourable, varying between 10 and 35% depending on the definition of recovery and the length of the follow-up period.^{Reference Catalan, Richter, Salazar de Pablo, Vaquerizo-Serrano, Mancebo and Pedruzo4,Reference Peralta, García de Jalón, Moreno-Izco, Peralta, Janda and Sánchez-Torres5} Although the outcome trajectories of individual patients are difficult to predict, clinical development over the first years of treatment is critical for longer-term outcomes. Early identification of individuals at high risk of non-response at the initial clinical encounter may enable timely, personalised interventions, reduce the delay in treatment initiation and optimise resource utilisation, redirecting intensive resources towards those with more complex or prolonged recovery trajectories, ultimately improving patient outcomes and increasing the likelihood of long-term recovery.^{Reference Albert and Weibell6}

Previous studies have identified several predictors of poor clinical, functional and cognitive outcomes, such as poor premorbid adjustment, history of developmental disorders, greater baseline symptom severity and longer duration of untreated psychosis.^{Reference Lally, Ajnakina, Stubbs, Cullinane, Murphy and Gaughran7–Reference Santesteban-Echarri, Paino, Rice, González-Blanch, McGorry and Gleeson9} However, these findings are based on population-level associations and do not necessarily allow accurate prediction at the individual level.

In recent years, multivariable prediction models such as machine learning algorithms have been increasingly used to provide individualised risk estimates. These tools aim to support personalised psychiatry by stratifying the probability of poor outcomes at the time of first clinical contact. Several systematic reviews have catalogued such models for psychosis outcomes, many of which rely on baseline clinical variables.^{Reference Salazar de Pablo, Radua, Pereira, Bonoldi, Arienti and Besana10,Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl11}

Among these clinical prediction models in FEP, the most frequently predicted outcome is clinical remission.^{Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark12–Reference Soldatos, Cearns, Nielsen, Kollias, Xenaki and Stefanatou14} However, there is increasing support for incorporating measures of psychosocial functioning and personal recovery to better capture the heterogeneity of FEP outcomes.^{Reference Peralta, García de Jalón, Moreno-Izco, Peralta, Janda and Sánchez-Torres5} In this context, clinical recovery – defined as sustained symptom remission alongside regained functional capacity – has been proposed as a more comprehensive and meaningful outcome.^{Reference Cuesta, Sánchez-Torres, Moreno-Izco, García de Jalón, Gil-Berrozpe and Zarzuela15,Reference Treen Calvo, Giménez-Donoso, Setién-Suero, Toll Privat, Crespo-Facorro and Ayesa Arriola16} Unlike traditional outcome measures that focus narrowly on symptom reduction, recovery reflects a holistic and person-centered approach that better captures the lived experience and long-term adaptation of individuals with schizophrenia. Recently, the Remission in Schizophrenia Working Group criteria for remission, in combination with functional measurements, have been used to define early clinical recovery (ECR) following a FEP as meeting the criteria for remission and regained functioning at 1-year follow-up.^{Reference Wold, Ottesen, Camilla, Johnsen, Lagerberg and Romm1}

To date, most prediction models for FEP outcomes have focused primarily on sociodemographic and clinical variables, often overlooking cognitive functioning and biomarkers. Neurocognitive impairment is strongly associated with poor functional outcomes and reduced likelihood of recovery.^{Reference Treen Calvo, Giménez-Donoso, Setién-Suero, Toll Privat, Crespo-Facorro and Ayesa Arriola16} Polygenic risk scores (PRSs), which aggregate the effects of many genetic variants associated with specific traits or disorders, are among the most promising emerging biomarkers. The integration of PRS into clinical practice is one of the most exciting developments in the advancement of precision psychiatry.^{Reference Wray, Lin, Austin, McGrath, Hickie and Murray17}

In the present study, we aimed to develop and externally validate a clinically applicable prediction model for ECR and evaluate its performance across different modelling approaches, including both regression-based and machine learning techniques. We used data from the two largest longitudinal FEP cohorts in Spain: the primeros episodios psicóticos (PEPs) cohort (from the study ‘Genotype–Phenotype Interaction and Environment. Application to a Predictive Model in First Psychotic Episodes’, N = 335) for model development and internal validation, and the Programa Asistencial a las Fases Iniciales de Psicosis (PAFIP) cohort (from the ‘Clinical Program on Early Phases of Psychosis’, N = 668) for external validation. We developed prediction models using different sets of predictors: clinical variables (including cognitive measures), genetic data (PRS) and a combination of both.

Method

The study was conducted in accordance with the protocol of the FarmaPRED study (approved by the Clinical Research Ethic Committee of Hospital Clinic Barcelona, approval no. HCB/2022/0079),^{Reference Mas, Julià, Cuesta, Crespo-Facorro, Vázquez-Bourgon and Spuch18} and is reported following the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD + AI) guidelines.^{Reference Collins, Moons, Dhiman, Riley, Beam and Calster19} A detailed description of protocol adherence is provided in Supplementary Table 1 available at https://doi.org/10.1192/bjp.2026.10581. The completed TRIPOD + AI reporting checklist can be found in Supplementary Table 2.

Participants

The development data-set comprised individuals from the PEPs cohort (N = 335), collected as part of the multicentre, naturalistic, longitudinal study ‘Genotype–Phenotype Interaction and Environment: Application to a Predictive Model in First Psychotic Episodes’.^{Reference Bernardo, Bioque, Parellada, Saiz Ruiz, Cuesta and Llerena20} Participants were recruited between April 2009 and April 2011 across 16 clinical centres throughout Spain. Inclusion criteria were as follows: age 7–35 years, psychotic symptoms <12 months, fluency in Spanish and signed informed consent. Exclusion criteria included intellectual disability (DSM-IV), history of head trauma with loss of consciousness and medical/neurological conditions affecting cognition. The study was approved by the Clinical Research Ethic Committee of Hospital Clinic Barcelona (approval no. HCB/2008/4232). All participants provided written informed consent.

The validation data-set consisted of individuals from the PAFIP cohort (N = 668), part of the programme ‘Clinical Programme on Early Phases of Psychosis’^{Reference Crespo-Facorro, de la Foz, Ayesa-Arriola, Pérez-Iglesias, Mata and Suarez-Pinilla21} conducted at University Hospital Marqués de Valdecilla (Santander, Spain). Recruitment occurred between February 2001 and September 2018. Inclusion criteria included: residence in Cantabria, first episode of non-affective psychosis, age 15–60 years and a DSM-IV diagnosis within the schizophrenia spectrum (schizophrenia, schizophreniform, schizoaffective, brief psychotic disorder or psychosis not otherwise specified). Exclusion criteria were substance- or medication-induced psychosis, brain injury or neurological illness, intellectual disability and substance dependence (except nicotine). Diagnoses were confirmed using the Structured Clinical Interview for DSM-IV Axis I Disorders, conducted by a trained research psychiatrist. Informed consent was obtained from participants and their families. The study was approved by the Institutional Review Board of Cantabria (approval no. 2017/247), and complied with international research ethics guidelines.

A complete description of these cohorts can be found elsewhere.^{Reference Bernardo, Bioque, Parellada, Saiz Ruiz, Cuesta and Llerena20,Reference Crespo-Facorro, de la Foz, Ayesa-Arriola, Pérez-Iglesias, Mata and Suarez-Pinilla21}

Early clinical recovery definition

The outcome for this study was ECR, defined as achieving both symptomatic and functional remission within 1 year of study inclusion according to Wold et al.^{Reference Wold, Ottesen, Camilla, Johnsen, Lagerberg and Romm1} Symptomatic remission was defined as a score of ≤3 (mild or less) on the following 8 Positive and Negative Syndrome Scale (PANSS) items: P1 (delusions), P2 (conceptual disorganisation), P3 (hallucinatory behaviour), N1 (blunted affect), N4 (social withdrawal), N6 (lack of spontaneity), G5 (mannerisms/posturing) and G9 (unusual thought content). Functional remission was defined as achieving a score of ≥61 on the Global Assessment of Functioning (GAF) scale. Participants who met both symptomatic and functional remission criteria at either the 3- or 6-month assessment were classified as ECR, provided that this recovery was confirmed as stable at 1 year. Participants not meeting these criteria were classified as non-ECR. No specific class imbalance correction methods were applied, because the outcome prevalence in the development cohort was 73.73% ECR v. 26.27% non-ECR.

The outcome in PAFIP was assessed at 3 months only, because the cohort schedule lacks a 6-month follow-up. Because PANSS scores were not available for all participants, symptomatic remission was defined according to the following items of the Scale for the Assessment of Positive Symptoms (SAPS) (item 20, Global Rating of Delusions); the Scale for the Assessment of Negative Symptoms (SANS) (item 9, Poverty of Speech; item 13, Global Rating of Alogia; item 18, Recreational Interests and Activities; item 22, Global Rating of Anhedonia/Asociality); and the Brief Psychiatric Rating Scale (item 4, Conceptual Disorganisation; item 7, Mannerisms and Posturing; item 12, Hallucinatory Behaviour; item 15, Unusual Thought Content). Functional remission was defined based on Disability Assessment Scale (DAS).

Candidate predictor variables

A total of 47 baseline variables (Table 1) and 87 genetic variables (Supplementary Table 3) were initially considered as potential predictors (i.e. prognostic factors) of ECR. All candidate predictors were selected based on their availability in the development data-set, and on previous research and consensus among 11 psychiatrists involved in the FarmaPRED study using the Delphi method.^{Reference Mas, Julià, Cuesta, Crespo-Facorro, Vázquez-Bourgon and Spuch18}

Table 1

Summary of clinical characteristics of the development cohort, stratified by early clinical recovery

ECR, early clinical recovery; PAS, Premorbid Adjustment Scale; DUP, duration of untreated psychosis; DTP, duration of treated psychosis; PANSS, Positive and Negative Syndrome Scale; YOUNG, Young Mania Rating Scale; MADRS, Montgomery–Åsberg Depression Rating Scale; GEOPTE, GEOPTE Scale of Social Cognition for Psychosis; GAF, Global Assessment of Functioning; TMT, trail-making test; CPT, continuous performance test; WCST, Wisconsin card-sorting test. Values are mean (standard deviation) unless stated otherwise. Categorical variables are compared using Fisher’s exact test, with continuous variables compared using the Wilcoxon rank-sum test.

Candidate baseline predictors included sociodemographics (age, gender, ethnicity, socioeconomic status, immigration, urbanicity, toxic habits, Premorbid Adjustment Scale); environmental (family psychiatric history, traumatic experiences, winter birth); clinical (days of untreated psychosis (DUP) and days of treated psychosis (DTP); insight (PANSS item G12); PANSS, Young Mania Rating Scale; Montgomery–Åsberg Depression Rating Scale, GAF); and cognitive variables (cognitive reserve, animal fluency test, trail-making test (TMT); Conners continuous performance test; digit span subtest and the letter–number sequencing subtest of the Wechsler Adult Intelligence Scale (WAIS-III); Verbal learning test Spain Complutense for Adults; and Wisconsin card-sorting test (WCST-perseverative response)).

Genetic variables included PRSs related to psychopathology, cognitive function, toxic habits, personality traits and immunological traits. A complete description of the protocol can be found elsewhere.^{Reference Segura, Serna, Sugranyes, Baeza, Valli and Martínez-Serrano22} All PRSs were standardised.

To characterise and compare the baseline clinical profile of individuals by ECR status, we computed descriptive statistics for all candidate predictors. For continuous variables we reported the mean and standard deviation, and class comparisons between ECR and non-ECR were performed using the Wilcoxon rank-sum test. For categorical variables, absolute and relative frequencies were calculated, with class differences assessed using Fisher’s exact test. Statistical significance was set at P < 0.05. This analysis was conducted separately for the development and validation cohorts.

To ensure comparability between the development and validation cohorts, we applied systematic data harmonisation to address differences in measurement instruments and variable ranges. DUP was defined similarly in both cohorts as the time between symptom onset and antipsychotic initiation but, because extreme values were present in PAFIP, individuals with DUP >912 days were excluded. Insight was assessed using different instruments across cohorts, each measuring the same construct (PANSS item G12 in PEPs and Scale Unawareness of Mental Disorders in PAFIP). Scores were rescaled to a common range of 1–7 using linear interpolation. Functioning was measured with GAF (range 0–100) in PEPs, and with DAS in PAFIP. DAS scores were inverted and linearly interpolated to approximate GAF scores. Cognitive reserve was summarised in PEPs via principal component analysis (PCA) on estimated IQ (evaluated with the vocabulary subtest of WAIS), education (years) and premorbid adjustment (childhood and early adolescence). For PAFIP, variables were standardised using PEPs parameters and projected onto PCA loadings from the validation cohort. Executive function was assessed using different instruments: WCST in PEPs and TMT-B in PAFIP. Scalar values were normalised to z-scores using values from control populations from the PEPs and PAFIP studies, respectively.

For reduction of dimensionality, avoidance of overfitting and selection of the most prognostically informative variables, we applied penalised logistic regression using the least absolute shrinkage and selection operator (LASSO). The model was trained on the development set using tenfold cross-validation, and the regularisation parameter (λ) was chosen according to the one-standard-error rule, selecting the most parsimonious model whose performance was within one standard error of the model with the highest cross-validated area under the receiver operating characteristic (ROC) curve (AUC). Distributions of the selected predictor variables across cohorts are visualised in Supplementary Fig. 1.

Missing data

Missing data in the development data-set were handled using multiple imputation by chained equations (MICE). Variables with more than 30% missingness were excluded from the analysis. The outcome variable (ECR) was defined prior to imputation, and none of the variables required to determine ECR were imputed. The imputation model assumed a missing-at-random mechanism and contained all other candidate predictors. A total of five imputed data-sets were generated, based on the proportion of incomplete cases among variables with missing values. Initial model fitting was performed on the first imputed data-set, followed by sensitivity analysis to evaluate the robustness and consistency of model estimates across the other imputed data-sets. No imputation procedures were applied to the external validation data-set. To assess potential selection bias arising from the complete-case approach in this cohort, we compared baseline characteristics between included and excluded participants.

Model development

Three predictive models were developed based on the type of prognostic factors (model type): (a) a clinical model, using predictors selected by LASSO from the 47 baseline clinical variables; (b) a genetic model, using LASSO-selected PRSs among the 87 available; and (c) a combined model, integrating selected clinical and genetic predictors. No additional transformations were applied beyond those described in the Candidate predictor variables section. For each model type, we applied four model-building approaches (algorithms): one classical statistical method, logistic regression and three machine learning approaches: the naive Bayes classifier (NBC), gradient-boosting machine (GBM) and support vector machine (SVM). To ensure terminological clarity, each trained model is referred to using the format (model type)-(algorithm) throughout the manuscript (e.g, clinical-logistic regression, genetic-GBM, combined-SVM).

Models were fitted using the entire development data-set (PEPs cohort). Internal validation was performed using bootstrap resampling to estimate optimism-adjusted performance metrics. Hyperparameters for the machine learning models were optimised using grid search and tenfold cross-validation, selecting the configuration that maximised prediction accuracy.

The models output the predicted probability of ECR for each patient. In all models, the predicted probabilities were converted into binary classifications (ECR versus non-ECR) using the optimal probability threshold determined via the Youden index, which maximises the sum of sensitivity and specificity.

Performance evaluation

Model performance was evaluated on the development data-set across three complementary dimensions: discrimination, calibration and clinical utility. External validation was performed on the validation data-set following the same evaluation procedures.

Discrimination was assessed using AUC. Confidence intervals (95%) for AUC were computed using 2000 stratified bootstrap resamples. In addition, standard classification metrics derived from the confusion matrix were calculated at the optimal threshold; these included accuracy, sensitivity and specificity. Calibration plots were generated to assess the alignment of observed proportion of ECR and predicted probabilities. The calibration curves were smoothed with locally estimated scatterplot smoothing.

To compare model performance across model types, we performed pairwise comparisons of AUC values using the bootstrap test for the difference in AUCs. Selection of the final model was based on a combination of discrimination, calibration and parsimony criteria.

To assess the clinical utility of the best-performing model, we applied decision curve analysis (DCA). Net benefit curves were plotted across a range of threshold probabilities, comparing the model strategy against the default strategies of ‘treat all’ and ‘treat none’. This method quantifies the trade-off between true positive identifications and potential harms due to false positives. In this context, predicted probabilities were inverted so that the event corresponded to non-ECR, which is the clinically undesirable outcome.

Finally, to enhance model interpretability, we computed Shapley additive explanations (SHAP) values for the best-performing model using the development cohort. These values quantify the contribution of each predictor to the model’s prediction at the individual level, and help identify which features most strongly influenced model decisions.

Software and code

All analyses were conducted using R version 4.5.0 (R Foundation for Statistical Computing, Vienna, Austria; https://www.R-project.org/). Key packages included the following: mice^{Reference Buuren and Groothuis-Oudshoorn23} for MICE, glmnet^{Reference Friedman, Hastie and Tibshirani24} for LASSO, rms^{Reference Harrell25} for logistic regression modelling and caret^{Reference Kuhn26} for training and tuning of machine learning algorithms. The full analysis code is publicly available in the following GitHub repository: https://github.com/laurajuliamelis/ECR-prediction-models. Additionally, a user-friendly online tool has been developed to facilitate individual prediction of ECR using the final model; this tool is accessible at https://farmapred-pep.shinyapps.io/predictECR/. Due to privacy and ethical restrictions, individual-level patient data are not publicly available.

Results

Study cohorts

In PEPs, 236 out of 335 (70.4%) had complete data; of these, 174 (73.7%) achieved ECR during the first year. In PAFIP, 451 out of 668 (67.5%) had complete data and 272 (60.3%) met ECR criteria at 3 months. Supplementary Table 4 shows that no clinically meaningful differences were detected between included and excluded patients in the PEPs cohort. However, excluded patients from the PAFIP cohort showed longer DUP, worse functioning and lower insight, suggesting that they may represent more severe cases who did not complete follow-up. The flow of participants and final sample sizes included in each analysis are detailed in Fig. 1.

Fig. 1

Flowchart for model development and external validation study populations. PEPs, primeros episodios psicóticos; PAFIP, Programa Asistencial a las Fases Iniciales de Psicosis; ECR, early clinical recovery; PAS, Premorbid Adjustment Scale.

Table 1 summarises baseline clinical characteristics of the development cohort, overall and stratified by ECR status. Significant differences between ECR and non-ECR groups were observed in several variables. ECR patients had a shorter DUP (mean 92 v. 141 days, P = 0.059), longer DTP (mean 62 v. 32 days, P = 0.034) and better insight (mean 2.7 v. 3.3, P = 0.020). They also presented better premorbid adjustment scores in early adolescence (mean 7.8 v. 10.0, P = 0.009), late adolescence (mean 8.9 v. 11.0, P = 0.049) and adulthood (mean 18.4 v. 22.0, P = 0.029), alongside higher estimated premorbid IQ (mean 88.9 v. 85.7, P = 0.022). Cognitive markers differed markedly, with ECR individuals showing higher cognitive reserve composite scores (mean 0.2 v. −0.4, P = 0.002). In terms of functioning, ECR individuals exhibited significantly higher GAF scores (mean 53.4 v. 41.6, P < 0.001).

In the validation cohort (Table 2), patients who achieved ECR had significantly shorter DUP, better functioning scores, higher cognitive reserve and better insight compared with non-ECR patients (all P < 0.05). No significant differences were observed for DTP, executive function or any of the PRSs.

Table 2

Clinical characteristics and polygenic risk scores of the validation cohort (PAFIP), stratified by early clinical recovery

PAFIP, Programa Asistencial a las Fases Iniciales de Psicosis; ECR, early clinical recovery; DUP, duration of untreated Psychosis; DTP, duration of treated psychosis; ASD, autism spectrum disorder; CP, cognitive performance; CPD, cigarettes per day; EA, educational attainment; IL-16, interleukin 16; MIF, macrophage migration inhibitory factor. Values are mean (standard deviation). Variables are compared using Student’s t-test or the Wilcoxon rank-sum test, as appropriate.

Model development

We generated five imputed data-sets using MICE. As shown in Supplementary Fig. 2, missingness was generally low for key variables and visual inspection of imputed versus observed distributions indicated plausible imputations with preserved data structure.

LASSO logistic regression was used to select the most predictive variables. For the clinical model (includes sociodemographic, clinical, functional and cognitive variables), the selected penalisation parameter (λ _1s.e. = 0.0717) retained 6 variables: DUP, DTP, functioning (GAF score), cognitive reserve, insight and executive function. For the genetic model, 7 PRSs were retained at λ _1s.e. = 0.0716: autism spectrum disorder, cognitive performance, cigarettes per day, educational attainment, IQ, macrophage migration inhibitory factor and interleukin 16. Final hyperparameter settings for each model type and algorithm are detailed in Supplementary Table 5.

Model evaluation

Figure 2 presents ROC curves and key performance metrics (accuracy, sensitivity, specificity) for each of the 12 models, combining 3 model types (clinical, genetic, combined) and 4 algorithms (logistic regression, NBC, GBM, SVM). All metrics were computed at the optimal classification threshold derived using the Youden index. When comparing optimism-corrected AUC (AUC_c), we observed considerable overfitting in GBM models, with AUC reductions ranging from 0.10 to 0.14 (e.g. genetic-GBM, AUC = 0.92, AUCc = 0.79). NBC models also showed performance degradation following correction, with AUC–AUC_c differences between 0.08 and 0.12. Logistic regression and SVM algorithms showed better robustness to overfitting. The best-performing model in terms of discrimination was combined-SVM (AUC_c = 0.80), followed closely by combined-logistic regression (AUC_c = 0.79) and clinical-logistic regression (AUC_c = 0.73). Genetic-only models consistently yielded the lowest AUCs across all algorithms. Supplementary Fig. 3 displays AUC distributions across imputations for each model, confirming robustness to imputation variability.

Fig. 2

Discriminative performance of the 12 models in the development cohort. Each panel displays the receiver operating characteristic (ROC) curve for one algorithm (logistic regression, naive Bayes classifier, gradient-boosting machine and support vector machines), with separate curves for each model type (clinical, genetic, combined). Summary metrics (accuracy, sensitivity, specificity) are shown within each panel and were computed at the optimal probability threshold based on the Youden Index. AUC, area under the ROC curve; AUC_c, optimism-corrected AUC, estimated via bootstrap resampling.

Figure 3 shows the smoothed calibration plots for the three logistic regression models. Both clinical-logistic regression (slope 1.00, 95% CI: 0.65–1.35) and combined-logistic regression (slope 1.00, 95% CI: 0.67–1.33) demonstrated generally good alignment of observed and predicted probabilities, with some over-prediction in those with predicted probabilities <0.3 in the clinical-logistic regression model. By contrast, the genetic-logistic regression model showed markedly poorer calibration, particularly at lower predicted probabilities, where sparse data contributed to unstable estimates and deviation from the ideal calibration line. Miscalibration was also prominent across models using other algorithms, as detailed in Supplementary Fig. 4.

Fig. 3

Calibration plots for logistic regression models in the development data-set, stratified by model type (clinical, genetic, combined). The diagonal line represents perfect calibration, with the locally estimated scatterplot smoothing curve showing the model’s calibration and 95% confidence interval in grey. The marginal bar plot indicates the distribution of patients with (1) and without (0) the observed outcome across deciles of predicted probability of early clinical recovery.

Among the logistic regression models, Combined-logistic regression outperformed both clinical and genetic types in AUC (combined versus clinical: 0.856 v. 0.761, P = 0.06; combined v. genetic: 0.856 v. 0.729, P = 0.02), although the former comparison did not reach statistical significance.

External validation was assessed in the PAFIP cohort, and the results are shown in Supplementary Fig. 5. All logistic regression models showed reduced discrimination performance in the PAFIP cohort compared with PEPs, with clinical-logistic regression achieving the highest AUC values (0.63). Calibration for these models is depicted in Supplementary Fig. 6.

Based on internal discrimination, calibration and parsimony criteria, we selected the clinical-logistic regression model for further interpretation and clinical translation.

Clinical-logistic regression model

To improve model interpretability, Fig. 4(a) displays SHAP values for the clinical-logistic regression model; functioning, executive function and DTP were the most influential predictors. Colour gradients indicate the direction of the effect: higher values of functioning (mean 0.065), executive performance (mean 0.058), DTP (mean 0.048) and cognitive reserve (mean 0.036) increased the likelihood of ECR, whereas higher DUP values (mean 0.042) and insight (mean 0.041) were associated with a decreased probability. Exponentiated coefficients and associated statistics for the clinical-logistic regression model are provided in Supplementary Table 6.

Fig. 4

(a) Shapley additive explanations (SHAP) summary plot for the clinical-logistic regression model, showing the contribution of each predictor to model output across all individuals. Each dot represents a patient, coloured by the actual value of the predictor variable (purple (dark blue in print version), high; yellow (light blue in print version), low). Predictors are ordered by their impact on model predictions. The horizontal bars on the right indicate each variable’s mean SHAP value, expressed as a percentage of the total contribution across all predictors. (b) Decision curve analysis for the clinical-logistic regression (LR) model, showing the net benefit of using the model to identify patients unlikely to experience early clinical recovery, across a range of threshold probabilities. The model’s net benefit curve (light blue) is compared against default strategies (‘treat all’ and ‘treat none’). DTP, duration of treated psychosis; DUP, duration of untreated psychosis.

Finally, Fig. 4(b) shows the DCA for the clinical-logistic regression model. This model yielded superior net benefit compared with ‘treat all’ or ‘treat none’ strategies across a wide range of threshold probabilities (particularly 0.1–0.8), supporting its potential clinical utility.

Discussion

In this study, we developed and validated predictive models for ECR in patients with FEP. Models were derived using both regression (logistic regression) and machine learning techniques in a large, naturalistic multicentre cohort (PEPs), and externally validated in an independent cohort (PAFIP). We tested various sets of predictors, including clinical and genetic variables. Overall, regression models using clinical and cognitive variables offered the best performance and generalisability.

The predictive performance of our best model (AUC = 0.76, AUC_c = 0.73) is consistent with previous studies predicting remission outcomes (AUC range 0.70–0.74).^{Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark12–Reference Soldatos, Cearns, Nielsen, Kollias, Xenaki and Stefanatou14} As observed in prior research involving multi-site prediction models, a drop in performance occurred when transitioning from the development to the validation data-set: the external validation performance in the PAFIP data-set decreased to an AUC of 0.63. This reduction may be attributable to heterogeneity between cohorts, including differences in inclusion/exclusion criteria, diagnostic distributions and geographic settings.

Notably, PEPs is a multicentre study across multiple regions in Spain, included both affective and non-affective psychoses, younger participants and it used more inclusive recruitment criteria than PAFIP, which focused on non-affective psychosis and recent-onset cases from Cantabria. Differences in follow-up timing (6 versus 3 months), symptom/functioning measures (PANSS/GAF versus SAPS–SANS/DAS) and the tools used to assess insight and cognition introduced additional variability. Despite this, our models remained robust during external validation, performing significantly above chance levels. This supports their broader relevance across heterogeneous clinical populations.

Our findings allow characterisation of the FEP population, with better prognosis for an ECR on the basis of good premorbid functioning, clinical (shorter DUP, longer DTP and better insight) and cognitive domains (stronger executive function and higher cognitive reserve). This profile is in agreement with substantial evidence in the literature^{Reference Catalan, Richter, Salazar de Pablo, Vaquerizo-Serrano, Mancebo and Pedruzo4,Reference Peralta, García de Jalón, Moreno-Izco, Peralta, Janda and Sánchez-Torres5,Reference Lally, Ajnakina, Stubbs, Cullinane, Murphy and Gaughran7} and is consistent with previous multivariate predictors of remission,^{Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark12–Reference Soldatos, Cearns, Nielsen, Kollias, Xenaki and Stefanatou14} and highlights the central role of cognitive functioning in long-term outcomes.^{Reference Treen Calvo, Giménez-Donoso, Setién-Suero, Toll Privat, Crespo-Facorro and Ayesa Arriola16} Developing scalable methods to incorporate such measures into routine minimisation of clinical burden will be critical for the deployment of precision psychiatry in clinical practice.^{Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl11}

Considering the limitations of the present study when interpreting our findings, we suggest that individuals predicted to be at risk of non-ECR may benefit from more intensive and sustained early-intervention care. This could include optimised pharmacological management, increased frequency or duration of psychosocial interventions and targeted cognitive interventions when cognitive difficulties are prominent.^{Reference Fusar-Poli, Rutigliano, Stahl, Davies, Bonoldi and Reilly3} The intensity and content of care should be tailored to the domains in which the individual shows the greatest need, rather than following a uniform, one-size-fits-all approach. Further research will be needed to explore whether individuals predicted to have poor early recovery could be considered candidates for earlier clozapine intervention.^{Reference Butler, Stratford and Jauhar27} In addition, an ECR profile may support shared decision-making between psychiatrists and patients regarding dose reduction or treatment discontinuation, as desired by many patients.^{Reference Sommer, de Beer, Gangadin, de Haan, Veling and van Beveren28}

Our findings have significant implications for the organisation of early-intervention services. The current practice in many settings is to offer all individuals with FEP the same fixed period of early-intervention services care, regardless of their clinical trajectory. Introducing flexibility in discharge timing could enable services to redirect intensive resources toward those with more complex or prolonged recovery trajectories – specifically patients who require extended or more intensive care – thereby enhancing overall service efficiency, a crucial consideration given the limited resources highlighted in economic evaluations of early-phase psychosis interventions.^{Reference Shields, Buck, Varese, Yung, Thompson and Husain29} Analogous to other areas of medicine, where prognostic models reliably stratify patients to guide the intensity and duration of treatment or monitoring, predictive approaches in FEP could inform personalised care pathways. By identifying those with low probability of early recovery or at higher risk of prolonged courses, services could allocate more intensive interventions and longer follow-up.

An important issue raised by the advancement of precision psychiatry concerns the caution required when communicating unfavourable prognostic information or low chances of ECR to individuals with FEP, in order to avoid fixed and pessimistic labels and to prevent unnecessary emotional distress, hopelessness and misunderstanding.^{Reference Huurman, Pijnenborg, Sportel, van Rijsbergen, Hasson-Ohayon and Boonstra30} Information about illness chronicity and relapse risk should be provided clearly, without implying that recovery is unlikely. This cautious approach is warranted given the substantial heterogeneity of outcomes in FEP, as well as the current limitations of predictive models, which are not yet sufficiently robust or precise to guide individual clinical decisions in isolation.^{Reference Lee, Leighton, Thomas, Gkoutos, Wood and Fenton31} Importantly, predictive information should be embedded within early-intervention services that emphasise modifiable environmental and clinical factors, including early engagement with treatment, reinforcement of substance use prevention strategies and sustained family support.

In terms of modelling approaches, whereas machine learning is often touted for its capacity to model complex relationships, our results showed no significant performance advantage over traditional regression techniques. The fact that machine learning did not outperform logistic regression here probably stems from the combination of our sample size which, although respectable for this clinical domain, may be below the threshold required for many complex algorithms to reliably extract signal from noise without overfitting, and the inherent heterogeneity of the phenotype. This echoes concerns raised in recent literature about overestimating the utility of machine learning in clinical prediction tasks, particularly with moderate-sized data-sets and low signal-to-noise ratios.^{Reference Vabalas, Gowen, Poliakoff and Casson32} In such settings, simpler, well-specified regression models can be equally or more performant, a finding consistent with recent comparative studies.^{Reference Christodoulou, Ma, Collins, Steyerberg, Verbakel and Van Calster33} We ultimately favoured regression models for their greater interpretability, facilitating transparency and clinical usability in contrast to the ‘black-box’ nature of many machine learning approaches.^{Reference Kundu34} Such transparency is crucial to enhancing clinician and patient engagement with clinical predictive models.

We also evaluated the added value of genetic information using PRS. Several recent studies have explored the integration of PRS with clinical data to improve outcome prediction, under the premise that genetic information could improve clinical models. However, the current discriminatory power of PRS in general populations remains limited.^{Reference Lewis and Vassos35} The PRS for cognitive performance was the most informative genetic predictor of recovery, consistent with prior findings linking cognitive PRS to relapse and functional outcomes.^{Reference Segura, Mezquida, Martínez-Pinteño, Gassó, Rodriguez and Moreno-Izco36,Reference Segura, Prohens, Gassó, Rodríguez, Garcia-Rizo and Moreno-Izco37} Nonetheless, although genetic-based models showed strong predictive performance during internal validation, they failed to generalise to the external cohort despite consistent methodologies.

This lack of generalisability may reflect either context-specific gene–environment interactions or phenotypic variability. Increasing evidence suggests that PRS portability is limited not only by ancestry but also by environmental, social and methodological factors.^{Reference Hui, Dudek, Kiryluk, Walunas, Kullo and Wei38} Advances in PRS technology – combining both common and rare variants, epigenetic factors and context-aware modelling – will be essential to realising its full clinical potential.^{Reference Zeng, Guo, Liu and Wong39}

In addition to its limited generalisability, the genetic-logistic regression model exhibited poor calibration, particularly at the lower end of the predicted probability range. This is likely to have resulted from three interlinked factors: (a) a high baseline probability of recovery, driven by class imbalance in the derivation cohort; (b) weak genetic effects; and (c) the scarcity of individuals with an extreme high-risk genetic profile. These characteristics reflect the current limitations of PRS-only models for individual-level clinical prediction.

Among the strengths of this study is the use of large, well-characterised, real-world FEP cohorts with a rich array of clinical, cognitive and genetic data. We followed TRIPOD-AI guidelines, selected predictors based on a priori clinical knowledge, applied coefficient shrinkage and used bootstrapping techniques to reduce overfitting. We reported multiple calibration and performance metrics. Importantly, our models were externally validated using data from a second, independent cohort. Finally, to increase transparency and facilitate future validations, the full analysis code is publicly available in GitHub and a user-friendly online tool has been developed to facilitate individual prediction of ECR using the best-performing model.

Nonetheless, our study has some limitations. We included only self-reported Caucasian participants and did not account for ethnicity as a predictor, limiting the generalisability of our findings. Additionally, the clinical nature of first-episode psychosis cohorts inherently constrains the inclusion of extremely high-risk clinical profiles, such as markedly impaired functioning, very poor insight or psychosis of exceptionally long duration. Patients with more severe presentations are often more difficult to retain in longitudinal studies, as illustrated by the exclusion of participants in the PAFIP cohort. Consequently, the model had limited exposure to very low-recovery phenotypes, which may partly explain the slight deviation in calibration observed in this range. Treatment effects were also not controlled for, due to the naturalistic design and wide variability in treatments used across both cohorts. We assumed that participants had received standard clinical care. Our model focused on the prediction of early clinical recovery (3–6 months to 1 year), which does not necessarily capture medium- or long-term trajectories. A further limitation is the suboptimal calibration of some models (particularly the genetic model), especially in the lower range of predicted probabilities, largely driven by class imbalance and the limited presence of patients with extreme profiles. Finally, unlike clinical symptomatic remission, recovery is a broad and multidimensional construct, making it difficult to establish a uniform definition or standardised measurement criteria. Moreover, recovery status may fluctuate over time, reflecting the dynamic interaction among symptoms, environment and personal factors, which poses challenges for both its longitudinal assessment and comparison of outcomes across studies. Future models should consider standardised outcome definitions, harmonised tools across cohorts and treatment trajectories.

The early identification of individuals at risk of poor recovery may support proactive interventions, tailored treatment plans and resource optimisation. To be translated into practice, prediction tools require prospective validation, integration in clinical decision processes and co-design with end-users, and appropriate clinical and regulatory approval. Ongoing efforts, including the FarmaPRED cohort,^{Reference Mas, Julià, Cuesta, Crespo-Facorro, Vázquez-Bourgon and Spuch18} will be valuable for refining and validating these models. Future work should evaluate dynamic prediction over time, incorporate treatment response and examine extended biomarker integration to improve predictive performance. Additionally, we recommend testing whether machine learning approaches provide greater benefit in predicting ECR in larger, multicentre consortia, where the signal-to-noise ratio may be sufficiently increased to reveal potential advantages over traditional regression methods.

In this study, we developed and validated a prediction model for early clinical recovery in patients with first-episode psychosis using a combination of clinical, cognitive and genetic predictors. The final logistic regression model, based on clinical variables, showed acceptable discrimination and calibration in the development data-set and retained moderate performance during external validation in an independent cohort. Despite methodological differences between cohorts and variation in outcome definitions, the model demonstrated a degree of generalisability, highlighting the potential of early data to inform prognostic stratification. Although the results are promising, further research is needed before implementation in clinical settings. The current model should be viewed as a proof of concept for the use of simple clinical indicators to anticipate early treatment response.

Supplementary material

The supplementary material is available online at https://doi.org/10.1192/bjp.2026.10581

Data availability

The data that support the findings of this study are available from the corresponding author, S.M., upon reasonable request.

Author contributions

The results presented here are part of a broader project, the FarmaPRED study. S.M. is the designer and coordinator of the FarmaPRED study, supervised the statistical analysis, performed interpretation of the results and wrote the first draft of the manuscript. L.J. and V.O.-G.F. performed the statistical analysis and wrote the first draft of the manuscript, and both authors contributed equally to this work. A.G.-S. and M.A. performed bioinformatic analysis of the genetic data. J.L.C. supervised the statistical analysis. C.M.D.-C., I.Z., A.L., A.R., R.A.-A., P.S.P., M.J.-R., M.G.R., C.C., J.U., A.I., C.R.-L.-A., C.S., A.M., A.G.-P., B.C.-F., A.C., M.J.C., S.A. and J.V.-B. participated in the recruitment and assessment of the sample. B.C.-F. and J.V.-B. are the coordinators of the PAFIP cohort. S.A. is the coordinator of the PEPs study. All authors contributed to the final draft of the manuscript.

Funding

This study (no. PMP21/00085) was funded by Instituto de Salud Carlos III and European Union Next Generation EU/Mecanismo para la Recuperación y la Resiliencia/Plan de Recuperación, Transformación y Resiliencia; CERCA Program; Catalan Government, the Secretariat of Universities and Research of the Department of Enterprise and Knowledge (no. 2021 SGR 00672); and Centro de Investigación Biomédica en Red de Salud Mental. The PEPs study was funded by Ministerio de Economía y Competitividad (nos PI08/0208, PI11/00325 and PI14/00612), Instituto de Salud Carlos III – Fondo Europeo de Desarrollo Regional. Union Europea. The PAFIP cohort was supported by Instituto de Investigación Sanitaria Valdecilla (grant nos INT/A20/04 and INT/A21/10) and Instituto de Salud Carlos III (grant nos PI020499, PI050427 and PI060507).

Declaration of interest

C.M.D.-C. has received honoraria and/or travel support from Angelini, Johnson & Johnson and Viatris. A.M. has served as a speaker for, and received financial support for attending meetings and travel from Otsuka, Angelini and Janssen Cilag. A.I. thanks the support of Consorcio Centro de Investigación Biomédica en Red (no. CB/07/09/0025), Instituto de Salud Carlos III, Ministerio de Ciencia e Innovación; the Madrid Regional Government (no. S2022/BMD-7216 (AGES 3-CM)) and European Union Structural Funds; and grant nos PI22/01183 and ICI21/00089, integrated into Plan Nacional de I+D+I and co-financed by ISCIII-Subdirección General de Evaluación and the Fondo Europeo de Desarrollo Regional (FEDER). She has received research support from, or served as speaker or advisor for, Alter, Casen Recordati, Janssen-Cilag, Lundbeck, Otsuka Pharmaceutical SA, Rovi and Viatris, with no financial or other relationship relevant to the subject of this article. C.d.l.C. has received financial support to attend scientific meetings from Esteve, Lundbeck, Rovi, Novartis, Casen Recordati and Janssen, outside the submitted work. She is supported by a grant from the Aragón Government (no. B15_R23). M.J.C. thanks the support of the Government of Navarra (grant nos 17/31 and 18/41) and Carlos III Health Institute (FEDER Funds) from the Spanish Ministry of Economy and Competitiveness (nos PI16/02148, PI19/01698, RD21/0009/0025 and RD24/0003/0015). He has also received financial support as a speaker on his research findings from Alianza Otsuka-Lundbeck and Rovi. All other authors report no biomedical financial interests or potential conflicts of interest.

Transparency declaration

The manuscript is an honest, accurate and transparent account of the study being reported; no important aspect of the study has been omitted, and any discrepancies from the study as planned have been explained.

Footnotes

These authors contributed equally.

Joint last authors.

References

Wold, KF, Ottesen, A, Camilla, BF, Johnsen, E, Lagerberg, TV, Romm, KL, et al. Early identification of treatment non-response in first-episode psychosis. Eur Psychiatry 2023; 66: e30.10.1192/j.eurpsy.2023.15CrossRef Google Scholar PubMed

Lisoway, AJ, Chen, CC, Zai, CC, Tiwari, AK, Kennedy, JL. Toward personalized medicine in schizophrenia: genetics and epigenetics of antipsychotic treatment. Schizophr Res 2021; 232: 112–24.10.1016/j.schres.2021.05.010CrossRef Google Scholar PubMed

Fusar-Poli, P, Rutigliano, G, Stahl, D, Davies, C, Bonoldi, I, Reilly, T, et al. Development and validation of a clinically based risk calculator for the transdiagnostic prediction of psychosis. JAMA Psychiatry 2017; 74: 493–500.10.1001/jamapsychiatry.2017.0284CrossRef Google Scholar PubMed

Catalan, A, Richter, A, Salazar de Pablo, G, Vaquerizo-Serrano, J, Mancebo, G, Pedruzo, B, et al. Proportion and predictors of remission and recovery in first-episode psychosis: systematic review and meta-analysis. Eur Psychiatry 2021; 64: e69.10.1192/j.eurpsy.2021.2246CrossRef Google Scholar PubMed

Peralta, V, García de Jalón, E, Moreno-Izco, L, Peralta, D, Janda, L, Sánchez-Torres, AM, et al. Long-term outcomes of first-admission psychosis: a naturalistic 21-year follow-up study of symptomatic, functional and personal recovery and their baseline predictors. Schizophr Bull 2022; 48: 631–42.10.1093/schbul/sbab145CrossRef Google Scholar PubMed

Albert, N, Weibell, MA. The outcome of early intervention in first episode psychosis. Int Rev Psychiatry 2019; 31: 413–24.10.1080/09540261.2019.1643703CrossRef Google Scholar PubMed

Lally, J, Ajnakina, O, Stubbs, B, Cullinane, M, Murphy, KC, Gaughran, F, et al. Remission and recovery from first-episode psychosis in adults: systematic review and meta-analysis of long-term outcome studies. Br J Psychiatry 2017; 211: 350–8.10.1192/bjp.bp.117.201475CrossRef Google Scholar PubMed

Jääskeläinen E, Juola, P, Hirvonen, N, McGrath, JJ, Saha, S, Isohanni, M, et al. A systematic review and meta-analysis of recovery in schizophrenia. Schizophr Bull 2013; 39: 1296–306.Google Scholar

Santesteban-Echarri, O, Paino, M, Rice, S, González-Blanch, C, McGorry, P, Gleeson, J, et al. Predictors of functional recovery in first-episode psychosis: a systematic review and meta-analysis of longitudinal studies. Clin Psychol Rev 2017; 58: 59–75.10.1016/j.cpr.2017.09.007CrossRef Google Scholar PubMed

Salazar de Pablo, G, Radua, J, Pereira, J, Bonoldi, I, Arienti, V, Besana, F. Probability of transition to psychosis in individuals at clinical high risk. JAMA Psychiatry 2021; 78: 9708.10.1001/jamapsychiatry.2021.0830CrossRef Google Scholar PubMed

Meehan, AJ, Lewis, SJ, Fazel, S, Fusar-Poli, P, Steyerberg, EW, Stahl, D, et al. Clinical prediction models in psychiatry: a systematic review of two decades of progress and challenges. Mol Psychiatry 2022; 27: 2700–8.10.1038/s41380-022-01528-4CrossRef Google Scholar PubMed

Leighton, SP, Krishnadas, R, Chung, K, Blair, A, Brown, S, Clark, S, et al. Predicting one-year outcome in first episode psychosis using machine learning. PLOS ONE 2019; 14: e0212846.10.1371/journal.pone.0212846CrossRef Google Scholar PubMed

Leighton, SP, Krishnadas, R, Upthegrove, R, Marwaha, S, Steyerberg, EW, Gkoutos, GV, et al. Development and validation of a nonremission risk prediction model in first-episode psychosis: an analysis of 2 longitudinal studies. Schizophr Bull Open 2021; 2: sgab041.10.1093/schizbullopen/sgab041CrossRef Google Scholar PubMed

Soldatos, RF, Cearns, M, Nielsen, MØ, Kollias, C, Xenaki, LA, Stefanatou, P, et al. Prediction of early symptom remission in two independent samples of first-episode psychosis patients using machine learning. Schizophr Bull 2022; 48: 122–33.10.1093/schbul/sbab107CrossRef Google Scholar PubMed

Cuesta, MJ, Sánchez-Torres, AM, Moreno-Izco, L, García de Jalón, E, Gil-Berrozpe, GJ, Zarzuela, A, et al. Neurocognitive correlates of the varied domains of outcomes at 20 year follow-up of first-episode psychosis. Psychiatry Res 2022; 318: 114933.10.1016/j.psychres.2022.114933CrossRef Google Scholar PubMed

Treen Calvo, D, Giménez-Donoso, S, Setién-Suero, E, Toll Privat, A, Crespo-Facorro, B, Ayesa Arriola, R. Targeting recovery in first episode psychosis: the importance of neurocognition and premorbid adjustment in a 3-year longitudinal study. Schizophr Res 2018; 195: 320–6.10.1016/j.schres.2017.08.032CrossRef Google Scholar

Wray, NR, Lin, T, Austin, J, McGrath, JJ, Hickie, IB, Murray, GK, et al. From basic science to clinical application of polygenic risk scores: a primer. JAMA Psychiatry 2021; 78: 101–9.10.1001/jamapsychiatry.2020.3049CrossRef Google Scholar PubMed

Mas, S, Julià, L, Cuesta, MJ, Crespo-Facorro, B, Vázquez-Bourgon, J, Spuch, C, et al. Applied pharmacogenetics to predict response to treatment of first psychotic episode: study protocol. Front Psychiatry 2025; 15: 1497565.10.3389/fpsyt.2024.1497565CrossRef Google Scholar PubMed

Collins, GS, Moons, KGM, Dhiman, P, Riley, RD, Beam, AL, Calster, BV, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024; 385: e078378.10.1136/bmj-2023-078378CrossRef Google Scholar PubMed

Bernardo, M, Bioque, M, Parellada, M, Saiz Ruiz, J, Cuesta, MJ, Llerena, A, et al. Criterios de valoración clínicos y de funcionamiento en un estudio de interacción gen-ambiente en primeros episodios psicóticos (PEPs). Rev Psiquiatr Salud Ment 2013; 6: 4–16.10.1016/j.rpsm.2012.11.001CrossRef Google Scholar

Crespo-Facorro, B, de la Foz, VO-G, Ayesa-Arriola, R, Pérez-Iglesias, R, Mata, I, Suarez-Pinilla, P, et al. Prediction of acute clinical response following a first episode of non affective psychosis: results of a cohort of 375 patients from the Spanish PAFIP study. Prog Neuropsychopharmacol Biol Psychiatry 2013; 44: 162–7.10.1016/j.pnpbp.2013.02.009CrossRef Google Scholar PubMed

Segura, AG, Serna, E, Sugranyes, G, Baeza, I, Valli, I, Martínez-Serrano, I, et al. Polygenic risk scores mediating functioning outcomes through cognitive and clinical features in youth at family risk and controls. Eur Neuropsychopharmacol 2024; 81: 28–37.10.1016/j.euroneuro.2024.01.009CrossRef Google Scholar PubMed

Buuren, S, Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J Stat Softw 2011; 45: 1–67.10.18637/jss.v045.i03CrossRef Google Scholar

Friedman, JH, Hastie, T, Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1–22.10.18637/jss.v033.i01CrossRef Google Scholar PubMed

Harrell, FE. rms: Regression Modeling Strategies. The Comprehensive R Archive Network, 2025 (https://cran.r-project.org/web/packages/rms/index.html).Google Scholar

Kuhn, M. Building predictive models in R using the caret package. J Stat Softw 2008; 28: 1–26.10.18637/jss.v028.i05CrossRef Google Scholar

Butler, E, Stratford, M, Jauhar, S. Should clozapine be offered as a second-line antipsychotic? Lancet Psychiatry 2025; 12: 85–6.10.1016/S2215-0366(24)00440-1CrossRef Google Scholar PubMed

Sommer, IE, de Beer, F, Gangadin, S, de Haan, L, Veling, W, van Beveren, N, et al. Early dose reduction or discontinuation vs maintenance antipsychotics after first psychotic episode remission: a randomized clinical trial. JAMA Psychiatry 2026; 83: 68–73.10.1001/jamapsychiatry.2025.2525CrossRef Google Scholar PubMed

Shields, GE, Buck, D, Varese, F, Yung, AR, Thompson, A, Husain, N, et al. A review of economic evaluations of health care for people at risk of psychosis and for first-episode psychosis. BMC Psychiatry 2022; 22: 126.10.1186/s12888-022-03769-7CrossRef Google Scholar PubMed

Huurman, ME, Pijnenborg, GHM, Sportel, BE, van Rijsbergen, GD, Hasson-Ohayon, I, Boonstra, N. Communicating diagnoses to individuals with a first episode psychosis: a qualitative study of individuals perspectives. Front Psychiatry 2023; 14: 1098224.10.3389/fpsyt.2023.1098224CrossRef Google Scholar PubMed

Lee, R, Leighton, SP, Thomas, L, Gkoutos, GV, Wood, SJ, Fenton, SJH, et al. Prediction models in first-episode psychosis: systematic review and critical appraisal. Br J Psychiatry 2022; 220: 179–91.10.1192/bjp.2021.219CrossRef Google Scholar PubMed

Vabalas, A, Gowen, E, Poliakoff, E, Casson, AJ. Machine learning algorithm validation with a limited sample size. PLOS One 2019; 14: e0224365.10.1371/journal.pone.0224365CrossRef Google Scholar PubMed

Christodoulou, E, Ma, J, Collins, GS, Steyerberg, EW, Verbakel, JY, Van Calster, B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019; 110: 12–22.10.1016/j.jclinepi.2019.02.004CrossRef Google Scholar PubMed

Kundu, S. AI in medicine must be explainable. Nat Med 2021; 27: 1328.10.1038/s41591-021-01461-zCrossRef Google Scholar

Lewis, CM, Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med 2020; 12: 44.10.1186/s13073-020-00742-5CrossRef Google Scholar PubMed

Segura, A, Mezquida, G, Martínez-Pinteño, A, Gassó, P, Rodriguez, N, Moreno-Izco, L, et al. Link between cognitive polygenic risk scores and clinical progression after a first-psychotic episode. Psychol Med 2022; 53: 4634–47.10.1017/S0033291722001544CrossRef Google Scholar PubMed

Segura, A, Prohens, L, Gassó, P, Rodríguez, N, Garcia-Rizo, C, Moreno-Izco, L, et al. The polygenic basis of relapse after a first episode of schizophrenia. Eur Neuropsychopharmacol 2023; 75: 80–92.10.1016/j.euroneuro.2023.06.003CrossRef Google Scholar PubMed

Hui, D, Dudek, S, Kiryluk, K, Walunas, TL, Kullo, IJ, Wei, WQ, et al. Risk factors affecting polygenic score performance across diverse cohorts. eLife 2025; 12: RP88149.10.7554/eLife.88149CrossRef Google Scholar PubMed

Zeng, W, Guo, H, Liu, Q, Wong, WH. Improving polygenic prediction from whole-genome sequencing data by leveraging predicted epigenomic features. Proc Natl Acad Sci U S A 2025; 122: e2419202122.10.1073/pnas.2419202122CrossRef Google Scholar PubMed

Table 1 Summary of clinical characteristics of the development cohort, stratified by early clinical recovery

Fig. 1 Flowchart for model development and external validation study populations. PEPs, primeros episodios psicóticos; PAFIP, Programa Asistencial a las Fases Iniciales de Psicosis; ECR, early clinical recovery; PAS, Premorbid Adjustment Scale.

Table 2 Clinical characteristics and polygenic risk scores of the validation cohort (PAFIP), stratified by early clinical recovery

Fig. 2 Discriminative performance of the 12 models in the development cohort. Each panel displays the receiver operating characteristic (ROC) curve for one algorithm (logistic regression, naive Bayes classifier, gradient-boosting machine and support vector machines), with separate curves for each model type (clinical, genetic, combined). Summary metrics (accuracy, sensitivity, specificity) are shown within each panel and were computed at the optimal probability threshold based on the Youden Index. AUC, area under the ROC curve; AUCc, optimism-corrected AUC, estimated via bootstrap resampling.

Fig. 3 Calibration plots for logistic regression models in the development data-set, stratified by model type (clinical, genetic, combined). The diagonal line represents perfect calibration, with the locally estimated scatterplot smoothing curve showing the model’s calibration and 95% confidence interval in grey. The marginal bar plot indicates the distribution of patients with (1) and without (0) the observed outcome across deciles of predicted probability of early clinical recovery.

Fig. 4 (a) Shapley additive explanations (SHAP) summary plot for the clinical-logistic regression model, showing the contribution of each predictor to model output across all individuals. Each dot represents a patient, coloured by the actual value of the predictor variable (purple (dark blue in print version), high; yellow (light blue in print version), low). Predictors are ordered by their impact on model predictions. The horizontal bars on the right indicate each variable’s mean SHAP value, expressed as a percentage of the total contribution across all predictors. (b) Decision curve analysis for the clinical-logistic regression (LR) model, showing the net benefit of using the model to identify patients unlikely to experience early clinical recovery, across a range of threshold probabilities. The model’s net benefit curve (light blue) is compared against default strategies (‘treat all’ and ‘treat none’). DTP, duration of treated psychosis; DUP, duration of untreated psychosis.

Julià et al. supplementary material

File 922.4 KB

This journal is not currently accepting new eletters.

eLetters

No eLetters have been published for this article.

Article contents

Predicting early clinical recovery in first-episode psychosis: development and external validation of a clinically interpretable multivariable model

Abstract

Keywords

Information

Method

Participants

Early clinical recovery definition

Candidate predictor variables

Missing data

Model development

Performance evaluation

Software and code

Results

Study cohorts

Model development

Model evaluation

Clinical-logistic regression model

Discussion

Supplementary material

Data availability

Author contributions

Funding

Declaration of interest

Transparency declaration

Footnotes

References

Julià et al. supplementary material

eLetters

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests