Prediction of estimated risk for bipolar disorder using machine learning and structural MRI features

Background. Individuals with bipolar disorder are commonly correctly diagnosed a decade after symptom onset. Machine learning techniques may aid in early recognition and reduce the disease burden. As both individuals at risk and those with a manifest disease display structural brain markers, structural magnetic resonance imaging may provide relevant classification features. Methods. Following a pre-registered protocol

Aggregation of big data from multiple centers and machine learning has enabled individualized predictions for diagnostics, prognosis, and therapy response (Dwyer, Falkai, & Koutsouleris, 2018). In the field of early recognition, psychosis risk has received the largest attention Koutsouleris et al., 2015Koutsouleris et al., , 2016Koutsouleris et al., , 2018. Prediction of transition to psychosis in high-risk subjects can be substantially improved using machine learning, achieving up to 85.5% accuracy when combined with clinicians' judgments (Koutsouleris et al., 2021). Disproportionately fewer machine learning studies have focused on the early recognition of bipolar disorder (Claude, Houenou, Duchesnay, & Favre, 2020).
Among neuroimaging data, structural magnetic resonance imaging (MRI) is especially suitable for diagnostic and prognostic analyses using machine learning techniques. Most psychiatric disorders have been associated with brain structural markers or alterations. Recent large-scale multicentric studies of major psychiatric disorders within the ENIGMA consortium showed that along with schizophrenia, bipolar disorder ranks highest in cortical thinning among major conditions beginning in early-to mid-adulthood Ching et al., 2020). Unlike major depression, attention-deficit hyperactivity disorder (ADHD), obsessive-compulsive disorder, or autism, both disorders seem to be associated with similar patterns of large-scale cortical thinning in frontal, temporal, and parietal regions with relatively high effect sizes. From a practical point of view, structural MRI (sMRI) requires relatively short scanning sequences, modest compliance, and displays high test-retest reliability (Hedges et al., 2022). Unlike genetic predisposition, which is a major risk for bipolar disorder with transition rates of 4.2-22.4% by first-degree relatives (Hafeman et al., 2017;Kerner, 2014;Post et al., 2018), using sMRI in assessment of risk for bipolar disorder has been rarely investigated.
Individuals at risk for bipolar disorders have been studied using two major approachesfamily cohorts, i.e. first-degree relatives (Hajek et al., 2013), and help-seeking populations (Pfennig et al., 2020). The latter approach enables for studying a broader range of risk factors including specific subsyndromal manic or depressive symptoms, mood swings, changes in sleep and circadian rhythm, anxiety, ADHD, specific character traits, stressful life events, or substance use (Faedda et al., 2019;Leopold et al., 2012). For this purpose, and in order to facilitate the risk recognition in help-seeking cohorts, several risk assessment tools have been developed, including (extended) bipolar-at-risk criteria [BAR(S)] (Bechdolf et al., 2014;Fusar-Poli et al., 2018), Bipolar Prodrome Symptom Interview and Scale (BPSS-P) (Correll et al., 2014), and the EPIbipolar interview (Leopold et al., 2012). It is a strength of our study that all of these three scores are available for our cohort and were investigated as the dependent variable.
Several studies have explored the use of machine learning in classifying diagnosed bipolar disorder (Hajek et al., 2015;Nunes et al., 2020) and individuals with high genetic risk for bipolar disorder (i.e. first-degree relatives). A review by Claude et al. (2020) identified five studies that aimed to classify persons with genetic risk using different modalities, achieving accuracies from 59.7% up to 83.21%. Among those, two studies used regional cortical volumes (Hajek et al., 2015;Lin et al., 2018) and two used functional MRI (Frangou, 2019;Mourão-Miranda et al., 2012;Roberts et al., 2017). To the best of our knowledge, no multicenter machine learning study has yet been conducted to classify risk scores for bipolar disorder while including, but not being limited to the genetic risk. Based on the data from the Early-BipoLife study (Pfennig et al., 2020), we aimed to train a machine learning classifier using 10-fold cross-validation to stratify help-seeking subjects by estimated risk using sMRI. In contrast to single-center studies, we also used the multicenter design to validate it on test data from an 'unseen' study site through a leave-one-site-out cross-validation. Our results may provide a proof-of-concept for the utility of sMRI data for individualized risk prediction in subjects seeking help.

Pre-registration
We pre-registered our analyses at the Open Science Framework (https://osf.io/c4hfn).

Sample
The data were collected within the multicenter Early-BipoLife study (Pfennig et al., 2020;Ritter et al., 2016). Early-BipoLife is a multicenter, naturalistic, prospective-longitudinal observational cohort study of adolescents and young adults (age 15-35) at risk for bipolar disorder. From 10 participating German university and teaching hospitals with early detection centers/facilities for bipolar disorder, seven centers (Berlin, Bochum, Frankfurt, Hamburg, Dresden, Marburg, Tübingen) acquired MRI data. For this study, we accessed the baseline clinical and MRI data. For a detailed description of data collection procedures, see Pfennig et al. (2020). Briefly, of the total N = 1229 recruited adolescents and young adults at risk, N = 313 opted to receive MRI. In order to include all proposed risk factors for bipolar disorder, we recruited the participants in three recruitment pathways: N = 123 were consulting early detection centers/facilities and were screened positive for ⩾1 proposed risk factor for bipolar disorder (see online Supplementary note 1), N = 146 were young in-and outpatients with a depressive syndrome, and N = 44 had an established diagnosis of ADHD. In order to include older individuals who might have an unrecognized bipolar disorder (e.g. due to presence of exclusively depressive episodes, but no full-blown mania or hypomania yet), we extended the age inclusion criterion beyond the typical age of onset based on available studies on time to diagnosis. For more details on inclusion/exclusion criteria, see online Supplementary note 1. The study was approved by the Ethics Committee of the Medical Faculty of the Technische Universität Dresden (No: EK290082014), as well as local ethics committees at each study site. We obtained a written informed consent after comprehensive information about study aims and procedures. Additionally, parents of adolescents gave their informed consent about their children's participation.

MRI acquisition, preprocessing and quality assessment
We acquired high-resolution structural T1-weighted images using Siemens Magnetom MR scanners at 6 sites (Trio, Skyra, Prisma) and a Philips Achieva scanner at 1 site. We standardized the pulse sequence parameters across all sites to the extent permitted by each platform. For a detailed description of the scanning protocol including the detail of MRI scanners, specific hardware configurations, and pulse sequence parameters, see Vogelbacher et al. (2021).
Prior to preprocessing, we performed the data acquisition and quality assessment according to the BipoLife study protocol . Briefly, we analyzed the MRI images using the MRIQC tool (Esteban et al., 2017). Two authors visually inspected the obtained reports of several metrics including a movement plot and a plot of the background noise. In this way, 23 subjects were excluded from further analysis due to strong movement (N = 18), ghosting (N = 1), or fold-over artifacts (N = 4).
We performed a standardized quality control of the cortical and subcortical segmentations and parcellations according to the established protocols of the ENIGMA working group (http:// enigma.ini.usc.edu/protocols/imaging-protocols). This included a visual inspection of the segmented regions using the internal and external surface methods, as well as statistical outlier detection. The outliers were subjected for further visual inspection. Three subjects did not pass the quality control or displayed major segmentation errors and were discarded.

Risk assessment instruments
We assessed the risk for the development of bipolar disorder using three state-of-the-art assessment instrumentsthe Bipolar At-Risk (BAR) criteria (Bechdolf et al., 2014) and the extended BAR criteria (BARS; Fusar-Poli et al., 2018), the Bipolar Prodrome Symptom Scale (BPSS-P; Correll et al., 2014), and the Early Phase Inventory for bipolar disorders (EPIbipolar; Leopold et al., 2012). BAR(S) criteria comprise a set of subthreshold clinical and behavioral symptoms as well as genetic risk. A person is assessed as having high risk if one or more risk syndromes are fulfilled: sub-threshold mania, sub-threshold depression, sub-threshold depression with genetic risk, mixed symptoms, or mood swings. BARS criteria showed an adequate prognostic accuracy of conversion to bipolar disorder (conversion rate 18.5% in N = 27 participants) in a longitudinal cohort (Fusar-Poli et al., 2018). BPSS-P and EPIbipolar are semi-structured interviews. BPSS-P was developed based on the DSM-IV criteria for bipolar disorder and major depression and established rating scales for these conditions. BPSS-P combines all these criteria to a mania symptom index, depression symptom index, and general symptom index. It implies two at-risk states: attenuated mania symptom syndrome (AMSS) and genetic mania risk and deterioration syndrome (GMRDS). BPSS-P has good internal consistency, convergent validity, and inter-rater reliability (Correll et al., 2014). EPIbipolar contains elements from BPSS-P and additionally captures risk factors that have been identified through a systematic literature review, such as subsyndromal manic or depressive symptoms, mood swings, changes in sleep and circadian rhythm, anxiety, ADHD, specific character traits, stressful life events, or changing patterns of substance use (Leopold et al., 2012). It defines three risk categories: no-risk, low-risk, and high-risk. For the purpose of this analysis, we pooled subjects from the low-risk and highrisk groups assessed by EPIbipolar, as these participants, unlike those from the no-risk group, displayed several clinically relevant risk factors or symptoms and are intended for targeted interventions in early recognition services. The term 'no-risk' group in EPIbipolar was originally established to describe the lack of need for a specialized clinical intervention in the participants with only minor risk factors (Leopold et al., 2012). Of note, all recruited participants, even those who did not fulfill the criteria of any risk syndrome/group on any of the three risk instruments, displayed at least one known risk factor for bipolar disorder (see online Supplementary note 1). In research settings, this label might be misleading, as participants in the no-risk group might also display minor risk factors and are not to be confused with healthy controls. The final binary outcomes were as follows: any symptom syndrome/no symptom syndrome for BPSS-P; any risk group/no risk group for BARS; high-risk + low-risk groups/ no-risk group for EPIbipolar (see also Table 1 for demographics). As we discarded subjects with missing data on corresponding assessment tools, the sample sizes for each of the three risk assessment tools varied (N BARS = 264, N BPSS−P = 276, N EPIbipolar = 273). For details on the risk assessment tools, see online Supplementary Table S1 and Pfennig et al. (2020). All three instruments/criteria sets were obtained from the respective authors and can be administered after appropriate training. The administration of the complete risk assessment battery takes 2-3h.

Machine learning classification
In accordance with a previous study of subjects with diagnosed bipolar disorder by the ENIGMA consortium  and to increase reproducibility, we used a linear support vector machine (SVM) classifier with the hyperparameter C = 1 for the primary analysis. We performed independent binary classifications for each risk instrument (BPSS-P, BARS, and EPIbipolar). Using Scikit-learn 1.0 package for Python 3.8.3 (Pedregosa et al., 2011), we utilized two cross-validation methods: 10-fold and leave-one-site-out (i.e. data from one study center was taken to be the test-data, while the training dataset included the data from all other centers). In each fold, we standardized features in the training and testing sets separately by removing the mean and scaling to unit variance using standard scaler (Scikit-learn 1.0 package, see above). We took the following measures to manage the imbalanced class distribution within the data: (A) we used a stratified cross-validation to ensure, that the class ratio in all folds stays approximately the same, (B) we used random oversampling of the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) in the training set, so that the class ratios in each fold was balanced. For the primary analysis, we used the 68 regional cortical thickness values as features and we performed both cross-validation methods (10-fold and leave-one-site-out), i.e. we trained six models altogether. As the class ratios for all three risk instruments were imbalanced, we used following two performance measures which are commonly used for imbalanced classification problems: Cohen's κ (i.e. the measure of agreement between the classifier and a random classifier relative to the frequency of classes, <0 no agreement, 0-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and 0.81-1 almost perfect agreement) (Landis & Koch, 1977), and balanced accuracy [balanced accuracy = (sensitivity + specificity)/2]. Additionally, we report sensitivity and specificity. We do not report other common measures such as accuracy and area under receiver operating characteristic curve, as these are not suitable for imbalanced data (He & Ma, 2013). As this was a population-based, observational study, the samples in each site were not balanced regarding participants at risk, some were even smaller than the recommended size N > 20 for a test set (Flint et al., 2021; as well as see online Supplementary  Table S2). For this reason, we report the performance in both 10-fold as well as leave-one-site-out cross-validations. For risk assessment instruments achieving an above chance prediction (i.e. Cohen's κ > 0 and lower confidence interval > 0), we assessed the possible effects of confounds using post-hoc statistical tests comparing the correctly and incorrectly classified subjects. This is a more valid approach to account for possible confounders than regressing out covariates prior to analysis, which would disrupt the train/test separation (Pereira, Mitchell, & Botvinick, 2009). We also report the post-hoc tests for the leave-one-site-out cross-validation by BPSS-P, where the lower confidence interval slightly crossed the zero boundary. Given the above-mentioned limitations of the leave-one-site-out crossvalidation (low sample size of some sites, imbalanced classes), we consider both measures relevant. We accounted for following confounds: age, sex, medication (yes/no), recruitment pathway (early recognition services/depression/ADHD), smoking status (never smoked/current smoker/past smoker), present cannabis use (no use/<1 per month/∼1 per month/2-9 per month/⩾10 per month), lifetime cannabis use (no use/<1x month/∼1x month/2-9x month/⩾10x month), site and scanner type (for the list of sites and scanner types see above).
We estimated the magnitude of contribution of brain regions to the SVM classification using SVM coefficients. Coefficients of a linear classifier can be interpreted as relative measure of feature importance (Pereira et al., 2009) for the classification process. Note that this is not to say that a highly weighted feature contains necessarily a lot of information about the target class (Haufe et al., 2014). We used the freesurfer_statsurf_display library (https:// chrisadamsonmcri.github.io/freesurfer_statsurf_ display) to visualize the results.

Secondary analyses
To investigate whether a lower feature/sample size ratio might improve classification performance, we selected 20 features based on the available literature from other relevant large-scale multicenter studies and included these in our pre-registration. We chose 20 features in order to approach the similar ratio of features as the prior study on bipolar disorder using SVM by Nunes et al. (2020), which reported having 20 times more participants, than features. We selected those features from another large-scale ENIGMA study of bipolar disorder and healthy controls by Hibar et al. (2018) which identified a pattern of significant reductions of cortical thickness in frontal, temporal, and parietal regions in a sample of 6503 participants and bipolar patients. We selected the 20 features displaying the highest effect sizes in that study (see online Supplementary note 2 for the list of features).
In order to better compare the performance of the SVM on our sample of help-seeking individuals at risk and patients with established disease published by Nunes et al. (2020), we also performed the classification using the same feature set of 150 features including 68 regional cortical thickness and 68 surface area values as well as volumes of 14 subcortical features plus the estimated total intracranial volume.
Lastly, we investigated whether hyperparameter optimization using a nested cross-validation would improve the results. In each fold, we divided the train set into train and test subsets once more and ran multiple nested SVM classifications with different SVM regularization parameters C (1 × 10 −5 , 1 × 10 −4 , 1 × 10 −3 , 1 × 10 −2 , 1 × 10 −1 , 1, 10, and 100) (grid search method). We selected the best possible model according to the achieved balanced accuracy. Finally, we tested the selected model on the unseen test data from the primary loop.

Demographics
For detailed demographics, see Table 1. The participants who fulfilled any risk syndrome according to BPSS-P did not differ from those not fulfilling any risk syndrome in any of the measured variables. The participants who fulfilled any risk syndrome according to BARS were more likely to take medication (χ 2 = 4.608, p = 0.032), to smoke (χ 2 = 6.008, p = 0.05), and suffer from diagnosed depression, but less likely to suffer from ADHD (χ 2 = 23.149, p ≤ 0.001) and were more likely to have attended high-school (χ 2 = 13.789, p = 0.032) than those not fulfilling any risk syndrome. The participants who fulfilled any risk syndrome according to EPIbipolar were more likely to be female (χ 2 = 3.894, p = 0.048), to take medication (χ 2 = 6.909, p = 0.009), to smoke (χ 2 = 6.036, p = 0.049), and suffer from diagnosed depression, but less likely to suffer from ADHD (χ 2 = 23.149, p ≤ 0.001), than those not fulfilling any risk syndrome. The participants removed due to movement during the scan and quality control did not differ from those in the final dataset in the proportion of any of the risk syndromes: BPSS-P (df = 1, χ 2 = 0.004, p = 0.949), BARS (df = 1, χ 2 = 0.412, p = 0.521), and EPIbipolar (df = 2, χ 2 = 1.092, p = 0.579).
See Table 2 for the summary of performance measures.

Secondary analyses
Both literature-derived feature selection of 20 regional cortical thickness features, as well as an extended feature set including whole-brain regional surface area and volumes of subcortical regions did not yield significantly higher accuracies, as the confidence intervals overlapped with those from the primary analysis (see Table 3 for the summary of classification metrics). The lower difference in performance measures between the 10-fold and the leave-one-site-out cross-validation using the 20 regional cortical features rather than all regional cortical values by BPSS-P suggests a non-significant trend toward better model validity when using the 20 cortical features.

Hyperparameter optimization
Using hyperparameter optimization, we achieved Cohen's κ 0.   in the leave-one-site-out crossvalidation. The mostly selected C parameter was 100 (7 out of 10 and 4 out of 7) for 10-fold and leave-one-site-out, respectively.

SVM coefficients
The mean over folds of the absolute values of the SVM coefficients by feature (brain region) for the BPSS-P, 10-fold crossvalidation, and whole-brain regional cortical thickness features are depicted in Fig. 1. For the values of all coefficients, see online Supplementary Table S3.

Discussion
The linear SVM classifier detected individuals with increased estimated risk for bipolar disorder as defined by the BPSS-P interview with a Cohen's κ of 0.227/0.141 and balanced accuracy of 63.1/ 56.2% (based on pooled sample and leave-one-site-out crossvalidations, respectively). Precuneus, inferior frontal gyrus, and posterior cingulate cortex ranked among the highest contributing features according to SVM coefficients. SVM could not detect Our results suggest that young participants at risk of bipolar disorder according to the BPSS-P display distinct structural brain features that permit better-than-chance classification. Importantly, using both the pooled sample (10-fold crossvalidation), as well as leave-one-site out cross-validation, we achieved accuracies comparable to the previous multicenter study by Nunes et al. (2020) that differentiated patients with manifest bipolar disorder from healthy controls with balanced accuracies of 65.23% (95% CI 63.47-67.00) and 58.67% (95% CI 56.70-60.63), respectively . Compared to their study, the 95% confidence intervals in our study were considerably wider, which was to be expected given that our sample was more than 10 times smaller (276 v. 3020 participants). Larger sample sizes tend to yield more stable performance (Nieuwenhuis et al., 2012). Post-hoc tests suggested an effect of present cannabis use on the classification using 10-fold cross-validation; however, regressing out present cannabis use did not impair the performance. Moreover, there was no such effect in the leave-one-site-out validation. Other demographic variables did not show an effect on the classification (see also online Supplementary note 3). As such, this would be consistent with the notion that differences in brain structure in bipolar disorder are not a result of the disorder but are a pre-morbid risk factor potentially related to genetics. On the other hand, as the age of participants in our sample was higher than the typical age of onset of bipolar disorder, we might have included older participants with a yet undiagnosed bipolar disorder. Those participants would have possibly displayed more structural differences than participants before the age of onset, which might in turn have led to higher classification accuracies.
Unlike previous attempts to detect participants with genetic risk within family cohorts (Hajek et al., 2015), we estimated the individual risk state using state-of-the-art screening instruments, which better address the clinical realities of early recognition centers. Given the variable estimated transition rates of 4.2-22.4% by known genetic risk (Hafeman et al., 2017;Kerner, 2014;Post et al., 2018), there is a need for more differentiated risk assessment including state markers in order to provide targeted interventions. Moreover, most people seeking for early recognition services do not have genetic risk (12.9% or 15 out of 116 recruited via the early recognition pathway). BPSS-P provides a conservative risk assessment, selecting persons displaying an AMSS or a GMRDS. In total, 20.3% of the participants screened positive on one of these syndromes.
Surprisingly, SVM could not detect participants at risk estimated using EPIbipolar, although we detected significant differences in cortical thickness between the high-risk and no-risk individuals in our previous study (Mikolas et al., 2021). Given similar sample size (previous study N = 263), we pooled the individuals in the high-risk and low-risk groups in order to allow for a binary classification. As a result, the no-risk group had only 32 participants, which might have had a negative influence on the learning phase. In a post hoc analysis (see online Supplementary note 4), a three-category classification using all three risk groups did not Table 3. Results of the secondary classification for all three risk instruments using two different feature selection methods: CT 20-20 selected cortical features, CT, all regional cortical values; SC, all subcortical volumes; SA, all regional surface area values Psychological Medicine 7 yield an above chance classification. However, after removing the low-risk group, we obtained a balanced accuracy of 60.9/55.5% (10-fold/leave-one-site-out). This suggests that whereas in a hypothesis-driven region-of-interest analysis, EPIbipolar selected participants displaying significantly thinner cortex in the left pars opercularis (Mikolas et al., 2021), BPSS-P selected participants displaying widespread structural alterations enabling for more accurate, single subject classification using machine learning. Interestingly, in our above-mentioned previous study (Mikolas et al., 2021), the pars opercularis was not significantly thinner in participants scoring positive in BPSS-P; however, the low p value might have suggested a non-significant trend. Additionally, among the participants scoring positive on any risk criterion in both EPIbipolar and BARS, those with depression were more represented comparing to BPSS-P. As a result, more participants with unipolar depression might have been selected by EPIbipolar and BARS, which might have impeded the classification. Indeed, the cortical thickness differences in major depression seem to be less prominent than in bipolar disorder Schmaal et al., 2017). Finally, unlike in BPSS-P, the participants who fulfilled any risk syndrome according to BARS or EPIbipolar differed from those not fulfilling any risk syndrome in several other demographic variables which might have confounding effects on cortical thickness, such as medication or smoking. The regions with highest contribution toward the classification (i.e. with the highest values of SVM coefficients) partially overlapped with those contributing to classification of patients with manifest bipolar disorder and healthy controls obtained by Nunes et al. (2020) in their study. Of 33 cortical thickness weights reported by Nunes et al.,69.7% hat the same sign as in our study. Notably, the inferior frontal gyrus is a region structurally and functionally associated with the genetic risk for bipolar disorder (Hajek et al., 2013;Roberts et al., 2013Roberts et al., , 2017. This suggests a consistent structural pattern of individuals at risk estimated by BPSS-P and patients with manifest disease or genetic risk. A direct comparison, however, of feature weights between Nunes et al. and our study is to be viewed with caution because of the complex covariance structure within the feature set (Haufe et al., 2014), the difference in the number and type of features, and the limited number of training samples in our study. While multivariate machine learning techniques have the potential to optimize prediction accuracies, univariate, between-group comparisons are more straight forward to interpret in terms of relative feature importance, as we have done in our previous work (Mikolas et al., 2021).
The achieved accuracy is not sufficient to suggest sMRI as a single risk assessment method. Even using the best performing model, among the subjects, who did not fulfill any risk criterion, 21.8% were classified as positive (type I error). Among the subjects at risk, 51.8% were classified as negative (type II error). Even feature selection approaches or hyperparameter optimization did not achieve a more accurate classification. Earlier machine learning neuroimaging studies reported accuracies well beyond the 80% boundary that roughly demarks the clinical utility Radua & Carvalho, 2021). However, many earlier studies did not comply with recently established criteria , for example, by using insufficiently small samples [i.e. N < 130 (Nieuwenhuis et al., 2012)] or omitting validation samples (Radua & Carvalho, 2021). Studies that used validation samples generally reported lower accuracies.
Differentiating between healthy, non-help-seeking persons and help-seeking persons with higher risk for bipolar disorder might lead to higher accuracies. For a potential clinical application, however, this might be misleading, as clinicians are required to make Figure 1. Magnitude of contribution of brain regions to SVM classification. The coefficients of a linear classifier can be interpreted as relative measure of feature importance. The color represents the mean over folds of the absolute value of the SVM coefficients for each region. The classification was based on BPSS-P risk instrument using regional cortical thickness values as features in a 10-fold cross-validation.
predictions by persons who already display symptoms and therefore seek for help. Thus, our population-based sample of helpseeking individuals reflects the real clinical setting better than using a healthy-control group. The very fact that we chose a conservative approach by comparing only help-seeking individuals and yet are able to obtain a clear above-chance prediction of the score in an established risk instrument with mere structural neuroimaging data demonstrates the potential of sMRI in risk stratification. A major advantage over functional MRI is that structural T1 images are part of any standard clinical exam and would not invoke additional costs for specialized scanning protocols.
Overall, our results suggest that in order to achieve clinically meaningful predictions, future approaches using brain imaging should aim at integrating multimodal data such as clinical data, such as body mass index (McWhinney et al., 2021) or genetics, rather than focusing on brain structure only. An 'augmentation' of clinical judgments of trained professional by a machine learning-based algorithm might be a realistic scenario. Koutsouleris et al. (2021) showed in individuals with psychosis risk, that in a multimodal application, sMRI might contribute to the overall prediction by several percent. As our study suggests, sMRI, especially cortical thickness, might contribute to the diagnostic performance of such algorithms aimed at estimating the risk for bipolar disorder.
An important limitation that needs to be addressed in future studies was the use of the estimated risk as outcome. The concept of high risk for bipolar disorder is still in development (Keramatian, Chakrabarty, Saraf, & Yatham, 2021). Participants scoring positive on those risk criteria might benefit from a more intensive diagnostic and prevention process. However, in order to further individualize the risk prediction, larger, longitudinal studies with sufficient number of participants who develop a first manic episode should be performed in the future.
Lower occurrence of ADHD in the high-risk group was due to the distribution of risk factors within the three different recruitment pathways. Although ADHD as a risk factor enabled the participants to enter the study through all three recruitment pathways, the risk factor ADHD was 'enriched' in the overall sample due to in-and outpatients entering the study through the 'ADHD' recruitment pathway. However, these participants displayed fewer additional risk factors, so that most did not fulfill the criteria for the higher risk groups.
An interesting research objective for future studies would be to include participants with borderline personality disorder, as these might be hard to clinically distinguish from bipolar disorder in its initial or at-risk state. Especially the question whether people that transition to different disorders also differ in brain structure would be highly relevant.
In summary, we show that machine learning techniques can detect brain structural alterations in young individuals at risk for bipolar disorder with a performance comparable to previous studies of patients with manifest disease and healthy controls. Whole-brain cortical thickness might be superior to other structural brain features in predicting the risk to develop bipolar disorder. Future studies should aim to improve the performance of predictive models for individuals at risk by using larger cohorts and multimodal data. Even more sophisticated machine learning methods or methods of feature extraction may contribute to clinically meaningful predictions. Our own study may contribute to this effort in the future (Böhle, Eitel, Weygandt, & Ritter, 2019). Conflict of interest. K. Leopold has been a consultant and/or advisor to or has received honoraria from: Janssen/J&J, Lundbeck, Otsuka, Recordati, and ROVI. She has received grant support from Janssen/J&J and Otsuka. All other authors declared no conflict of interest.
Ethical standards. The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.