Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-07T02:14:56.911Z Has data issue: false hasContentIssue false

Investigating data-driven biological subtypes of psychiatric disorders using specification-curve analysis

Published online by Cambridge University Press:  11 August 2020

Lian Beijers*
Affiliation:
Department of Psychiatry, University of Groningen, University Medical Center Groningen, Interdisciplinary Center Psychopathology and Emotion regulation (ICPE), Groningen, The Netherlands
Hanna M. van Loo
Affiliation:
Department of Psychiatry, University of Groningen, University Medical Center Groningen, Interdisciplinary Center Psychopathology and Emotion regulation (ICPE), Groningen, The Netherlands
Jan-Willem Romeijn
Affiliation:
Faculty of Philosophy, University of Groningen, Groningen, The Netherlands
Femke Lamers
Affiliation:
GGZ inGeest and Department of Psychiatry, Amsterdam Public Health Research Institute, VU University Medical Center, Amsterdam, The Netherlands
Robert A. Schoevers
Affiliation:
Department of Psychiatry, University of Groningen, University Medical Center Groningen, Interdisciplinary Center Psychopathology and Emotion regulation (ICPE), Groningen, The Netherlands Department of Psychiatry, University of Groningen, University Medical Center Groningen, Research School of Behavioural and Cognitive Neurosciences, Groningen, The Netherlands
Klaas J. Wardenaar
Affiliation:
Department of Psychiatry, University of Groningen, University Medical Center Groningen, Interdisciplinary Center Psychopathology and Emotion regulation (ICPE), Groningen, The Netherlands
*
Author for correspondence: Lian Beijers, E-mail: l.beijers@umcg.nl
Rights & Permissions [Opens in a new window]

Abstract

Background

Cluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.

Methods

Proteomics data (31 biomarkers) were used from patients (n = 688) and healthy controls (n = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes.

Results

The real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased.

Conclusion

SCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2020. Published by Cambridge University Press
Figure 0

Fig. 1. Flowchart of the complete analytical process, including real data preparation, data simulation and specification curve analysis.

Figure 1

Table 1. Biochemical analytes and associated biological processes

Figure 2

Fig. 2. Descriptive Specification Curve in the sample with MDD subjects only, with small clusters (⩽1% of subjects) removed. Each black dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind those estimates. The green lines indicate the expected range of results at each position. N.B. this is not the expected range of the specific combination of options, but rather the range of the mth smallest K.

Figure 3

Table 2. Stability measures of models with different numbers of clusters (K) for the MDD dataset

Figure 4

Fig. 3. Specification curves based on simulated datasets with K = 2, with small clusters (⩽1% of subjects) removed. Each dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind the estimates of the baseline analysis. N.B. the analytic decisions behind the other analyses are not presented here.

Figure 5

Fig. 4. Specification curves based on simulated datasets with K = 5, with small clusters (⩽1% of subjects) removed. Each dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind the estimates of the baseline analysis. N.B. the analytic decisions behind the other analyses are not presented here.

Figure 6

Fig. 5. Specification curves based on simulated datasets with K = 10, with small clusters (⩽1% of subjects) removed. Each dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind the estimates of the baseline analysis. N.B. the analytic decisions behind the other analyses are not presented here.

Supplementary material: PDF

Beijers et al. supplementary material

Beijers et al. supplementary material

Download Beijers et al. supplementary material(PDF)
PDF 1 MB