Investigating data-driven biological subtypes of psychiatric disorders using specification-curve analysis

Background Cluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results. Methods Proteomics data (31 biomarkers) were used from patients (n = 688) and healthy controls (n = 426) in the Netherlands Study of Depression and Anxiety. In SCAs, consistency of results was evaluated across 1200 k-means and hierarchical clustering analyses, each with a unique combination of the clustering algorithm, fit-index, and distance metric. Next, SCAs were run in simulated datasets with varying cluster numbers and noise/outlier levels to evaluate the effect of data properties on SCA outcomes. Results The real data SCA showed no robust patterns of biological clustering in either the MDD or a combined MDD/healthy dataset. The simulation results showed that the correct number of clusters could be identified quite consistently across the 1200 model specifications, but that correct cluster identification became harder when the number of clusters and noise levels increased. Conclusion SCA can provide useful insights into the presence of clusters in biomarker data. However, SCA is likely to show inconsistent results in real-world biomarker datasets that are complex and contain considerable levels of noise. Here, the number and nature of the observed clusters may depend strongly on the chosen model-specification, precluding conclusions about the existence of biological clusters among psychiatric patients.

In view of its inflammatory function in innate immunity and its ability to detect a class of ligands through a common structural motif, RAGE (receptor for advanced glycation endproducts) is often referred to as a pattern recognition receptor. The interaction between RAGE and its ligands is thought to result in pro-inflammatory gene activation. (4) Due to an enhanced level of RAGE ligands in diabetes or other chronic disorders, this receptor is hypothesised to have a causative effect in a range of inflammatory diseases such as diabetic complications, Alzheimer's disease and even some tumors. ENRAGE (extracellular newly identified RAGE) is such a ligand. 0 0 0

GROWTH-REGULATED ALPHA PROTEIN
IM Growth-regulated alpha protein (GROa) activates neutrophils. (5) It also plays a role in certain types of cancer, stimulating tumor-associated macrophages and cancer-associated fibroblasts. (6) Bot et al. (2015) observed lower levels of the GROa but another study found higher GROa levels in MDD patients in the discovery phase (not validated). (7) 0 0 0

INTERLEUKIN-12P40
IM Subunit beta of interleukin 12 (also known as interleukin-12p40) is a common subunit of interleukin 12 and 13. Interleukin 12 is a cytokine that acts on T and natural killer cells, and has a broad array of biological activities. It is expressed by activated macrophages that serve as an essential inducer of Th1 cells development, and has been found to be important for sustaining a sufficient number of memory/effector Th1 cells to mediate long-term immunity. Lactoylglutathionase lyase (also known as glyoxalase I) catalyzes of the first step in the glyoxalase system, a critical two-step detoxification system for methylglyoxal. Methylglyoxal is produced naturally as a byproduct of normal biochemistry, but is highly toxic, due to its chemical reactions with proteins, nucleic acids, and other cellular components. Experiments suggest that methylglyoxal is preferentially toxic to proliferating cells, such as those in cancer. (11) Glyoxalase I is an attractive target for the development of drugs to treat infections by some parasitic protozoa, and cancer.(12)

CC,ST
The six members of the insulin-like growth factor-binding protein family (IGFBP 1-6) are important components of the IGF (insulin-like growth factor) axis. In this capacity, they serve to regulate the activity of both IGF-I and -II polypeptide growth factors. IGFBP-5 also has an important role in controlling cell survival, differentiation and apoptosis. (13) 2 0 2

CC,ST
The urokinase receptor (also known as urokinase-type plasminogen activator receptor, uPAR), was originally identified as a saturable binding site for urokinase on the cell surface. Besides the primary ligand urokinase, uPAR interacts with several other proteins, among others: vitronectin, the uPAR associated protein (uPARAP) and the integrin family of membrane proteins. uPAR is a part of the plasminogen activation system, which in the healthy body is involved in tissue reorganization events such as mammary gland involution and wound healing. In order to be able to reorganize tissue, the old tissue must be able to be degraded. An important mechanism in this degradation is the proteolysis cascade initiated by the plasminogen activation system. uPAR binds urokinase and thus restricts plasminogen activation to the immediate vicinity of the cell membrane. (14) RECEPTOR

IM
Carcinoembryonic antigen (CEA) describes a set of highly related glycoproteins involved in cell adhesion. CEA is normally produced in gastrointestinal tissue during fetal development, but the production stops before birth. Consequently, CEA is usually present at very low levels in the blood of healthy adults (about 20 ng/mL). However, the serum levels are raised in some types of cancer, which means that it can be used as a tumor marker in clinical tests. (25) Serum levels can also be elevated in heavy smokers.
is a small protein that stimulates the growth of new blood vessels through the process of angiogenesis. It is associated with cancer and neurological disease through angiogenesis and through activating gene expression that suppresses apoptosis.

VASCULAR ENDOTHELIAL GROWTH FACTOR
CC,ST Vascular endothelial growth factor (VEGF) is a signal protein produced by cells that stimulates the formation of blood vessels. VEGF is involved in both vasculogenesis (the de novo formation of the embryonic circulatory system) and angiogenesis (the growth of blood vessels from pre-existing vasculature). It is part of the system that restores the oxygen supply to tissues when blood circulation is inadequate such as in hypoxic conditions. Serum concentration of VEGF is high in bronchial asthma and diabetes mellitus. (26) 2 1 1

APOLIPOPROTEIN A4
T Apolipoproteins are proteins that bind lipids (oil-soluble substances such as fat and cholesterol) to form lipoproteins. They transport the lipids through the lymphatic and circulatory systems. Apolipoproteins also serve as enzyme cofactors, receptor ligands, and lipid transfer carriers that regulate the metabolism of lipoproteins and their uptake in tissues. Intestinal fat absorption dramatically increases the synthesis and secretion of apo A-IV.
ApoD has also been shown to be an important link in the transient interaction between HDL and LDL particles and between HDL particles and cells. APOD is associated with neurological disorders and nerve injury, especially related to myelin sheath, and is elevated in patients with schizophrenia, bipolar disorder, and Alzheimer's disease. (27) 126 124 2

CC,ST
The fatty-acid-binding proteins (FABPs) are a family of transport proteins for fatty acids and other lipophilic substances such as eicosanoids and retinoids.These proteins are thought to facilitate the transfer of fatty acids between extra-and intracellular membranes. It has been implicated in heart disease and diabetes(28) as well as asthma (29).

CC,ST
Pancreatic polypeptide (PPP) is a polypeptide secreted by PP cells in the endocrine pancreas predominantly in the head of the pancreas. The function of PP is to self-regulate pancreatic secretion activities (endocrine and exocrine); it also has effects on hepatic glycogen levels and gastrointestinal secretions. Its secretion in humans is increased after a protein meal, fasting, exercise, and acute hypoglycemia and is decreased by somatostatin and intravenous glucose. Plasma PP has been shown to be reduced in conditions associated with increased food intake and elevated in anorexia nervosa. (30) 1 0 1

PM
The von Willebrand factor (vWF) is a large multimeric glycoprotein present in blood plasma and produced constitutively as ultra-large vWF in endothelium (in the Weibel-Palade bodies), megakaryocytes (α-granules of platelets), and subendothelial connective tissue. It is involved in hemostasis. Increased plasma levels in a large number of cardiovascular, neoplastic, and connective tissue diseases are presumed to arise from adverse changes to the endothelium, and may contribute to an increased risk of thrombosis.   x  x  2  single  binary  dunn  x  x  x  x  x  2  single  binary  frey  x  x  x  x  x  3  single  binary  friedman  x  x  x  1  single  binary  mcclain  x  x  x  x  x  2  single  binary  ptbiserial  x  x  x  x  x  2  single  binary  silhouette  x  x  x  x  x  2  single  binary  tracew  x  x  x  1  single  binary  trcovw  x  x  x  1  single  euclidean  beale  x  1  single  euclidean  ccc  x  2  single  manhattan beale  x  1  single  manhattan ccc  x  1  single  minkowski beale  x  1  single  minkowski

Supplementary Table S5. Comparing model results with and without removal of small (≤1%) clusters in the MDD dataset.
The rows show how the model results with each number of clusters (K) are distributed after removal of small (≤1%) clusters. The columns show how many models had different numbers of clusters before removal of small (≤1%) clusters for each number of clusters (K) used in the SCA plots. The diagonal shows the number of models that only contained large (>1% clusters).