Hostname: page-component-5db58dd55d-jhf8m Total loading time: 0 Render date: 2026-06-01T12:00:11.637Z Has data issue: false hasContentIssue false

Using large biobanks for psychiatric genomic research: Consistency of clinical and genetic aspects of recorded depression across US states in the All of Us Research Program

Published online by Cambridge University Press:  23 January 2026

Katherine M. Keyes*
Affiliation:
Department of Epidemiology, Columbia University Department of Epidemiology, New York, USA
Catherine Gimbrone
Affiliation:
Columbia University Department of Epidemiology, USA
Caroline Rutherford
Affiliation:
Columbia University, USA
Yingzhe Zhang
Affiliation:
Harvard University Department of Epidemiology, USA
Karmel Choi
Affiliation:
Massachusetts General Hospital Department of Psychiatry, USA
Louisa Smith
Affiliation:
Northeastern University Bouve College of Health Sciences, USA
Philip Greenland
Affiliation:
Northwestern University Feinberg School of Medicine, USA
Jordan W. Smoller
Affiliation:
Harvard University Department of Epidemiology, USA
Maria Argos
Affiliation:
Boston University, USA
*
Corresponding author: Katherine Keyes; Email: kmk2104@columbia.edu
Rights & Permissions [Opens in a new window]

Abstract

Background

Large biobanks offer unprecedented data for psychiatric genomic research, but concerns exist about representativeness and generalizability. This study examined depression prevalence and polygenic risk score (PRS) associations in the All of Us data to assess potential impacts of nonrepresentative sampling.

Methods

Depression prevalence and correlates were analyzed in two subsamples: those with self-reported personal medical history (PMH) data (N = 185,232 overall; N = 114,739 with genetic data) and those with electronic health record (EHR) data (N = 287,015 overall; N = 206,175 with genetic data). PRS weights were estimated across ancestry groups. Associations of PRS with depression were examined by state and ancestry.

Results

Depression prevalence varied across states in both PMH (16.7–35.9%) and EHR (0.2–45.8%) data. Concordance between PMH and EHR diagnoses was low (kappa: 0.29, 95% CI: 0.30–0.30). Overall, one standard deviation increase in depression PRS was associated with lifetime depression based on PMH (odds ratio [OR] = 1.05, 95% confidence interval [CI]: 1.04–1.07) and EHR (OR = 1.05, 95% CI: 1.04–1.07). Results were generally consistent by ancestry, with the strongest signal for European ancestry (PMH: OR = 1.10, 95% CI: 1.08–1.12; EHR: OR = 1.07, 95% CI: 1.05–1.10). Associations between PRS and lifetime depression were largely consistent and significant associations varied minimally (ORs = 1.06–1.45) by state of residence in both subsamples.

Conclusions

Recorded depression prevalence by state in All of Us demonstrates a wide range, likely reflecting recruitment differences, EHR data completeness, and true geographic variation; yet PRS associations remained relatively stable. As studies like All of Us expand, accounting for sample composition and measurement approaches will be crucial for generating actionable findings.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. Ascertained depression prevalence by sample* and state of residence. (A) PMH Sample (N = 185,232*). (B) EHR Sample (N = 287,015*).*In the PMH sample, data are based on self-reported depression. In the EHR sample, data are based on available EHR records. Note that not all sites provided access to mental health EHR data; thus, the prevalence is based on what was submitted and may not reflect total depression prevalence in all possible EHR data.Note: Participants in states in gray were excluded from analyses if the state did not enroll any participants, or if the number of enrolled participants was less than 500. States included in analyses: Alabama, Arizona, California, Colorado, Connecticut, Florida, Georgia, Iowa, Illinois, Indiana, Kansas, Louisiana, Massachusetts, Maryland, Michigan, Minnesota, Missouri, Mississippi, North Carolina, New Hampshire, New Jersey, New Mexico, New York, Ohio, Oregon, Pennsylvania, South Carolina, Tennessee, Texas, Virginia, Washington, and Wisconsin.

Figure 1

Table 1. Demographic distribution of samples by depression diagnosis

Figure 2

Figure 2. Forest plot of odds ratios and 95% confidence intervals for one standard deviation change in PRS with lifetime depression overall and stratified by state of residence in both the PMH (N = 108,928) and EHR (N = 192,667) genomic subsamples. (A) PMH genomic subsample. (B) EHR genomic subsample.Notes: States and corresponding estimates in blue denote locations of All of Us enrollment centers (All of Us Research Program, 2024b). Models adjusted for 10 PCs.*Statistical significance after false discovery rate correction.

Figure 3

Figure 3. Forest plot of odds ratios and 95% confidence intervals for one standard deviation change in PRS with self-reported lifetime depression by genetic ancestry and state of residence in the PMH genomic subsample.Note: States and corresponding estimates in blue denote locations of All of Us enrollment centers (All of Us Research Program, 2024b). Models adjusted for 10 PCs. Models with sample sizes <500 overall and/or ≤5 for either response to a binary depression outcome were excluded from analyses in order to ensure statistical power and compliance with data dissemination policies. Estimates in subgroups with smaller sample sizes should be interpreted with caution.*Statistical significance after false discovery rate correction.

Figure 4

Figure 4. Forest plot of odds ratios and 95% confidence intervals for one standard deviation change in PRS on diagnosed lifetime depression by genetic ancestry and state of residence in the EHR genomic subsample.Note: States and corresponding estimates in blue denote locations of All of Us enrollment centers (All of Us Research Program, 2024b). Models adjusted for 10 PCs. Models with sample sizes <500 overall and/or ≤5 for either response to a binary depression outcome were excluded from analyses in order to ensure statistical power and compliance with data dissemination policies. Estimates in subgroups with smaller sample sizes should be interpreted with caution.*Statistical significance after false discovery rate correction.

Supplementary material: File

Keyes et al. supplementary material

Keyes et al. supplementary material
Download Keyes et al. supplementary material(File)
File 711.3 KB