Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-08T09:05:23.142Z Has data issue: false hasContentIssue false

Genetic Similarity Clustering Using the UK Biobank as a Reference Dataset

Published online by Cambridge University Press:  28 April 2025

Ngoc-Quynh Le*
Affiliation:
Statistical Genetics Lab, QIMR Berghofer Medical Research Institute, Herston, Brisbane, QLD, Australia Faculty of Medicine, University of Queensland, Brisbane, QLD, Australia
Puya Gharahkhani
Affiliation:
Statistical Genetics Lab, QIMR Berghofer Medical Research Institute, Herston, Brisbane, QLD, Australia
Stuart MacGregor
Affiliation:
Statistical Genetics Lab, QIMR Berghofer Medical Research Institute, Herston, Brisbane, QLD, Australia
*
Corresponding author: Ngoc-Quynh Le; Email: NgocQuynh.Le@qimrberghofer.edu.au

Abstract

Incorporating genetic data from diverse populations is crucial for understanding genetic contributions to diseases and ensuring health equity in healthcare practices. However, existing reference panels either capture a limited number of populations or have small sample sizes. We examine the UK Biobank’s performance as a reference for clustering genetically similar individuals. Leveraging data from participants of diverse origins, we aim to improve population representation and mitigate bias caused by the limited number of populations in other reference panels. We combined countries of birth and ethnic backgrounds data fields from the UK Biobank and genetic information to infer genetically similar population labels. A random forest model was then trained on genetic principal components to identify each individual’s most genetically similar population. The model’s performance was validated using the 1000 Genomes and the CARTaGENE biobank data. We identified more diverse reference populations than present in datasets such as 1000 Genomes, covering 19 populations worldwide. Our model achieved medium to high precision and recall for most labeled populations, although lower rates were observed in closely related groups. For instance, we identified 519 people in CARTaGENE most genetically similar to the Middle Eastern reference sample derived in the UK Biobank (there are no Middle Eastern samples in 1000 Genomes), yielding an 81.1% precision and a 97.0% recall rate compared to demographic-based information. This practical approach of clustering genetically similar individuals utilizing existing biobank data may facilitate downstream analyses, such as genomewide association studies or polygenic risk scores in underrepresented populations in genetic studies.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of International Society for Twin Studies
Figure 0

Table 1. Training data for random forest model

Figure 1

Table 2. Genetic similarity clustering accuracy in CARTaGENE

Figure 2

Table 3. Genetic similarity clustering accuracy in 1000 Genomes

Supplementary material: File

Le et al. supplementary material

Le et al. supplementary material
Download Le et al. supplementary material(File)
File 90.3 KB