Hostname: page-component-89b8bd64d-ktprf Total loading time: 0 Render date: 2026-05-09T05:01:43.402Z Has data issue: false hasContentIssue false

Measuring Distances in High Dimensional Spaces

Why Average Group Vector Comparisons Exhibit Bias, And What to Do about it

Published online by Cambridge University Press:  22 January 2025

Breanna Green
Affiliation:
PhD candidate, Information Science, Cornell University, Ithaca, NY, USA
William Hobbs*
Affiliation:
Assistant Professor, Department of Psychology and Department of Government, Cornell University, Ithaca, NY, USA
Sofia Avila
Affiliation:
PhD student, Department of Sociology, Princeton University, Princeton, NJ, USA
Pedro L. Rodriguez
Affiliation:
Visiting Scholar, Center for Data Science, New York University, New York, NY, USA; and International Faculty, Instituto de Estudios Superiores de Administración (IESA), Caracas, Venezuela
Arthur Spirling
Affiliation:
Professor, Department of Politics, Princeton University, Princeton, NJ, USA
Brandon M. Stewart
Affiliation:
Associate Professor, Department of Sociology and Office of Population Research, Princeton University, Princeton, NJ, USA
*
Corresponding author: William Hobbs; Email: hobbs@cornell.edu
Rights & Permissions [Opens in a new window]

Abstract

Analysts often seek to compare representations in high-dimensional space, e.g., embedding vectors of the same word across groups. We show that the distance measures calculated in such cases can exhibit considerable statistical bias, that stems from uncertainty in the estimation of the elements of those vectors. This problem applies to Euclidean distance, cosine similarity, and other similar measures. After illustrating the severity of this problem for text-as-data applications, we provide and validate a bias correction for the squared Euclidean distance. This same correction also substantially reduces bias in ordinary Euclidean distance and cosine similarity estimates, but corrections for these measures are not quite unbiased and are (non-intuitively) bimodal when distances are close to zero. The estimators require obtaining the variance of the latent positions. We (will) implement the estimator in free software, and we offer recommendations for related work.

Information

Type
Letter
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Smaller sample sizes and larger group imbalance both lead to increased estimate uncertainty, and so artificially inflate distance estimates. This can exaggerate majority-minority group differences relative to equally sized group differences.

Figure 1

Figure 2 This figure shows simulation results for the squared Euclidean norm divided by the number of dimensions (i.e., the square of the true $\beta $’s). The horizontal black lines represent the true Euclidean norm squared, divided by the number of dimensions (50). Points represent the averages of the simulations and intervals are the $2.5\%$ to $97.5\%$ quantiles of the sampling distributions. Small sample size and greater group imbalance increase estimation uncertainty (i.e., the standard error/variance of $\hat {\beta }$). The effect of greater k can be determined by multiplying the y-axis scale by k.

Figure 2

Figure 3 Estimator performance on sub-samples of Twitter data set.

Supplementary material: File

Green et al. supplementary material

Green et al. supplementary material
Download Green et al. supplementary material(File)
File 857.5 KB