Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-06T01:42:54.510Z Has data issue: false hasContentIssue false

Robust Estimation of Polychoric Correlation

Published online by Cambridge University Press:  01 December 2025

Max Welz*
Affiliation:
Department of Psychology, University of Zurich , Switzerland Department of Econometrics, Erasmus University Rotterdam , The Netherlands
Patrick Mair
Affiliation:
Department of Psychology, Harvard University , USA
Andreas Alfons
Affiliation:
Department of Econometrics, Erasmus University Rotterdam , The Netherlands
*
Corresponding author: Max Welz; Email: max.welz@uzh.ch
Rights & Permissions [Opens in a new window]

Abstract

Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance, through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust against partial misspecification of the polychoric model, that is, when the model is misspecified for an unknown fraction of observations, such as careless respondents. To this end, the estimator minimizes a robust loss function based on the divergence between observed frequencies and theoretical frequencies implied by the polychoric model. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation, is consistent as well as asymptotically normally distributed, and comes at no additional computational cost. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.

Information

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Figure 1 Simulated data with $K_X=K_Y=5$ response options where the polychoric model is misspecified with contamination fraction $\varepsilon =0.15$.Note: The gray dots represent random draws of $(\xi ,\eta )$ from the polychoric model with $\rho _{\star}=0.5$, whereas the orange dots represent draws from a contamination distribution that primarily inflates the cell $(x,y)=(5,1)$. The contamination distribution is bivariate normal with a mean $(2.5,-2.5)^\top $, variances $(0.25, 0.25)^\top $, and zero correlation. The blue lines indicate the locations of the thresholds. In each cell, the numbers in parentheses denote the population probability of that cell under the true polychoric model.

Figure 1

Figure 2 Visualization of the robust discrepancy function $\varphi (z)$ in (5.3) for $c = 0.6$ (solid line) and the ML discrepancy function $\varphi ^{\mathrm {MLE}}(z) = (z+1) \log (z+1)$ (dotted line).

Figure 2

Figure 3 The population estimand $\rho _0$ of the polychoric correlation coefficient for various degrees of contamination fractions $\varepsilon $ (x-axis) and tuning constants c (line colors), for the same contamination distribution as in Figure 1.Note: The ML estimand corresponds to $c=+\infty $. There are $K_X=K_Y=5$ response options and the true value corresponds to $\rho _{\star} = 0.5$ (dashed line).

Figure 3

Figure 4 Boxplot visualization of the bias of three estimators of the polychoric correlation coefficient, $\widehat {\rho }_N - \rho _{\star}$, for various contamination fractions in the misspecified polychoric model across 5,000 repetitions.Note: The estimators are the robust estimator with $c=0.6$ (left), the MLE (center), and the Pearson sample correlation (right). Diamonds represent the respective average bias. The dashed line denotes value 0 and the dotted line $-\rho _{\star} = -0.5$, the latter of which indicating a sign flip in the correlation estimate.

Figure 4

Table 1 Results for the robust estimator with $c=0.6$, the MLE, and the Pearson sample correlation, for various contamination fractions across 5,000 simulated datasets

Figure 5

Table 2 Correlation matrix of $r=5$ latent variables as in Foldnes & Grønneberg (2020)

Figure 6

Figure 5 Absolute average bias (top) and confidence interval coverage (bottom) at nominal level 95% (dashed horizontal lines) of the robust estimator with $c=0.6$ (left) and the MLE (right) for each unique pairwise polychoric correlation coefficient in the true correlation matrix (Table 2), expressed as a function of the contamination fraction $\varepsilon $ (x-axis).Note: Results are aggregated over 5,000 repetitions.

Figure 7

Figure 6 Difference between absolute estimates for the polychoric correlation coefficient of the robust estimator with $c=0.6$ and the MLE for each item pair in the neuroticism scale, using the data of Arias et al. (2020).Note: The items are “calm” (N1_P), “angry” (N1_N), “relaxed” (N2_P), “tense” (N2_N), “at ease” (N3_P), “nervous” (N3_N), “not envious” (N4_P), “envious” (N4_N), “stable” (N5_P), “unstable” (N5_N), “contented” (N6_P), and “discontented” (N6_N). For the item naming given in parentheses, items with identical identifier (the integer after the first “N”) are polar opposites, where a last character “P” refers to the positive opposite and “N” to the negative opposite. The individual estimates of each method are provided in Table F.2 in the Supplementary Material.

Figure 8

Table 3 Parameter estimates and standard error estimates ($\widehat {\text {SE}}$) for the correlation between the neuroticism adjective pair “not envious” and “envious” in the data of Arias et al. (2020), using the robust estimator with $c=0.6$, the MLE, and the Pearson sample correlation

Figure 9

Table 4 Empirical relative frequency (top), estimated response probability (center), and Pearson residual (PR) (bottom) of each response $(x,y)$ for the item pair “not envious” (X) and “envious” (Y) in the measurements of Arias et al. (2020) of the neuroticism scale

Figure 10

Figure 7 Estimates of the polychoric correlation between the items “not envious” and “envious” in the data of Arias et al. (2020) for various choices of the tuning constant c (x-axis).Note: The dashed vertical line marks the default value of $c=0.6$.

Figure 11

Figure 8 Bivariate density plots of the standard normal distribution (left) and Clayton copula with standard normal marginals (right), for population correlations 0.9 (top) and 0.3 (bottom).

Figure 12

Table 5 Results for the robust estimator with $c=0.6$ and the polychoric MLE across 5,000 simulated datasets under distributional misspecification via a Clayton copula with true population correlation $\rho _G\in \{0.9, 0.3\}$

Figure 13

Figure 9 Boxplot visualization of the bias of the robust estimator and the polychoric MLE, $\widehat {\rho }_N - \rho _G$, under distributional misspecification via a Clayton copula with correlation $\rho _G = 0.9$ (left) and $\rho _G = 0.3$ (right), across 5,000 repetitions.Note: Diamonds represent the respective average bias. The tuning constant of the robust estimator is set to $c=0.6$.

Supplementary material: File

Welz et al. supplementary material

Welz et al. supplementary material
Download Welz et al. supplementary material(File)
File 3.5 MB