Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-03-29T17:06:55.405Z Has data issue: false hasContentIssue false

Bayesian Nonparametric Models for Multiple Raters: A General Statistical Framework

Published online by Cambridge University Press:  11 August 2025

Giuseppe Mignemi*
Affiliation:
Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy
Ioanna Manolopoulou
Affiliation:
Statistics Department, University College London, Bloomsbury, UK
*
Corresponding author: Giuseppe Mignemi; Email: giuseppe.mignemi@unibocconi.it
Rights & Permissions [Opens in a new window]

Abstract

Rating procedure is crucial in many applied fields (e.g., educational, clinical, emergency). In these contexts, a rater (e.g., teacher, doctor) scores a subject (e.g., student, doctor) on a rating scale. Given raters’ variability, several statistical methods have been proposed for assessing and improving the quality of ratings. The analysis and the estimate of the Intraclass Correlation Coefficient (ICC) are major concerns in such cases. As evidenced by the literature, ICC might differ across different subgroups of raters and might be affected by contextual factors and subject heterogeneity. Model estimation in the presence of heterogeneity has been one of the recent challenges in this research line. Consequently, several methods have been proposed to address this issue under a parametric multilevel modelling framework, in which strong distributional assumptions are made. We propose a more flexible model under the Bayesian nonparametric (BNP) framework, in which most of those assumptions are relaxed. By eliciting hierarchical discrete nonparametric priors, the model accommodates clusters among raters and subjects, naturally accounts for heterogeneity, and improves estimates’ accuracy. We propose a general BNP heteroscedastic framework to analyze continuous and coarse rating data and possible latent differences among subjects and raters. The estimated densities are used to make inferences about the rating process and the quality of the ratings. By exploiting a stick-breaking representation of the discrete nonparametric priors, a general class of ICC indices might be derived for these models. Our method allows us to independently identify latent similarities between subjects and raters and can be applied in precise education to improve personalized teaching programs or interventions. Theoretical results about the ICC are provided together with computational strategies. Simulations and a real-world application are presented, and possible future directions are discussed.

Information

Type
Application and Case Studies - Original
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Figure 1 Graphical representation of the dependencies implied by the model. The boxes indicate replicates, the four outer plates represent, respectively, subjects and raters, and the inner grey plate indicates the observed rating.

Figure 1

Figure 2 Illustrative examples of empirical $ICC_A$ and $\mathbf {E}[ICC]$ across independent datasets and under different reliability scenarios. The grey balls indicate the mean pairwise $ICC$ between each rater and the others; the black triangles represent the computed $ICC_A$.

Figure 2

Figure 3 Average estimated density across 10 independent datasets under different scenarios. The columns indicate the cardinality of $|\mathcal {R}_i|=\{2,4\}$: left and right, respectively; the rows indicate bimodal or multimodal scenario: first and second row, respectively. The solid red lines indicate the true densities; the solid black line and the shaded grey area indicate, respectively, the point-wise mean and $95\%$ quantile-based credible intervals; the density implied by the BP model (black dotted lines).

Figure 3

Figure 4 Top row: empirical distribution of the data (red solid line) and empirical distribution of replicated data (black solid lines) from the respective BNP and BP posterior distributions (left and right columns, respectively). Middle and bottom row: Test statistics computed on the data (red solid line) and histograms of those computed on replicated data.

Figure 4

Table 1 RMSE and MAE of individuals parameters corresponding to BP, BSP and BNP models.

Figure 5

Table 2 Standardized RMSE and standardized MAE of structural parameters corresponding to BP, BSP and BNP models.

Figure 6

Figure 5 The empirical distribution of ratings and the frequency of students per teacher are reported at left and right, respectively.

Figure 7

Table 3 The WAIC is reported for each of the fitted models: BNP, BP, and BSP; the pairwise WAIC difference ($\Delta WAIC$) between the model with the best fit and each other is reported

Figure 8

Table 4 Posterior mean and $95\%$ quantile-based credible intervals of the estimated structural parameters of the BNP model are reported

Figure 9

Figure 6 The estimated densities of the subject’s true score $\theta $, rater’s systematic bias $\tau $ and the residual term $\epsilon $ are reported; the black solid lines and the shade grey areas indicate the pointwise posterior mean and $95\%$ quantile-based credible intervals of the respective densities. Bottom-right figure shows the posterior distribution of the $ICC_A$, the black solid and dotted lines indicate, respectively, the $95\%$ credible interval and the posterior mean. The rugs at the margins of the first three figures indicate the clustering of individuals.

Figure 10

Table 5 RMSE and MAE of individuals parameters across bimodal scenarios with coarsened ratings

Figure 11

Table 6 The WAIC is reported for each of the fitted models: BNP, BP, and BSP; the pairwise WAIC difference ($\Delta WAIC$) between the model with the best fit and each other is reported

Figure 12

Table 7 Posterior mean and $95\%$ quantile-based credible intervals of the estimated structural parameters of the BNP model are reported

Figure 13

Figure 7 The estimated densities of the subject’s true score $\theta $, rater’s systematic bias $\tau $ and the residual term $\epsilon $ are reported; the black solid lines and the shade grey areas indicate the pointwise posterior mean and $95\%$ quantile-based credible intervals of the respective densities. Bottom-right figure shows the posterior distribution of the $ICC_A$, the black solid and dotted lines indicate, respectively, the $95\%$ credible interval and the posterior mean. The rugs at the margins of the first three figures indicate the clustering of individuals.

Figure 14

Figure C.1 Average estimated density across 10 independent datasets under the unimodal scenario. The columns indicate the cardinality of $|\mathcal {R}_i|=\{2,4\}$: left and right, respectively. The solid red lines indicate the true densities; the solid black line and the shaded grey area indicate, respectively, the point-wise mean and $95\%$ quantile-based credible intervals; the density implied by the BP model (black dotted lines).

Figure 15

Figure C.2 First row: examples of posterior similarity matrices for pairwise subject and raters allocation (left and right column, respectively). Second row: posterior similarity matrices for pairwise subject and raters allocation in real data analyzed in Section 7.

Supplementary material: File

Mignemi and Manolopoulou supplementary material

Mignemi and Manolopoulou supplementary material
Download Mignemi and Manolopoulou supplementary material(File)
File 5.4 MB