Bayesian Nonparametric Models for Multiple Raters: A General Statistical Framework

Giuseppe Mignemi; Ioanna Manolopoulou

doi:10.1017/psy.2025.10035

Bayesian Nonparametric Models for Multiple Raters: A General Statistical Framework

Published online by Cambridge University Press: 11 August 2025

Giuseppe Mignemi and

Ioanna Manolopoulou

Show author details

Giuseppe Mignemi*: Affiliation:
https://ror.org/05crjpb27 Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy
Ioanna Manolopoulou: Affiliation:
Statistics Department, https://ror.org/02jx3x895 University College London, Bloomsbury, UK
*: Corresponding author: Giuseppe Mignemi; Email: giuseppe.mignemi@unibocconi.it

Article contents

Abstract
Introduction
BNP rating model
BNP intra-class correlation coefficient
Reduced model for one-way designs
Posterior inference
Simulation study
Application on large-scale essay assessment
Coarsened ratings extension
Concluding remarks
Competing interests
Footnotes
References

Rights & Permissions

Abstract

Rating procedure is crucial in many applied fields (e.g., educational, clinical, emergency). In these contexts, a rater (e.g., teacher, doctor) scores a subject (e.g., student, doctor) on a rating scale. Given raters’ variability, several statistical methods have been proposed for assessing and improving the quality of ratings. The analysis and the estimate of the Intraclass Correlation Coefficient (ICC) are major concerns in such cases. As evidenced by the literature, ICC might differ across different subgroups of raters and might be affected by contextual factors and subject heterogeneity. Model estimation in the presence of heterogeneity has been one of the recent challenges in this research line. Consequently, several methods have been proposed to address this issue under a parametric multilevel modelling framework, in which strong distributional assumptions are made. We propose a more flexible model under the Bayesian nonparametric (BNP) framework, in which most of those assumptions are relaxed. By eliciting hierarchical discrete nonparametric priors, the model accommodates clusters among raters and subjects, naturally accounts for heterogeneity, and improves estimates’ accuracy. We propose a general BNP heteroscedastic framework to analyze continuous and coarse rating data and possible latent differences among subjects and raters. The estimated densities are used to make inferences about the rating process and the quality of the ratings. By exploiting a stick-breaking representation of the discrete nonparametric priors, a general class of ICC indices might be derived for these models. Our method allows us to independently identify latent similarities between subjects and raters and can be applied in precise education to improve personalized teaching programs or interventions. Theoretical results about the ICC are provided together with computational strategies. Simulations and a real-world application are presented, and possible future directions are discussed.

Keywords

Bayesian hierarchical models Bayesian mixture models Bayesian nonparametric models intraclass correlation coefficient rating models

Information

Type: Application and Case Studies - Original
Information: Psychometrika , First View , pp. 1 - 36

DOI: https://doi.org/10.1017/psy.2025.10035 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society

1 Introduction

Rating procedure is crucial in several applied scientific fields, such as educational assessment (Childs & Wooten, Reference Childs and Wooten2023; Chin et al., Reference Chin, Quinn, Dhaliwal and Lovison2020), psychological and medical diagnoses (D’lima et al., Reference D’lima, Taylor, Mitri, Harding, Lai and Manias2024; Królikowska et al., Reference Królikowska, Reichert, Karlsson, Mouton, Becker and Prill2023; Li et al., Reference Li, Cui, Li, Guo, Ke, Liu, Luo, Zheng and Leckman2022), emergency rescue (Albrecht et al., Reference Albrecht, Espejo, Riedel, Nissen, Banerjee, Conroy and Nickel2024; Lo et al., Reference Lo, Heinemann, Gray, Lindquist, Kocherginsky, Post and Dresden2021) or grant review process (Cao et al., Reference Cao, Stokes and Zhang2010; Sattler et al., Reference Sattler, McKnight, Naney and Mathis2015). It implies that an observer, commonly called a rater (e.g., teacher, doctor), assesses some subject attribute or latent ability (e.g., student proficiency, patient severity) on a rating scale. Raters’ variability might pose reliability concerns and uncertainty about the quality of ratings (Bartoš & Martinková, Reference Bartoš and Martinková2024; Mignemi et al., Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021). Several statistical methods have been proposed to address these issues, they aim to assess or improve the accuracy of ratings (Casabianca et al., Reference Casabianca, Lockwood and Mccaffrey2015; Gwet, Reference Gwet2008; Martinková et al., Reference Martinková, Bartoš and Brabec2023; McGraw & Wong, Reference McGraw and Wong1996; Nelson & Edwards, Reference Nelson and Edwards2015). Multilevel modelling serves as a natural statistical framework for rating data since subjects are either nested within raters or crossed with them (Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021). These models (e.g., one-way or two-way ANOVA, hierarchical linear or generalized linear models) decompose the total variance of observed ratings according to different sources of variability, i.e., subjects and raters (see Martinková & Hladká, Reference Martinková and Hladká2023, chapter 4, for an overview). The observed rating is commonly broken down into different effects, for instance, the effect of the subject (i.e., true score, latent ability; Lord & Novick, Reference Lord and Novick1968), the effect of the rater (i.e., rater’s systematic bias) and a residual part (McGraw & Wong, Reference McGraw and Wong1996; Shrout & Fleiss, Reference Shrout and Fleiss1979). This allows us to jointly estimate the subject true score and the reliability of ratings, which is generally referred to as the proportion of total variance due to the subjects’ variability (McGraw & Wong, Reference McGraw and Wong1996; Werts et al., Reference Werts, Linn and Jöreskog1974).

Several methods have been proposed to analyze rating data under the Item Response Theory (IRT) framework, such as the Generalized Many Facet Rasch Models (GMFRMs; Linacre, Reference Linacre1989; Uto et al., Reference Uto, Tsuruta, Araki and Ueno2024; Uto & Ueno, Reference Uto and Ueno2020), the Hierarchical Raters Models (HRMs; DeCarlo et al., Reference DeCarlo, Kim and Johnson2011; Molenaar et al., Reference Molenaar, Uluman, Tavşancl and De Boeck2021; Nieto & Casabianca, Reference Nieto and Casabianca2019; Patz et al., Reference Patz, Junker, Johnson and Mariano2002) or the Generalized Hierarchical Raters Models (GHRMs; Muckle & Karabatsos, Reference Muckle and Karabatsos2009). These models jointly estimate the subject’s latent ability, rater effects (e.g., systematic bias and reliability), and item features (i.e., difficulty, discrimination). They typically rely on the assumption that subjects’ latent abilities are independent and identically distributed (i.i.d.) from a normal distribution. Other recent research lines concentrate on modeling and estimation issues in the presence of subjects’ and raters’ heterogeneity (Martinková et al., Reference Martinková, Bartoš and Brabec2023; Mutz et al., Reference Mutz, Bornmann and Daniel2012; Sattler et al., Reference Sattler, McKnight, Naney and Mathis2015; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022). These works model systematic differences among subjects or raters are to allow more accurate estimates and detailed information about the rating procedure. Individual subjects’ or raters’ characteristics may affect rating reliability, so that more flexible models result in separate reliability estimates (Martinková et al., Reference Martinková, Bartoš and Brabec2023). Recent models have been proposed to address this issue under a parametric multilevel modeling framework (Erosheva et al., Reference Erosheva, Martinková and Lee2021; Martinková et al., Reference Martinková, Bartoš and Brabec2023; Martinkova et al., Reference Martinkova, Goldhaber and Erosheva2018; Mutz et al., Reference Mutz, Bornmann and Daniel2012) in which heterogeneity is addressed as a covariate-dependent difference among subjects and subject- and rater-specific effects are assumed to be i.i.d from a normal distribution. The normality assumption made under all the aforementioned models might be unrealistic under a highly heterogeneous scenario in which possible clusters among subjects or raters might be reasonably expected and the conditional density of the respective effects might be multimodal (Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023; Verbeke & Lesaffre, Reference Verbeke and Lesaffre1996; Yang & Dunson, Reference Yang and Dunson2010). Such patterns have emerged from real data, showing that both the conditional densities of subjects’ latent ability (e.g., Uto et al., Reference Uto, Tsuruta, Araki and Ueno2024) and raters’ systematic bias (e.g., Muckle & Karabatsos, Reference Muckle and Karabatsos2009) might be multimodal and the normality assumption violated. In these cases, the data exhibit two levels of heterogeneity. The first, known as individual heterogeneity, captures the differences between individuals; the second, referred to as population heterogeneity, pertains to the differences between clusters. Although parametric mixture models might represent a suitable solution, the number of mixture components needs to be fixed. Models with different numbers of components have to be fitted and model selection techniques are required to identify the optimal number of clusters (Bartholomew et al., Reference Bartholomew, Knott and Moustaki2011).

1.1 Our contributions

Our proposal aims to overcome these restrictions under a Bayesian nonparametric (BNP) model, which naturally accommodates subgroups among students and raters and allows less restrictive distributional assumptions on the respective effects (Ferguson, Reference Ferguson1973; Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017; Hjort et al., Reference Hjort, Holmes, Müller and Walker2010). BNP inference has led to new developments and advances during the last decades in psychometrics (Cremaschi et al., Reference Cremaschi, De Iorio, Seng Chong, Broekman, Meaney and Kee2021; Karabatsos & Walker, Reference Karabatsos and Walker2009; Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023; Roy et al., Reference Roy, Daniels and Roy2024; San Martín et al., Reference San Martín, Jara, Rolin and Mouchart2011; Tang et al., Reference Tang, Chow, Ibrahim and Zhu2017; Wang & Kingston, Reference Wang and Kingston2020; Yang & Dunson, Reference Yang and Dunson2010), but few contributions have been proposed for the analysis of rating data (DeYoreo & Kottas, Reference DeYoreo and Kottas2018; Kottas et al., Reference Kottas, Müller and Quintana2005; Mignemi et al., Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024; Savitsky & Dalal, Reference Savitsky and Dalal2014). We provide a flexible statistical framework for rating models in which latent heterogeneity among subjects and raters is captured with the stochastic clustering induced by the Dirichlet Process Mixture (DPM) placed over their respective effects. Modelling subjects’ and raters’ effect parameters as an infinite mixture of some distribution family (e.g., Normal, Gamma) enables the model to account for possible multimodality without specifying the number of mixture components (De Iorio et al., Reference De Iorio, Favaro, Guglielmi and Ye2023; Yang & Dunson, Reference Yang and Dunson2010). Although previous works have raised questions about the identifiability of the parameters in BNP IRT models San Martín et al. (Reference San Martín, Jara, Rolin and Mouchart2011), theoretical results by (Pan et al., Reference Pan, Shen, Davis-Stober and Hu2024) have recently shown that BNP IRT models (e.g., 1PL) are identifiable.

Under the general case of a two-way design (McGraw & Wong, Reference McGraw and Wong1996), we specify a measurement model for the subject latent ability (e.g., student proficiency) in which the rater’s systematic bias (i.e., severity) and reliability are consistently estimated. This makes our method more relevant for subject scoring purposes than the other BNP models proposed for the analysis of rating data (DeYoreo & Kottas, Reference DeYoreo and Kottas2018; Kottas et al., Reference Kottas, Müller and Quintana2005; Savitsky & Dalal, Reference Savitsky and Dalal2014). Our proposal may be suitable both for balanced (i.e., when all raters score each subject; Nelson & Edwards, Reference Nelson and Edwards2010, Reference Nelson and Edwards2015) and unbalanced designs (i.e., when a subset of raters scores each subject; Martinková et al., Reference Martinková, Bartoš and Brabec2023; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022). Furthermore, we propose a Semiparametric model as a nested version of the BNP in which raters’ effects are i.i.d. from a unimodal distribution. Very small rater sample sizes may not reasonably be considered representative of the overall rater population, making the semiparametric specification a potentially more suitable choice.

The advantages of the proposed method are manyfold. First, it relies on more relaxed distributional assumptions for the subjects’ and raters’ effects, allowing for density estimation using mixtures (Escobar & West, Reference Escobar and West1994; Ghosal et al., Reference Ghosal, Ghosh and Ramamoorthi1999) and preventing model misspecification issues (Antonelli et al., Reference Antonelli, Trippa and Haneuse2016; Walker & Gutiérrez-Peña, Reference Walker and Gutiérrez-Peña2007). As recently argued by Tang et al. (Reference Tang, Chow, Ibrahim and Zhu2017), BNP priors might be helpful in assessing the appropriateness of common parametric assumptions for psychometrics models and represent a solution under their violation (Antoniak, Reference Antoniak1974; Ferguson, Reference Ferguson1973). Second, it naturally enables independent clustering of subjects and raters, bringing more detailed information about their latent differences (De Iorio et al., Reference De Iorio, Favaro, Guglielmi and Ye2023; Mignemi et al., Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024). This allows the joint analysis of individual and population heterogeneity of both subjects and raters. This aspect might be beneficial in the context of precise education (Coates, Reference Coates2025; Cook et al., Reference Cook, Kilgus and Burns2018), where information about individual and cluster differences might be used for implementing more personalized educational programs or interventions (Hart, Reference Hart2016; Henderson et al., Reference Henderson, Louis, Rosner and Varadhan2020). Third, exploiting a stick-breaking representation of the Dirichlet Process(DP) (Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017; Ishwaran & James, Reference Ishwaran and James2001), a general class of ICC indices might be derived, and different indices might be computed according to distinct clusters of subjects or raters. Fourth, it is readily extended to account for coarse or ordinal ratings (Goel & Thakor, Reference Goel and Thakor2015; Lockwood et al., Reference Lockwood, Castellano and Shear2018). Fifth, the general hierarchical formulation of our model allows comparisons with other methods and further extensions under unifying modelling frameworks (e.g., generalized linear latent and mixed model, GLLAMM Rabe-Hesketh & Skrondal, Reference Rabe-Hesketh and Skrondal2016). This facilitates a straightforward communication between different statistical fields and a wider application of the BNP method.

Model parameters are learned through full posterior sampling. Since most of the parameters in the model have conjugate prior distributions, full conditional Gibbs sampling is possible for most of the parameters (Ishwaran & James, Reference Ishwaran and James2001). Nonetheless, few parameters do not have conjugate priors and a derivatives matching technique is involved to approximate the full conditional (Miller, Reference Miller2019).

1.2 Outline of the article

The outline of the article is as follows: we present the general framework and introduce the model in Sections 2.1–2.3, respectively; different approximate ICC indices are derived in Section 3 and a reduced model for one-way designs is detailed in Section 4; prior elicitation and posterior sampling are discussed and presented in Section 5; simulations and real-world applications are illustrated, respectively, in Section 6 and Section 7; the model extension for coarse ratings and is presented in Section 8, along with some numerical results from real and generated data. Advantages and limitations of the proposal are discussed in Section 9. Further BNP extensions, proofs for ICCs indices, and additional plots are given in the Appendices. Additional results on balanced design in small sample sizes, technical details on out-of-sample predictive performance assessment and posterior computation for this class of models are presented in the Supplementary Material. We provide an R package RatersBNP to facilitate direct usage by researchers and practitioners of our method. Code and Supplementary Material are available online through the link: https://osf.io/3yx4j/?view_only=98c600198a6b4807878989765118f97e.

2 BNP rating model

2.1 General framework

Several model specifications have been proposed for different data structures and designs(Gwet, Reference Gwet2008; Shrout & Fleiss, Reference Shrout and Fleiss1979; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022). One-way designs are preferred when rater differences are typically considered as noise (Martinková et al., Reference Martinková, Bartoš and Brabec2023), whereas two-way designs are usually involved if the rater’s effect needs to be identified (Casabianca et al., Reference Casabianca, Lockwood and Mccaffrey2015; Mignemi et al., Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024). Balanced designs require each subject to be rated by all the raters, while in an unbalanced design each subject is only rated by a generally small subset of them (Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021). Raters might be considered either fixed or random (i.e., drawn from the population) depending on the inference the researcher might be interested in (Koo & Li, Reference Koo and Li2016).

The unbalanced two-way design with random raters is considered a general case to present our model. The reasons for this choice are both theoretical and practical. We aim to provide a comprehensive statistical framework for modeling the dependency of ratings on different categorical predictors (i.e., subjects’ and raters’ identities). This setting is a neat compromise between the one-way design, which implies only one categorical predictor (i.e., subject identity), and more complex dependency structures that involve more than two identities (i.e., several categorical predictors). However, an example of these extensions is given in Appendix A.2. Indeed, our proposal might be alternatively reduced or extended to be suitable for these different levels of complexity. The unbalanced design implies some sparsity in the co-occurrence between subjects and raters and each subject is rated only by a small subset of raters (Papaspiliopoulos et al., Reference Papaspiliopoulos, Roberts and Zanella2019; Papaspiliopoulos et al., Reference Papaspiliopoulos, Stumpf-Fétizon and Zanella2023), as a consequence each rater might score a different number of subjects. This makes the framework general and flexible, it might be seen as an extension of cross-classified models in which uncertainty is modeled also hierarchically. From a practical perspective, our choice is reasonable since many large studies and applications use unbalanced designs to distribute the workload across different raters (Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022).

2.2 Preliminaries on Bayesian nonparametric inference

In this subsection, we briefly review some basic preliminaries on BNP inference providing here a very general framework which is detailed in Sections below (refer to Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017 and Hjort et al., Reference Hjort, Holmes, Müller and Walker2010 for exhaustive treatments).

Suppose $Y_1,\dots , Y_n$ , are observations (e.g., ratings), with each $Y_i$ taking values in a complete and separable metric space $\mathbb {Y}$ . Let $\Pi $ denote a prior probability distribution on the set of all probability measures $\mathbf {P}_{\mathbb {Y}}$ such that

(1)

$$ \begin{align} Y_i | p \overset{\mathrm{iid}}{\sim} p, \quad \;\; p \sim \Pi, \end{align} $$

for $i=1,\dots ,n$ . Here p is a random probability measure on $\mathbb {Y}$ and $\Pi $ is its probability distribution and might be interpreted as the prior distribution for Bayesian inference (De Blasi et al., Reference De Blasi, Favaro, Lijoi, Mena, Prünster and Ruggiero2015). The inferential problem is called parametric when $\Pi $ degenerates on a finite-dimensional subspace of $\mathbf {P}_{\mathbb {Y}}$ , and nonparametric when the support of $\Pi $ is infinite-dimensional (Hjort et al., Reference Hjort, Holmes, Müller and Walker2010, chapter 3). To the best of our knowledge, the vast majority of the contributions present in rating models literature (Bartoš & Martinková, Reference Bartoš and Martinková2024; Casabianca et al., Reference Casabianca, Lockwood and Mccaffrey2015; Martinková et al., Reference Martinková, Bartoš and Brabec2023; Martinkova et al., Reference Martinkova, Goldhaber and Erosheva2018; Nelson & Edwards, Reference Nelson and Edwards2010, Reference Nelson and Edwards2015; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021, Reference Ten Hove, Jorgensen and van der Ark2022; Zupanc & Štrumbelj, Reference Zupanc and Štrumbelj2018) are developed within a parametric framework making use of a prior that assigns probability one to a small subset of $\mathbf {P}_{\mathbb {Y}}$ . Although Mignemi et al. (Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024) recently proposed a Bayesian semi-parametric (BSP) model for analyzing rating data. Even if they relax the normality assumption for the rater effect (i.e., the systematic bias), normality is still assumed for the subject true score distribution. This strong prior assumption is overcome through a BNP approach (Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017) in the present work.

Dirichlet processes. For the present proposal, we assume $\Pi $ to be a discrete nonparametric prior and correspond to a DP which has been widely used in BNP psychometric research (Cremaschi et al., Reference Cremaschi, De Iorio, Seng Chong, Broekman, Meaney and Kee2021; Karabatsos & Walker, Reference Karabatsos and Walker2009; Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023; Yang & Dunson, Reference Yang and Dunson2010). Given $\Pi = DP(\alpha P_0)$ , p is a random measure on $\mathbb {Y}$ following a DP with concentration parameter $\alpha>0$ and base measure $P_0$ . This implies that for every finite measurable partition $\{B_1,\dots ,B_k\}$ of $\mathbb {Y}$ , the joint distribution $(p(B_1),\dots ,p(B_k))$ follows a k-variate Dirichlet distribution with parameters $\alpha P_0(B_1),\dots ,\alpha P_0(B_k)$ :

(2)

$$ \begin{align} (p(B_1),\dots,p(B_k)) \sim Dir(\alpha P_0(B_1),\dots,\alpha P_0(B_k)). \end{align} $$

The base measure $P_0$ is our prior guess at p as it is the prior expectation of the DP, i.e., $\mathbf {E}[p]=P_0$ . The parameter $\alpha $ (also termed precision parameter) controls the concentration of the prior for p about $P_0$ . In the limit of $\alpha \rightarrow \infty $ , the probability mass is spread out and p gets closer to $P_0$ ; on the contrary, as $\alpha \rightarrow 0$ , p is less close to $P_0$ and concentrates at a point mass.

Dirichlet process mixtures. Given the discrete nature of the DP, whenever $\mathbb {Y} = \mathbb {R}$ it is not a reasonable prior for the real-valued random variable Y. Nonetheless, it might be involved in density estimation through hierarchical mixture modeling (Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017). Let $f(\cdot ;\tilde {\theta })$ denote a probability density function for $\tilde {\theta } \in \Theta \subseteq \mathbb {R}$ , we modify (1) such that for $i=1,\dots ,n$ :

(3)

$$ \begin{align} Y_i | \tilde{\theta}_i \overset{\mathrm{ind}}{\sim} f(\cdot;\tilde{\theta}_i), \quad \tilde{\theta}_i|p \overset{\mathrm{iid}}{\sim} p, \quad \;\; p \sim DP(\alpha P_0). \end{align} $$

The realizations of the DP are almost surely (a.s.) discrete which implies a positive probability that $\tilde {\theta }_i = \tilde {\theta }_{i'}$ , for $i \neq i'$ . Indeed, a random sample $(\tilde {\theta }_1,\dots , \tilde {\theta }_n)$ from p features $1\leq K_n \leq n$ different unique values $(\tilde {\theta }^{\ast}_1,\dots , \tilde {\theta }^{\ast}_{K_n})$ and leads to a random partition of $\{1,\dots , n\}$ into $K_n$ blocks such that $ \tilde {\theta }_i \in (\tilde {\theta }^{\ast}_1,\dots , \tilde {\theta }^{\ast}_{K_n})$ for $i=1,\dots ,n$ . This naturally induces a mixture distribution for the observations $Y_1,\dots , Y_n$ with probability density:

(4)

$$ \begin{align} f(Y) = \int f(Y;\tilde{\theta})p(d\tilde{\theta}). \end{align} $$

To provide some intuition, by using a DP as a prior for an unknown mixture distribution we mix parametric families nonparametrically (Gelman et al., Reference Gelman, Hwang and Vehtari2014). This model specification introduced by Lo (Reference Lo1984) and termed DPM provides a BNP framework to model rating data.

2.3 Proposed model

Consider a subject $i = 1, \dots , I$ , whose attribute is independently scored by a finite random subset of raters $\mathcal {R}_i \subseteq \{1,\dots , J\}$ on a continuous rating scale. We assume that the observed rating $Y_{ij} \in \mathbb {R}$ depends independently on subject i and rater $j \in \mathcal {R}_i$ . The effect of the former is interpreted as i’s true score and is the rating procedure’s focus. We let the residual part, that is the difference between the true and the observed score, depend on rater j’s effects, i.e., systematic bias and reliability.

Modelling rating $Y_{ij}$ . We specify the following decomposition of rating $Y_{ij}$ :

(5)

$$ \begin{align} Y_{ij} = \theta_i + \tau_j +\epsilon_{ij}, \quad i = 1,\dots,I; \;\; j \in \mathcal{R}_i. \end{align} $$

Here $\theta _i$ captures the subject i’s latent “true” score and $ \tau _j +\epsilon _{ij}$ is the difference between the observed and the true score, representing the error of rater j. The true score $\theta _i$ might be regarded as the latent consensus across the raters $\mathcal {R}_i$ on the attribute of subject i.We assume these terms to be mutually independent.

Modelling subject’s true score. For each subject $i = 1,\dots ,I$ we assume that the true score $\theta _i$ is independently distributed following a normal distribution with mean $\mu _i$ and variance $\omega ^2_i$ :

(6)

$$ \begin{align} \theta_i|\mu_i,\omega^2_i \ \overset{\mathrm{ind}}{\sim} \ N(\mu_i,\omega^2_i). \end{align} $$

Here $\mu _i$ is the mean of subject i’s true score, $\omega ^2_i$ is its variability and we assume them to be independent. Conditional on the rater’s error, higher values of $\theta _i$ imply higher levels of the subjects’ attribute (e.g., higher student proficiency); on the contrary lower values indicate poor levels of their attribute (e.g., poor student proficiency).

We specify a DP prior with precision parameter $\alpha _1$ and base measure $G_0$ for the pair $(\mu _i,\omega ^2_i)$ , $i=1,\dots ,I$ :

(7)

$$ \begin{align} (\mu_i,1/\omega^2_i)|G \overset{\mathrm{iid}}{\sim} G, \quad \; G \sim DP(\alpha_1G_0). \end{align} $$

We choose $G_0=N(\mu _0, S_0) \times Ga(w_0, w_0/W_0)$ , where $\mu _0$ and $S_0$ are the mean and variance of the normal distribution and $w_0$ and $W_0$ are, respectively, the shape and the mean parameters of the gamma. We note that G is a.s. discrete with a non-zero probability of ties, such that different subjects will share the same values of $(\mu _i,1/\omega ^2_i)$ with a probability greater than zero, that is $P[(\mu _i,1/\omega ^2_i)=(\mu _{i'},1/\omega ^2_{i'})]>0$ , for $i \neq i'$ . This discreteness property naturally induces clustering across subjects and leads to a location-scale DPM prior for $\theta _i$ . That is, this formulation can capture clusters of subject abilities. Figure 1 shows the hierarchical dependence of subjects’ true scores.

Figure 1 Graphical representation of the dependencies implied by the model. The boxes indicate replicates, the four outer plates represent, respectively, subjects and raters, and the inner grey plate indicates the observed rating.

Modelling rater’s bias and reliability. For each rater $j=1,\dots , J$ , who scores a subset of subjects $\mathcal {S}_j \subseteq \{1,\dots ,I\}: j \in \mathcal {R}_i$ , the difference between the observed rating $Y_{ij}$ and the subject’s true score $\theta _i$ , $i \in \mathcal {S}_j$ , is decomposed into the rater effects $\tau _j$ and $\epsilon _{ij}$ (5), assuming $\tau _j {\perp \!\!\! \perp } \epsilon _{ij}$ . We model $\tau _j$ to be normally distributed with mean $\eta _j$ and variance $\omega ^2_j$ :

(8)

$$ \begin{align} \tau_j|\eta_j,\phi^2_j \overset{\mathrm{ind}}{\sim} N(\eta_j,\phi^2_j), \quad j=1,\dots,J. \end{align} $$

Here $\eta _j$ and $\phi ^2_j$ are the mean and the variance of the rater j’s effect $\tau _j$ . It captures j’s specific systematic bias, i.e., the mean difference between the observed rating $Y_{ij}$ and the subject’s true score $\theta _i$ , $i \in \mathcal {S}_j$ . Given two raters such that $\tau _j<\tau _{j'}$ , j is said to be more strict and expected to give systematically smaller ratings than j on average.

The residual term $\epsilon _{ij}$ is assumed to be i.i.d. for $i \in \mathcal {S}_j$ following a normal distribution with zero mean and variance $\sigma ^2_j$ . We let this parameter vary across raters and assume $1/\sigma ^2_j$ follows a gamma distribution with shape and rate parameters $\gamma _j,\gamma _j/\beta _j$ , respectively:

(9)

$$ \begin{align} \epsilon_{ij}|\sigma^2_j & \ \ \overset{\mathrm{iid}}{\sim}\ \ N(0, \sigma^2_j), \quad i \in \mathcal{S}_j, \end{align} $$

(10)

$$ \begin{align} 1/\sigma^2_j|\gamma_j,\beta_j & \ \ \overset{\mathrm{ind}}{\sim} \ \ Ga(\gamma_j,\gamma_j/\beta_j), \quad j=1,\dots,J. \end{align} $$

Under this parametrization, $1/\sigma ^2_j$ is the rater j’s specific reliability with mean $\beta _j$ and $\gamma _j$ is the shape parameter. We prefer this parametrization for interpretability purposes, which implies a simpler notation below. Conditional on subjects’ true score $\theta _i$ , $i \in \mathcal {S}_j$ , larger values of $\sigma ^2_j$ imply more variability across the ratings given by j and might be interpreted as a poorly consistent rating behaviour. On the contrary, smaller values of $\sigma ^2_j$ indicate less variability and higher consistency for j across subjects. As a consequence, the parameter $1/\sigma ^2_j$ might be equivalently referred to as the rater-specific precision.

We specify a DP prior with concentration parameter $\alpha _2$ and base measure $H_0$ for the four-dimensional vector $(\eta _j,1/\phi ^2_j,\gamma _j,1/\beta _j)$ , $j=1,\dots ,J$ :

(11)

$$ \begin{align} (\eta_j,1/\phi^2_j,\gamma_j,1/\beta_j)|H \overset{\mathrm{iid}}{\sim} H, \quad \; H \sim DP(\alpha_2H_0). \end{align} $$

We assume mutual independence for the elements of the vector and choose $H_0=N(\eta _0, D_0) \times Ga(a_0, a_0/A_0) \times Ga(b_0, b_0/B_0) \times Ga(m_0, m_0/M_0)$ , where $\eta _0$ and $D_0$ are mean and scale parameters, respectively; $a_0, b_0, m_0$ are shape parameters and $A_0, B_0, M_0$ are mean parameters. This formulation induces a DPM prior for raters’ bias and reliability $\tau _j$ and $1/\sigma ^2_j$ . Figure 1 gives a graphical representation of the model. The independence assumption might be relaxed by employing a suitable multivariate base measure accounting for possible dependencies among the four elements of the vector. However, this implies a more complex specification, which is beyond the purpose of this work. Further constraints on raters’ systematic bias $\tau $ are needed for identifiability purposes which are discussed in Section 2.5, after presenting the stick-breaking representation.

2.4 Stick-breaking representation

The random probability measures G and H are assigned discrete priors, as a consequence they might be represented as a weighted sum of point masses:

(12)

$$ \begin{align} G &= \sum_{n \geq1} \pi_{1n} \delta_{\xi_n}, \end{align} $$

(13)

$$ \begin{align} H &= \sum_{k\geq1} \pi_{2k} \delta_{\zeta_k}, \end{align} $$

where the weights $\{\pi _{1n}\}_{n=1}^{\infty }$ and $\{\pi _{2k}\}_{k=1}^{\infty }$ take values on the infinite probability simplex and $\delta _{x}(\cdot )$ stands for the Dirac measure and denotes a point mass at x. Note that, we index the components of the infinite mixture (12) corresponding to the subjects with $n=1,\dots ,\infty $ , whereas $k=1,\dots ,\infty $ is used for that corresponding to the raters (13). The random vectors $\xi _n=(\mu _n,\omega ^2_n)$ , $n=1,\dots ,\infty $ are i.i.d. from the base measure $G_0$ , $\zeta _k=(\eta _k,\phi ^2_k,\gamma _k,\beta _k)$ , $k=1,\dots ,\infty $ are i.i.d. from the base measure $H_0$ , and both vectors are assumed to be independent of the corresponding weights. This makes clear why the expectations of the $DPs$ are $G_0$ and $H_0$ , respectively, and are said to be our prior guess at G and H (see Section 2.2).

This discreteness property of the $DP$ allows us to define G and H through the stick-breaking representation introduced by Sethuraman (Reference Sethuraman1994):

(14)

$$ \begin{align} G = \sum_{n \geq1} \pi_{1n} \delta_{\xi_n}, \quad \pi_{1n} = V_{1n} \prod_{l<n}(1-V_{1l}), \quad V_{1n} \overset{\mathrm{iid}}{\sim} Beta(1,\alpha_1), \quad \xi_n \overset{\mathrm{iid}}{\sim} G_0, \end{align} $$

and

(15)

$$ \begin{align} H = \sum_{k \geq1} \pi_{2k} \delta_{\zeta_k}, \quad \pi_{2k} = V_{2k} \prod_{l<k}(1-V_{2l}), \quad V_{2k} \overset{\mathrm{iid}}{\sim} Beta(1,\alpha_2), \quad \zeta_k \overset{\mathrm{iid}}{\sim} H_0. \end{align} $$

This construction of the $DP$ implies that, for each subject $i=1,\dots , I$ , $(\mu _i, \omega ^2_i)=\xi _n$ with probability $ \pi _{1n} = V_{1n} \prod _{l<n}(1-V_{1l})$ . Equivalently, for each rater $j=1,\dots ,J$ , the probability that $(\eta _j,\phi ^2_j,\gamma _j,\beta _j)=\zeta _k$ is given by $\pi _{2k} = V_{2k} \prod _{l<k}(1-V_{2l})$ .

Moments of student latent true score $\theta _i$ . The mean and the variance of the subject’s true score $\theta _i$ , $i=1,\dots , I$ , under a $DP(\alpha _1G_0)$ prior are:

(16)

$$ \begin{align} \mathbf{E}[\theta_i|G] = \mu_{G} =\sum_{n\geq 1} \pi_{1n} \mu_n, \quad \quad \mathbf{Var}[\theta_i|G] = \omega^2_{G} =\sum_{n\geq 1} \pi_{1n} (\mu_n^2 + \omega_n^2) - \mu^2_{G}, \end{align} $$

where $\mu _n$ and $\omega _n^2$ are the mean and the variance of $\theta _i$ for the nth component of the mixture. Here $\mu _{G}$ is the weighted average across components and captures the mean true score across subjects. The parameter $ \omega ^2_{G}$ is the conditional variance of the infinite mixture and indicates the variability of true scores across subjects.

Moments of raters’ bias $\tau _j$ . The mean and the variance of the rater’s bias $\tau _j$ , $j=1,\dots , J$ , under a $DP(\alpha _2H_0)$ prior are:

(17)

$$ \begin{align} \mathbf{E}[\tau_j|H] = \eta_{H} =\sum_{k \geq 1} \pi_{2k} \eta_k, \quad \quad \mathbf{Var}[\tau_j|H] = \phi^2_{H} =\sum_{k \geq 1} \pi_{2k} (\eta_k^2 + \phi_k^2) - \eta^2_{H}, \end{align} $$

where $\eta _k$ and $\phi _k^2$ are the mean and the variance of $\tau _j$ for the kth component of the mixture. Here $\eta _{H}$ and $\phi ^2_{H}$ capture the mean and the variance of the systematic bias within the general population of raters.

Moments of raters’ reliability $1/\sigma ^2_j$ . Raters’ residual mean is fixed to zero by the model (9), that is $\mathbf {E}[\epsilon ]=0$ ; mean and variance of raters reliability $1/\sigma ^2_j$ under a $DP(\alpha _2H_0)$ prior are:

(18)

$$ \begin{align} \mathbf{E}[1/\sigma^2_j|H] = \beta_{H} =\sum_{k \geq 1} \pi_{2k} \beta_k \quad \quad \mathbf{Var}[1/\sigma^2_j|H] = \psi^2_{H} =\sum_{k \geq 1} \pi_{2k} (\beta_k^2 + \psi_k) - \beta^2_{H}, \end{align} $$

where $\beta _{H}$ captures raters’ weighted average reliability and $\psi ^2_{H}$ indicates the total reliability variance across them. Here $\beta _k$ and $\psi _k=\beta _k^2/\gamma _k$ are, respectively, the mean and the variance of $1/\sigma ^2_j$ for the kth component of the mixture.

Note that we model the independent rater’s features, i.e., bias and reliability, by placing the same $DP(\alpha _2H_0)$ prior. In other terms, $\tau _j$ and $1/\sigma _j$ are two independent elements of the same vector drawn from H.

Finite stick-breaking approximation. The recursive generation defined in (14) and (15) implies a decreasing stochastic order of the weights $\{\pi _{1n}\}_{n=1}^{\infty }$ and $\{\pi _{2k}\}_{k=1}^{\infty }$ as the indices n and k grow. Considering the expectations $\mathbf {E}[V_{1n}]=1/(1+\alpha _1)$ and $\mathbf {E}[V_{2k}]=1/(1+\alpha _2)$ it is clear that the rates of decreasing depend on the concentration parameters $\alpha _1$ and $\alpha _2$ , respectively. Values of these parameters close to zero imply a mass concentration on the first couple of atoms, with the remaining atoms being assigned small probabilities; which is consistent with the general formulation of the $DP$ discussed in Section 2.2. Given this property of the weights, in practical applications the infinite sequences (12) and (13), are truncated at enough large values of $R \in \mathbb {N}$ :

(19)

$$ \begin{align} G = \sum_{n =1}^{R} \pi_{1n} \delta_{\xi_n}, \quad \quad H = \sum_{k =1}^{R} \pi_{2k} \delta_{\zeta_k}. \end{align} $$

We use this finite stick-breaking approximation proposed by Ishwaran & James (Reference Ishwaran and James2001) to let $V_{1R}=V_{2R}=1$ , and discard the terms $R+1,\dots ,\infty $ , for G and H.

The moment formulas (16), (17), and (18) are readily modified accordingly to the truncation and computed as finite mixture moments.

Nested versions. Semiparametric nested versions of the BNP model might be specified in which alternatively G or H are degenerate on a single component and $R=1$ for one of them in the finite approximation. That is, subjects or raters are all clustered together. For instance, for very small values of J (i.e., raters’ sample size), raters might not be reasonably considered a representative sample of their population and limited information is available for drawing inference about it. Under these scenarios, raters’ effects might be assumed to be i.i.d. from a normal distribution.

2.5 Semi-centered DPM

Hierarchical models (e.g., GLMM, Linear Latent Factor models), might suffer from identifiability issues, and constraints on the latent variable distributions are needed for consistently identify and interpret model parameters (Bartholomew et al., Reference Bartholomew, Knott and Moustaki2011; Gelman & Hill, Reference Gelman and Hill2006; Yang & Dunson, Reference Yang and Dunson2010). More specifically, under the linear random effects models a standard procedure to achieve model identifiability is to constrain the mean of the random effects to be zero (Agresti, Reference Agresti2015). We aim to consistently involve the same mean constraint for our proposal and allow straightforward and interpretable comparisons between the parametric and the nonparametric models. Similar to Yang & Dunson (Reference Yang and Dunson2010), we encompass a DPM-centered prior such that the expected value of the rater systematic bias is fixed to zero, $\mathbf {E}[\tau _j]=0$ , for $j=1,\dots ,J$ .

Since the rating process focuses on the subjects’ true scores, it might be more reasonable to centre the DPM for the raters’ effects and let the model estimate the mean of the true scores $\mu _G$ . Given that the mean of the raters’ residual is fixed to zero in (9), the mean raters’ bias needs to be fixed. We adapt the centering procedure based on a parameter-expanded approach proposed by Yang et al., Reference Yang, Dunson and Baird2010 and Yang & Dunson, Reference Yang and Dunson2010 to our proposal. We specify a semi-centered DPM (SC-DPM) involving an expansion in raters’ systematic bias $\{\tau ^{\ast}_j\}_1^J$ , such that their mean $\eta ^{\ast}_H=0$ a.s. The expanded-parameters (8) can be expressed as

(20)

$$ \begin{align} \tau^{\ast}_j = \tau_j - \eta_H, \quad \quad \tau_j|\eta_j,\phi^2_j \overset{\mathrm{ind}}{\sim} N(\eta_j,\phi^2_j), \quad \quad j=1,\dots, J, \end{align} $$

and the decomposition of rating $Y_{ij}$ (5) becomes

(21)

$$ \begin{align} Y_{ij} = \theta_i + \tau^{\ast}_j +\epsilon_{ij}, \quad \quad i = 1,\dots,I; \;\; j \in \mathcal{R}_i. \end{align} $$

Given the location transformation in (20) the expectation of the expanded parameters is zero:

(22)

$$ \begin{align} \mathbf{E}[\tau^{\ast}_j|H] = 0. \end{align} $$

It is worth noting that the centering needs only to concern the location of the systematic bias and not its scale as it is in the centered-DPM introduced by Yang & Dunson (Reference Yang and Dunson2010), which explains the term “semi-centring” adopted here to avoid confusion. Accordingly, under the semiparametric specifications, the only location of the parametric distribution needs to be fixed; a zero mean normal distribution might be a suitable solution.

3 BNP intra-class correlation coefficient

Intra-class correlation coefficient (ICC) is widely used in applied statistics to quantify the degree of association between nested observations (Agresti, Reference Agresti2015; Gelman et al., Reference Gelman, Hwang and Vehtari2014) and to get relevant information about the level of heterogeneity across different groups (Mulder & Fox, Reference Mulder and Fox2019). Indeed, it is commonly applied in psychometrics to assess the consistency of ratings given by different raters to the same subject (Erosheva et al., Reference Erosheva, Martinková and Lee2021; Martinková et al., Reference Martinková, Bartoš and Brabec2023; Nelson & Edwards, Reference Nelson and Edwards2010, Reference Nelson and Edwards2015; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021, Reference Ten Hove, Jorgensen and van der Ark2022). We provide a within-subject correlation structure (for any subject and a given raters pair) $ICC_{j,j'}$ based on the BNP model presented in Section 2.3. This formulation relates to those proposed in psychometric literature regarding the $ICC_1$ (e.g., Bradlow et al., Reference Bradlow, Wainer and Wang1999; De Boeck, Reference De Boeck2008; Erosheva et al., Reference Erosheva, Martinková and Lee2021; Fox & Glas, Reference Fox and Glas2001; Shrout & Fleiss, Reference Shrout and Fleiss1979; Werts et al., Reference Werts, Linn and Jöreskog1974), but doesn’t rely on strong distributional assumptions and naturally accommodates for both subjects and raters sub-populations. We also propose a lower bound $ICC_A$ for the expected ICC which might be used for inference purposes about the general population of raters. An exact formula for the ICC suitable for the reduced one-way designs is proposed in Section 4.1. We propose a BNP class of ICCs which are a generalization of those recently proposed by (Lin et al., Reference Lin, Tuerlinckx and Vanbelle2025) and Ten Hove et al. (Reference Ten Hove, Jorgensen and van der Ark2021) under the generalizability theory.

The paragraphs below provide preliminary information on computing the ICC under a parametric framework necessary to detail the BNP extension.

Parametric ICC. Under a parametric standard framework, i.e., equipping the parameters with finite-dimensional priors, the ICC is defined as the proportion of variance of the ratings due to the subjects’ true score:

(23)

$$ \begin{align} ICC = \frac{\omega_i^2}{\omega_i^2+\phi_j^2+\sigma_j^2} = \frac{\omega^2}{\omega^2+\phi^2+\sigma^2}, \end{align} $$

assuming $\omega ^2_i=\omega ^2$ , for $i=1,\dots , I$ ; $\phi ^2_j=\phi ^2$ and $\sigma ^2_j=\sigma ^2$ , for $j=1,\dots , J$ . Given two raters $j,j' \in \mathcal {R}_i, \;\; j \neq j'$ who rate the same subject i, the ICC is the correlation between the ratings $Y_{ij}$ and $Y_{ij'}$ . Note that under this formulation $ICC \in [0,1]$ , it can not capture any negative correlations. This index is also interpreted as the inter-rater reliability of a single rating and is also indicated by $IRR_1$ (see Erosheva et al., Reference Erosheva, Martinková and Lee2021 for further details). The homoscedastic assumption may be relaxed and raters’ residual variance might be let to vary across raters according to (9) and (10), given $\gamma _j=\gamma $ and $\beta _j=\beta $ for $j=1,\dots , J$ .

Given that $\sigma _j^2 \neq \sigma _{j'}^2$ for $j \neq j'$ , it is possible to compute as many ICCs indices as possible pairs of raters, i.e., $J(J-1)/2$ . In such cases the resulting $ICC_{j,j'}$ is the conditional correlation between the the ratings given to a random subject by raters j and $j'$ , given the other parameters:

(24)

$$ \begin{align} ICC_{j,j'} = \frac{\omega^2}{\sqrt{\omega^2+\phi^2+\sigma_j^2}\sqrt{\omega^2+\phi^2+\sigma_{j'}^2}}. \end{align} $$

A more general index accounting for all raters’ residual variance might be more useful in applications. Despite the expected ICC, i.e., $\mathbf {E}[ICC|\omega ^2,\phi ^2]$ , might represent a neat solution, it is not available in a close form and the posterior mean taken over the MCMC might be prohibitive in large scale assessments since there are $J(J-1)/2$ ICCs indices to compute for each iteration. An alternative index that might be readily computed is the ICC between two raters with average reliability. That is, we replace $\sigma _j^2$ with its expectation, i.e., $\mathbf {E}[\sigma ^2]$ :

(25)

$$ \begin{align} ICC_A = \frac{\omega^2}{\omega^2+\phi^2+\mathbf{E}[\sigma^2]}. \end{align} $$

It gives the correlation between the ratings given to the same random subject $i=1,\dots , I$ by two random raters $j,j' \in \mathcal {R}_i, \;\; j \neq j'$ , satisfying $\sigma ^2_j=\sigma ^2_{j'}=\mathbf {E}[\sigma ^2]$ . That is the correlation between two ratings given to the same random student by two raters having an average reliability level. We note that they are different quantities: the expected pairwise ICC and the pairwise ICC between two mean reliable raters. Nonetheless, relying on a theoretical result that is given below, we can use the $ICC_A$ to have information about the other.

Given that the rater’s reliability is assumed to follow a gamma distribution (9), the inverse follows an inverse gamma distribution $\sigma ^2_j|\gamma ,\beta \overset {\mathrm {ind}}{\sim } IGa(\gamma ,\gamma /\beta )$ for $j=1,\dots , J$ , whose expected value is only defined for $\gamma>1$ . In such cases we reparametrize (9):

(26)

$$ \begin{align} 1/\sigma^2_j|\gamma,\beta \overset{\mathrm{iid}}{\sim} Ga\left(1+\gamma,\frac{1+\gamma}{\beta}\right), \quad j=1,\dots,J. \end{align} $$

This specification ensures the expectation of raters’ residual variance to be defined for any $\gamma>0$ and implies

(27)

$$ \begin{align} \mathbf{E}[\sigma^2_j|\gamma,\beta]=\tilde{\sigma}= \frac{1+\gamma}{\beta \gamma}. \end{align} $$

It is the mean raters’ residual variance and its derivation is given in the Supplementary Material. The $ICC_A$ under the new parametrization is

(28)

$$ \begin{align} ICC_A = \frac{\omega^2}{\omega^2+\phi^2+\tilde{\sigma}}. \end{align} $$

Figure 2 shows the difference between the empirical mean pairwise $ICC$ between each rater (red solid line) and the others and the computed $ICC_A$ (blue solid line) across independent datasets and different reliability scenarios. The mean difference between these two indices is consistently tight, and it seems to be narrower at increasing reliability levels.

Figure 2 Illustrative examples of empirical $ICC_A$ and $\mathbf {E}[ICC]$ across independent datasets and under different reliability scenarios. The grey balls indicate the mean pairwise $ICC$ between each rater and the others; the black triangles represent the computed $ICC_A$ .

BNP ICC. The moments defined in (16), (17), and (18) account for heterogeneous populations of subjects and raters and can be used to compute a flexible ICC.

Proposition 1. Given a random subject $i=1,\dots ,I$ , independently scored by two random raters ${j,j' \in \mathcal {R}_i, j \neq j'}$ , the conditional correlation between the scores $Y_{ij}$ and $Y_{ij'}$ is

(29)

$$ \begin{align} ICC_{j,j'} = Corr\left(Y_{ij}, Y_{ij'} |G,H, \sigma_j^2, \sigma_{j'}^2 \right) = \frac{\omega^2_G}{\sqrt{\omega^2_G +\phi^2_H +\sigma_j^2}\sqrt{\omega^2_G +\phi^2_H +\sigma_{j'}^2} }. \end{align} $$

The proof is reported in Appendix B. However, a more general index, unconditioned on specific raters’ parameters, might be more useful in practice. For this reason, we propose a $ICC_A$ index for this BNP class of models. To this aim, the variance of subjects’ true score $\omega ^{\ast}_G$ and the variance of raters’ systematic bias $\phi ^2_H$ can be directly plugged into the ICC formula. Since we have heteroscedasticity across raters, we need to take the expectation of raters’ residual variance $\mathbf {E}[\sigma ^2|H]=\tilde {\sigma }^{\ast}_H$ . Similarly to the above parametric case, we reparametrize (10) with:

(30)

$$ \begin{align} 1/\sigma^2_j|\gamma,\beta \overset{\mathrm{ind}}{\sim} Ga\left(1+\gamma_j,\frac{1+\gamma_j}{\beta_j}\right), \quad j=1,\dots,J, \end{align} $$

and define:

(31)

$$ \begin{align} \mathbf{E}[\sigma^2_j|G,H]=\mathbf{E}[\sigma^2_j|H] = \tilde{\sigma}_{H} =\sum_{k \geq 1} \pi_{2k} \tilde{\sigma}_k, \end{align} $$

where $\tilde {\sigma }_k = (1+\gamma _k)/(\beta _k \gamma _k)$ is the mean residual variance for the kth component of the infinite mixture. As a result, the $ICC_A$ for the BNP models might be computed as reported below.

Proposition 2. Given a random subject $i=1,\dots ,I$ , independently scored by two random raters ${j,j' \in \mathcal {R}_i, \; j \neq j'}$ , satisfying $\sigma _j^2=\sigma _{j'}^2=\tilde {\sigma }_H$ :

(i) the conditional correlation between the ratings $Y_{ij}$ and $Y_{ij'}$ is
(32) $$ \begin{align} ICC_A = Corr\left(Y_{ij}, Y_{ij'} |G,H, \sigma_j^2=\sigma_{j'}^2=\tilde{\sigma}_H \right) = \frac{\omega^2_G}{\omega^2_G +\phi^2_H +\tilde{\sigma}_H}; \end{align} $$
(ii) the $ICC_A$ is the lower bound of the conditional expectation of the correlation between the ratings $Y_{ij}$ and $Y_{ij'}$ (ICC):
(33) $$ \begin{align} ICC_A \leq \mathbf{E}[Corr\left(Y_{ij}, Y_{ij'}|G,H \right)] = \mathbf{E}[ICC|G,H]. \end{align} $$

The proofs are reported in Appendix B. The index therefore accounts for the heterogeneity of the two populations (subjects and raters). It reduces to the parametric $ICC_A$ (23) whenever $\omega ^2_n=\omega ^2$ , for ${n=1,\dots , \infty }$ ; $\phi ^2_k=\phi ^2$ and $\tilde {\sigma }_k=\tilde {\sigma }$ , for $k=1,\dots , \infty $ ; $ICC_A$ (32) is a generalization of its parametric version (23). The $ICC_A$ might reveal valuable information in inter-rater reliability or agreement analysis. For instance, when the ICC is used as an inter-rater reliability index (Erosheva et al., Reference Erosheva, Martinková and Lee2021; Martinková et al., Reference Martinková, Bartoš and Brabec2023; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022), the $ICC_A$ is the lower bound of the expected inter-rater reliability of a single rating.

In this work, we mainly focus on the population level $ICC_A$ , but different ICC indices can be computed and compared under this framework by conditioning on different subjects or raters’ clusters.

We note that this class of ICCs generalizes those proposed by Lin et al. (Reference Lin, Tuerlinckx and Vanbelle2025) for multilevel data. As a consequence, propositions (1) and (2) hold for standard parametric multilevel models (Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021, Reference Ten Hove, Jorgensen and van der Ark2022). The BNP ICCs we propose account for heteroskedasticity across raters and naturally accommodate multiple clusters of both subjects and raters, whereas the standard ICCs commonly do not encompass these heterogeneous aspects.More details on these aspects are given in Appendix A.2.

4 Reduced model for one-way designs

One-way designs are common when raters’ identity is unknown and the systematic biases $\{\tau _j\}_1^J$ can not be identifiable. It might be seen as a limiting case in which each rater only scores one subject, i.e., $|\mathcal {S}_j|=1$ .

Some blocks of the model in Section 2.3 reduce as briefly presented below. Note that we model subjects’ true score $\theta _i$ as in the main model (6) and (7).

Modelling rating $Y_{ij}$ . We decompose the observed rating $Y_i$ as

(34)

$$ \begin{align} Y_{ij} = \theta_i + \epsilon_{ij}, \quad i = 1,\dots,I; \;\; j \in \mathcal{R}_i, \end{align} $$

Here $\epsilon _{ij}$ is the error of rater j in rating the subject i and it is the difference between the observed score $Y_{ij}$ and the subject true score $\theta _i$ .

Modelling raters’ error $\epsilon _n$ For each rating $Y_{ij}$ we assume that the rater’s error $\epsilon _{ij}$ is drawn independently from a normal distribution with mean $\eta _{ij}$ and variance $\phi ^2_{ij}$ :

(35)

$$ \begin{align} \epsilon_{ij}|\eta_{ij},\phi^2_{ij} \overset{\mathrm{ind}}{\sim} N(\eta_{ij},\phi^2_{ij}), \quad i = 1,\dots,I; \;\; j \in \mathcal{R}_i. \end{align} $$

We specify a DP prior with concentration parameter $\alpha _2$ and base measure $H_0$ for the two-dimensional vector $(\eta _{ij},\phi ^2_{ij})$ , for $i=1\dots ,I$ and $ j \in \mathcal {R}_i$ :

(36)

$$ \begin{align} (\eta_{ij},\phi^2_{ij})|H \overset{\mathrm{iid}}{\sim} H, \quad \; H \sim DP(\alpha_2 H_0). \end{align} $$

We assume $\eta _{ij},\phi ^2_{ij}$ to be independent and choose $H_0=N(\eta _0, D_0) \times IGa(a_0, A_0)$ , where $\eta _0$ and $D_0$ are mean and scale parameters, respectively. This formulation induces a DPM prior for raters’ error $\epsilon _{ij}$ .

4.1 Identifiability and ICC

The moments of the error $\epsilon _{ij}$ , $i=1\dots ,I$ and $ j \in \mathcal {R}_i$ , are, respectively:

(37)

$$ \begin{align} \mathbf{E}[\epsilon_{ij}|H] = \eta_{H} =\sum_{k\geq1} \pi_{2k} \eta_k, \quad \quad \mathbf{Var}[\epsilon_{ij}|H] = \phi^2_{H} =\sum_{k\geq1} \pi_{2k} (\eta_k^2 + \phi_k^2) - \eta^2_{H}. \end{align} $$

The centering strategy detailed in Section 2.5 is here used and a SC-DPM is here placed over $\epsilon _{ij}$ :

(38)

$$ \begin{align} \epsilon^{\ast}_{ij} = \epsilon_{ij} - \eta_H, \quad \quad \epsilon_{ij}|\eta_{ij},\phi^2_{ij} \overset{\mathrm{ind}}{\sim} N(\eta_{ij},\phi^2_{ij}), \quad \quad i = 1,\dots,I; \;\; j \in \mathcal{R}_i. \end{align} $$

Under this parameter-expanded specification, the decomposition of rating $Y_{ij}$ (34) becomes

(39)

$$ \begin{align} Y_{ij} = \theta_i +\epsilon^{\ast}_{ij}, \quad \quad \quad i = 1,\dots,I; \;\; j \in \mathcal{R}_i. \end{align} $$

Given the location transformation in (38), the expectation of the residuals is zero:

(40)

$$ \begin{align} \mathbf{E}[\epsilon^{\ast}_{ij}|H] = 0. \end{align} $$

For the one-way designs, the exact general ICC might be consistently estimated.

Proposition 3. Given a random subject i, $i=1,\dots ,I$ , independently scored by two random raters $j, j' \in \mathcal {R}_i$ , $j \neq j'$ , the conditional correlation between the ratings $Y_{ij}$ and $Y_{ij'}$ is

(41)

$$ \begin{align} Corr\left(Y_{ij}, Y_{ij'} |G,H \right) = ICC = \frac{\omega^2_G}{\omega^2_G + \phi^2_H}. \end{align} $$

The proof is given in Appendix B. Conditioning on different clusters of subjects or raters and different ICC formulations lead to possible comparisons among clusters similar to the main model.

5 Posterior inference

The parameters of the DPs’ base measures (i.e., $G_0$ , $H_0$ ) and the respective concentration parameters $\alpha _1$ and $\alpha _2$ have to be assigned either a value or a hyperprior to complete the model specification and conduct posterior inference. This section outlines our choices about the hyperprior and the posterior computation. Several parameter specifications may be considered for the DP parameters (Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017; Hjort et al., Reference Hjort, Holmes, Müller and Walker2010) as they may be assigned a prior or fixed in advance. We placed a hyperprior on those parameters and let the data inform their parameters.

Under this model specification, the most natural choices to compute the posterior are conditional sampling schemes, such as Blocked Gibbs Sampling, which rely upon the approximate stick-breaking construction of the DP. They directly involve the prior in the sampling scheme avoiding its marginalization and accommodating hyperprior for the base measures (Ishwaran & James, Reference Ishwaran and James2001). They also come with further advantages, such as an improved mixing property, better interpretability of the mixture parameters (Gelman et al., Reference Gelman, Hwang and Vehtari2014; Hjort et al., Reference Hjort, Holmes, Müller and Walker2010) and the direct computation of the ICC. Indeed, avoiding the prior marginalization, the moments (16), (17), and (18) can be readily computed and plugged in the ICC formula (32).

However, tailored considerations have to be made in practical applications based on specific data features.

5.1 Hyperprior specification

Eliciting the concentrations ‘and base measures’ parameters has a role in controlling the posterior distribution over clustering (Gelman et al., Reference Gelman, Hwang and Vehtari2014). Small values of the variance parameters of the base measures $G_0$ , and $H_0$ favor the clustering of subjects and raters, respectively, to different clusters. On the contrary, larger values of $G_0$ and $H_0$ variances favor the allocation of different subjects and raters, respectively, to the same cluster.

We improve model flexibility by placing a prior on the base measures $G_0$ and $H_0$ , and the concentration parameters $\alpha _1$ and $\alpha _2$ letting them be informed by the data. For the subjects’ true score base measure $G_0=N(\mu _0, S_0) \times Ga(w_0, w_o/W_0)$ the following hyperpriors are specified:

$$ \begin{align*} \mu_0 \sim N(\lambda_{\mu_0},\kappa^2_{\mu_0}), \quad S_0 \sim IGa(q_{S_0},Q_{S_0}), \quad w_0 \sim Ga(q_{w_0},Q_{w_0}), \quad W_0 \sim IGa(q_{W_0},Q_{W_0}). \end{align*} $$

We let $\lambda _{\mu _0}$ be the rating scale’s center value (e.g., $\lambda _{\mu _0}=50$ on a 1-100 rating scale), $\kappa ^2_{\mu _0}=100$ and the parameters $q_{w_0}, Q_{w_0},q_{W_0}, Q_{W_0}$ equal to 0.005. For the raters’ base measure $H_0=N(\eta _0, D_0) \times Ga(a_0, a_0/A_0) \times Ga(b_0, b_0/B_0) \times Ga(m_0, m_0/M_0)$ , the following hyperpriors are specified:

$$ \begin{align*} \eta_0 \sim N(\lambda_{\eta_0},\kappa^2_{\eta_0}), \quad D_0 \sim IGa(q_{D_0},Q_{D_0}), \quad a_0 \sim Ga(q_{a_0},Q_{a_0}), \quad A_0 \sim IGa(q_{A_0},Q_{A_0}), \end{align*} $$

$$ \begin{align*} b_0 \sim Ga(q_{b_0},Q_{b_0}), \quad B_0 \sim IGa(q_{B_0},Q_{B_0}), \quad m_0 \sim Ga(q_{m_0},Q_{m_0}), \quad M_0 \sim IGa(q_{M_0},Q_{M_0}). \end{align*} $$

Where $\lambda _{\eta _0}=0$ , $\kappa ^2_{\eta _0}=100$ , and the other hyperparameters are fixed to 0.005. The concentration parameters $\alpha _1$ and $\alpha _2$ are assumed to follow, respectively, a gamma distribution:

$$ \begin{align*} \alpha_1 \sim Ga(a_1,A_1) \quad \quad \alpha_2 \sim Ga(a_2, A_2). \end{align*} $$

where $a_1,A_1,a_2,A_2$ are fixed to 1. The values we fix for the hyperprior’s parameters are very common in literature and they are consistent with those proposed by many other studies on BNP models (e.g., Gelman et al., Reference Gelman, Hwang and Vehtari2014; Heinzl et al., Reference Heinzl, Fahrmeir and Kneib2012; Mignemi et al., Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024; Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023; Yang & Dunson, Reference Yang and Dunson2010).

5.2 Posterior computation

Since most of the parameters in the model have conjugate prior distributions, a Blocked Gibbs sampling algorithm was used for the posterior sampling (Ishwaran & James, Reference Ishwaran and James2001). No conjugate priors are available for the gamma’s shape parameters (e.g., $\gamma _k$ , $k=1,\dots , R$ , $a_0$ , $b_0$ ), thus we approximate the full conditionals using a derivatives-matching procedure (D-M) which is involved as an additional sampling step within the MCMC. This method has several advantages over other sampling schemes (e.g., adaptive rejection sampling or Metropolis-Hasting) in terms of efficiency, flexibility, and convergence property (Miller, Reference Miller2019). We use the same D-M algorithm introduced by Miller, Reference Miller2019 to approximate the posterior of the gamma shape parameters of the base measures, i.e., $w_0, a_0, b_0, m_0$ and a modified version for the parameters $\gamma _k$ , $k=1,\dots , R$ , since the parametrization (30) is adopted. We detail this adapted version of the D-M algorithm in the paragraph below and provide the complete Gibbs sampling in the Supplementary Material.

The notation on the independent allocation of subjects and rater to the corresponding clusters is introduced here. Let $c_{1i}$ denote the cluster allocation of subject $i=1,\dots , I$ , with $c_{1i}=n$ whenever $\xi _i=\xi _n$ , $n=1,\dots , R$ . Given the finite stick-breaking approximation detailed in Section 2.4, R is the maximum number of clusters. We indicate the set of all the subjects assigned to the nth cluster with $\mathcal {C}_{1n}$ and with $N_{1n}=|\mathcal {C}_{1n}|$ its cardinality. Accordingly, let $c_{2j}$ denote the cluster allocation of rater $j=1,\dots , J$ , such that $c_{2j}=k$ whenever $\zeta ^{\ast}_j=\zeta ^{\ast}_k$ , $k=1,\dots , R$ . The set of all the raters assigned to the kth cluster is denoted by $\mathcal {C}_{2k}$ with $N_{2k}=|\mathcal {C}_{2k}|$ being its cardinality.

Derivatives-matching procedure. Since no conjugate priors are available for the gamma’s shape parameters $\{\gamma _k\}_1^R$ , we involve, for each of these parameters, a D-M procedure to find a gamma distribution that approximates the full conditional distribution of these parameters, when their prior is also a gamma distribution (Miller, Reference Miller2019).

We aim to approximate $p(\gamma _k|\cdot )$ , i.e., the true full conditional density of $\gamma _k$ , $k=1,\dots , R$ , by finding $U_{1k}$ and $U_{2k}$ such that

(42)

$$ \begin{align} p(\gamma_k|\cdot) \approx g(\gamma_k|U_{1k},U_{2k}), \quad k=1,\dots,R, \end{align} $$

where $g(\cdot )$ is a gamma density, $U_{1k}$ and $U_{2k}$ are shape and rate parameters, respectively. The algorithm aims to find $U_{1k}$ and $U_{2k}$ such that the first and the second derivatives of the corresponding log densities of $p(\gamma _j|\cdot )$ and $g(\gamma _k|U_{1k}, U_{2k})$ match at a point $\gamma _{k}$ . Miller (Reference Miller2019) suggest to choose $\gamma _{k}$ to be near the mean of $p(\gamma _k|\cdot )$ for computational convenience. The approximation is iteratively refined by matching derivatives at the current $g(\cdot )$ mean as shown by Algorithm 1. We adapt the algorithm to our proposal, more specifically we consider the model involving the shape constraint introduced in equation (30). When this constraint is not imposed, the original algorithm by Miller (Reference Miller2019) may be directly used.

We denote with $X_{1k}$ and $X_{2k}$ the sufficient statistics for $\gamma _k$ corresponding to the kth raters’ mixture component. For the implementation of the Algorithm 1 we set the convergence tolerance $\epsilon _0=10^{-8}$ and the maximum number of iterations $M=10$ . Here $\psi (\cdot )$ and $\psi '(\cdot )$ are the digamma and trigamma functions, respectively.

The parameters $U_{1k}$ and $U_{2k}$ , returned by the algorithm, are used to update $\gamma _k \sim Ga(U_{1k},U_{2k})$ , $k=1,\dots ,R$ , through the MCMC sampling. The derivation of the algorithm is given in the Supplementary Material.

5.3 Post-processing procedures

Semi-centered DPM processes. The sampling scheme detailed in the Supplementary Material provides draws under the noncentered DPM model. However, as discussed in Section 2.5, it is not identifiable, and we need to post-process the MCMC samples to make inferences under the SC-DPM parameter-expanded model Yang & Dunson (Reference Yang and Dunson2010). Since it is a semi-centered model that naturally constrains the raters’ systematic bias $\{\tau _j\}_1^J$ to have zero mean, a few location transformations are needed. After computing $\eta _H$ according to 17 for each iteration, the samples of $\mu _0,\mu _G,\{\theta _i\}_1^I$ , and $\{\tau _j\}_1^J$ are computed:

$$ \begin{align*} \mu_0^{\ast} &= \mu_0 + \eta_H, \\ \mu_G^{\ast} & = \mu_G + \eta_H, \\ \theta_i^{\ast} & = \theta_i + \eta_H, \quad \quad \text{for} \;\; i=1,\dots,I; \\ \tau_j^{\ast} & = \tau_j - \eta_H, \quad \quad \text{for} \;\; j=1,\dots,J. \end{align*} $$

The first three are due to the location transformation of $\tau _j$ and have to be considered for inference purposes under the SC-DPM model.

Posterior densities and clusters point estimates. Each density equipped with a BNP prior might be monitored along the MCMC by a dense grid of equally spaced points (Gelman et al., Reference Gelman, Hwang and Vehtari2014; Mignemi et al., Reference Mignemi, Calcagnì, Spoto and Manolopoulou2024; Yang & Dunson, Reference Yang and Dunson2010). Each point of the grid is evaluated according to the mixture resulting from the finite stick-breaking approximation at each iteration. At the end of the MCMC, for each point of the grid posterior mean and credible interval might be computed, and as a by-product, the pointwise posterior distribution of the density might be represented.

The BNP model provides a posterior over the entire space of subjects’ and raters’ partitions, respectively. However, we can summarize these posteriors and determine the point estimates of these clustering structures by minimizing the respective variation of information (VI) loss functions. We refer to Wade & Ghahramani (Reference Wade and Ghahramani2018) and Meilă (Reference Meilă2007) for further details on VI and point estimates of probabilistic clustering.

As for every parameter of the model, we use the posterior distribution of the subjects’ specific parameters for inference purposes. Point estimates of the subjects’ true scores $\{\theta _i\}_1^I$ , such as the posterior mean (i.e., expected a posteriori, EAP) or the maximum a posteriori (i.e., MAP), might be used as official evaluations (i.e., final grades), and the posterior credible intervals as uncertainty quantification around those values. The $ICC_A$ index (32) can be computed at each iteration of the MCMC to get its posterior distribution, which might be used for inference purposes.

Computational details. In the present work, both for the simulations and the real data analysis, similarly to previous works (e.g., Heinzl et al., Reference Heinzl, Fahrmeir and Kneib2012; Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023), the number of iterations is fixed to 80,000 (with a thin factor of 60 due to memory constraints), discarding the first 20,000 as burn-in. We fix the maximum number of clusters to be $R=25,$ respectively, for subjects’ and raters’ DPM priors (Gelman et al., Reference Gelman, Hwang and Vehtari2014). The package mcclust.ext (Wade, Reference Wade2015) is used for the point estimate of the clustering structures based on the VI loss functions. We graphically check out trace plots for convergence and use the package coda for model diagnostics (De Iorio et al., Reference De Iorio, Favaro, Guglielmi and Ye2023; Plummer et al., Reference Plummer, Best, Cowles and Vines2006). Convergence is also confirmed through multiple runs of the MCMC with different starting valuesFootnote ¹.

6 Simulation study

We perform a simulation study to compare the performance of the proposed models (BNP and a nested version) over the standard parametric one, highlighting the strength of our method. Concerning the individual-specific level, the three models are evaluated on the accuracy of the estimates of the individual-specific parameters they provide (i.e., how close $\theta _i$ , $\tau _j$ , $\sigma ^2_j$ are to the respective true values). Regarding the population level, we compare the estimated population distribution of the subjects’ and raters’ features and evaluate the predictive performance of the three methods across different scenarios.

BP model. The first model is the Bayesian parametric one (BP model), which can be considered a reduction of the BNP model in which all the subjects and the raters are allocated to the same cluster, respectively, such that $\mu _i=\mu $ and $\omega _i^2=\omega ^2$ , for $i=1,\dots , I$ , and $\eta _j=\eta $ , $\phi _i^2=\phi ^2$ , $\gamma _j=\gamma $ and $\beta _j=\beta $ for $j=1,\dots , J$ . This model might be obtained by fixing the maximum number of clusters $R=1$ .

BSP model. The second model is the BSP one (BSP model), in which the normality assumption is relaxed for the subjects’ true score such that we model $\{\theta _i\}_i^I$ as detailed in Section 2.3, but we model the raters’ effects $\{\tau _j, 1/\sigma _j^2\}_1^J$ as in the parametric model (i.e., they are all assigned to the same cluster). This implies $R=1$ only for the rater-related DPM. Since in this model both G and H are degenerate on a mixture of only one component, we refer to the structural parameter as $\mu _G=\mu $ , $\omega ^2_G=\omega ^2$ and $\phi ^2_H=\phi ^2$ .

BNP model. The third model is the BNP model presented in Section 2.3 in which the normality assumption is relaxed both for subjects and raters. Under this model, subjects and raters are allowed to be, respectively, assigned to different clusters.

Three data-generative processes are set up with different clustering structures for subjects and raters. The densities of the subject’s true score and the rater’s effects are either unimodal, bimodal or multimodal. This allows us to assess the extent to which BNP priors might mitigate model misspecification and the BNP model reduces to the parametric one when the latter is properly specified; this setup is consistent with other works on BNP modeling in psychometrics (Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023).

We keep some features of the generated data similar to the real data set analyzed in Section 7 (e.g., sample size, rating scale, ratings per subject), they are also comparable with those of other works on rating models (Bartoš & Martinková, Reference Bartoš and Martinková2024; Martinková et al., Reference Martinková, Bartoš and Brabec2023). Additional simulation results on small sample size applications of our proposal are presented in the Supplementary Material.

6.1 Setting

We generate subjects’ ratings on a continuous scale, $Y_{ij} \in (1,100)$ , the number of subjects $I=500$ and raters $J=100$ are fixed, whereas the number of ratings per subject and the true generative model vary across scenarios.

Generative scenarios. We manipulate the number of ratings per subject to be $|\mathcal {R}_i| \in \{2,4\}$ for $i=1,\dots ,I$ , since in many real contexts (e.g., education, peer review) it is common for the subjects to be rated only by two or few more independent raters (Zupanc & Štrumbelj, Reference Zupanc and Štrumbelj2018).

Data are generated as specified by equations 5, 9, and one of the schemes below, according to the three different scenarios:

Unimodal: Under this scenario, subjects’ true score and raters’ effects densities are unimodal:
$$ \begin{align*} \theta_i \overset{\mathrm{iid}}{\sim} N(50,50), \quad \quad (\tau_j, 1/\sigma_j^2) \overset{\mathrm{iid}}{\sim} N(0,25) \;Ga(10,10/0.15), \end{align*} $$
for $i=1,\dots ,I$ and $j=1,\dots ,J$ . This corresponds to the standard BP model in which subjects’ true scores are assumed to be i.i.d across subjects and raters’ effects are drawn jointly i.i.d. across raters.
Bimodal: In this scenario, both subjects’ and raters’ populations are composed, respectively, of two different clusters:
$$ \begin{align*} \theta_i &\overset{\mathrm{iid}}{\sim} 0.7 \cdot N(39,50) + 0.3 \cdot N(75.6,30) , \\ (\tau_j, 1/\sigma_j^2) &\overset{\mathrm{iid}}{\sim} 0.5 \cdot N(-5,10) \;Ga(10,10/0.1) + 0.5 \cdot N(5,5) \;Ga(10,10/0.2), \end{align*} $$
for $i=1,\dots ,I$ and $j=1,\dots ,J$ .
Multimodal: Under this scenario, both subjects and raters are assigned, respectively, to three clusters:
$$ \begin{align*} \theta_i &\overset{\mathrm{iid}}{\sim} 0.2 \cdot N(35,50) + 0.2 \cdot N(45,20) + 0.6 \cdot N(56.6,20) , \\ (\tau_j, 1/\sigma_j^2) & \overset{\mathrm{iid}}{\sim} 0.4 \cdot SN(-5,3.162,-5) \;Ga(10,10/0.15) \\ &+ 0.4 \cdot N(0,10) \;Ga(10,10/0.10) \\ &+ 0.2 \cdot N(10,10) \;Ga(10,10/0.20), \end{align*} $$
for $i=1,\dots ,I$ and $j=1,\dots ,J$ . Here $SN(\xi ,\omega ,\alpha )$ stands for the skew-normal distribution with location, scale and slant parameters, $\xi $ , $\omega $ and $\alpha $ , respectively.

These scenarios mimic three different levels of heterogeneity. From an interpretative point of view, in the first scenario, all the subjects’ true scores are concentrated around the center of the rating scale, and the raters are quite homogeneous in their severity and reliability. The heterogeneity of the subjects and the raters is only at the individual level since they are not nested with clusters. Under the second scenario, we introduce heterogeneity at the population level as both subjects and raters are assigned to different clusters, respectively. Here, we mimic the case in which subjects are clustered within two different levels of true score (e.g., low versus high proficiency level), and raters are either systematically slightly more lenient and reliable or more severe and less reliable. Under the third scenario, subjects and raters are assigned, respectively, to three poorly separated clusters. This results in a highly negatively skewed distribution for the subjects’ true score and a multimodal distribution for the raters’ systematic bias. Figures 3 and 4, Figure C.1 in Appendix C, and Figures 1, 2, and 3 in the Supplementary Material show the respective true densities and the empirical distributions of the generated ratings.

Figure 3 Average estimated density across 10 independent datasets under different scenarios. The columns indicate the cardinality of $|\mathcal {R}_i|=\{2,4\}$ : left and right, respectively; the rows indicate bimodal or multimodal scenario: first and second row, respectively. The solid red lines indicate the true densities; the solid black line and the shaded grey area indicate, respectively, the point-wise mean and $95\%$ quantile-based credible intervals; the density implied by the BP model (black dotted lines).

Figure 4 Top row: empirical distribution of the data (red solid line) and empirical distribution of replicated data (black solid lines) from the respective BNP and BP posterior distributions (left and right columns, respectively). Middle and bottom row: Test statistics computed on the data (red solid line) and histograms of those computed on replicated data.

Ten independent data sets are generated under the six scenarios resulting from the $2 \times 3$ design, for each data set, the standard parametric (BP), the semi-parametric (BSP), and the nonparametric (BNP) models are fitted.

Model recovery assessment. Parameter recovery performance is assessed through the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE) computed, respectively, as the root mean square difference and the mean absolute difference between the posterior mean and the true value of the parameters across data sets. For the subject and raters specific parameters, i.e., $\{\theta _i\}_1^I$ , $\{\tau _j, 1/\sigma ^2_j\}_1^J$ , RMSE, and MAE are average both across individuals and data sets.

For the sake of comparison across different scenarios, we report the standardized version of both indices (S-RMSE, S-MAE) for the structural parameters. More precisely, those related to $\mu $ and $\mu _G$ are divided by the mean value of the rating scale, i.e., 50; those regarding $\omega ^2$ , $\omega ^2_G$ , $\phi ^2$ , $\phi ^2$ , $\tilde {\sigma }$ , $\tilde {\sigma }_H$ , and the $ICC_A$ are divided by their true value.

The models’ performance in recovering the density distributions of individuals’ specific parameters is evaluated through visual inspection. We give an example of how different densities might lead to very different conclusions on the data generative process (Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2013; Paganin & de Valpine, Reference Paganin and de Valpine2024; Steinbakk & Storvik, Reference Steinbakk and Storvik2009). Specifically, we draw new replications from the respective posterior predictive distributions and compare these samples to the original data. If the models capture relevant aspects of the data, they should look similar, and replications should not deviate systematically from the data. We measure discrepancy in central asymmetry through the statistic ${T_1(y,\mu _G)=|y_{.25}-\mu _G|-|y_{.75}-\mu _G|}$ , where $y_{.25}$ and $y_{,75}$ are the first and the third quartile, and in the left tail weight by the statistics $T_2(y)=min(y)$ .

6.2 Results

Results from the simulation study suggest that our proposals (i.e., BSP and BNP) systematically improve the estimates of the individual-specific parameters across scenarios. However, the accuracy of these estimates is comparable under the unimodal scenarios across the three models. Meanwhile, the BSP and BNP models overcome, on average, the BP under the bimodal and multimodal scenarios in both conditions $|\mathcal {R}_i|=2$ and $|\mathcal {R}_i|=4$ . As expected, the accuracy of subjects’ and raters’ specific parameters is higher in the conditions with a larger number of raters per subject $|\mathcal {R}_i| =4$ (Table 1). As indicated by $RMSE$ and $MAE$ indices, on average, the estimates of subjects’ and raters’ specific parameters provided by all the models degrade from the unimodal to the multimodal scenario.

Table 1 RMSE and MAE of individuals parameters corresponding to BP, BSP and BNP models.

Note: The bold text indicates the average, most accurate estimates.

Table 2 Standardized RMSE and standardized MAE of structural parameters corresponding to BP, BSP and BNP models.

Note: The bold text indicates the average, most accurate estimates.

Regarding the population parameter estimates, all the models provide overall similar estimates (Table 2). We observe the largest improvement of the BSP and BNP over the parametric model under the bimodal scenarios concerning subjects’ true score variance $\omega _G^2$ and raters’ systematic bias variance $\phi _H^2$ . However, in these cases, the BP model provides better estimates of the expected residual variance $\tilde {\sigma }_H$ . As a result, these differences are not detectable in the $ICC_A$ estimates and we observe equal accuracy for this index across the three models.

Figure 3 gives some examples of the estimated true score densities under the bimodal and multimodal scenarios; those under the unimodal scenario are reported in the Appendix C. The raters’ features density plots are shown in the Supplementary Material. The BNP model consistently estimates the respective densities under all the considered scenarios. The most prominent improvement of our proposals over the parametric model is observed under the heterogeneous scenarios. Accurate estimates of the densities are also provided under the extreme case of $|\mathcal {R}_i|=2$ , that is, when each subject is rated by only two independent raters. Nonetheless, we note that the uncertainty about the densities is reduced when subjects are rated by a larger number of raters (i.e., $|\mathcal {R}_i|=4$ ). This reduction mostly regards the subjects’ true score densities across all the scenarios. Our proposals capture the latent clustering structures of both subjects and raters as displayed by the posterior similarity matrices in Figure C.2 in Appendix C. The entries of these matrices are the pairwise probability that two entries (e.g., subjects or raters) are clustered together. The clustering structure implied by the generative process under the bimodal scenario is readily recognized by the graphical inspection.

The BNP model effectively captures relevant latent aspects of the data, such as deviations from normality both in the center and in the tails of the distributions across all the scenarios. As a by-product, the replications drawn from the posterior predictive distribution of the BNP model are remarkably more plausible than those generated under the BP model. As shown in Figure 4, the normality assumptions made in the latter model restrict the shapes of the distributions for subjects’ and raters’ features. As a result, when these assumptions are violated, any inferences about the data-generating process might be misleading and unreliable. Replications under the BP model are far from the data both in the centre and on the tails of the distribution, as suggested by the statistics $T_1(y,\mu )$ and $T_2(y)$ in Figure 4.

The improvement of our method over the parametric one is more prominent when the design is balanced (e.g., fully crossed designs) and the samples of subjects and raters are smaller. We present these results in the Supplementary Material.

7 Application on large-scale essay assessment

We analyze the Matura data set from Zupanc & Štrumbelj (Reference Zupanc and Štrumbelj2018) as an illustrative example. The data come from a large-scale essay assessment conducted by the National Examination Centre in upper secondary schools in Slovenia during the nationwide external examination. Each student received a holistic grade on a 1-50 rating scale by two independent teachers. We considered a random sample of $I=700$ students out of the 6995 who were examined during the spring term argumentative essays for the year 2014. A sample of $J=152$ teachers were involved who graded, on average, $9.21$ students, with a minimum of 2 and a maximum of 21 (see Figure 5). The observed ratings ranged from $0$ to $50$ , with a mean of $29.35$ , a skewness of $-0.051$ , and a kurtosis of $3.148$ (see Figure 5). More details about the assessment procedure might be found in Zupanc & Štrumbelj (Reference Zupanc and Štrumbelj2018).

Figure 5 The empirical distribution of ratings and the frequency of students per teacher are reported at left and right, respectively.

Model comparison. The three different models detailed in Section 6, i.e., the parametric (BP) model, the semiparametric (BSP) model, and the nonparametric (BNP) model, were fitted to these data and compared on their out-of-sample prediction accuracy. The Watanabe–Akaike information criterion (WAIC) was used for this purpose.

This is a fully Bayesian approach for estimating the out-of-sample expectation, which relies on the computed log pointwise posterior predictive density and on a penalty term correction for the effective number of parameters to prevent overfitting (Gelman et al., Reference Gelman, Hwang and Vehtari2014). The respective WAIC formulas are provided in the Supplementary Material.

7.1 Results

The total computational elapsed time for the BP, BSP, and BNP models was 180, 300, and 355 minutes, respectively. No convergence or mixing issues emerged from the graphical inspections of the MCMCs and diagnostics from CODA package (Plummer et al., Reference Plummer, Best, Cowles and Vines2006); further details and examples of trace plots are given in the Supplementary Material. Table 3 shows the WAIC indices for each fitted model and shows that the selection procedure indicates that the BNP model best fits the data and overcomes the others in predicting out-of-sample ratings. These results are consistent with the additional hold-out validation procedure presented in the Supplementary Material. Based on the model comparison procedure, we focus on the results from the BNP model.

Table 3 The WAIC is reported for each of the fitted models: BNP, BP, and BSP; the pairwise WAIC difference ( $\Delta WAIC$ ) between the model with the best fit and each other is reported

Table 4 Posterior mean and $95\%$ quantile-based credible intervals of the estimated structural parameters of the BNP model are reported

Figure 6 The estimated densities of the subject’s true score $\theta $ , rater’s systematic bias $\tau $ and the residual term $\epsilon $ are reported; the black solid lines and the shade grey areas indicate the pointwise posterior mean and $95\%$ quantile-based credible intervals of the respective densities. Bottom-right figure shows the posterior distribution of the $ICC_A$ , the black solid and dotted lines indicate, respectively, the $95\%$ credible interval and the posterior mean. The rugs at the margins of the first three figures indicate the clustering of individuals.

The posterior expectation of student ability mean $\mu _G$ and variance $\omega ^2_G$ population parameters are $29.126,$ and $32.702$ , respectively (Table 4). The respective narrow credible intervals suggest low uncertainty about these values. As expected from Antoniak (Reference Antoniak1974), the posterior values of the concentration parameters $\alpha _1$ and $\alpha _2$ are proportional to the respective sample sizes and larger for the former. Details of the posterior values of base measures’ parameters are reported in the Supplementary Material. The posterior expectation of raters’ systematic bias variance $\phi ^2_H$ and reliability $\tilde {\sigma }_H$ are, respectively, $5.465$ and $13.913$ . The corresponding credible intervals suggest low uncertainty around these values (Table 4).

Figure 6 gives the graphical representation of the respective estimated densities. The multimodal distribution of student ability $\theta $ implies heterogeneity among student abilities and points to the presence of multiple sub-populations. The variance in ratings is broadly due to students’ ability, despite the variability of raters’ systematic bias and reliability. Regarding the clustering structure of subjects and raters, the posterior similarity matrix, reported in Figure C.2 in Appendix C, suggests the presence of some latent partition of subjects, whereas no evidence of raters’ clusters emerged from the posterior. This is coherent with the clusters’ point estimate based on the variation of information (VI) loss function, which indicates four clusters for the subjects and one cluster for the raters. We render this result in Figure 6 through rugs of different colors at the margin of the density plots; these values indicate the posterior mean of each subject and rater specific parameter. It is worth noting that we observe a cluster of subjects whose proficiency level is remarkably lower than the others, and another cluster in which subjects’ performance is slightly superior than the others (Figure 6, upper-left; blue and brown rugs, respectively). These subjects might benefit from more personalized and specialized educational pathways. The posterior distribution of the $ICC_A$ with mean and credible intervals, respectively, equal to $0.627$ and $(0.577, 0.672)$ , suggests a moderate inter-rater reliability; Figure 6 shows the posterior distribution of this index. Since $ICC_A$ might be interpreted as the lower bound of the expected inter-rater reliability of a single rating, poor levels of reliability can be excluded (Koo & Li, Reference Koo and Li2016). However, this result is coherent with the findings of the original study by Zupanc & Štrumbelj (Reference Zupanc and Štrumbelj2018), where raters’ variability and reliability have a substantial effect on ratings. Aggregate or average ratings over different teachers might mitigate inter-rater reliability issues (Erosheva et al., Reference Erosheva, Martinková and Lee2021).

8 Coarsened ratings extension

Ratings data might be arbitrarily coarsened into a small number of ordered categories (Goel & Thakor, Reference Goel and Thakor2015; Harbaugh & Rasmusen, Reference Harbaugh and Rasmusen2018; Peeters, Reference Peeters2015; van Praag et al., Reference van Praag, Hop and Greene2025). As a result, continuous ratings that fall between two consecutive cut-offs are collapsed into the same ordered category, and fine-grained distinctions between individual scores are missing (Ho & Reardon, Reference Ho and Reardon2012; Reardon et al., Reference Reardon, Shear, Castellano and Ho2017). The available ratings are ordinal in these cases, and the rating model proposed in Section 2.3 has to be modified accordingly.

We leverage the underlying response variable formulation to extend the model to the ordinal case and consider the data coarsening mechanism (Agresti, Reference Agresti2015; Albert & Chib, Reference Albert and Chib1993; Bartholomew et al., Reference Bartholomew, Knott and Moustaki2011; Cao et al., Reference Cao, Stokes and Zhang2010; Nelson & Edwards, Reference Nelson and Edwards2015). Our proposal might be seen as a BNP extension of the heteroscedastic ordered probit (HETOP; Lockwood et al., Reference Lockwood, Castellano and Shear2018). We specify the cumulative density function of the standard normal $\Phi (\cdot )$ as a link function, which implies that we only need to modify the equation (5). This extension might readily adapt to the One-Way designs presented in Section 4.

We note that coarse and ordinal ratings might be rather different. In the first case, the categories are consecutive intervals of a continuous rating scale, which is not the case for ordinal ratings. Here, we propose the HETOP specification as a possible straightforward extension of the main model for coarsened ratings and leave more advantageous formulations for ordinal data for future investigations.

8.1 Categorical modeling

Modeling rating $Y_{ij}$ We assume that the observed ordinal rating $Y_{ij}\in \{1,\dots , K\} \subset \mathbb {N}$ is generated by an underlying unobserved normally distributed variable $Y_{ij}^{\ast}$ (Jöreskog & Moustaki, Reference Jöreskog and Moustaki2001) and that we observe $Y_{ij}=k$ if $\delta _{k-1}<Y_{ij}^{\ast}\leq \delta _k$ ; $\delta _0=-\infty < \delta _1 < \dots , <\delta _K=+\infty $ are ordered thresholds over the underlying response variable distribution and are equal across raters. The underlying variable $Y_{ij}^{\ast}$ might be interpreted as a latent rating or the original continuous rating before the coarsening procedure. The conditional probability that $Y_{ij}=k$ is

(43)

$$ \begin{align} \mathbb{P}[Y_{ij}=k|\theta_i, \tau_j,\sigma_j,\delta_{k},\delta_{k+1}] = \Phi \left( \frac{\delta_{k+1} - \theta_i - \tau_j}{ \sigma_j} \right) - \Phi \left( \frac{\delta_{k} - \theta_i - \tau_j}{ \sigma_j} \right), \end{align} $$

for $i = 1,\dots ,I; \;\; j \in \mathcal {R}_i$ . Additional considerations on the interpretation of $\sigma _j$ under this formulation are given in the Supplementary Material.

Identifiability issues. Under this parametrization, we need to put additional constraints for identifiability purposes since the underlying response variables’ mean and variance are freely estimated (DeYoreo & Kottas, Reference DeYoreo and Kottas2018; Kottas et al., Reference Kottas, Müller and Quintana2005). Two thresholds (e.g., $\delta _1$ , $\delta _{K-1}$ as proposed by Song et al., Reference Song, Lu, Cai and Ip2013) have to be fixed in advance, as it is common in multi-group analysis (Lockwood et al., Reference Lockwood, Castellano and Shear2018). From a statistical perspective, we note that each rater might be seen as a group of observations (Papaspiliopoulos et al., Reference Papaspiliopoulos, Stumpf-Fétizon and Zanella2023). Moreover, an SC-DPM prior has to be placed on the subject’s true score $\{\theta _i\}_1^I$ to fix their mean and resolve identifiability issues (Gelman et al., Reference Gelman, Hwang and Vehtari2014), as a by-product under the parameter-expanded specification, equation (43) becomes

(44)

$$ \begin{align} \mathbb{P}[Y_{ij}=k|\theta_i^{\ast}, \tau_j^{\ast},\sigma_j,\delta_{k},\delta_{k+1}] = \Phi \left( \frac{\delta_{k+1} - \theta^{\ast}_i - \tau^{\ast}_j}{ \sigma_j} \right) - \Phi \left( \frac{\delta_{k} - \theta^{\ast}_i - \tau^{\ast}_j}{ \sigma_j} \right), \end{align} $$

for $i = 1,\dots ,I; \;\; j \in \mathcal {R}_i$ . Whenever $K=2$ , i.e., dichotomous rating scale, $\{\sigma _j\}_1^J$ can not be identified and need to be fixed in advance, e.g., $\sigma _j=1$ , $j=1,\dots , J$ , which implies assuming raters to be equally reliable (Lockwood et al., Reference Lockwood, Castellano and Shear2018).

Generalized ICCs. Under this model specification, the ICCs computed according to propositions 1 and 2 are generalized ICCs that indicate the polychoric correlation between two latent ratings (Jöreskog, Reference Jöreskog1994; Uebersax, Reference Uebersax1993). For instance, proposition 1 implies here:

(45)

$$ \begin{align} ICC_{j,j'}^{\ast} = Corr\left(Y_{ij}^{\ast}, Y_{ij'}^{\ast} |G,H, \sigma_j^2, \sigma_{j'}^2 \right) = \frac{\omega^2_G}{\sqrt{\omega^2_G +\phi^2_H +\sigma_j^2}\sqrt{\omega^2_G +\phi^2_H +\sigma_{j'}^2} }, \end{align} $$

where $ICC_{j,j'}^{\ast}$ indicates the conditional pairwise polychoric correlation between the latent ratings given by raters $j \neq j'$ to subject i. Similar considerations might be extended to propositions 2 and 3. As a by-product, the $ICC_A^{\ast}$ is the lower bound of the expected polychoric correlation between the latent ratings $Y_{ij}^{\ast}$ and $Y_{ij'}^{\ast}$ , with $j \neq j'$ :

(46)

$$ \begin{align} ICC_A^{\ast} \leq \mathbf{E}[Corr\left(Y_{ij}^{\ast}, Y_{ij'}^{\ast}|G,H \right)] = \mathbf{E}[ICC^{\ast}|G,H]. \end{align} $$

8.2 Posterior computation

A data augmentation procedure may simulate the underlying response variables (Albert & Chib, Reference Albert and Chib1993). The underlying continuous ratings $Y_{ij}^{\ast}$ , $i=1,\dots , I$ , $j \in \mathcal {R}_i$ are sampled:

$$\begin{align*} Y^{\ast}_{ij}| \cdot \overset{\mathrm{ind}}{\sim} N(\theta^{\ast}_i-\tau_j,\sigma^2_j) \times I(\delta_{k-1} < Y^{\ast}_{ij} \le \delta_k), \quad k=1,\dots,K. \end{align*}$$

Here $I(\cdot )$ is an indicator function. Following Albert & Chib (Reference Albert and Chib1993) the conditional posterior distribution of the $K-3$ freely estimated thresholds,e.g., $\delta _2,\dots ,\delta _{K-2}$ might be seen to be uniform on the respective intervals:

$$ \begin{align*} \delta_k|\cdot &\overset{\mathrm{ind}}{\sim}& U(max\{max\{Y_{ij}^{\ast}: Y_{ij}=k\}, \delta_{k-1}\}, min\{min\{Y_{ij}^{\ast}: Y_{ij}=k+1\}, \delta_{k+1}\} ), \end{align*} $$

here $U(\cdot )$ stands for uniform distribution.

All the other parameters are updated according to the posterior sampling scheme detailed in Section 3.1 of Supplementary Material and the post-process transformation outlined in Section 5.3 needs to take into account the double-centering. After computing $\mu _G$ and $\eta _H$ according to 16 and 17 for each iteration, the samples of $\mu _0,\mu _G,\{\theta _i\}_1^I$ , and $\{\tau _j\}_1^J$ are computed as follows:

$$ \begin{align*} \mu_0^{\ast} &= \mu_0 -\mu_G + \eta_H, \\ \theta_i^{\ast} & = \theta_i - \mu_G + \eta_H, \quad \quad \text{for} \;\; i=1,\dots,I; \\ \tau_j^{\ast} & = \tau_j - \eta_H + \mu_G, \quad \quad \text{for} \;\; j=1,\dots,J. \end{align*} $$

8.3 Generated and real coarsened ratings analysis

In this section, we present the analysis of real and generated coarsened ratings and compare the results with those presented in Sections 6 and 7. For the real data, we deliberately coarsened the original continuous ratings analyzed in Section 7 into $K=4$ ordered categories according to the following cutoffs: $\delta _1=20$ , $\delta _2=30$ , $\delta _3=40$ . The fit of the BP, BSP, and BNP models to the data are compared according to the WAIC for ordered data discussed in the Supplementary Material.

We performed a simulation study to assess the accuracy of the BNP and the BP versions for ordered ratings. More specifically, the same data sets generated under the bimodal scenarios in Section 6 are coarsened and considered for this study. We coarse these ratings into $K=4$ ordered categories according to three consecutive cutoffs: $\delta _1=35$ , $\delta _2=50$ , $\delta _3=75$ . The same parameter recovery assessment procedure detailed in Section 6 is consistently used here.

Table 5 RMSE and MAE of individuals parameters across bimodal scenarios with coarsened ratings

Note: The bold text indicates the average, most accurate estimates.

Table 6 The WAIC is reported for each of the fitted models: BNP, BP, and BSP; the pairwise WAIC difference ( $\Delta WAIC$ ) between the model with the best fit and each other is reported

In real context, the cutoffs of the coarsening procedure are generally known since the continuous rating scale is deliberately broken down into a small number of consecutive intervals and raters are explicitly asked to coarse their ratings accordingly (Peeters, Reference Peeters2015; van Praag et al., Reference van Praag, Hop and Greene2025). For example, on a 1–100 continuous scale, they might be asked to indicate which of the following intervals each subject’s score falls into: (1–25), (25–50), (50,75) or (75,100). On the contrary, when ratings are directly given on an ordinal scale, the categories’ labels are not necessarily associated with any continuous scale intervals (e.g., “poor”, “acceptable”, “good”, “very good”). In these scenarios, we consider the observed ordered ratings as coarsened representations of underlying continuous values according to some unknown consecutive cutoffs. In the first case, this coarsening process is factual; in the second, it is merely assumed. However, since in both cases at least two cutoffs need to be fixed for identification purposes, we decide to fix $\delta _1$ and $\delta _3$ to the true values and let the model estimate $\delta _2$ , both for real and generated data.

Results. The total computational elapsed times for the BP, BSP, and BNP models were roughly similar to those of previous Sections. Upon graphical inspections of the MCMC chains and diagnostics, no convergence or mixing issues emerged for both generated and real data. Table 6 gives the WAIC indices for each fitted model and suggests that the BNP model provides the best fit to the data. Based on this model comparison procedure, we focus on the results from the BNP model. As shown in Table 7, the estimates are equivalent to those obtained under the continuous BNP model presented in Section 7. We note that the only notable difference concerns the point estimate of the subjects’ clustering structure (Figure 7). In this case, they are clustered into two (instead of four) subjects’ groups.

Table 7 Posterior mean and $95\%$ quantile-based credible intervals of the estimated structural parameters of the BNP model are reported

Results from generated data suggest that the BNP model provides more accurate estimates of subjects’ and raters’ specific parameters and overcomes the BP model. The only exception is observed for the rater-specific reliability parameter $1/\sigma ^2$ under the scenario $|\mathcal {R}_i|=4$ ; here, the BP model overcomes our proposal. Under the standard parametric model, we only have two population parameters $\gamma $ and $\beta $ (i.e., $(\gamma _j,\beta _j)=(\gamma ,\beta )$ , for $j=1,\dots , J$ ) and, as a consequence, more information is available for their estimation. This might result in a faster accuracy improvement of this model for this set of parameters as the ratio of students per rater increases. The comparison between the $RMSE$ and the $MAE$ of Tables 1 and 5 suggests that the estimates of both the BP and BNP models degrade with coarse data. The same trend emerged regarding the structural parameters and the densities; we report these results in the Supplementary Material.

Figure 7 The estimated densities of the subject’s true score $\theta $ , rater’s systematic bias $\tau $ and the residual term $\epsilon $ are reported; the black solid lines and the shade grey areas indicate the pointwise posterior mean and $95\%$ quantile-based credible intervals of the respective densities. Bottom-right figure shows the posterior distribution of the $ICC_A$ , the black solid and dotted lines indicate, respectively, the $95\%$ credible interval and the posterior mean. The rugs at the margins of the first three figures indicate the clustering of individuals.

9 Concluding remarks

A flexible BNP framework is proposed for the analysis of holistic rating data . We adopt the two-way unbalanced design as a general setting (McGraw & Wong, Reference McGraw and Wong1996) which allows us to relate our proposal to other existing models (e.g., cross-classified or crossed random effects models, multilevel models, IRT-based rating models). We specify a measurement model to jointly estimate the subject’s latent quality (e.g., student’s proficiency) and the rater’s features (i.e., severity and consistency). Our proposal may be suitable both for balanced (i.e., when all raters score each subject; Nelson & Edwards, Reference Nelson and Edwards2010, Reference Nelson and Edwards2015) and unbalanced designs (i.e., when a subset of raters scores each subject; Martinková et al., Reference Martinková, Bartoš and Brabec2023; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022). This method aims to capture latent heterogeneity among subjects and raters with the stochastic clustering induced by the DPM placed over their effects. This allows us to relax the common distributional assumptions on the respective parameters, preventing model misspecification issues (Antonelli et al., Reference Antonelli, Trippa and Haneuse2016; Walker & Gutiérrez-Peña, Reference Walker and Gutiérrez-Peña2007).

Results from the simulation study highlight the flexibility of our proposal, which provides accurate estimates across different scenarios. Exploiting the DPM prior, the respective densities of the students’ and raters’ effects are consistently estimated both when the normality assumption holds and when it is violated. Our method provides a more prominent improvement in small sample sizes and with coarse data. Our proposal provides the best fit to the real data, both for continuous and coarse ratings, compared to the parametric competitor. Nonetheless, the accuracy of the estimates with coarse ratings might be a concern when subjects are only rated by a very small number of raters and the estimated true scores are used, for instance, for selection purposes or as official grades. The theoretical results presented in Section 3 are employed to make inferences about the inter-rater reliability of the single ratings.

The relatively long computational times of the MCMC chains might be prohibitive if used for repeated or massive scoring procedures. In such cases, if one is interested in capturing systematic heterogeneity among subjects or raters, any formulation of a mixture model (parametric or nonparametric) might be computationally cumbersome. In contrast, if this is not the focus of the analysis, the parametric model might be a computationally faster solution.

Under our model, rater’s systematic bias and reliability are assumed to be independent conditional on the parameters of the cluster; additionally, the reliability of the raters is assumed to be independent of their specific workload $|\mathcal {S}_j|$ (i.e., the cardinality of the subset of subjects the rater has to evaluate). These assumptions might be unrealistic in some real contexts, and they might be relaxed under more general model specifications. For example, a multivariate distribution might be specified as a base measure $H_0$ to account for the correlation between the rater’s features, and the rater-specific workload $|\mathcal {S}_j|$ might be modeled as a random variable correlated to the rater’s features. Furthermore, because the measurement model includes raters’ effects only as an additive component, all raters are assumed to have the same ability to discriminate between subjects with different latent true scores. This assumption might be relaxed by specifying an additional rater-specific multiplicative effect for the subject’s true score, similar to the GMFRMs (Uto & Ueno, Reference Uto and Ueno2016).

The model detailed in Section 2.3 might be further extended to account for multidimensional ratings, i.e., when subjects are rated on multiple items. Under this three-way design, item parameters might be identified under some general conditions, and the model might extend Paganin et al., Reference Paganin, Paciorek, Wehrhahn, Rodríguez, Rabe-Hesketh and de Valpine2023, or Karabatsos & Walker, Reference Karabatsos and Walker2009 to account for raters’ characteristics. Further BNP generalizations of the existing rating models, e.g., GMFRMs, (e.g., Uto & Ueno, Reference Uto and Ueno2016; Uto et al., Reference Uto, Tsuruta, Araki and Ueno2024) HRMs (e.g., Casabianca et al., Reference Casabianca, Lockwood and Mccaffrey2015; DeCarlo et al., Reference DeCarlo, Kim and Johnson2011; Molenaar et al., Reference Molenaar, Uluman, Tavşancl and De Boeck2021) or Trifactor Models (e.g., Shin et al., Reference Shin, Rabe-Hesketh and Wilson2019; Soland & Kuhfeld, Reference Soland and Kuhfeld2022) are left for future investigations.

The effect of covariate and contextual factors might be incorporated in the structural models 6, 8, or 9 if additional information on subjects or raters is available. This extension might relate our model to Explanatory Response Models (Kim & Wilson, Reference Kim and Wilson2020; Wilson & De Boeck, Reference Wilson and De Boeck2004) and be a BNP generalization of those methods. According to the data structure, more complex hierarchical priors might be placed over the subjects’ true scores, such as hierarchical (Paisley et al., Reference Paisley, Wang, Blei and Jordan2014; Teh et al., Reference Teh, Jordan, Beal and Blei2004), nested DPMs (Gelman et al., Reference Gelman, Hwang and Vehtari2014; Hjort et al., Reference Hjort, Holmes, Müller and Walker2010; Rodriguez et al., Reference Rodriguez, Dunson and Gelfand2008) or hidden hierarchical DPMs recently introduced by (Lijoi et al., Reference Lijoi, Prünster and Rebaudo2023) which overcomes some flaws of the previous ones. Stochastic Approximations of the DPM might be further considered for the stick-breaking constructions avoiding a maximum number of clusters (Arbel et al., Reference Arbel, Blasi and Prünster2019).

Our method might provide practitioners with valuable insights about the subjects’ and raters’ specific features along with the respective clustering structures. This information might be used to great advantage of individualized teaching programs (Coates, Reference Coates2025) and might improve the matching procedure between subjects in peer teaching activities (Stigmar, Reference Stigmar2016). Our theoretical finding and computational solution might enhance the analysis of rating data and contribute novel knowledge about the rating process.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/psy.2025.10035.

Acknowledgments

We gratefully acknowledge the anonymous reviewers for their insightful suggestions regarding the extension to the coarse rating case. We are sincerely grateful to Professors Patricia Martinkova and Elena Erosheva for the precious insights on the first draft of the article, and to Professors Antonio Lijoi and Igor Prünster for fruitful discussions on these classes of priors.

Competing interests

The authors declare none.

Appendices

A Further extensions

In this section, we present some model extensions for more flexible clustering and complex hierarchical structures. We briefly detail alternative discrete priors that generalize the Dirichlet Process, and provide a more general framework accounting for multiple source of variability.

A.1 DP generalizations

Following the notation in Section 2.2, given $\Pi =DP(\alpha P_0)$ the number of different unique values $K_n$ generated by p increase asymptotically at a logarithmic rate, with $K_n \sim \alpha \; log(n)$ a.s. for $n \rightarrow \infty $ . Alternative priors might be specified over p which overcome this issue and allow for a more flexible prior specification on the number of clusters. More general specifications of $\Pi $ are briefly presented below.

Our proposal might readily encompass these priors, and since they all share the stick-breaking representation presented in Section 14, the $ICCs$ estimation and the Semi-centered identifiability procedure still hold for these cases.

A.1.1 Mixture of Pitman–Yor process

One of the most common generalizations of the Dirichlet Process is the Pitman–Yor Process $PY(d, \alpha , P_0)$ , indexed by a discount parameter $0<d<1$ , a concentration parameter $\alpha>-d$ , and a base measure $P_0$ . This is also termed the two-parameter Poisson DP. For instance, we can place the $PY$ as a prior over the subject random measure $G \sim PY(d, \alpha , H_0)$ , which might be represented as

$$ \begin{align*} G = \sum_{n \geq1} \pi_{1n} \delta_{\xi_n}, \quad \pi_{1n} = V_{1n} \prod_{l<n}(1-V_{1l}), \quad V_{1n} \overset{\mathrm{iid}}{\sim} Beta(1-d,\alpha_1+nd). \quad \xi_n \overset{\mathrm{iid}}{\sim} G_0, \end{align*} $$

Under this specification, the number of observed clusters $K_I$ out of a sample of I subjects increase asymptotically at a rate $I^d$ , with $K_I \sim S_{d,\alpha }I^d$ as $I \rightarrow \infty $ . Here $S_{d,\alpha }$ is a limiting random variable with a probability distribution depending on d and $\alpha $ and a positive density on $\mathbb {R}^+$ . For $d \rightarrow 0$ , we recover the $DP(\alpha , G_0)$ , whereas for larger values of d, the rate of increase of $K_I$ is faster. The discount parameter d might be interpreted as the proportion of small clusters that will be observed out of a sample of I subjects. Indeed, this parameter plays a double role in the clustering behavior of the model. The higher values of d imply a reinforcement mechanism that favors the allocation of a subject to the larger clusters (the “rich-get-richer” property) and, at the same time, a higher probability of being assigned to a new cluster. This is clear from $\mathbb {E}[\pi _{1n}]=O(n^{-1/d})$ , for $0<d<1$ , which suggests that the decay of the cluster sizes is governed by a power law.

A.1.2 Mixture of Normalized Generalized Gamma process

We can alternatively specify a Normalized Generalized Gamma (NGG) process as a prior for $p \sim NGG(\alpha , d, P_0)$ (Brix, Reference Brix1999; Lijoi et al., Reference Lijoi, Mena and Prünster2007). This distribution is characterized by $\tau>0$ , $d \in (0,1)$ , and a base measure $P_0$ . Following the previous example, we can consider the subject random measure to be distributed according to an NGG, $G \sim NGG(\alpha , d, G_0)$ . It might be represented as

$$ \begin{align*} G = \sum_{n \geq1} \pi_{1n} \delta_{\xi_n}, \quad \pi_{1n} = T_n/\sum_{i \geq 1} T_i, \quad \xi_n \overset{\mathrm{iid}}{\sim} G_0, \end{align*} $$

where $T_n$ are points of a generalized gamma process with parameters $\alpha>0$ , $d \in (0,1)$ , and $\sum _{i \geq 1} T_i< \infty $ (Brix, Reference Brix1999). For $d \rightarrow 0$ we recover the DP. See Ghosal & van der Vaart (Reference Ghosal and van der Vaart2017) for the correspondence between the PY and NGG processes. The interpretation of the parameters $\alpha $ and d, and the comments on the power law tails behavior of the PY process might be readily applied to the NGG process.

In educational rating contexts, the PY and the NGG processes might be preferred to the DP when the interest is to identify a few large clusters of subjects with similar proficiency levels and subjects who might need more one-on-one or personalized teaching. We refer to De Blasi et al. (Reference De Blasi, Favaro, Lijoi, Mena, Prünster and Ruggiero2015); Hjort et al. (Reference Hjort, Holmes, Müller and Walker2010) and Ishwaran & James (Reference Ishwaran and James2001) for a broader treatment of this class of priors.

A.2 Multiple ways unbalanced designs

We present here an extension of our proposal for multiple-way unbalanced designs. Without loss of generality, we consider a three-way design in which ratings are additionally nested within an independent factor. It is the case where subjects are rated over different tasks (Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2022) or across different occasions (Lin et al., Reference Lin, Tuerlinckx and Vanbelle2025).

Consider a subject $i = 1, \dots , I$ , whose attribute is independently scored by a random subset of raters $\mathcal {R}_i \subseteq \{1,\dots , J\}$ on a continuous rating scale across different tasks $d \in \{1,\dots , D\}$ . We assume that the observed rating $Y_{ijd} \in \mathbb {R}$ depends independently on subject i, rater $j \in \mathcal {R}_i$ and task $d \in \{1,\dots , D\}$ . We let the residual part, that is the difference between the true and the observed score, depend on rater j’s effects, i.e. systematic bias and reliability. We specify the following decomposition of rating $Y_{ijd}$ :

(47)

$$ \begin{align} Y_{ijd} = \theta_i + \tau_j + \xi_d + SR_{ij} + SD_{id} + RD_{jd} +\epsilon_{ijd}, \end{align} $$

for $i = 1,\dots ,I$ , $j \in \mathcal {R}_i$ and $d \in \{1,\dots ,D\}$ . Here $\theta _i$ is the true score of subject i, $\tau _j$ is the systematic bias of rater j, $\xi _d$ is the average easiness of task d, $\tau _j+SR_{ij}$ is the average bias of rater j in scoring subject i, $\xi _d + SD_i$ is the easiness of task d for subject i, $\tau _j+SR_{ij}+SD_{jd}$ is the bias of rater j in scoring subject i in task d and $\epsilon _{ijd}$ is a residual term following a zero-mean normal distribution with variance $\sigma ^2_j$ as in the two-way unbalanced design. All the parameters in (47) are assumed to be mutually independent.

We extend the model presented in Section 2.3 by placing a prior over the new set of parameters. They might be equivalently equipped with parametric or nonparametric priors. In the latter case, the same modelling strategy as described for $\{\tau _j\}_1^J$ in Section 2.3 and 2.4 might be freely adopted for any of the new parameters.

The ICCs class for these multiple-way models might be derived straightforwardly as in the two-way design setting and following the rationale detailed in Lin et al. (Reference Lin, Tuerlinckx and Vanbelle2025). If nonparametric priors are placed over the new set of parameters, the variance of their respective mixture distribution might be estimated as indicated in Section 2.4. As for the two-way case, the BNP ICCs class for these models is a generalization of most of those proposed for multilevel data (e.g., Lin et al., Reference Lin, Tuerlinckx and Vanbelle2025; Ten Hove et al., Reference Ten Hove, Jorgensen and van der Ark2021, Reference Ten Hove, Jorgensen and van der Ark2022), since the residuals’ heteroschedasticity and the flexible mixture distribution of the parameters. As a consequence, when parametric priors are placed over the parameters in (47) and given $\sigma ^2_j=\sigma ^2$ , for $j=1,\dots , J$ , the standard ICCs in Lin et al. (Reference Lin, Tuerlinckx and Vanbelle2025) are recovered.

B Proofs

B.1 Proof of Proposition 1

Proof. Let $Y_{ij}$ and $Y_{ij'}$ be the ratings given by two random raters $j,j' \in \mathcal {R}_i, \;\; j \neq j'$ , to a random subject i, for $i=1,\dots ,I$ :

$$ \begin{align*} Y_{ij} = \theta_i +\tau_j +\epsilon_{ij}, \quad \quad Y_{ij'} = \theta_i +\tau_{j'} +\epsilon_{ij'}. \end{align*} $$

Assuming mutual independence between the terms of the decomposition:

$$ \begin{align*} \mathbf{Var}[ Y_{ij} |G,H] = \omega^2_G + \phi^2_H + \sigma_j^2, \quad \quad \mathbf{Var}[ Y_{ij'} |G,H] = \omega^2_G + \phi^2_H + \sigma_{j'}^2 \end{align*} $$

and the conditional covariance between the two ratings is

$$ \begin{align*} \mathbf{Cov}[Y_{ij}, Y_{ij'} |G,H] & = \mathbf{Cov}[\theta_i +\tau_j +\epsilon_{ij} , \theta_i +\tau_{j'} +\epsilon_{ij'} |G,H] \\& = \mathbf{Cov}[\theta_i,\theta_i|G,H] + \mathbf{Cov}[\theta_i,\tau_{j'}|G,H] + \mathbf{Cov}[\theta_i,\epsilon_{ij'}|G,H] \\& \quad + \mathbf{Cov}[\tau_j,\theta_i|G,H]+ \mathbf{Cov}[\tau_j,\tau_{j'}|G,H]+ \mathbf{Cov}[\tau_j,\epsilon_{ij'}|G,H] \\& \quad + \mathbf{Cov}[\epsilon_{ij},\theta_i |G,H] + \mathbf{Cov}[\epsilon_{ij},\tau_{j'} |G,H] + \mathbf{Cov}[\epsilon_{ij},\epsilon_{ij'}|G,H] \\& = \mathbf{Cov}[\theta_i,\theta_i|G,H] \\& = \omega^2_G. \end{align*} $$

The correlation between the ratings is

$$ \begin{align*} ICC_{j,j'} = Cor[Y_{ij}, Y_{ij'} |G,H,\sigma_j^2,\sigma_{j'}^2] &= \frac{ \mathbf{Cov}[Y_{ij}, Y_{ij'} ]|G,H}{\sqrt{ (\mathbf{Var}[ Y_{ij}|G,H ]} \sqrt{\mathbf{Var}[ Y_{ij'}|G,H ]) }} \\&= \frac{\omega^2_G}{\sqrt{\omega^2_G +\phi^2_H +\sigma_j^2}\sqrt{\omega^2_G +\phi^2_H +\sigma_{j'}^2} }.\\[-41pt] \end{align*} $$

B.2 Proof of statement (i) of Proposition 2

Proof. Let $Y_{ij}$ and $Y_{ij'}$ be the ratings given by two random raters $j,j' \in \mathcal {R}_i, \;\; j \neq j'$ , satisfying $\sigma _j^2=\sigma _{j'}^2=\tilde {\sigma }_H$ to a random subject i, $i=1,\dots ,I$ :

$$ \begin{align*} Y_{ij} = \theta_i +\tau_j +\epsilon_{ij}, \quad \quad Y_{ij'} = \theta_i +\tau_{j'} +\epsilon_{ij'}. \end{align*} $$

Assuming mutual independence between the terms of the decomposition:

$$ \begin{align*} \mathbf{Var}[ Y_{ij} |G,H] = \omega^2_G + \phi^2_H + \tilde{\sigma}_H, \quad \quad \mathbf{Var}[ Y_{ij'} |G,H] = \omega^2_G + \phi^2_H + \tilde{\sigma}_H \end{align*} $$

and the conditional covariance between the two ratings is

The correlation between the ratings is

$$ \begin{align*} ICC_A = \mathbf{Cor}[Y_{ij}, Y_{ij'} |G,H] &= \frac{ \mathbf{Cov}[Y_{ij}, Y_{ij'} ]|G,H}{\sqrt{ (\mathbf{Var}[ Y_{ij}|G,H ]} \sqrt{\mathbf{Var}[ Y_{ij'}|G,H ]) }} \\&= \frac{\omega^2_G}{\omega^2_G + \phi^2_H + \tilde{\sigma}_H}.\\[-37pt] \end{align*} $$

B.3 Proof of statement (ii) of Proposition 2

Proof. Let us consider the function $ICC$ which, conditional on G and H, is a convex function of the random variable $\sigma _j^2$ :

(48)

$$ \begin{align} ICC(\sigma^2_j|G,H) = \frac{\omega^2_G}{\omega^2_G + \phi^2_H + \sigma_j^2}, \quad j=1,\dots,J. \end{align} $$

. Let $ICC_A$ be the $ICC$ function of the expected value of $\sigma _j$ :

(49)

$$ \begin{align} ICC_A = \frac{\omega^2_G}{\omega^2_G + \phi^2_H + \mathbf{E}[\sigma_j^2|G,H]}, \quad j=1,\dots,J. \end{align} $$

Note that $E[\sigma ^2_j|G,H]=\mathbf {E}[\sigma ^2_{j'}|G,H]$ for $j,j'=1,\dots ,J$ , $j\neq j'$ . It readily follows from the conditional Jensen’s Inequality that

(50)

$$ \begin{align} ICC(\mathbf{E}[\sigma_j|G,H]) \leq \mathbf{E}[ICC(\sigma_j^2|G,H)]. \end{align} $$

Since for brevity we define $ICC_A=ICC(\mathbf {E}[\sigma _j|G,H])$ and $ICC=ICC(\sigma _j^2|G,H)$ :

(51)

$$ \begin{align} ICC_A \leq \mathbf{E}[ICC|G,H]. \end{align} $$

where $\mathbf {E}[ICC|G, H]=\mathbf {E}[Corr\left (Y_{ij}, Y_{i,j'}|G, H\right )]$ , $i=1,\dots , I$ and $j,j' in \mathcal {R}_i$ . That is the expected correlation between two independent ratings given to a random subject.

B.4 Proof of Proposition 3

Proof. Let $Y_{ij}$ and $Y_{ij'}$ be the ratings given by $j,j' \in \mathcal {R}_i, \;\; j \neq j'$ , to a random subject i, $i=1,\dots ,I$ :

$$ \begin{align*} Y_{ij} = \theta_i +\epsilon_{ij}, \quad \quad Y_{ij'} = \theta_i +\epsilon_{ij'}. \end{align*} $$

Assuming mutual independence between the terms of the decomposition:

$$ \begin{align*} \mathbf{Var}[ Y_{ij} |G,H] = \omega^2_G + \phi^2_H, \quad \quad \mathbf{Var}[ Y_{ij'} |G,H] = \omega^2_G + \phi^2_H \end{align*} $$

and the conditional covariance between the two ratings is

$$ \begin{align*} \mathbf{Cov}[Y_{ij}, Y_{ij'} |G,H] & = \mathbf{Cov}[\theta_i +\epsilon_{ij} , \theta_i +\epsilon_{ij'} |G,H] \\ & = \mathbf{Cov}[\theta_i,\theta_i|G,H] + \mathbf{Cov}[\theta_i,\epsilon_{ij'}|G,H]+ \mathbf{Cov}[\epsilon_{ij},\theta_i |G,H] + \mathbf{Cov}[\epsilon_{ij},\epsilon_{ij'}|G,H] \\ & = \mathbf{Cov}[\theta_i,\theta_i|G,H]\\ & = \omega^2_G. \end{align*} $$

The correlation between the ratings is

$$ \begin{align*} ICC = Cor[Y_{ij}, Y_{ij'} |G,H] &= \frac{ \mathbf{Cov}[Y_{ij}, Y_{ij'} |G,H]}{\sqrt{ (\mathbf{Var}[ Y_{ij}|G,H ]} \sqrt{\mathbf{Var}[ Y_{ij'}|G,H ]) }} \\ &= \frac{\omega^2_G}{\omega^2_G +\phi^2_H }. \end{align*} $$

Since the conditional variance of ratings is equal across subjects $\mathbf {Var}[ Y_{ij} |G,H]= \omega ^2_G + \phi ^2_H$ for $i=1,\dots , I$ , the ICC is unique for all the subjects.

C Plots

Figure C.1 Average estimated density across 10 independent datasets under the unimodal scenario. The columns indicate the cardinality of $|\mathcal {R}_i|=\{2,4\}$ : left and right, respectively. The solid red lines indicate the true densities; the solid black line and the shaded grey area indicate, respectively, the point-wise mean and $95\%$ quantile-based credible intervals; the density implied by the BP model (black dotted lines).

Figure C.2 First row: examples of posterior similarity matrices for pairwise subject and raters allocation (left and right column, respectively). Second row: posterior similarity matrices for pairwise subject and raters allocation in real data analyzed in Section 7.

Footnotes

1 CPU configuration: 12th Gen Intel(R) Core(TM) i9 12900H.

References

Agresti, A. (2015). Foundations of linear and generalized linear models. Wiley.Google Scholar

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679.10.1080/01621459.1993.10476321CrossRef Google Scholar

Albrecht, R., Espejo, T., Riedel, H. B., Nissen, S. K., Banerjee, J., Conroy, S. P., Nickel, C. H. (2024). Clinical frailty scale at presentation to the emergency department: interrater reliability and use of algorithm-assisted assessment. European Geriatric Medicine, 15(1), 105–113.10.1007/s41999-023-00890-yCrossRef Google Scholar

Antonelli, J., Trippa, L., & Haneuse, S. (2016). Mitigating bias in generalized linear mixed models: The case for Bayesian nonparametrics. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 31(1), 80.10.1214/15-STS533CrossRef Google Scholar PubMed

Antoniak, C. E. (1974). Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2(6), 1152–1174.10.1214/aos/1176342871CrossRef Google Scholar

Arbel, J., Blasi, P. D., & Prünster, I. (2019). Stochastic approximations to the Pitman–Yor process. Bayesian Analysis, 14(4), 1201–1219.10.1214/18-BA1127CrossRef Google Scholar

Bartholomew, D., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach (3^rd ed.) Wiley.10.1002/9781119970583CrossRef Google Scholar

Bartoš, F., & Martinková, P. (2024). Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability. British Journal of Mathematical and Statistical Psychology, 77(3), 651–671.10.1111/bmsp.12343CrossRef Google Scholar PubMed

Bradlow, E., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.10.1007/BF02294533CrossRef Google Scholar

Brix, A. (1999). Generalized gamma measures and shot-noise cox processes. Advances in Applied Probability, 31(4), 929–953.10.1239/aap/1029955251CrossRef Google Scholar

Cao, J., Stokes, S. L., & Zhang, S. (2010). A Bayesian approach to ranking and rater evaluation: An application to grant reviews. Journal of Educational and Behavioral Statistics, 35(2), 194–214.10.3102/1076998609353116CrossRef Google Scholar

Casabianca, J. M., Lockwood, J. R., & Mccaffrey, D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75(2), 311–337.10.1177/0013164414539163CrossRef Google Scholar PubMed

Childs, T. M., & Wooten, N. R. (2023). Teacher bias matters: An integrative review of correlates, mechanisms, and consequences. Race Ethnicity and Education, 26(3), 368–397.10.1080/13613324.2022.2122425CrossRef Google Scholar

Chin, M. J., Quinn, D. M., Dhaliwal, T. K., & Lovison, V. S. (2020). Bias in the air: A nationwide exploration of teachers’ implicit racial attitudes, aggregate bias, and student outcomes. Educational Researcher, 49(8), 566–578.10.3102/0013189X20937240CrossRef Google Scholar

Coates, W. C. (2025). Precision education—A call to action to transform medical education. International Journal of Emergency Medicine, 18(1), 21.10.1186/s12245-025-00819-1CrossRef Google Scholar PubMed

Cook, C. R., Kilgus, S. P., & Burns, M. K. (2018). Advancing the science and practice of precision education to enhance student outcomes. Journal of School Psychology, 66, 4–10.10.1016/j.jsp.2017.11.004CrossRef Google Scholar PubMed

Cremaschi, A., De Iorio, M., Seng Chong, Y., Broekman, B., Meaney, M. J., & Kee, M. Z. (2021). A Bayesian nonparametric approach to dynamic item-response modeling: An application to the gusto cohort study. Statistics in Medicine, 40(27), 6021–6037.10.1002/sim.9167CrossRef Google Scholar

De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., & Ruggiero, M. (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 212–229.10.1109/TPAMI.2013.217CrossRef Google Scholar PubMed

De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559.10.1007/s11336-008-9092-xCrossRef Google Scholar

DeCarlo, L. T., Kim, Y., & Johnson, M. S. (2011). A hierarchical rater model for constructed responses, with a signal detection rater model. Journal of Educational Measurement, 48(3), 333–356.10.1111/j.1745-3984.2011.00143.xCrossRef Google Scholar

De Iorio, M., Favaro, S., Guglielmi, A., & Ye, L. (2023). Bayesian nonparametric mixture modeling for temporal dynamics of gender stereotypes. The Annals of Applied Statistics, 17(3), 2256–2278.10.1214/22-AOAS1717CrossRef Google Scholar

DeYoreo, M., & Kottas, A. (2018). Bayesian nonparametric modeling for multivariate ordinal regression. Journal of Computational and Graphical Statistics, 27(1), 71–84.10.1080/10618600.2017.1316280CrossRef Google Scholar

D’lima, J, Taylor, S. E., Mitri, E., Harding, A., Lai, J., & Manias, E. (2024). Assessment of inter-rater reliability of screening tools to identify patients at risk of medication-related problems across the emergency department continuum of care. Australasian Emergency Care, 27(2), 136–141.10.1016/j.auec.2023.10.005CrossRef Google Scholar PubMed

Erosheva, E. A., Martinková, P., & Lee, C. J. (2021). When zero may not be zero: A cautionary note on the use of inter-rater reliability in evaluating grant peer review. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(3), 904–919.10.1111/rssa.12681CrossRef Google Scholar

Escobar, M., & West, M. (1994). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.10.1080/01621459.1995.10476550CrossRef Google Scholar

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2), 209–230.10.1214/aos/1176342360CrossRef Google Scholar

Fox, J.-P., & Glas, C. A. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, 271–288.10.1007/BF02294839CrossRef Google Scholar

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., & Rubin, D. (2013). Bayesian data analysis. (2^rd ed.) Taylor & Francis.10.1201/b16018CrossRef Google Scholar

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.10.1017/CBO9780511790942CrossRef Google Scholar

Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997–1016.10.1007/s11222-013-9416-2CrossRef Google Scholar

Ghosal, S., Ghosh, J. K., & Ramamoorthi, R. V. (1999). Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 27(1), 143–158.10.1214/aos/1018031105CrossRef Google Scholar

Ghosal, S., & van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference. Cambridge University Press.10.1017/9781139029834CrossRef Google Scholar

Goel, A. M., & Thakor, A. V. (2015). Information reliability and welfare: A theory of coarse credit ratings. Journal of Financial Economics, 115(3), 541–557.10.1016/j.jfineco.2014.11.005CrossRef Google Scholar

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48.10.1348/000711006X126600CrossRef Google Scholar PubMed

Harbaugh, R., & Rasmusen, E. (2018). Coarse grades: Informing the public by withholding information. American Economic Journal: Microeconomics, 10(1), 210–235.Google Scholar

Hart, S. A. (2016). Precision education initiative: Moving toward personalized education. Mind, Brain, and Education, 10(4), 209–211.10.1111/mbe.12109CrossRef Google Scholar

Heinzl, F., Fahrmeir, L., & Kneib, T. (2012). Additive mixed models with Dirichlet process mixture and p-spline priors. Asta Advances in Statistical Analysis, 96, 47–68.10.1007/s10182-011-0161-6CrossRef Google Scholar

Henderson, N. C., Louis, T. A., Rosner, G. L., & Varadhan, R. (2020). Individualized treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models. Biostatistics, 21(1), 50–68.10.1093/biostatistics/kxy028CrossRef Google Scholar PubMed

Hjort, N., Holmes, C., Müller, P., & Walker, S. (2010). Bayesian nonparametrics. Cambridge University Press.10.1017/CBO9780511802478CrossRef Google Scholar

Ho, A. D., & Reardon, S. F. (2012). Estimating achievement gaps from test scores reported in ordinal “proficiency” categories. Journal of Educational and Behavioral Statistics, 37(4), 489–517.10.3102/1076998611411918CrossRef Google Scholar

Shin, H. J., Rabe-Hesketh, S., & Wilson, M. (2019). Trifactor models for multiple-ratings data. Multivariate Behavioral Research, 54(3), 360–381.10.1080/00273171.2018.1530091CrossRef Google Scholar PubMed

Ishwaran, H., & James, L. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161–173.10.1198/016214501750332758CrossRef Google Scholar

Jöreskog, K. G. (1994). On the estimation of polychoric correlations and their asymptotic covariance matrix. Psychometrika, 59(3), 381–389.10.1007/BF02296131CrossRef Google Scholar

Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347–387.10.1207/S15327906347-387CrossRef Google Scholar PubMed

Karabatsos, G., & Walker, S. (2009). A Bayesian nonparametric approach to test equating. Psychometrika, 74, 211–232.10.1007/s11336-008-9096-6CrossRef Google Scholar

Kim, J., & Wilson, M. (2020). Polytomous item explanatory item response theory models. Educational and Psychological Measurement, 80(4), 726–755.10.1177/0013164419892667CrossRef Google Scholar PubMed

Koo, T., & Li, M. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15, 155–163.10.1016/j.jcm.2016.02.012CrossRef Google Scholar PubMed

Kottas, A., Müller, P., & Quintana, F. (2005). Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics, 14(3), 610–625.10.1198/106186005X63185CrossRef Google Scholar

Królikowska, A., Reichert, P., Karlsson, J., Mouton, C., Becker, R., & Prill, R. (2023). Improving the reliability of measurements in orthopaedics and sports medicine. Knee Surgery, Sports Traumatology, Arthroscopy, 31(12), 5277–5285.10.1007/s00167-023-07635-1CrossRef Google Scholar PubMed

Li, F., Cui, Y., Li, Y., Guo, L., Ke, X., Liu, J., Luo, X., Zheng, Y., & Leckman, J. F. (2022). Prevalence of mental disorders in school children and adolescents in China: Diagnostic data from detailed clinical assessments of 17,524 individuals. Journal of Child Psychology and Psychiatry, 63(1), 34–46.10.1111/jcpp.13445CrossRef Google Scholar

Lijoi, A., Mena, R. H., & Prünster, I. (2007). Controlling the reinforcement in Bayesian non-parametric mixture models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(4), 715–740.10.1111/j.1467-9868.2007.00609.xCrossRef Google Scholar

Lijoi, A., Prünster, I., & Rebaudo, G. (2023). Flexible clustering via hidden hierarchical Dirichlet priors. Scandinavian Journal of Statistics, 50(1), 213–234.10.1111/sjos.12578CrossRef Google Scholar

Lin, T.-Y., Tuerlinckx, F., & Vanbelle, S. (2025). Reliability for multilevel data: A correlation approach. Psychological Methods, Advance online publication.Google Scholar

Linacre, J. M. (1989). Many-faceted Rasch measurement [Unpublished doctoral dissertation]. The University of Chicago.Google Scholar

Lo, A. X., Heinemann, A. W., Gray, E., Lindquist, L. A., Kocherginsky, M., Post, L. A., & Dresden, S. M. (2021). Inter-rater reliability of clinical frailty scores for older patients in the emergency department. Academic Emergency Medicine, 28(1), 110–113.10.1111/acem.13953CrossRef Google Scholar PubMed

Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics, 12(1), 351–357.10.1214/aos/1176346412CrossRef Google Scholar

Lockwood, J. R., Castellano, K. E., & Shear, B. R. (2018). Flexible Bayesian models for inferences from coarsened, group-level achievement data. Journal of Educational and Behavioral Statistics, 43, 663–692.10.3102/1076998618795124CrossRef Google Scholar

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley Publishing Company, Inc.Google Scholar

Martinková, P., Bartoš, F., & Brabec, M. (2023). Assessing inter-rater reliability with heterogeneous variance components models: Flexible approach accounting for contextual variables. Journal of Educational and Behavioral Statistics, 48(3), 349–383.10.3102/10769986221150517CrossRef Google Scholar

Martinkova, P., Goldhaber, D., & Erosheva, E. (2018). Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability. PloS One, 13, e0203002.10.1371/journal.pone.0203002CrossRef Google Scholar PubMed

Martinková, P., & Hladká, A. (2023). Computational aspects of psychometric methods: With R. Chapman and Hall/CRC.10.1201/9781003054313CrossRef Google Scholar

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30.10.1037/1082-989X.1.1.30CrossRef Google Scholar

Meilă, M. (2007). Comparing clusterings—An information based distance. Journal of Multivariate Analysis, 98(5), 873–895.10.1016/j.jmva.2006.11.013CrossRef Google Scholar

Mignemi, G., Calcagnì, A., Spoto, A., & Manolopoulou, I. (2024). Mixture polarization in inter-rater agreement analysis: A Bayesian nonparametric index. Statistical Methods & Applications, 33, 325–355.10.1007/s10260-023-00741-xCrossRef Google Scholar

Miller, J. W. (2019). Fast and accurate approximation of the full conditional for gamma shape parameters. Journal of Computational and Graphical Statistics, 28(2), 476–480.10.1080/10618600.2018.1537929CrossRef Google Scholar

Molenaar, D., Uluman, M., Tavşancl, E., & De Boeck, P. (2021). The hierarchical rater thresholds model for multiple raters and multiple items. Open Education Studies, 3, 33–48.10.1515/edu-2020-0105CrossRef Google Scholar

Muckle, T. J., & Karabatsos, G. (2009). Hierarchical generalized linear models for the analysis of judge ratings. Journal of Educational Measurement, 46(2), 198–219.10.1111/j.1745-3984.2009.00078.xCrossRef Google Scholar

Mulder, J., & Fox, J.-P. (2019). Bayes factor testing of multiple intraclass correlations. Bayesian Analysis, 14(2), 521–552.10.1214/18-BA1115CrossRef Google Scholar

Mutz, R., Bornmann, L., & Daniel, H.-D. (2012). Heterogeneity of inter-rater reliabilities of grant peer reviews and its determinants: A general estimating equations approach. PLoS One, 7(10), e48509.10.1371/journal.pone.0048509CrossRef Google Scholar PubMed

Nelson, K., & Edwards, D. (2010). Improving the reliability of diagnostic tests in population-based agreement studies. Statistics in Medicine, 29(6), 617–626.10.1002/sim.3819CrossRef Google Scholar PubMed

Nelson, K., & Edwards, D. (2015). Measures of agreement between many raters for ordinal classifications. Statistics in Medicine, 34, 3116–3132.10.1002/sim.6546CrossRef Google Scholar PubMed

Nieto, R., & Casabianca, J. M. (2019). Accounting for rater effects with the hierarchical rater model framework when scoring simple structured constructed response tests. Journal of Educational Measurement, 56(3), 547–581.10.1111/jedm.12225CrossRef Google Scholar

Paganin, S., & de Valpine, P. (2024). Computational methods for fast Bayesian model assessment via calibrated posterior p-values. Journal of Computational and Graphical Statistics, 34(2), 1–12.Google Scholar

Paganin, S., Paciorek, C. J., Wehrhahn, C., Rodríguez, A., Rabe-Hesketh, S., & de Valpine, P. (2023). Computational strategies and estimation performance with Bayesian semiparametric item response theory models. Journal of Educational and Behavioral Statistics, 48(2), 147–188.10.3102/10769986221136105CrossRef Google Scholar

Paisley, J., Wang, C., Blei, D. M., & Jordan, M. I. (2014). Nested hierarchical Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2), 256–270.10.1109/TPAMI.2014.2318728CrossRef Google Scholar

Pan, T., Shen, W., Davis-Stober, C. P., & Hu, G. (2024). A Bayesian nonparametric approach for handling item and examinee heterogeneity in assessment data. British Journal of Mathematical and Statistical Psychology, 77(1), 196–211.10.1111/bmsp.12322CrossRef Google Scholar PubMed

Papaspiliopoulos, O., Roberts, G. O., & Zanella, G. (2019). Scalable inference for crossed random effects models. Biometrika, 107(1), 25–40.Google Scholar

Papaspiliopoulos, O., Stumpf-Fétizon, T., & Zanella, G. (2023). Scalable Bayesian computation for crossed and nested hierarchical models. Electronic Journal of Statistics, 17(2), 3575–3612.10.1214/23-EJS2172CrossRef Google Scholar

Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341–384.10.3102/10769986027004341CrossRef Google Scholar

Peeters, M. J. (2015). Measuring rater judgment within learning assessments—Part 1: Why the number of categories matters in a rating scale. Currents in Pharmacy Teaching and Learning, 7(5), 656–661.10.1016/j.cptl.2015.06.015CrossRef Google Scholar

Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7–11.Google Scholar

Rabe-Hesketh, S., & Skrondal, A. (2016). Generalized linear latent and mixed modeling. In Handbook of item response theory, Wim J. van der Linden, (pp. 531–554).Google Scholar

Reardon, S. F., Shear, B. R., Castellano, K. E., & Ho, A. D. (2017). Using heteroskedastic ordered probit models to recover moments of continuous test score distributions from coarsened data. Journal of Educational and Behavioral Statistics, 42(1), 3–45.10.3102/1076998616666279CrossRef Google Scholar

Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association, 103(483), 1131–1154.10.1198/016214508000000553CrossRef Google Scholar

Roy, S., Daniels, M. J., & Roy, J. (2024). A Bayesian nonparametric approach for multiple mediators with applications in mental health studies. Biostatistics, 25(3), 919–932.10.1093/biostatistics/kxad038CrossRef Google Scholar PubMed

San Martín, E., Jara, A., Rolin, J. M., & Mouchart, M. (2011). On the Bayesian nonparametric generalization of IRT-type models. Psychometrika, 76, 385–409.10.1007/s11336-011-9213-9CrossRef Google Scholar

Sattler, D. N., McKnight, P. E., Naney, L., & Mathis, R. (2015). Grant peer review: Improving inter-rater reliability with training. PloS One, 10(6), e0130450.10.1371/journal.pone.0130450CrossRef Google Scholar PubMed

Savitsky, T. D., & Dalal, S. R. (2014). Bayesian non-parametric analysis of multirater ordinal data, with application to prioritizing research goals for prevention of suicide. Journal of the Royal Statistical Society Series C: Applied Statistics, 63(4), 539–557.10.1111/rssc.12049CrossRef Google Scholar

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4(2), 639–650.Google Scholar

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420.10.1037/0033-2909.86.2.420CrossRef Google Scholar PubMed

Soland, J., & Kuhfeld, M. (2022). Examining the performance of the trifactor model for multiple raters. Applied Psychological Measurement, 46(1), 53–67.10.1177/01466216211051728CrossRef Google Scholar PubMed

Song, X.-Y., Lu, Z.-H., Cai, J.-H., & Ip, E. (2013). A Bayesian modeling approach for generalized semiparametric structural equation models. Psychometrika, 78(4), 624–647.10.1007/s11336-013-9323-7CrossRef Google Scholar PubMed

Steinbakk, G. H., & Storvik, G. O. (2009). Posterior predictive p-values in Bayesian hierarchical models. Scandinavian Journal of Statistics, 36(2), 320–336.10.1111/j.1467-9469.2008.00630.xCrossRef Google Scholar

Stigmar, M. (2016). Peer-to-peer teaching in higher education: A critical literature review. Mentoring & Tutoring: Partnership in Learning, 24(2), 124–136.10.1080/13611267.2016.1178963CrossRef Google Scholar

Tang, N., Chow, S.-M., Ibrahim, J. G., & Zhu, H. (2017). Bayesian sensitivity analysis of a nonlinear dynamic factor analysis model with nonparametric prior and possible nonignorable missingness. Psychometrika, 82(4), 875–903.10.1007/s11336-017-9587-4CrossRef Google Scholar PubMed

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476), 1566–1581.10.1198/016214506000000302CrossRef Google Scholar

Ten Hove, D., Jorgensen, T. D., & van der Ark, L. A. (2021). Interrater reliability for multilevel data: A generalizability theory approach. Psychological Methods, 27(4), 650–666.10.1037/met0000391CrossRef Google Scholar

Ten Hove, D., Jorgensen, T. D., & van der Ark, L. A. (2022). Updated guidelines on selecting an intraclass correlation coefficient for interrater reliability, with applications to incomplete observational designs. Psychological Methods, 29(5), 967–979.10.1037/met0000516CrossRef Google Scholar PubMed

Uebersax, J. S. (1993). Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 88(422), 421–427.10.1080/01621459.1993.10476291CrossRef Google Scholar

Uto, M., Tsuruta, J., Araki, K., & Ueno, M. (2024). Item response theory model highlighting rating scale of a rubric and rater–rubric interaction in objective structured clinical examination. PLoS One, 19(9), 1–23.10.1371/journal.pone.0309887CrossRef Google Scholar PubMed

Uto, M., & Ueno, M. (2016). Item response theory for peer assessment. IEEE Transactions on Learning Technologies, 9, 157–170.10.1109/TLT.2015.2476806CrossRef Google Scholar

Uto, M., & Ueno, M. (2020). A generalized many-facet rasch model and its Bayesian estimation using hamiltonian Monte Carlo. Behaviormetrika, 47, 1–28.10.1007/s41237-020-00115-7CrossRef Google Scholar

van Praag, B. M., Hop, J. P., & Greene, W. H. (2025). Estimation of linear models from coarsened observations: A method of moments approach. Psychometrika, published online 2025:1-3410.1017/psy.2024.19CrossRef Google Scholar

Verbeke, G., & Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91, 217–221.10.1080/01621459.1996.10476679CrossRef Google Scholar

Wade, S. (2015). Point estimation and credible balls for Bayesian cluster analysis [Computer software manual]. Retrieved from https://www.maths.ed.ac.uk/~swade/docs/mcclust.ext-manual.pdf (R package version 1.0) Google Scholar

Wade, S., & Ghahramani, Z. (2018). Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Analysis, 13(2), 559–626.10.1214/17-BA1073CrossRef Google Scholar

Walker, S. G., & Gutiérrez-Peña, E. (2007). Bayesian parametric inference in a nonparametric framework. Test, 16, 188–197.10.1007/s11749-006-0008-8CrossRef Google Scholar

Wang, W., & Kingston, N. (2020). Using Bayesian nonparametric item response function estimation to check parametric model fit. Applied Psychological Measurement, 44(5), 331–345.10.1177/0146621620909906CrossRef Google Scholar PubMed

Werts, C., Linn, R., & Jöreskog, K. (1974). Intraclass reliability estimates: Testing structural assumptions. Educational and Psychological Measurement, 34(1), 25–33.10.1177/001316447403400104CrossRef Google Scholar

Wilson, M., & De Boeck, P. (2004). Descriptive and explanatory item response models. In Explanatory item response models: A generalized linear and nonlinear approach (pp. 43–74). Springer.10.1007/978-1-4757-3990-9_2CrossRef Google Scholar

Yang, M., & Dunson, D. B. (2010). Bayesian semiparametric structural equation models with latent variables. Psychometrika, 75, 675–693.10.1007/s11336-010-9174-4CrossRef Google Scholar

Yang, M., Dunson, D. B., & Baird, D. (2010). Semiparametric Bayes hierarchical models with mean and variance constraints. Computational Statistics & Data Analysis, 54(9), 2172–2186.10.1016/j.csda.2010.03.025CrossRef Google Scholar PubMed

Zupanc, K., & Štrumbelj, E. (2018). A Bayesian hierarchical latent trait model for estimating rater bias and reliability in large-scale performance assessment. PloS One, 13(4), 1–16.10.1371/journal.pone.0195297CrossRef Google Scholar PubMed

Figure 3 Average estimated density across 10 independent datasets under different scenarios. The columns indicate the cardinality of $|\mathcal {R}_i|=\{2,4\}$: left and right, respectively; the rows indicate bimodal or multimodal scenario: first and second row, respectively. The solid red lines indicate the true densities; the solid black line and the shaded grey area indicate, respectively, the point-wise mean and $95\%$ quantile-based credible intervals; the density implied by the BP model (black dotted lines).

Table 1 RMSE and MAE of individuals parameters corresponding to BP, BSP and BNP models.

Table 2 Standardized RMSE and standardized MAE of structural parameters corresponding to BP, BSP and BNP models.

Figure 5 The empirical distribution of ratings and the frequency of students per teacher are reported at left and right, respectively.

Table 3 The WAIC is reported for each of the fitted models: BNP, BP, and BSP; the pairwise WAIC difference ($\Delta WAIC$) between the model with the best fit and each other is reported

Table 4 Posterior mean and $95\%$ quantile-based credible intervals of the estimated structural parameters of the BNP model are reported

Figure 6 The estimated densities of the subject’s true score $\theta $, rater’s systematic bias $\tau $ and the residual term $\epsilon $ are reported; the black solid lines and the shade grey areas indicate the pointwise posterior mean and $95\%$ quantile-based credible intervals of the respective densities. Bottom-right figure shows the posterior distribution of the $ICC_A$, the black solid and dotted lines indicate, respectively, the $95\%$ credible interval and the posterior mean. The rugs at the margins of the first three figures indicate the clustering of individuals.

Table 5 RMSE and MAE of individuals parameters across bimodal scenarios with coarsened ratings

Table 6 The WAIC is reported for each of the fitted models: BNP, BP, and BSP; the pairwise WAIC difference ($\Delta WAIC$) between the model with the best fit and each other is reported

Table 7 Posterior mean and $95\%$ quantile-based credible intervals of the estimated structural parameters of the BNP model are reported

Figure 7 The estimated densities of the subject’s true score $\theta $, rater’s systematic bias $\tau $ and the residual term $\epsilon $ are reported; the black solid lines and the shade grey areas indicate the pointwise posterior mean and $95\%$ quantile-based credible intervals of the respective densities. Bottom-right figure shows the posterior distribution of the $ICC_A$, the black solid and dotted lines indicate, respectively, the $95\%$ credible interval and the posterior mean. The rugs at the margins of the first three figures indicate the clustering of individuals.

Figure C.1 Average estimated density across 10 independent datasets under the unimodal scenario. The columns indicate the cardinality of $|\mathcal {R}_i|=\{2,4\}$: left and right, respectively. The solid red lines indicate the true densities; the solid black line and the shaded grey area indicate, respectively, the point-wise mean and $95\%$ quantile-based credible intervals; the density implied by the BP model (black dotted lines).

Mignemi and Manolopoulou supplementary material

File 5.4 MB

Article contents

Bayesian Nonparametric Models for Multiple Raters: A General Statistical Framework

Abstract

Keywords

Information

1 Introduction

1.1 Our contributions

1.2 Outline of the article

2 BNP rating model

2.1 General framework

2.2 Preliminaries on Bayesian nonparametric inference

2.3 Proposed model

2.4 Stick-breaking representation

2.5 Semi-centered DPM

3 BNP intra-class correlation coefficient

4 Reduced model for one-way designs

4.1 Identifiability and ICC

5 Posterior inference

5.1 Hyperprior specification

5.2 Posterior computation

5.3 Post-processing procedures

6 Simulation study

6.1 Setting

6.2 Results

7 Application on large-scale essay assessment

7.1 Results

8 Coarsened ratings extension

8.1 Categorical modeling

8.2 Posterior computation

8.3 Generated and real coarsened ratings analysis

9 Concluding remarks

Supplementary material

Acknowledgments

Competing interests

Appendices

A Further extensions

A.1 DP generalizations

A.1.1 Mixture of Pitman–Yor process

A.1.2 Mixture of Normalized Generalized Gamma process

A.2 Multiple ways unbalanced designs

B Proofs

B.1 Proof of Proposition 1

B.2 Proof of statement (i) of Proposition 2

B.3 Proof of statement (ii) of Proposition 2

B.4 Proof of Proposition 3

C Plots

Footnotes

References

Mignemi and Manolopoulou supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests