1 Introduction
Measurement models, such as the popular twoparameter item response theory (IRT) model, are commonly used to measure latent socialscientific constructs like political ideology. Such models use observed responses to a common set of stimuli (e.g., congressional bills to be voted on) in order to estimate underlying traits of respondents and mappings from those traits to the responses given (e.g., a “yea” or “nay” vote). Standard applications of these models typically proceed on the assumption that the set of stimuli used to measure constructs of interest are understood equally by all respondents, thus making their answers (and anything we learn from them) comparable. This assumption is commonly known as measurement invariance, or measurement equivalence (King et al. Reference King, Murray, Salomon and Tandon2004; Stegmueller Reference Stegmueller2011).
As early as 1980, however, researchers were aware that violations of this assumption were possible. Today, violations of this assumption are commonly referred to as differential item functioning (DIF). In the language of the time, Lord (Reference Lord1980, 212) defined DIF by stating that “if an item has a different item response function for one group than for another, it is clear that the item is biased.”
Since Lord’s description of the problem that DIF poses to measurement, a number of researchers have developed and adopted various techniques to mitigate its effects. Lord (Reference Lord1980, Reference Lord and Poortinga1977) proposed a general test of joint difference between the item parameters estimates for two groups of respondents in the data. Thissen, Steinberg, and Wainer (Reference Thissen, Steinberg, Wainer, Holland and Wainer1993) build on this work, proposing additional methods for fitting IRT models to a known reference and focal group and then testing for the statistical differences in item parameters between the two groups. This work in identifying DIF is complemented by work that attempts to correct DIF under very specific circumstances and assumptions, including Aldrich and McKelvey (Reference Aldrich and McKelvey1977), Hare et al. (Reference Hare, Armstrong, Bakker, Carroll and Poole2015), Jessee (Reference Jessee2021), King et al. (Reference King, Murray, Salomon and Tandon2004), Poole (Reference Poole1998), and Stegmueller (Reference Stegmueller2011).
In this paper, we propose a model designed to improve measurement when DIF is present. To do so, we rely on Bayesian nonparametrics to flexibly estimate differences in the mappings used by respondents when presented with a common set of items. While we are not the first scholars to combine Bayesian nonparametric techniques (and specifically the Dirichlet process) with IRT models (see, e.g., Jara et al. Reference Jara, Hanson, Quintana, Müller and Rosner2011; Miyazaki and Hoshino Reference Miyazaki and Hoshino2009), to the best of our knowledge, we are the first to do so explicitly with the goal of diagnosing DIF. Our model—which we refer to as the multiple policy space (MPS) model—addresses one specific violation of measurement invariance that is of particular importance in political methodology.
Our model identifies subgroups of respondents who share common item parameter values, and whose positions in a shared latent space can thus safely be compared. Thus, while subgroups in our model will not necessarily be distinct from each other, our model can estimate groupspecific latent traits by first learning a sorting of observations across unobserved groups of respondents who share a common understanding of items, and conditioning on these group memberships to carry out the measurement exercise. This is similar in spirit to work done by Lord (Reference Lord1980) and Thissen et al. (Reference Thissen, Steinberg, Wainer, Holland and Wainer1993), but a crucial difference in our work is that we do not require researchers to a priori specify a set of group memberships of members before testing. Rather, our work offers an automated, modelbased approach to discover these group memberships from response patterns alone, which in turn also identifies groups of respondents for whom common latent trait mappings can and cannot be validly compared. In discovering these latent group memberships, we can also distinguish the set of respondents in our data that are comparable on a common latent score (i.e., a liberalconservative ideological spectrum) from those who think on a different dimension (i.e., a libertarian–authoritarian spectrum).Footnote ^{1}
To empirically illustrate our model, we apply it to the estimation of political ideology using a dataset that contains both legislators and voters. Our application is based on the dataset analyzed by Jessee (Reference Jessee2016), which contains 32,800 respondents in a survey conducted in 2008 and 550 U.S. Congress members who served in the same year. As we discussed above and will elaborate in the next section, the aim of the MPS model in this application is to identify subsets of the voters and legislators within which IRFs are shared and to measure latent traits within each subset, rather than jointly scaling the actors into a common ideology space or determining whether joint scaling disrupts ideal point estimates or not. In our analysis, we find that the 73% of the voters in the dataset share item parameters with the legislators, whereas the 27% of the voters do not.
Our paper proceeds as follows. First, we introduce the substantive context and dataset of our application, focusing on the work of Jessee (Reference Jessee2016). Second, we discuss and motivate the details of our IRT model for dealing with measurement heterogeneity, discussing the role of the Dirichlet process prior—the underlying technology that our proposed model uses to nonparametrically separate respondents into groups. Third, we offer Monte Carlo simulation evidence demonstrating the ability of our model to recover the key parameters of interest. Fourth, we present a substantive application of our model to the debate on the joint scaling of legislators and voters. This debate focuses on the extent to which we can reasonably scale legislators and voters into the same ideological space, which effectively can be reframed as a question regarding the extent to which voters share the same item parameters as legislators. We conclude with some thoughts on potential applications of our approach to dealing with heterogeneity in measurement.
2 Application: Scaling Legislators and Voters
In recent years, a literature extending the canonical twoparameter IRT model to jointly scale legislators and voters using bridging items has emerged (Bafumi and Herron Reference Bafumi and Herron2010; Hirano et al. Reference Hirano, Imai, Shiraito and Taniguchi2011; Jessee Reference Jessee2012; Saiegh Reference Saiegh2015). In such applications, researchers begin with a set of items that legislators have already provided responses to, such as a set of preexisting roll call votes. Voters on a survey are then provided with the same items and asked for their responses. The responses of the voters and legislators are grouped together and jointly scaled into a common space, providing estimated ideal points of voters and legislators that in theory can then be compared to one another.
In an influential critique of this work, Jessee (Reference Jessee2016) argued that this approach did not necessarily guarantee that legislators and voters could jointly be scaled into a common space.Footnote ^{2} Jessee’s core critique was that legislators and voters potentially saw the items and the ideological space differently, even if they were expressing preferences on the same items. Joint scaling effectively constrains the item parameters for those items to be identical for both groups, but does not guarantee that they are actually identical in reality. In the language of the MPS model, Jessee claimed that there were potentially two separate clusters—one for legislators and another for voters—through which DIF can occur.
For Jessee, the question of whether voters and legislators could be jointly scaled was essentially a question of sensitivity analysis. He conceptualized the answer to this question as a binary one—that is, either all voters and legislators could be jointly scaled together, or they could not be. His proposed solution to answer this question was to estimate two separate models for legislators and voters. Jessee then used the legislator item parameters to scale voters in “legislator space,” and the voter item parameters to scale legislators into “voter space.” If these estimates were similar to those obtained via joint scaling, then the results were robust and legislators and voters could be scaled together. The Jessee approach essentially adopts Lord (Reference Lord1980) and Thissen et al. (Reference Thissen, Steinberg, Wainer, Holland and Wainer1993) approach for testing for DIF, and adds an extra step by reestimating latent traits for the reference and focal groups conditional on the item parameters of the other group.
Our approach to answering this question differs substantially from Jessee, but it is worth noting that his conception of the problem is a special case of our approach. To answer this question using our model, we can estimate an MPS model where we constrain all of the legislators to share a common set of item parameters, but allow voters to move between clusters. Voters can thus be estimated to share membership in the legislator cluster, or they can splitoff into other separate clusters occupied only by voters. This highlights the principal difference between the MPS model and Jessee’s approach. Jessee’s approach is a sensitivity analysis in the spirit of Lord (Reference Lord1980) that provides a binary Yes/No answer to the question of whether jointly scaling legislators and voters together will change the ideal points estimates meaningfully—that is, it scales voters using the item parameters of the legislators, and legislators using the item parameters of the voters. Substantial deviation in the estimated ideal points between these approaches suggests that voters and legislators cannot be scaled together in a common space. In contrast, the MPS model identifies the subset of voters that can be jointly scaled with legislators, which the Jessee model does not. While two special cases of the MPS model (i.e., either all voters lie share item parameters with the legislators, or none of them do) correspond to potential answers that Jessee’s model can provide, our model can provide intermediate answers—notably, we can identify the number and identity of the voters who share an ideological space with legislators, and voters need not all share a common ideological space with one another.
3 Model Description
Our modeling approach adopts the same groupbased definition of DIF previously described by Lord (Reference Lord1980) and Thissen et al. (Reference Thissen, Steinberg, Wainer, Holland and Wainer1993). Specifically, we assume that there are subsets of respondents who share the same IRFs, which in turn are different from those used by members of other subsets.
If we knew a priori what these groups were (e.g., gender of legislators in legislative voting), correcting/accounting for DIF would be relatively easy, and would amount to conditioning on group membership during the scaling exercise. However, the subsets of respondents for whom items are expected to function in different ways is often not immediately obvious. In such cases, we can use response patterns across items to estimate membership into groups of respondents defined by clusters of item parameter values (i.e., of the parameters that define different IRFs). This is the key insight behind our approach, which relies on a Dirichlet process prior for item parameters that allows us to identify collections of individuals for whom IRFs operate similarly without the need to fix memberships or the number of such groups a priori.
To this end, we propose a model that addresses DIF violations occurring across groups of respondents. When group membership is held constant across items, we are able to identify sets of respondents who are effectively mapped onto different spaces, but who are guaranteed to be comparable within group assignment. Our approach, which we call the MPS model, is a latentvariable generalization of the standard nonparametric Dirichlet process mixture regression model (e.g., Hannah, Blei, and Powell Reference Hannah, Blei and Powell2011).Footnote ^{3}
With these intuitions in place, we now present our DPenhanced IRT model, including a discussion of how the Dirichlet process prior can help us address the issue of heterogeneous IRFs, but leave the details of our Bayesian simulation algorithm to the Appendix.
3.1 The Multiple Policy Space Model
Let $y_{i,j}\in \{0,1\}$ be respondent i’s ( $i\in {1,\ldots ,N}$ ) response on item $j\in {1,\ldots ,J}$ . Our twoparameter IRT model defines
where $k[i] \in 1, \ldots $ is a latent cluster to which respondent i belongs; $\boldsymbol {\theta }_i$ is a vector of latent respondent positions on Ddimensional space; $\boldsymbol {\beta }_{k,j}$ is a vector of clusterspecific itemdiscrimination parameters; $\gamma _{k,j}$ is a clusterspecific itemdifficulty parameter.Footnote ^{4} Substantively, clusterspecific item parameters reflect the possibility that the IRF is shared by respondents belonging to the same group k but heterogeneous across groups.
To aid in the substantive interpretation of this model, it is helpful to consider the case where we only keep respondents in group $k = k'$ , and discard respondents belonging to all other groups. Thus, we are only using the item parameters from the cluster $k'$ , which are common to all respondents in that cluster. Since this is the case, we can discard the cluster indexing altogether, and the first line of Equation (1) reduces to
This is the standard twoparameter IRT model. Thus, we can summarize our model as follows: if cluster memberships were known, the MPS model is equivalent to taking subsets of respondents by cluster, and scaling each cluster separately using the standard twoparameter IRT model. This implies that even though they are expressing preferences on the same items, respondents in different clusters are mapping the same items onto different latent spaces. Thus, comparisons of $\boldsymbol {\theta }_i$ are only meaningful when those $\boldsymbol {\theta }_i$ belong to the same cluster (i.e., would have been scaled together in the same IRT model).Footnote ^{5}
Given that we do not observe which observations belong to which clusters, however, we need to define a probabilistic model for the cluster memberships that does not require a priori specifying how many clusters respondents can be sorted into. For this, we rely on the Dirichlet process prior.
3.2 Sampling Cluster Memberships Using a Dirichlet Process Mixture
The Dirichlet process is a popular nonparametric Bayesian prior (Ferguson Reference Ferguson1973; see also Teh Reference Teh2010). The basic idea of the Dirichlet process is that any sample of data for which one typically estimates a set of parameters can be split into subgroups of units, letting the data guide discovery of those groups instead of requiring users to prespecify their number a priori. Technically, the Dirichlet process prior allows mixture models to have a potentially infinite number of mixture components, but in general it allows a small number of components to be occupied by observations by penalizing the total number of occupied components. It is known that the number of mixture components is not consistently estimated. Nevertheless, when used for density estimation (Ghosal, Ghosh, and Ramamoorthi Reference Ghosal, Ghosh and Ramamoorthi1999) and nonparametric generalized (mixed) linear models (Hannah et al. Reference Hannah, Blei and Powell2011; Kyung et al. Reference Kyung, Gill and Casella2009), Dirichlet process mixture models consistently estimate the density and the mean function, respectively.
We now describe the Dirichlet process mixture of our MPS model.Footnote ^{6} Let $p_{k^{\prime }}$ denote the probability that each observation is assigned to cluster $k^{\prime }$ , for $k^{\prime } = 1, 2, \dots $ , that is, $p_{k^{\prime }} \equiv \mathrm { Pr}(k[i] = k^{\prime })$ , and let the last line of Equation (1) be the base distribution from which clusterspecific item parameters are drawn. Then, under a DPmixture model of clusterspecific IRT likelihoods, we have
Equations (2)–(4) are the key to understanding how the Dirichlet process mixture makes nonparametric estimation possible. At the first step in the data generating process, we assign each observation to one of clusters $k^{\prime } = 1, 2, \dots $ . The assignment probabilities are determined by Equations (3) and (4), which is called the “stickbreaking” process. The origin of the name sheds light on how this process works. When deciding the probability of the first cluster ( $k^{\prime } = 1$ ), a stick of length $1$ is broken at the location determined by the Beta random variable ( $\pi _1$ ). The probability that each observation is assigned to the first cluster is set to be the length of the broken stick, $\pi _1$ . Next, we break the remaining stick of length $1  \pi _1$ again at the place $\pi _2$ within the remaining stick. The length of the second broken stick ( $\pi _2 (1  \pi _1)$ ) is used as the probability of each observation being assigned to the second cluster. After setting the assignment probability of the second cluster, we continue to break the remaining stick following the same procedure an infinite number of times. The probabilities produced by the stochastic process vanish as the cluster index increases because the remaining stick becomes shorter every time it is broken. Although we do not fix the maximum number of clusters and allow the number to diverge in theory, the property of the stickbreaking process that causes the probability to quickly shrink toward zero prevents the number of clusters from diverging in practice.Footnote ^{7}
Accordingly, when clusters over which DIF occurs are unobserved (both in membership and in number), we can rely on this probabilistic clustering process over a potentially infinite number of groups. In this context, each cluster $k^{\prime }$ effectively defines a (potentially) different IRF, which in turn allows us to automatically sort observations into equivalence classes within which measurement invariance is expected to hold, without guaranteeing that observations sorted into different clusters will be comparable. Hence, our model partitions respondents across a (potentially infinite) set of multiple policy spaces.
In general, the substantive interpretation of estimated clusters needs to be approached cautiously. While our model is useful for identifying which respondents perceive a common latent space with each other, it will generally overestimate the total number of actual (i.e., substantively distinct) clusters in the data (Kyung et al. Reference Kyung, Gill and Casella2009; Womack, Gill, and Casella Reference Womack, Gill and Casella2014).Footnote ^{8} In the MPS model, multiple DP clusters can be thought of as being part of the same substantive group—even if their corresponding item parameters are not exactly the same. What is more, this subclustering phenomenon can exacerbate known pathologies of mixture modeling and IRT modeling, such as label switching (i.e., invariance with respect to component label permutations) and additive and multiplicative aliasing (i.e., invariance with respect to affine transformations of item parameters and ideal points).
Thus, even if all respondents actually belonged to the same cluster $k'$ , we could estimate more than one cluster (denoted here by $k^{\prime\prime}$ ) with the other clusters recovering the transformed set of item parameters $\boldsymbol {\beta }_{k_{"r},j} = (\boldsymbol {\beta }_{k',j}^{\top } K)$ (where K is an arbitrary rotation matrix). However, we would still be able to see that clusters $k'$ and $k^{\prime\prime}$ were similar by examining the correlation between $\boldsymbol {\beta _{k'}}$ and $\boldsymbol {\beta _{k^{\prime\prime}}}$ , as well as the patters of correlation between these and the item parameters associated with other clusters. When subclustering is an issue, two subclusters can be thought of as being part of the same substantive cluster if their items are highly correlated, or of they share similar correlation patterns with parameters in other subclusters.Footnote ^{9}
Having presented the details of our model, we now present the results of a Monte Carlo simulation that illustrates its ability to accurately partition respondents across clusters and recover the associated item parameters within each cluster.
4 Monte Carlo Simulations
As an initial test of our MPS model, we conduct a Monte Carlo simulation to test the ability of our model to correctly recover our parameters of interest. We simulate a dataset in which $N=1,000$ respondents provide responses to $J=200$ binary items. Respondents are randomly assigned to one of three separate clusters with probabilities 0.5, 0.2, and 0.3, respectively. In each cluster, respondent ability parameters and item difficult and discrimination parameters are all drawn from a standard normal distribution. For starting values, we use kmeans clustering to generate initial cluster assignments, and principal components analysis on subsets of the data matrix defined by those cluster assignments for starting ability starting values. Item difficulty and discrimination starting values were generated for each cluster and item by running probit regressions of the observed data on the starting ability parameter values by cluster. We run 1,000 Markov Chain Monte Carlo (MCMC) iterations, discarding the first 500 as burnin, and keeping only the sample that produces the highest posterior density as the maximum a posteriori (MAP) estimate of all parameters and latent variables, to avoid issues associated with label switching.Footnote ^{10}
Table 1 shows a crosstabulation of the simulated versus estimated cluster assignments. The estimation procedure is able to separate the simulated clusters well, in the sense that none of the estimated clusters span multiple simulated clusters. However, we see evidence of the subclustering phenomenon discussed earlier. Members of simulated cluster 1, for instance, were split into estimated clusters 3, 7, 9, and 10. Since members of simulated cluster 1 were all generated using the same item parameters, the four estimated clusters that partition them are effectively noisy affine transformations of each other. Thus, we expect that the four sets of estimated item parameters for clusters 3, 7, 9, and 10 will be correlated. Simulated clusters 2 and 3 are similarly split between multiple estimated clusters, and we could expect these parameters to be similarly correlated.
In a realcase application, of course, access to the true underlying cluster memberships is not available. And as we discussed earlier, Dirichlet process mixtures are ideal for capturing the distribution of parameters by discretizing their support into an infinite number of subclusters. As a result, many of these Dirichlet subclusters may share very similar parameter values, effectively representing the same substantive groupings in terms of item functionings. Accordingly, using DP mixtures for diagnosing DIF requires a formal procedure for establishing which subclusters belong together by virtue of sharing similar item parameters, and which contain observations that truly differ in their item functionings.
The practical issue of establishing equivalence across groups can be approached from a number of perspectives. For example, researchers could employ pairwise equivalence tests on the item parameters (see, e.g., Hartman and Hidalgo Reference Hartman and Hidalgo2018; Rainey Reference Rainey2014, for illustrations in Political Science), being careful to account for the problems raised by conducting multiple comparisons (e.g., using a Bonferronistyle correction, or the Benjamini–Hochberg procedure to control the false discovery rate). Given the potentially large number of pairings, however, we rely on an alternative approach that studies the second and third order information contained in the item parameter correlation matrix. Specifically, we study the graph induced by correlations across entire vectors of estimated item parameters to reconstruct substantive clusters from the subclusters identified through the DP mixture, and encourage applied researchers to follow the same approach.
To do so, we treat correlations among parameters as the adjacency matrix of a weighted, undirected graph defined on the set of subclusters. The problem of finding substantive clusters can then be cast as the problem of finding the optimal number of communities of subclusters on this graph—a problem for which a number of approximate solutions exist (for a succinct review, see Sinclair Reference Sinclair and Michael Alvarez2016).
For instance, a simple tool for identifying the optimal number of communities in a network is given by the Gap Statistic (Tibshirani, Walther, and Hastie Reference Tibshirani, Walther and Hastie2001), which compares an average measure of dissimilarity among community members relative to the dissimilarity that would be expected under a null distribution of edge weights emerging from a noheterogeneity scenarioFootnote ^{11}
The optimal number of communities (i.e., of substantive clusters) can then be established by finding the $k^{\star }$ that maximizes $\text {Gap}(k)$ . Figure 1 shows the value of gap statistic for different values of k, suggesting that the correct number of substantive clusters is 3 or 4.
Indeed, Figure 2 shows the result of applying a simple community detection algorithmFootnote ^{12} to the graphs formed by using correlations across discriminations (left panel) and correlations across difficulties (right panel). In both instances, the true simulated clusters are denoted using shapes for the graph nodes, and the substantive groupings discovered by the community detection algorithm are denoted using shaded areas. In all instances, the communities identified map perfectly onto the known simulation clusters.
While our previous analyses tested the correspondence between the true and estimated clusters, they say little about the recovery of the correct item parameters. In Figure 3, we explore the item discrimination parameters in a series of plots, where each panel plots two sets of item discrimination parameters against each other. Along the main diagonal, we plot combinations of the simulated item discrimination parameters (columns) for each cluster against the estimated parameters (rows) for the corresponding known cluster. In all three cases, the item parameters are well recovered and estimates are highly correlated with truth, with correlations of $r = 0.99$ , $r = 0.97$ , and $r = 0.97$ for the three plots.Footnote ^{13}
In turn, the offdiagonal panels present each combination of the simulated item discrimination parameters versus their (mismatched) counterparts in other clusters. Since parameters in each cluster were generated from independent draws, the items are uncorrelated in reality. As expected, this independence is reflected in the estimated item parameters, which appear similarly uncorrelated with one another and with parameters in other known clusters.
We repeat the same exercise in Figure 4, but this time for the latent traits. In all cases, the latent traits are highly correlated, again demonstrating correct recovery of the traits of interest. The figures also highlight the fact that, in the MPS model, estimated latent traits are only comparable to other respondents belonging to the same cluster. If the MPS model facilitated comparisons across clusters, then at a minimum all of the figures shown here would consistently either be positively or negatively correlated with the simulated true ideal point. However, this is not the case. This is of course not surprising—the MPS model effectively estimates a separate twoparameter IRT model for each cluster of legislators, allowing the same items to assume different item parameters for each group. Thus, ideal points across groups would not be comparable, any more than ideal points from separate IRT models would be comparable. Of course, the MPS model makes a significant innovation in this regard—it allows us to use the data itself to sort respondents into clusters, rather than forcing the researcher to split the sample a priori.
Notably, standard measures of model fit also suggests that the MPS model fits the data better in the Monte Carlo. The MPS model produced a loglikelihood of $85,776.71$ , but when we fit the standard IRT model on the data that constrains all legislators to share the same single cluster, the loglikelihood drops significantly to $117,477.2$ . This improvement in fit is not surprising—compared to standard twoparameter IRT, MPS fits a much more flexible model. Whereas the standard, single cluster model involves estimating 1,000 respondent and 400 item parameters for a total of 1,400 parameters, the MPS model estimates 1,000 respondent parameters and 400 item parameters per cluster. Since the maximum number of clusters in the estimation is set to 10, effectively the MPS model estimates 5,000 total parameters. Thus, a better measure of fit would penalize MPS for the added flexibility afforded by the substantial increase in parameters. The Bayesian Information Criterion (BIC) offers one such measure. It is equal to 252,043 for the single cluster model and for 232,604.7 the MPS model, which confirms that the MPS model fits the data better—even after accounting for the substantial increase in model flexibility. Note that this BIC test is essentially a test of DIF across the identified clusters using methods similar in spirit to those proposed by Lord (Reference Lord1980) and Thissen et al. (Reference Thissen, Steinberg, Wainer, Holland and Wainer1993).
Finally, it is important to note that while MPS will partition observations into subclusters even when there is no underlying heterogeneity (i.e., even when the standard IRT model is correct), the similarity of item parameters across subclusters will immediately suggest that the resulting partition is substantively spurious. To see this, consider Figure 5, which depicts the values of the gap statistic as computed on a graph defined as those in Figure 3, but resulting from a model estimated on data that has no underlying heterogeneity in IRFs. The gap statistic correctly suggests that the correct number of substantive clusters is, in fact, 1. The idea that there is no heterogeneity is further supported by the fact, under such a datagenerating process, the standard IRT model with a single cluster fits the data better, with $\text {BIC}_{\text {IDEAL}}= 168,430.8$ versus $\text {BIC}_{\text {MPS}}=173,686.3$ . Thus, there is little evidence that MPS will overfit data when there is no heterogeneity to be identified.
We now turn to our original motivating application: evaluating whether (or rather which) U.S. voters can be scaled on the same space as their legislators.
5 Empirical Results
We apply the MPS model to one of the main examples used in Jessee (Reference Jessee2016)—the 2008 Cooperative Congressional Election Study (CCES). This is an online sample of 32,8000 survey respondents from the YouGov/Polimetrix panel administered during October and November 2008. In total, the CCES included eight bridging items that directly corresponded to votes taken during the 110th House and Senate, which can be matched to 550 legislators.Footnote ^{14} The policy items included withdrawing troops from Iraq within 180 days, increasing the minimum wage, federal funding of stem cell research, warrantless eavesdropping of terrorist suspects, health insurance for low earners, foreclosure assistance, extension of free trade to Peru and Colombia, and the 2008 bank bailout bill.Footnote ^{15} In this example, Jessee found that joint scaling appeared to work relatively well for this dataset—that is, the ideal points from the grouped model look relatively similar regardless of whether one uses item parameters derived from respondents, the House, or the Senate.
We run 110,000 MCMC iterations, discarding the first 10,000 as burnin, and keeping only the MAP estimate of the parameters of interest. The maximum number of clusters is constrained to be 10. Similar to the Monte Carlo, we generate starting ideal point values using principal components analysis within each cluster, and probit regression for starting item parameter values. However, rather than generating initial cluster assignments using kmeans clustering, we instead start all legislators in one cluster, and all voters in a second cluster. Legislators are constrained to remain in the same cluster throughout each iteration, but voters are permitted to change cluster memberships.Footnote ^{16}
Table 2 shows a crosstabulation of the final estimated clusters on the rows against the two separate starting clusters for the legislators and voters. All 550 legislators start in the same cluster, and are constrained to remain so (although their ideal points within the cluster are permitted to change). In turn, the 32,800 surveyed voters divide themselves across six different clusters, with 15,732 respondents remaining in the same cluster as the legislators.
The 15,732 respondents estimated to share the same cluster with the legislators are almost certainly underestimated, due to the fact that different clusters in DPprior models may nevertheless share similar parameter values. Table 3 explores this further, tabulating the correlations of the item discrimination parameters between each of the six populated estimated clusters. From examining this table, we see that estimated clusters 2 and 5 have item parameters that are highly correlated with those in the constrained legislator cluster. Combining respondents from clusters 1, 2, and 5 together, 24,102 of the 32,800 respondents in the CCES sample, or approximately 73% of the sample, lie in the same ideological space as legislators.
With this large number of observations falling in a single cluster, it is not surprising that different model selection criteria provide different indications as to whether a standard IRT or MPS fits the data better. For instance, while the comparison between the BIC produced by our model (viz., 408,016.4) and the BIC produced by a standard IRT model (viz., 407,033.7) would suggest the latter offers a better fit to these data, the evidence is reversed when we consider Akaike Information Criterion (AIC) as a selection criterion (with values of 355,419.4 and 370,214.8 for MPS and the regular IRT, respectively.). Nevertheless, an evaluation of the extent to which communities of subclusters emerge from these pairwise correlations suggests the importance of separating between two sets of voters.
The right panel of Figure 6 depicts this correlationweighted graph, along with the substantive clusters identified by the same greedy algorithm used in the previous section (indicated using gray shaded areas). In this case, both the greedy communitydetection procedure and the gap statistic (depicted on the left panel of Figure 6) identify two communities—one containing all legislators and a large number of voters, and another composed of the remaining voters who do not share the same policy space as legislators.
To further validate this sorting, we study the extent to which a model that forces all voters in subclusters 1, 2, and 5 to remain fixed in the cluster containing all legislators results in a better fit to the observed responses. Such a model results in an unequivocally better fit versus a model that allows all voters to be freely allocated to clusters, with a BIC of 407,426.8 and an AIC of 365,820.8.Footnote ^{17}
In addition, and to explore the question of what characterizes the 24,102 survey respondents who “think like a legislator” (i.e., who are sorted into estimated clusters 1, 2, and 5), we group these respondents together and predict membership in this pseudolegislator group with a Bayesian binomial probit regression (with vague, uniform priors), using a range of standard covariates—including education, gender, age, income, race, party identification, political interest, and church attendance. We report these results in Figure 7.Footnote ^{18}
We find that older voters and people who express more interest in politics tend to map their latent traits onto observed responses similarly to the way legislators do, while Black and Hispanic voters are less likely than their white counterparts to share an ideological space with legislators. And while the coefficients associated with education, income and gender all fail to attain our chosen level of significance, their signs do indicate that more educated and richer voters also tend to think more like legislators, while women appear less likely to share the policy space of their (mostly male) legislative counterparts.
Overall, our findings are largely consistent with Jessee, who found that latent trait estimates from this dataset were consistent regardless of whether one used the item parameters estimated from legislators or voters. However, the key difference from our approach is that we not only identify the 73% of survey respondents who follow this pattern, but also the 27% of survey respondents that do not share an ideological space with legislators. Furthermore, our improved fit statistics suggests that the improvement in model fit for this subset of respondents is quite significant, even for a dataset where the recovered ideal points would be somewhat similar regardless of whether one used only the voter, House, or Senate item parameters to generate ideal points.
6 Conclusion
When implementing commonly used measurement models, most researchers implicitly subscribe to the idea that all individuals share a common understanding of how their latent traits map onto the set of observed responses: legislators are believed to have shared sense of where the cutpoint between voting alternatives lies, survey respondents are assumed to ascribe a common meaning to the scales presented in the questions they confront, and voters are understood to perceive the same candidates and parties as taking on similar ideological positions.
When this assumption is violated by the real datagenerating process, however, adopting this widespread strategy can be a costly oversimplification that results in invalid measures of the characteristics of interest. By assuming that units can be separated into groups for whom comparable item functioning holds, we propose a modeling strategy that relaxes the stringent measurement invariance assumption, allowing researchers to identify sets of incomparable units who can be mapped onto multiple latent spaces. The distinctive feature of our proposed approach is that it does not require a priori identification of group memberships—or even a prior specification of the number of heterogeneous groups present in the sample.
On this note, it is important to reiterate that the clusters we obtain from our Dirichlet process prior models are not distinct groups, in the sense that they may share parameters that are similar enough to be considered part of the same subpopulation. Our models, therefore, are designed to account for the existence of these heterogeneous groups without directly identifying a posteriori memberships into them. In so doing, our models assume that the target of inference is the latent traits, rather than the group memberships. And while it is sometimes possible to tease out subpopulations from estimated Dirichlet process clusters, we generally discourage users from trying to ascribe substantive meaning to the clusters directly identified by our nonparametric model—except to say that observations that are estimated to be in the same Dirichlet process cluster have latent traits that can be safely compared to one another. If a more thorough interpretation of which subclusters are, in fact, substantively equivalent is of interest, we encourage researchers to postprocess the Dirichlet mixture clusters in order to identify the more substantive groupings defined by item parameters that are similar enough, as we did through the use of the gap statistic on the graph of item parameter correlations in our illustration of the MPS model.Footnote ^{19} Having done so, researchers can then make datadriven decisions about the presence and pervasiveness of DIF in their data. Alternatively, designbased solutions (such as anchoring vignettes) can help ascribe meaning to different subgroups, while other modelbased approaches—such as the product partition DPprior model proposed by Womack et al. (Reference Womack, Gill and Casella2014), or the repulsive DPmixture model proposed by Xie and Xu (Reference Xie and Xu2020)—may offer potential analytical avenues, if adapted to the IRT framework. We leave these possibilities for future research.
Despite these caveats, we believe our proposed model can offer researchers a simple alternative to the standard modeling approach and its strong invariance assumptions. If heterogeneity in item functioning is a possibility—as we suspect is often the case in the social science contexts in which probabilistic measurement tools are usually deployed—our approach offers applied researchers the opportunity to assess that possibility and identify differences across units if said differences are supported by the data, rather than simply assuming those differences across subpopulations away.
A broader substantive question that this paper does not address directly is whether our empirical results hold for joint scaling of legislators and voters using different datasets and/or in other contexts. While we found that most voters share an ideological space with legislators in the CCES dataset, it is still an open question whether most voters and legislators can be jointly scaled particularly when there are a greater number of bridging items that provide more information about how similar their IRFs are. Having presented the methodology that allows researchers to address this question, we leave it for future research.
A Computational Details
Gibbs Sampler
Truncate the stickbreaking process at some constant K. Define

1. Update the stickbreaking weight $\pi _{k^{\prime }}$ for $k^{\prime } = 1, \dots , K  1$ by sampling from a Beta distribution s.t.
$$ \begin{align*} \pi_{k^{\prime}} \sim \mathrm{Beta} \left(1 + N_{k^{\prime}}, \alpha + \sum_{l = k^{\prime} + 1}^{K} N_l \right), \end{align*} $$where $N_k$ is the number of observations assigned to cluster k under the current state.

2. Update $k[i] \in \{1, \dots , K \}$ for $i = 1, \dots , N$ by multinomial sampling with
$$ \begin{align*} \mathrm{Pr}(k[i] = k^{\prime} \mid \boldsymbol{y}_{i}, \, \boldsymbol{\theta},\boldsymbol{\beta}, \boldsymbol{\gamma} ) \propto p_{k^{\prime}} \, \mathrm{Pr}\left( \boldsymbol{y}_{i} \mid \boldsymbol{\theta}_i,\boldsymbol{\beta}_{k^{\prime}},\boldsymbol{\gamma}_{k^{\prime}} \right), \end{align*} $$where
$$ \begin{align*} p_{k^{\prime}} &\equiv \pi_{k^{\prime}} \prod_{l = 1}^{k^{\prime}  1} (1  \pi_{l}), \\ \mathrm{Pr}\left( \boldsymbol{y}_{i} \mid \boldsymbol{\theta}_i,\boldsymbol{\beta}_{k^{\prime}},\boldsymbol{\gamma}_{k^{\prime}} \right) & = \left( \Phi \left( \boldsymbol{\beta}_{k^{\prime},j} \boldsymbol{\theta}_{i}  \gamma_{k^{\prime},j} \right) \right)^{y_{ij}} \left( 1  \Phi \left( \boldsymbol{\beta}_{k^{\prime},j}\boldsymbol{\theta}_{i}  \gamma_{k^{\prime},j} \right) \right)^{1  y_{ij}}. \end{align*} $$In practice, we augment the latent variable $y_{i,j}^{\ast }$ , so that we have
$$\begin{align*}\mathrm{Pr}(k[i] = k^{\prime} \mid \boldsymbol{y}_{i}^{\ast}, \, \boldsymbol{\theta}_i,\boldsymbol{\beta}_{k^{\prime}}, \boldsymbol{\gamma}_{k^{\prime}} ) \propto p_{k^{\prime}} \, \mathcal{N}\left( y_{i,j}^{\ast} \mid \boldsymbol{\beta}_{k^{\prime}, j}^{\top} \boldsymbol{\theta}_i  \gamma_{k^{\prime} , j} , \, 1 \right). \end{align*}$$ 
3. Conditional on $\boldsymbol {\theta }$ , $\boldsymbol {\beta }$ , $\boldsymbol {\gamma }$ , and $\boldsymbol {k}$ , sample
$$\begin{align*}y_{i,j}^{\ast} \sim \begin{cases} \mathcal{N}(\theta_i\beta_{k^{\prime}, j}  \gamma_{k^{\prime}, j}, 1)\mathcal{I}(y_{i,j}^{\ast} < 0), &\text{if } y_{i,j}=0,\\ \mathcal{N}(\theta_i\beta_{k^{\prime}, j}  \gamma_{k^{\prime}, j}, 1)\mathcal{I}(y_{i,j}^{\ast} \geq 0), &\text{if } y_{i,j}=1, \end{cases} \end{align*}$$which can be parallelized over respondents and items, for dramatic speedups.

4. Conditional on $\boldsymbol {\theta }$ , $\boldsymbol {y}^{\ast }$ , and $\boldsymbol {k}$ , sample
$$\begin{align*}(\boldsymbol{\beta}_{k^{\prime},j}, \gamma_{k^{\prime},j}) \sim \mathcal{N}_{D+1}\left(\boldsymbol{\mu}_{k^{\prime},j},\boldsymbol{M}_{k^{\prime},j}^{1}\right), \end{align*}$$where $\boldsymbol {M}_{k^{\prime }, j}=(\boldsymbol {X}_{k^{\prime }}^{\top }\boldsymbol {X}_{k^{\prime }}+\boldsymbol {\Omega })$ ; $\boldsymbol {\mu }_{k^{\prime },j}=\boldsymbol {M}_{k^{\prime }, j}^{1}\boldsymbol {X}_{k^{\prime }}^{\top }\boldsymbol {y}^{\ast }_{k^{\prime },j}$ ; $\boldsymbol {X}_{k^{\prime }}$ is a matrix with typical row given by $\boldsymbol {x}_i=[\boldsymbol {\theta }_i,1]$ for i s.t. $k[i]=k^{\prime }$ , and $\boldsymbol {y}^{\ast }_{k^{\prime },j}$ is a vector with typical element $y^{\ast }_{i,j}$ , again restricted to i s.t. $k[i]=k^{\prime }$ .
Once again, this can be parallelized over items and clusters, reducing user computation times.

5. Conditional on $\boldsymbol {\beta }$ , $\boldsymbol {\gamma }$ , and $\boldsymbol {k}$ , and for each i s.t. $k[i]=k^{\prime }$ , sample
$$\begin{align*}\boldsymbol{\theta}_i \sim \mathcal{N}_{D}(\boldsymbol{\nu}_{k^{\prime}}, \boldsymbol{N}_{k^{\prime}}^{1}), \end{align*}$$where $\boldsymbol {N}_{k^{\prime }}=\left (\boldsymbol {B}_{k^{\prime }}^{\top }\boldsymbol {B}_{k^{\prime }} + \boldsymbol {\Lambda }\right )$ ; $\boldsymbol {\nu }_{k^{\prime }}=\boldsymbol {N}_{k^{\prime }}^{1}\boldsymbol {B}_{k^{\prime }}^{\top }\mathbf {w}_i$ ; $\boldsymbol {B}_{k^{\prime }}=[\boldsymbol {\beta }_{k^{\prime },1},\ldots ,\boldsymbol {\beta }_{k^{\prime },J}]^{\top }$ is a $J\times D$ matrix, and $\boldsymbol {w}_i=\boldsymbol {y}^{\ast }_{i}+\boldsymbol {\gamma }_{k^{\prime }}$ is a $J\times 1$ vector. We parallelize these computations over respondents.

6. Finally, conditional on cluster assignments and stickbreaking weights, sample
$$\begin{align*}\alpha \sim \text{Gamma}(a_0 + N  1, b_0  \sum_{k^{\prime}=1}^{N1}\log(1\pi_{k^{\prime}})). \end{align*}$$
Acknowledgment
We would like to thank Kevin Quinn, Iain Osgood, participants in the 2019 Asian PolMeth conference and at the UCLA Political Science Methods Workshop, and two anonymous reviewers for their useful feedback.
Data Availability Statement
Replication materials are available in Shiraito, Lo, and Olivella (Reference Shiraito, Lo and Olivella2022).
Conflict of Interest
The authors have no conflicts of interest to declare. All coauthors have seen and agree with the contents of the manuscript and there is no financial interest to report.