Hostname: page-component-5db58dd55d-mhzq2 Total loading time: 0 Render date: 2026-05-31T08:02:24.096Z Has data issue: false hasContentIssue false

Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

Published online by Cambridge University Press:  01 January 2025

Felix Zimmer*
Affiliation:
University of Zurich
Clemens Draxler
Affiliation:
The Health and Life Sciences University
Rudolf Debelak
Affiliation:
University of Zurich
*
Correspondence should be made to Felix Zimmer, University of Zurich, Zurich, Switzerland. Email: felix.zimmer@uzh.ch
Rights & Permissions [Opens in a new window]

Abstract

The Wald, likelihood ratio, score, and the recently proposed gradient statistics can be used to assess a broad range of hypotheses in item response theory models, for instance, to check the overall model fit or to detect differential item functioning. We introduce new methods for power analysis and sample size planning that can be applied when marginal maximum likelihood estimation is used. This allows the application to a variety of IRT models, which are commonly used in practice, e.g., in large-scale educational assessments. An analytical method utilizes the asymptotic distributions of the statistics under alternative hypotheses. We also provide a sampling-based approach for applications where the analytical approach is computationally infeasible. This can be the case with 20 or more items, since the computational load increases exponentially with the number of items. We performed extensive simulation studies in three practically relevant settings, i.e., testing a Rasch model against a 2PL model, testing for differential item functioning, and testing a partial credit model against a generalized partial credit model. The observed distributions of the test statistics and the power of the tests agreed well with the predictions by the proposed methods in sufficiently large samples. We provide an openly accessible R package that implements the methods for user-supplied hypotheses.

Information

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright
Copyright © 2022 The Author(s)
Figure 0

Figure 1. Distributions of the Wald, LR, score, and gradient statistics under the null and an alternative hypothesis in a test of a Rasch versus a 2PL model. The parameters used stem from a model fit of the five items in the “LSAT7” dataset, which is publicly available in the mirt package (Chalmers, 2012). The curve labeled “Null” represents a central χ2\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\chi ^2$$\end{document} distribution that applies to all four statistics under the null hypothesis. For the other four curves, the colored areas under the curve represent the power of the corresponding test under the alternative.

Figure 1

Figure 2. QQ-plots for the Rasch versus 2PL hypothesis with 50 items and the sampling-based power analysis method.

Figure 2

Figure 3. QQ-plots for the DIF hypothesis with 5 items and the sampling-based power analysis method.

Figure 3

Figure 4. Observed and expected hit rates by effect size and sample size using the sampling-based power analysis approach.

Figure 4

Figure 5. OHR: Observed hit rate. EHR: Expected hit rate. The expected hit rate was calculated using the sampling-based power analysis approach.

Figure 5

Figure 6. Observed and expected hit rates by hypothesis type and the number of items using the sampling-based power analysis approach.

Figure 6

Table 1. Parameters for the M1 and M2 PISA 2015 items

Figure 7

Figure 7. Power curves for testing a Rasch against a 2PL model for two PISA item clusters.

Figure 8

Figure 8. QQ-plots for the Rasch versus 2PL hypothesis with 5 items and the analytical power analysis method.

Figure 9

Figure 9. QQ-plots for the Rasch versus 2PL hypothesis with 5 items and the sampling-based power analysis method.

Figure 10

Figure 10. QQ-plots for the Rasch versus 2PL hypothesis with 10 items and the analytical power analysis method.

Figure 11

Figure 11. QQ-plots for the Rasch versus 2PL hypothesis with 10 items and the sampling-based power analysis method.

Figure 12

Figure 12. QQ-plots for the Rasch versus 2PL hypothesis with 20 items and the sampling-based power analysis method.

Figure 13

Figure 13. QQ-plots for the DIF hypothesis with 5 items and the analytical power analysis method.

Figure 14

Figure 14. QQ-plots for the DIF hypothesis with 10 items and the analytical power analysis method.

Figure 15

Figure 15. QQ-plots for the DIF hypothesis with 10 items and the sampling-based power analysis method.

Figure 16

Figure 16. QQ-plots for the DIF hypothesis with 20 items and the sampling-based power analysis method.

Figure 17

Figure 17. QQ-plots for the DIF hypothesis with 50 items and the sampling-based power analysis method.

Figure 18

Figure 18. QQ-plots for the PCM versus GPCM hypothesis with 5 items and the analytical power analysis method. To maintain readability, three outliers with values greater than 1000 were removed for the Wald statistic in the condition using N = 100 and the large effect size.

Figure 19

Figure 19. QQ-plots for the PCM versus GPCM hypothesis with 5 items and the sampling-based power analysis method. To maintain readability, three outliers with values greater than 1000 were removed for the Wald statistic in the condition using N=100\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N = 100$$\end{document} and the large effect size.

Figure 20

Figure 20. QQ-plots for the PCM versus GPCM hypothesis with 10 items and the analytical power analysis method.

Figure 21

Figure 21. QQ-plots for the PCM versus GPCM hypothesis with 10 items and the sampling-based power analysis method.

Figure 22

Figure 22. QQ-plots for the PCM versus GPCM hypothesis with 20 items and the sampling-based power analysis method.

Figure 23

Figure 23. QQ-plots for the PCM versus GPCM hypothesis with 50 items and the sampling-based power analysis method.

Figure 24

Figure 24. Observed and expected hit rates by effect size and sample size using the analytical power analysis approach.

Figure 25

Figure 25. OHR-EHR: Observed minus expected hit rate. The expected hit rate was calculated using the analytical power analysis approach.

Figure 26

Figure 26. Observed and expected hit rates by hypothesis type and the number of items using the analytical power analysis approach.

Figure 27

Table 2. Observed and expected hit rates for the Rasch versus 2PL hypothesis

Figure 28

Table 3. Observed and expected hit rates for the DIF hypothesis

Figure 29

Table 4. Observed and expected hit rates for the PCM versus GPCM hypothesis

Supplementary material: File

Zimmer et al. supplementary material

Zimmer et al. supplementary material
Download Zimmer et al. supplementary material(File)
File 118.3 MB