Examining Differential Item Functioning from a Multidimensional IRT Perspective

Terry A. Ackerman; Ye Ma

doi:10.1007/s11336-024-09965-6

Examining Differential Item Functioning from a Multidimensional IRT Perspective

Published online by Cambridge University Press: 01 January 2025

Terry A. Ackerman

and

Ye Ma

Show author details

Terry A. Ackerman*: Affiliation:
The University of Iowa
Ye Ma: Affiliation:
Amazon Web Services
*: Correspondence should be made to Terry A. Ackerman, The University of Iowa, 8 North Shore Drive, Edwardsville, IL62025, USA. tackerman@uiowa.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Differential item functioning (DIF) is a standard analysis for every testing company. Research has demonstrated that DIF can result when test items measure different ability composites, and the groups being examined for DIF exhibit distinct underlying ability distributions on those composite abilities. In this article, we examine DIF from a two-dimensional multidimensional item response theory (MIRT) perspective. We begin by delving into the compensatory MIRT model, illustrating and how items and the composites they measure can be graphically represented. Additionally, we discuss how estimated item parameters can vary based on the underlying latent ability distributions of the examinees. Analytical research highlighting the consequences of ignoring dimensionally and applying unidimensional IRT models, where the two-dimensional latent space is mapped onto a unidimensional, is reviewed. Next, we investigate three different approaches to understanding DIF from a MIRT standpoint: 1. Analytically Uniform and Nonuniform DIF: When two groups of interest have different two-dimensional ability distributions, a unidimensional model is estimated. 2. Accounting for complete latent ability space: We emphasize the importance of considering the entire latent ability space when using DIF conditional approaches, which leads to the mitigation of DIF effects. 3. Scenario-Based DIF: Even when underlying two-dimensional distributions are identical for two groups, differing problem-solving approaches can still lead to DIF. Modern software programs facilitate routine DIF procedures for comparing response data from two identified groups of interest. The real challenge is to identify why DIF could occur with flagged items. Thus, as a closing challenge, we present four items (Appendix A) from a standardized test and invite readers to identify which group was favored by a DIF analysis.

Keywords

multidimensional IRT differential item functioning compensatory and noncompensatory MIRT models

Information

Type: Theory & Methods
Information: Psychometrika , Volume 89 , Issue 1 , March 2024 , pp. 4 - 41

DOI: https://doi.org/10.1007/s11336-024-09965-6 [Opens in a new window]
Copyright: Copyright © 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Ackerman, T.A.. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15, 13–24.10.1177/014662169101500103CrossRef Google Scholar

Ackerman, T.A.. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.10.1111/j.1745-3984.1992.tb00368.xCrossRef Google Scholar

Ackerman, T.A., Evans, J.A.. (1994). The influence of conditioning scores in performing DIF analyses. Applied Psychological Measurement, 18, 329–342.10.1177/014662169401800404CrossRef Google Scholar

Ackerman, T. A., McCallaum, B., & Ngerano, G. (2014). Differential item functioning from a compensatory-noncompensatory perspective. Invited address to the International Congress of Educational Research, Haceppette University, Ankara, Turkey.Google Scholar

Ackerman, T. A. & Xie, Q. (2019). DIF graphical simulator. Educational Measurement: Issues and Practice, 38(1), 5. https://10.1111/emip.12171.CrossRef Google Scholar

Ackerman,T. A. & Xie, Q. (2019). DIF graphical simulator. Educational Measurement: Issues and Practice, 38(1), 5. https://10.1111/emip.12171.10.1111/emip.12171CrossRef Google Scholar

Bauer, D.J., Belzak, W.C., Cole, V.T.. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27, 43–55.10.1080/10705511.2019.1642754CrossRef Google Scholar PubMed

Bolt, D. M., & Johnson. (2009). Addressing score bias and differential item functioning due to individual differences in response style. Applied Psychological Assessment, 33 (5), 335–352. https://10.1177/0146621608329891.Google Scholar

Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of a multidimensional item response model. Applied Psychological Measurement, 16 2129–147.CrossRef Google Scholar

Camilli, G, Penfield, D.A.. (1997). Variance estimation for differential test functioning based on Mantel–Haenszel statistics. Journal of Educational Measurement, 34 2123–139.10.1111/j.1745-3984.1997.tb00510.xCrossRef Google Scholar

Carlson, J.E.. (2017). Unidimensional vertical scaling in multidimensional space. ETS 11 Research Report Series, 2017 11–28.10.1002/ets2.12157CrossRef Google Scholar

Cattell, R.B.. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1 2245–276PMID 26828106.10.1207/s15327906mbr0102_10CrossRef Google Scholar PubMed

Clauser, B. E. & Mazor, K. M. (1998). Using Statistical Procedures To Identify Differentially Functioning Test Items. An NCME Instructional Module. Educational Measurement: Issues and Practice, 17(1), 31–44. https://10.1111/j.1745-3992.1998.tb00619.x.CrossRef Google Scholar

Clauser, B.E., Nungester, R.J., Swaminathan, H. (1996). Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement, 33 4454–464.10.1111/j.1745-3984.1996.tb00501.xCrossRef Google Scholar

Clauser, B. E. & Mazor, K. M. (1998). Using Statistical Procedures To Identify Differentially Functioning Test Items. An NCME Instructional Module. Educational Measurement: Issues and Practice, 17(1), 31–44. https://10.1111/j.1745-3992.1998.tb00619.x.10.1111/j.1745-3992.1998.tb00619.xCrossRef Google Scholar

Cohen, A.S., Kim, S.H., Baker, F.B.. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17 4335–350.10.1177/014662169301700402CrossRef Google Scholar

De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559.CrossRef Google Scholar

Fleishman, J. A. & Lawrence, W. F. (2003). Demographic variation in SF-12 scores: true differences or differential item functioning. Medical care, 41(7), 75–86. https://10.1097/01.MLR.0000076052.42628.10.1097/01.MLR.0000076052.42628CrossRef Google Scholar

Ip, E. H. (2010). Empirically indistinguishable multidimensional IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395–416. https://10.1348/000711009x466835.10.1348/000711009X466835CrossRef Google Scholar

Kolen, M.J., Brennan, R.L.Test equating, scaling, and linking: Methods and practices 2014 New YorkSpringer.CrossRef Google Scholar

Lim, H, Choe, E.M., Han, K. (2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement, .10.1111/jedm.12313CrossRef Google Scholar

Liu, Y, Zumbo, B, Gustason, P, Huang, Y, Kroc, E, Wu, A. (2016). Investigating causal DIF via propensity score methods. Practical Assessment, Research and Evaluation, 21 131–24.Google Scholar

Ma, Y., Ackerman, T., Ip, E., & Chung, J. (2023). The effect of the projective IRT model on DIF detection. IMPS 2023 Annual Meeting, College Park, Maryland, United States.Google Scholar

Mazor, K.M., Hambleton, R.K., Clauser, B.E.. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22 4357–367.10.1177/014662169802200404CrossRef Google Scholar

Flowers, C.P., Oshima, T.C., Raju, N.S.. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23 4309–326.10.1177/01466219922031437CrossRef Google Scholar

Holland, P. W., & Thayer, D. T. (1988). Differential item functioning detection and the Mantel–Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.129–145). Hillsdale, NJ: Lawrence Erlbaum. http://www.books.google.co.ke/books?isbn=1109103204.Google Scholar

Huang, P.H.. (2018). A penalized likelihood method for multi-group structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71, 499–522 121-130.10.1111/bmsp.12130CrossRef Google Scholar PubMed

Junker, B., & Stout, W. F. (1991). Robustness of ability estimation when multiple traits are present with one trait dominant. Paper presented at the International Symposium on Modern Theories in Measurement: Problems and Issues. Montebello, Quebec.Google Scholar

Kok, F. (1988). Item bias and test multidimensionality. In R. Lange Heine & J. Rost (Eds.), Latent trait and latent class models (pp. 263–275). New York: Plenum Press. https://10.1007/978-1-4757-5644-9_12.10.1007/978-1-4757-5644-9_12CrossRef Google Scholar

Li, Y.H., Lissitz, R.W.. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115–138.CrossRef Google Scholar

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum https://eric.ed.gov/?id=ED312280.Google Scholar

Magis, D, Beland, S, Tuerlinckx, F, De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42 3847–862.CrossRef Google Scholar

McKinley, R. L., & Reckase, M. D. (1982). The use of the general rasch model with multidimensional item response data.Google Scholar

Muthen, B, Asparouhov, T. (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47, 637–664.CrossRef Google Scholar

Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement 37(4), 357–373. http://www.jstor.org/stable/1435246.10.1111/j.1745-3984.2000.tb01092.xCrossRef Google Scholar

Penfield, R, Algina, J. (2006). A generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educational Measurement, 43 4295–312.10.1111/j.1745-3984.2006.00018.xCrossRef Google Scholar

Raju, N.S.. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.10.1007/BF02294403CrossRef Google Scholar

Raju, N.S., van der Linden, W.J., Fleer, P.F.. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–368.10.1177/014662169501900405CrossRef Google Scholar

Ramsay, J. O. (1990). A kernel smoothing approach to IRT modeling. Talk presented at the Annual Meeting of the Psychometric Society at Princeton New Jersey.Google Scholar

Reckase, M.D. (2009) Multidimensional item response theory New YorkSpringer.10.1007/978-0-387-89976-3CrossRef Google Scholar

Shealy, R, Stout, W.F., (1993). An item response theory model for test bias, In Holland, P, Wainer, H. (Eds.), Differential item functioning, HillsdaleErlbaum 197–239.Google Scholar

Shealy, R, Stout, W.F.. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–19.10.1007/BF02294572CrossRef Google Scholar

Spray, J., Davey, T., Reckase, M., Ackerman, T. & Carlson, J. (1990). Comparison of two logistic multidimensional item response theory models. ACT Research Report ONR90-8.Google Scholar

Stout, W.F.. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52 4589–617.10.1007/BF02294821CrossRef Google Scholar

Strachan, T, Ip, E, Fu, Y, Ackerman, T, Chen, S.H., Willse, J. (2020). Robustness of projective IRT to misspecification of the underlying multidimensional model. Applied Psychological Measurement, 44 5362–375.CrossRef Google Scholar PubMed

Strachan, T, Cho, U.H., Ackerman, T, Chen, S-H, de la Torre, J, Ip, E. (2022). Evaluation of the linear composite conjecture for unidimensional IRT scale for multidimensional responses. Applied Psychological Measurement, 46 5347–360.10.1177/01466216221084218CrossRef Google Scholar PubMed

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370. https://10.1111/j.1745-3984.1990.tb00754.x.CrossRef Google Scholar

Sympson, B. (1978) A model for testing with Multidimensional items. In Weiss, D. J. (ed) Proceedings of the 1977 Computerized Adaptive Testing Conference, University of Minnesota, Minneapolis.Google Scholar

Thissen, D, Steinberg, L, Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In Wainer, H, Braun, H.I.(Eds.), Test validity, Hillsdale NJErlbaum 147–169.Google Scholar

Wang, M. (1985). Fitting a unidimensional model multidimensional item response data: The effects of latent space misspecification on the application of IRT Unpublished manuscript, University of Iowa.Google Scholar

Williams, N.J., Beretvas, S.N.. (2006). DIF identification using HGLM for polytomous items. Applied Psychological Measurement, 30, 22–42.10.1177/0146621605279867CrossRef Google Scholar

Wolfram, 2020 Wolfram Research, Inc., (2020). Mathematica, (Version 12.2), [Computer Software]. Champaign, IL.Google Scholar

Zhang, J, Stout, W.F.. (1999). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64 2213–249.10.1007/BF02294536CrossRef Google Scholar

Article contents

Examining Differential Item Functioning from a Multidimensional IRT Perspective

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests