Hostname: page-component-6766d58669-76mfw Total loading time: 0 Render date: 2026-05-15T21:26:58.553Z Has data issue: false hasContentIssue false

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Published online by Cambridge University Press:  16 June 2025

Daniel Bengs*
Affiliation:
Leibniz Institute for Research and Information in Education, Frankfurt, Germany Leuphana University Lüneburg, Lüneburg, Germany
Ulf Brefeld
Affiliation:
Leuphana University Lüneburg, Lüneburg, Germany
Ulf Kroehne
Affiliation:
Leibniz Institute for Research and Information in Education, Frankfurt, Germany
Fabian Zehner
Affiliation:
Leibniz Institute for Research and Information in Education, Frankfurt, Germany Centre for International Student Assessment
*
Corresponding author: Daniel Bengs; Email: d.bengs@dipf.de
Rights & Permissions [Opens in a new window]

Abstract

Testitems using open-ended response formats can increase an instrument’s construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.

Information

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Figure 1 Average performance measures for the CER model (a) and VER model (b) in the balanced error condition. Error bars: 95% confidence intervals.

Figure 1

Figure 2 Average performance measures for the CER model (a) and VER model (b) in the unbalanced error condition with increased false-positive rate. Error bars: 95% confidence intervals.

Figure 2

Figure 3 Fitted classifier error models of four exemplary items. (a): conditional probability of false-positive classification, (b): conditional probability of false-negative classification. Blue solid line: G4PL, red dashed line: 4PL. Error models for both models coincide where constant error rates were used with the G4PL in accordance with results from independence testing. Jittered data points are overlaid (a: ordinate 1—false positives, ordinate 0—true negatives, b: ordinate 1—false negatives, ordinate 0—true positives. The amount of jitter is ±.3 for both directions.

Figure 3

Figure 4 Item characteristic curves of four exemplary items. Item characteristic curves of four exemplary items, giving the probability of observing a response scored as correct by manual scoring (2PL model, solid line) and, respectively, automatic scoring (4PL model, dotted line, G4PL model, where fitted: dashed line).

Figure 4

Figure 5 Item information curves of four exemplary items. Item information curves of four exemplary items under manual scoring (2PL model, solid line) and, respectively, automatic scoring (4PL model: dotted line, G4PL model, where fitted: dashed line).

Figure 5

Figure 6 Standard error of measurement (SEM). SEM for manual scoring (2PL) and automatic scoring (4PL model: dotted line, G4PL model: dashed line) for both classifiers and test forms comprising eight PISA items.

Figure 6

Table 1 Correlations of ability estimates (all test-takers)

Figure 7

Table 2 Correlations of ability estimates (test-takers with one or more classification errors)

Supplementary material: File

Bengs et al. supplementary material

Bengs et al. supplementary material
Download Bengs et al. supplementary material(File)
File 1.2 MB