Hostname: page-component-89b8bd64d-7zcd7 Total loading time: 0 Render date: 2026-05-07T14:26:43.924Z Has data issue: false hasContentIssue false

Evaluating GPT-generated feedback for beginning-level Spanish writing: An exploratory study

Published online by Cambridge University Press:  02 March 2026

Alyssia Miller De Rutté*
Affiliation:
Colorado State University, USA (alyssia1@colostate.edu)
Maite Correa
Affiliation:
Colorado State University, USA (maite.correa@colostate.edu)
*
Corresponding author: Alyssia Miller De Rutté; Email: alyssia1@colostate.edu
Rights & Permissions [Opens in a new window]

Abstract

Feedback is integral to second language (L2) writing instruction. However, large class sizes and limited teacher time often challenge the delivery of personalized feedback, prompting interest in AI-powered solutions such as ChatGPT (Escalante et al., 2023; Huete-García & Tarp, 2024; Steiss et al., 2024; Yoon et al., 2023; Zhang, 2024). This study evaluates a task-customized GPT model, “Belinda,” trained to assess A1-level Spanish learners’ writing and provide feedback. Two research questions guided the investigation: (1) Can Belinda accurately score beginner Spanish writing using a provided rubric? (2) Can Belinda deliver constructive qualitative feedback? Human and GPT-generated scores were compared for inter- and intrarater reliability, and qualitative analyses categorized the feedback for usability in the classroom. Results revealed moderate alignment between Belinda’s scores and human raters, though reliability of the GPT fell short of calibration benchmarks. Feedback quality varied, with Belinda often providing vague, incomplete, or inaccurate suggestions. Despite iterative training, the GPT struggled to balance error correction with encouragement, a critical need for novice learners. Additionally, inconsistencies in identical GPT versions raised concerns about reliability. While Belinda showed potential in automating feedback, its limitations in accuracy, contextual understanding, and positivity suggest it is not yet a viable substitute for human evaluation by itself. These findings emphasize the challenges of integrating AI into L2 instruction and call for the need for extensive datasets, robust training, and human–AI collaboration to achieve pedagogically sound outcomes. Future research should explore hybrid feedback models and scalable solutions to enhance AI’s role in language education without compromising learner progress or confidence.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of EUROCALL, the European Association for Computer-Assisted Language Learning
Figure 0

Table 1. Examples of the types of feedback generated

Figure 1

Table 2. GPT version history

Figure 2

Table 3. Summary of descriptive statistics

Figure 3

Table 4. Interrater reliability between human raters and Belinda

Figure 4

Table 5. Intrarater reliability between three versions of Belinda

Supplementary material: File

Miller De Rutté and Correa supplementary material

Miller De Rutté and Correa supplementary material
Download Miller De Rutté and Correa supplementary material(File)
File 92.7 KB