Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-07T20:30:55.105Z Has data issue: false hasContentIssue false

Impact of prompt sophistication on ChatGPT’s output for automated written corrective feedback

Published online by Cambridge University Press:  29 December 2025

Na Luo*
Affiliation:
Lanzhou University , China (luona@lzu.edu.cn)
Yifan Wang
Affiliation:
Lanzhou University, China (wangyifan2024@lzu.edu.cn)
Zhe (Victor) Zhang
Affiliation:
Macao Polytechnic University, Macao SAR, China (victorzhang@mpu.edu.mo)
Yile Zhou
Affiliation:
Freelancer, China (zhouyile6@gmail.com)
Rongfu Zhao
Affiliation:
Lanzhou University, China (zhaorf2024@lzu.edu.cn)
*
Corresponding author: Na Luo; Email: luona@lzu.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

The emergence of large language models, exemplified by ChatGPT, has garnered growing attention for their potential to generate feedback in second language writing, particularly automated written corrective feedback (AWCF). In this study, we examined how prompt design – a generic prompt and two domain-specific prompts (zero-shot and one-shot) enriched with comprehensive domain knowledge about written corrective feedback (WCF) – influences ChatGPT’s ability to provide AWCF. The accuracy and coverage of ChatGPT’s feedback across these three prompts were benchmarked against Grammarly, a widely used traditional automated writing evaluation (AWE) tool. We find that ChatGPT’s ability in flagging language errors grew considerably with prompt sophistication driven by the integration of domain-specific knowledge and examples. While the generic prompt resulted in substantially lower performance than Grammarly, the zero-shot prompt achieved comparable results to it and the one-shot prompt surpassed it considerably in error detection. Notably, the most pronounced improvement in ChatGPT’s performance was observed in its detection of frequent error categories, including those of word choice or expression, direct translation, sentence structure and pronoun. Nonetheless, even with the most sophisticated prompt, ChatGPT still displayed certain limitations when compared to Grammarly. Our study has both theoretical and practical implications. Theoretically, it lends empirical evidence to Knoth et al.’s (2024) proposition to separate domain-specific AI literacy from generic AI literacy. Practically, it sheds light on the pedagogical application and technical development of AWE systems.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press or the rights holder(s) must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of EUROCALL, the European Association for Computer-Assisted Language Learning
Figure 0

Figure 1. The number of errors flagged by ChatGPT with the three prompts and Grammarly. GPT-P1, GPT-P2 and GPT-P3 stand for the AWCF generated by ChatGPT with Prompt 1 (zero-shot generic), Prompt 2 (zero-shot domain-specific) and Prompt 3 (one-shot domain-specific) respectively.

Figure 1

Table 1. Precision and recall of ChatGPT’s AWCF across prompts compared with that of Grammarly

Supplementary material: File

Luo et al. supplementary material 1

Luo et al. supplementary material
Download Luo et al. supplementary material 1(File)
File 415 Bytes
Supplementary material: File

Luo et al. supplementary material 2

Luo et al. supplementary material
Download Luo et al. supplementary material 2(File)
File 462.7 KB