Feedback is integral to second language (L2) writing instruction. However, large class sizes and limited teacher time often challenge the delivery of personalized feedback, prompting interest in AI-powered solutions such as ChatGPT (Escalante et al., 2023; Huete-García & Tarp, 2024; Steiss et al., 2024; Yoon et al., 2023; Zhang, 2024). This study evaluates a task-customized GPT model, “Belinda,” trained to assess A1-level Spanish learners’ writing and provide feedback. Two research questions guided the investigation: (1) Can Belinda accurately score beginner Spanish writing using a provided rubric? (2) Can Belinda deliver constructive qualitative feedback? Human and GPT-generated scores were compared for inter- and intrarater reliability, and qualitative analyses categorized the feedback for usability in the classroom. Results revealed moderate alignment between Belinda’s scores and human raters, though reliability of the GPT fell short of calibration benchmarks. Feedback quality varied, with Belinda often providing vague, incomplete, or inaccurate suggestions. Despite iterative training, the GPT struggled to balance error correction with encouragement, a critical need for novice learners. Additionally, inconsistencies in identical GPT versions raised concerns about reliability. While Belinda showed potential in automating feedback, its limitations in accuracy, contextual understanding, and positivity suggest it is not yet a viable substitute for human evaluation by itself. These findings emphasize the challenges of integrating AI into L2 instruction and call for the need for extensive datasets, robust training, and human–AI collaboration to achieve pedagogically sound outcomes. Future research should explore hybrid feedback models and scalable solutions to enhance AI’s role in language education without compromising learner progress or confidence.