Automated analysis of common errors in L2 learner production: Prototype web application development

Atsushi Mizumoto

doi:10.1017/S0272263125100934

Automated analysis of common errors in L2 learner production: Prototype web application development

Published online by Cambridge University Press: 19 June 2025

Atsushi Mizumoto

Show author details

Atsushi Mizumoto*: Affiliation:
Faculty of Foreign Language Studies, https://ror.org/03xg1f311 Kansai University , Osaka, Japan
*: Email: mizumoto@kansai-u.ac.jp

Article contents

Abstract
Introduction
Background
Description of the tool
Ethical considerations in Auto Error Analyzer
Concluding remarks
Data availability statement
References

Rights & Permissions

Abstract

This research report presents the development and validation of Auto Error Analyzer, a prototype web application designed to automate the calculation of accuracy and its related metrics for measuring second language (L2) production. Building on recent advancements in natural language processing (NLP) and artificial intelligence (AI), Auto Error Analyzer introduces an automated accuracy measurement component, bridging a gap in existing assessment tools, which traditionally require human judgment for accuracy evaluation. By utilizing a state-of-the-art generative AI model (Llama 3.3) for error detection, Auto Error Analyzer analyzes L2 texts efficiently and cost-effectively, producing accuracy metrics (e.g., errors per 100 words). Validation results demonstrate high agreement between the tool’s error counts and human rater judgments (r = .94), with microaverage precision and recall in error detection being high as well (.96 and .94 respectively, F1 = .95), and its T-unit and clause counts matched outputs from established tools like L2SCA. Developed under open science principles to ensure transparency and replicability, the tool aims to support researchers and educators while emphasizing the complementary role of human expertise in language assessment. The possibilities of Auto Error Analyzer for efficient and scalable error analysis, as well as its limitations in detecting context-dependent and first-language (L1)-influenced errors, are also discussed.

Keywords

accuracy CAF common errors Generative AI (GenAI)natural language processing (NLP)

Type: Research Report
Information: Studies in Second Language Acquisition , First View , pp. 1 - 18

DOI: https://doi.org/10.1017/S0272263125100934 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Introduction

The complexity, accuracy, and fluency (CAF) framework is grounded in theories of second language (L2) acquisition and performance. It relates to models of L2 competence, proficiency, and processing (Housen & Kuiken, Reference Housen and Kuiken2009). Within this framework, complexity refers to the breadth and sophistication of linguistic structures learners produce, primarily focusing on syntactic aspects. Accuracy pertains to the precision and correctness of language use, reflecting the degree to which a learner’s output adheres to the norms of the target language. Fluency addresses the fluidity, smoothness, and speed of language production. As Ortega (Reference Ortega, Kortmann and Szmrecsanyi2012) notes, CAF measures serve multiple functions: describing performance, benchmarking development, and gauging proficiency. These measures have become critical metrics for evaluating L2 proficiency across both instructional settings and standardized language assessments, demonstrating their predictive strength for language proficiency and writing quality (Crossley, Kyle, Allen, Guo, & McNamara, Reference Crossley, Kyle, Allen, Guo and McNamara2014; Plakans, Gebril, & Bilki, Reference Plakans, Gebril and Bilki2019).

As the field advances, there is a growing acknowledgment of the need to integrate the three CAF measures and treat them as a whole rather than a separate construct (Larsen-Freeman, Reference Larsen-Freeman2009; Norris & Ortega, Reference Norris and Ortega2009; Teng & Huang, Reference Teng and Huang2023). This integrated approach reflects the inherently dynamic and holistic nature of language learning and underscores the importance of viewing CAF as interconnected dimensions that mutually inform assessments of learner language development (Hou, Loerts, & Verspoor, Reference Hou, Loerts, Verspoor, Fogal and Verspoor2020).

Parallel to these theoretical developments, significant progress has occurred in the automated measurement of syntactic complexity and fluency. Advances in natural language processing (NLP) have made the automated calculation of syntactic complexity widely accessible through tools such as L2 Syntactic Complexity Analyzer (L2SCA; Lu, Reference Lu2010) and the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC; Kyle & Crossley, Reference Kyle and Crossley2018). Fluency measures, typically operationalized through word counts and sentence-level metrics, have similarly been automated with relative ease using existing NLP techniques.

However, accuracy, particularly the precise identification and quantification of linguistic errors, remains a domain traditionally reliant on resource-intensive human evaluation. Existing automated tools, such as Grammarly (https://app.grammarly.com/), offer learner-centric corrective feedback but do not provide the quantitative accuracy indices extensively used in L2 research, such as errors per T-unit, errors per clause, or proportions of error-free units (Polio & Shea, Reference Polio and Shea2014). The recent emergence of generative AI models, such as ChatGPT and other large language models (LLMs), has sparked interest in the feasibility of automated error analysis (Mizumoto, Shintani, Sasaki, & Teng, Reference Mizumoto, Shintani, Sasaki and Teng2024; Pfau, Polio, & Xu, Reference Pfau, Polio and Xu2023). These developments suggest significant potential for automating accuracy assessment, thereby addressing a major gap in current research and practical applications.

In response to this opportunity, the present research report introduces Auto Error Analyzer, a prototype web application specifically designed to automate the computation and reporting of accuracy metrics within the CAF framework. By leveraging cutting-edge generative AI (Llama 3.3), this tool efficiently and affordably quantifies accuracy in L2 learner texts, aligning closely with human evaluation benchmarks. This integration not only completes the automation of all CAF dimensions but also enables more scalable and reliable analyses of L2 production data, bridging the gap between human judgment and AI-driven assessment.

Background

Complexity and fluency in L2 development research

The CAF measures have been a fundamental framework for evaluating learners’ linguistic proficiency. They play a critical role in both the task-based assessment of L2 oral and written performance in classroom settings (Koizumi & In’nami, Reference Koizumi and In’nami2024) and in the context of norm-referenced language testing environments (Plakans et al., Reference Plakans, Gebril and Bilki2019). Complexity encompasses the breadth and sophistication of linguistic structures that a learner is capable of producing, emphasizing the employment of advanced lexical, morphological, and syntactic complexity (Housen et al., Reference Housen, Kuiken, Vedder, Housen, Kuiken and Vedder2012; Pallotti, Reference Pallotti, Winke and Brunfaut2021). Researchers have developed numerous indices to quantify syntactic complexity measures in L2 writing, such as mean length of T-unit (MLT), mean length of clause (MLC), coordinate phrases per clause (CP/C), complex nominals per clause (CN/C), dependent clause ratio (DC/C), T-unit complexity ratio (C/T), mean length of sentence (MLS), sentence coordination ratio (T/S), and dependent clauses per T-unit (DC/T) (see Hwang & Polio, Reference Hwang and Polio2023 for comprehensive review of syntactic complexity indices in L2 writing studies). Recently, researchers have expanded the analytical methods with more refined syntactic indices that capture specific dimensions of complexity, including phrasal elaboration, structural variations, and clause-level relationships (Biber, Gray, Staples, & Egbert, Reference Biber, Gray, Staples and Egbert2020; Crossley, Reference Crossley2020; Kyle & Crossley, Reference Kyle and Crossley2018).

These syntactic complexity measures capture the elaborateness and hierarchy of a learner’s sentences. In empirical studies, such complexity indices have been shown to predict writing proficiency (e.g., Kuiken & Vedder, Reference Kuiken and Vedder2019; Phuoc & Barrot, Reference Phuoc and Barrot2022) and elucidate text variation (e.g., Tao & Aryadoust, Reference Tao and Aryadoust2024). For example, Mizumoto and Eguchi (Reference Mizumoto and Eguchi2023) explored the efficacy of automated essay scoring (AES) with the GPT-3 text-davinci-003 model on the 12,100 essays in the ETS Corpus of Non-Native Written English (TOEFL11). Their findings revealed that the inclusion of the above-mentioned syntactic complexity measures and additional linguistic features, such as lexical diversity (Kyle, Crossley, & Jarvis, Reference Kyle, Crossley and Jarvis2021) and lexical sophistication (Crossley, Kyle, & Berger, Reference Crossley, Kyle and Berger2018), significantly improved the accuracy of the scoring compared to using the GPT model alone.

Advances in NLP in recent years have made syntactic complexity easier to assess, using tools such as L2SCA (Lu, Reference Lu2010) and TAASSC (Kyle, Reference Kyle2016). In addition, L2SCA’s online interface, the Web-based L2 Syntactic Complexity Analyzer (https://aihaiyang.com/software/l2sca/), offers an accessible platform for easy online operations (Ai & Lu, Reference Ai, Lu, Díaz-Negrillo, Ballier and Thompson2013). These tools have enabled automated large-scale analysis of syntactic complexity without manual coding, thereby being widely used in numerous studies in the field. It is important to note that the current version of the tool primarily incorporates syntactic complexity measures to support its accuracy calculations. Specifically, syntactic complexity indices, derived from well-established NLP methodologies and validated tools (e.g., L2SCA, TAASSC), serve as foundational units (e.g., T-units and clauses) for quantifying accuracy metrics such as errors per T-unit or clause. While lexical and morphological complexities are equally significant, their computational integration differs substantially and is beyond the immediate scope focused on accuracy assessment. Future development will explore the incorporation of these additional complexity dimensions to provide a more comprehensive analysis of L2 production.

Fluency, on the other hand, concerns the flow and speed of language production (articulation rate), breakdown (pauses), and repair (reformulations and false starts), emphasizing the ability to communicate ideas without undue pausing or hesitation. Within the CAF framework, fluency has been straightforwardly operationalized in L2 writing as the total number of words, or at times, clauses or T-units per text (Plakans et al., Reference Plakans, Gebril and Bilki2019). With the advent of computer-based writing as the standard, writing fluency now also focuses on minimal pauses, few revisions, and significant word production (Garcés-Manzanera, Reference Garcés-Manzanera2024; Gilquin, Reference Gilquin2024). Consequently, there has been an increase in studies using keystrokes as a measure of writing fluency, facilitated by keystroke logging programs (e.g., Mohsen, Reference Mohsen2021). Recently, some researchers have expanded fluency measurement methods by integrating eye-tracking technology, providing further insights into the cognitive processes underlying L2 fluency (Chukharev-Hudilainen, Saricaoglu, Torrance, & Feng, Reference Chukharev-Hudilainen, Saricaoglu, Torrance and Feng2019; Lehtilä & Lintunen, Reference Lehtilä and Lintunen2025; Révész, Michel, Lu, Kourtali, Lee, & Borges, Reference Révész, Michel, Lu, Kourtali, Lee and Borges2022).

Among the three dimensions of complexity, accuracy, and fluency (CAF), fluency, particularly measured by word count, is recognized as a predictor of both writing quality and overall language development in second language (L2) learners (Plakans et al., Reference Plakans, Gebril and Bilki2019). Research has consistently demonstrated that word count serves as a robust predictor of variance in L2 writing assessment scores (Ferris, Reference Ferris1994; Gebril & Plakans, Reference Gebril and Plakans2013; Goh, Sun, & Yang, Reference Goh, Sun and Yang2020). For example, in a study examining linguistic microfeatures as predictors of L2 writing quality, Crossley et al. (Reference Crossley, Kyle, Allen, Guo and McNamara2014) identified word count (or word type) as a significant determinant of essay scores, demonstrating that test-takers who produced more text generally received higher evaluations of their writing.

Accuracy in L2 research: Significance and challenges

Accuracy, in the CAF framework, denotes the precision and correctness of language use, reflecting how closely a learner’s production in L2 adheres to the grammatical, syntactic, and lexical norms of the target language (Housen, Kuiken, & Vedder, Reference Housen, Kuiken, Vedder, Housen, Kuiken and Vedder2012). High accuracy is evident when an L2 text has few or no errors in syntax, morphology, or lexis.

Accuracy is not just a predictor of proficiency, evidenced by its correlation with proficiency levels (Thewissen, Reference Thewissen2013) and writing scores (Polio & Shea, Reference Polio and Shea2014); it also holds a pivotal role in classrooms through formative assessment. The ability to identify, correct, and reconstruct the interlanguage of learners, including their errors, is essential. Consequently, accuracy is a key indicator for both researchers and practitioners alike, underscoring its importance in evaluating language proficiency and guiding the instructional process (Pallotti, Reference Pallotti2017).

To evaluate accuracy, researchers have developed a variety of metrics and coding schemes. Updating Polio (Reference Polio1997), Polio and Shea (Reference Polio and Shea2014) comprehensively reviewed accuracy measures used in L2 writing research. They reported that common measures include (a) holistic scales for linguistic accuracy, (b) error-free units (e.g., error-free T-units), (c) the number of errors per 100 words, (d) the number of specific error types (e.g., grammar errors such as articles), and (e) measures considering the severity of the error (e.g., weighted error-free clauses). Pallotti (Reference Pallotti, Winke and Brunfaut2021) similarly highlighted several of these indicators, particularly emphasizing measures such as the average number of errors per unit (clause, sentence, or T-unit) and the proportion of error-free units, thereby reaffirming key metrics identified in Polio and Shea’s earlier review.

These measures provide a detailed view of accuracy by quantifying not just how many errors are present but also their types and distribution. For instance, one learner’s essay might have eight errors per 100 words (mainly minor spelling mistakes), while another’s has four errors per 100 words but with two being severe grammatical errors; different accuracy metrics would capture different aspects of these texts’ correctness. Notably, accuracy metrics like error-free clause counts and error rates have demonstrated validity by aligning with expert judgments and predicting writing performance (Polio & Shea, Reference Polio and Shea2014). In other words, texts with fewer errors (or higher proportions of error-free sentences) tend to receive higher ratings, confirming accuracy as an essential indicator of writing quality.

Despite its importance, accuracy has historically been the most challenging CAF dimension to measure automatically. Complexity and fluency lend themselves to automation because they deal with structural and quantitative features that computers can parse (sentence lengths, word counts, etc.). Accuracy, in contrast, requires identifying linguistic errors, a task that traditionally relies on human evaluators who can distinguish true errors from stylistic variations. While automated grammar checkers (e.g., Grammarly) and other Automated Writing Evaluation (AWE) tools provide corrective feedback to users, they do not directly output research-friendly accuracy metrics like those reviewed above (Polio & Shea, Reference Polio and Shea2014). For example, a tool like Grammarly might underline a misused article or a verb tense error and suggest a correction, but it will not tell a researcher “this essay has 5 errors in 200 words” or “80% of the sentences are error-free.” As a result, studies analyzing accuracy have had to resort to labor-intensive human coding of errors in student texts. This manual process raises concerns about consistency and scalability. Researchers must agree on what counts as an error, train raters, and spend considerable time on analysis―a barrier to large-scale or classroom-level implementation. In short, accuracy remains the least automated component of CAF, and improving this state of affairs is a pressing need in L2 research.

Automating accuracy measurement

Recent advancements in NLP and artificial intelligence (AI) are beginning to bridge the gap in accuracy assessment. Researchers have started comparing cutting-edge AI language models with existing grammar checking tools to evaluate how accurately each can detect and evaluate errors in L2 production, especially in writing. Traditional tools such as Grammarly or Criterion offer quick fixes for obvious mistakes, yet they sometimes miss context-dependent errors and, as noted, do not report comprehensive accuracy statistics. Crossley, Bradfield, and Bustamante (Reference Crossley, Bradfield and Bustamante2019) introduced the Grammar and Mechanics Error Tool (GAMET), based on LanguageTool v3.2, to automatically assess structural and mechanical errors in texts. Although GAMET achieved high precision in identifying certain error types (e.g., spelling and typographical errors), its overall accuracy and recall were still limited, particularly in grammar and punctuation error detection. Consequently, while GAMET represented an important attempt at automating accuracy assessment, it lacked the comprehensive precision offered by newer, generative AI-based methods.

In contrast, the new generation of AI, exemplified by large language models (LLMs) like Generative Pre-trained Transformer (GPT), can analyze text with greater depth and potentially provide more detailed error analysis. A key question has been whether these AI models can match or approach human accuracy in identifying and judging L2 writing errors, and early evidence is promising. For example, Pfau et al. (Reference Pfau, Polio and Xu2023) conducted a pioneering study on GPT-4’s accuracy assessment capabilities, examining 100 essays written by non-native English speakers. They used ChatGPT to detect and classify errors in each sentence and had human experts verify the AI’s performance. Their findings revealed strong alignment between ChatGPT and human evaluators, with a correlation of .97. Such a high correlation suggests that generative AI can replicate human-level accuracy judgments with remarkable precision. Building on the work of Pfau et al., Mizumoto, Yasuda, and Tamura (Reference Mizumoto, Yasuda and Tamura2024) directly compared the accuracy evaluations of generative AI models with both human assessments and Grammarly’s feedback, using a sample of 232 L2 English writing scripts. The AI (based on ChatGPT) not only showed a strong correlation (ρ = .79) with human judgment of errors, but it actually outperformed the grammar checker in mimicking human evaluations. Specifically, the AI’s error counts and ratings were more in line with the human experts’ (higher agreement) than Grammarly’s were, and the AI’s accuracy metrics were slightly better predictors of the students’ actual writing scores on a proficiency exam.

These studies demonstrate promising results, with generative AI (GPT) showing high accuracy in detecting grammatical errors and aligning closely with human judgments. The evidence supports the use of generative AI as a viable tool for automated accuracy assessment, suggesting that generative AI is now sufficiently developed to warrant consideration for practical implementation. The present work takes up this charge by focusing on accuracy automation, directly addressing the identified gap in tools in L2 research.

Having identified the critical gap in automated accuracy measurement, the following section introduces Auto Error Analyzer, a tool specifically designed to address these limitations through its innovative approach to error detection, classification, and reporting for L2 production assessment.

Description of the tool

Indices calculated in the Auto Error Analyzer

The newly developed tool, named “Auto Error Analyzer,” is freely accessible at https://langtech.jp/autoerror/. Figure 1 shows the Auto Error Analyzer landing page. Users can analyze English text entered manually or uploaded as text (.txt) or Microsoft Word (.docx) files.

Figure 1. Auto Error Analyzer landing page.

In addition to written texts, Auto Error Analyzer can also process transcribed spoken data. Since the tool focuses on error detection and classification, its core functionalities apply to both written and transcribed spoken texts, making it useful for analyzing a wider range of learner production.

Table 1 presents the indices calculated by the Auto Error Analyzer. To compute accuracy indices, fluency-related indices (e.g., total number of words and sentences) and complexity-related indices (e.g., T-units and clauses) are also displayed. Since one of the defining features of the Auto Error Analyzer is its automated calculation of accuracy indices, computing T-units and clauses is essential for deriving accuracy measures such as Errors per T-unit, Errors per Clause, Error-free T-units, and Error-free Clauses. These supplementary indices are displayed on the left of Table 1 and are included in the web application for reference purposes. This is because the Auto Error Analyzer is specifically designed to focus on the automatic calculation of accuracy indices.

Table 1. Indices calculated in the Auto Error Analyzer

The main category, accuracy, includes 12 indices that are based on widely used accuracy measures in L2 writing research (Polio & Shea, Reference Polio and Shea2014). In a recent study by Mizumoto et al. (Reference Mizumoto and Plonsky2024), GPT-4 was employed for error identification because of its demonstrated high accuracy in detecting and correcting grammatical errors, thus providing a robust baseline for AI-based accuracy assessment. Specifically, GPT-4 was used via OpenAI’s API (Application Programming Interface, a set of protocols and tools for software applications to communicate with each other). For Auto Error Analyzer, Llama 3.3 through Groq’s API is used. This model offers capabilities comparable to GPT-4, with the added advantages of significantly faster processing and lower API costs. Given that processing speed and cost-efficiency are critical when delivering LLM-powered services, these advantages—significantly faster processing and lower API costs—make Llama 3.3 an optimal and practical choice, particularly suitable for real-time applications and large-scale analyses. The prompt used for error analysis in Auto Error Analyzer is the same as the one employed in Mizumoto et al. (Reference Mizumoto and Plonsky2024):

[Prompt]

Reply with a corrected version of the input sentence with all grammatical, spelling, and punctuation errors fixed. Be strict about the possible errors. If there are no errors, reply with a copy of the original sentence. Please do not add any unnecessary explanations.

Extraction of T-units and clauses

In this application, the extraction of T-units and clauses from text involves distinct processes aimed at analyzing syntactic and grammatical structures to ensure clear and meaningful segmentation. A T-unit, or minimal terminable unit, represents an independent clause along with any attached dependent clauses or modifying phrases. To identify T-units, the program analyzes the text using an NLP approach that examines conjunctions, punctuation, and overall sentence structure. Specifically, the application focuses on conjunctions such as “and,” “but,” or “so,” and punctuation marks like semicolons and conjunctive adverbs to determine logical points for splitting sentences into meaningful units. Each potential split point is carefully assessed within the application to confirm the presence of a complete independent clause, ensuring the resulting units are coherent and self-contained.

Clauses, on the other hand, are segments within sentences identified primarily through their verbs. In this developed application, the identification of clauses involves analyzing verbs that form the core of these grammatical structures. By focusing on the main actions and their syntactic relationships, the program identifies both independent and dependent clauses, distinguishing them clearly within complex sentences. This process within the program also considers special grammatical structures like coordinated clauses and relative clauses, ensuring accurate representation and readability.

Following extraction, the application further refines T-units and clauses by correcting punctuation, contractions, and spacing, enhancing clarity and readability. If discrepancies arise, it performs additional refinements and reanalyses. This iterative approach ensures that the final output accurately represents the intended linguistic structures. By automating this step, the application establishes refined T-units and clauses as foundational units for calculating accuracy indices, such as errors per T-unit and errors per clause, within the overall accuracy measurement process.

In technical terms, the application uses NLP techniques similar to those in tools like L2SCA and TAASSC. Specifically, the approach used here mirrors the methodologies of these established tools by parsing text at the sentence and word level. However, unlike TAASSC, which does not directly compute T-units and clauses, this application employs custom-developed code written from scratch, inspired by the results of T-unit and clause extraction as implemented in L2SCA. While L2SCA utilizes the Stanford Parser (Klein & Manning, Reference Klein and Manning2003), this application employs spaCy, a robust NLP library in Python, complemented by custom-developed algorithms specifically tailored to replicate similar outcomes (see the online supplementary material for the actual code: https://osf.io/jyf3r/).

Indices such as Errors per T-unit, Errors per Clause, Error-free T-units, and Error-free Clauses are calculated based on errors detected and T-units and clauses. That is, after the errors are automatically identified in the program, a simple matching approach categorizes them within T-units and clauses, enabling the calculation of these indices.

Technical implementation of error analysis with Llama 3.3

The error analysis component of Auto Error Analyzer is powered by Llama 3.3, accessed through Groq’s API service. This implementation involves several technical steps that transform raw text input into structured error analysis output.

First, the input text undergoes preprocessing to normalize spacing, handle special characters, and segment the text into individual sentences. This segmentation is crucial for the sentence-by-sentence analysis approach used by the tool.

Next, each sentence is submitted to the Llama 3.3 model via Groq’s API with the specific prompt shown earlier. The implementation uses asynchronous API calls to optimize processing time, with appropriate rate limiting to respect API constraints. When the model returns a corrected sentence, a diff algorithm compares the original and corrected versions to identify specific changes. These changes are then classified into the 23 error categories (detailed below) using another set of API calls (see the online supplementary material for the actual code and prompt: https://osf.io/jyf3r/).

After error identification and classification, a mapping algorithm associates these errors with the previously extracted T-units and clauses to generate unit-level metrics such as errors per T-unit. This association uses span indices that track the exact character positions of each error, T-unit, and clause within the original text, ensuring accurate alignment even with complex nested structures.

Error types detected by the tool

Defining what constitutes an “error” has proven to be elusive in L2 research (Pallotti, Reference Pallotti, Winke and Brunfaut2021, p. 4). To address this challenge systematically, Auto Error Analyzer implements a comprehensive classification framework comprised of 23 error categories (Table 2). This taxonomy was developed through careful synthesis of established error coding schemes, including the Cambridge Learner Corpus Error Coding Scheme (Nicholls, Reference Nicholls2003) and L2 studies on error analysis or written corrective feedback (e.g., Dagneaux, Denness, & Granger, Reference Dagneaux, Denness and Granger1998; Ferris, Reference Ferris2011; Lee, Reference Lee2004; Lira-Gonzales & Nassaji, Reference Lira-Gonzales and Nassaji2020). The resulting categories are organized into three primary domains—grammatical, lexical, and mechanical dimensions—providing a structured approach to error identification and classification.

Table 2. Error types in Auto Error Analyzer

Note. For specific examples of each error, refer to the online supplementary material (https://osf.io/jyf3r/).

The grammatical categories (e.g., article errors, verb tense errors, subject-verb agreement errors) align closely with error taxonomies in previous studies, addressing basic syntactic and morphological error types (e.g., Ferris, Reference Ferris2011). In addition, discourse-level errors (e.g., conjunction errors and structural errors) are informed by cohesion frameworks, addressing coherence, logical connectivity, and higher-order discourse issues (e.g., Zhang et al., Reference Zhang, Diaz, Chen, Wu, Qian, Voss and Yu2024). Lexical error categories (e.g., collocation errors, inappropriate expressions, idiomatic usage) reflect corpus-based approaches that emphasize lexical appropriateness and collocational accuracy (e.g., Agustín-Llach, Reference Agustín-Llach2017). Mechanical/formal errors (e.g., spelling errors, punctuation errors, unnecessary words) correspond to surface-level or orthographic inaccuracies (e.g., Crossley et al., Reference Crossley, Bradfield and Bustamante2019). By integrating insights from established frameworks and empirical studies, the proposed 23-category scheme offers a theoretically sound, empirically grounded, and comprehensive tool for accurately capturing the full range of English as a Second Language/English as a Foreign Language (ESL/EFL) learner errors, thereby enhancing its relevance for research and pedagogical applications.

Figure 2 illustrates how Auto Error Analyzer presents its analysis results. In addition to the accuracy indices mentioned in Table 1, the tool displays the original sentences, corrected sentences, error counts, and corresponding error types in a tabular format for easy inspection. Corrections within the sentences are highlighted in bold, allowing users to quickly identify the changes. By automatically assigning error types based on the developed error categories, the tool enables users to not only count errors but also gain a clear understanding of the specific types of errors present in the text.

Figure 2. Sample of error analysis output in Auto Error Analyzer.

Furthermore, as shown in Figure 3, the tool provides a visual distribution of the various error types detected in the user’s input text. By presenting this information in an easy-to-understand format, the tool allows users to intuitively grasp not only the presence of errors but also the specific patterns and trends in their writing. This feature can help users improve their writing skills by making them more aware of recurring mistakes.

Figure 3. Error distribution plot in Auto Error Analyzer.

Another key feature of Auto Error Analyzer is its ability to download a CSV (comma-separated value) file that contains even more detailed information than the tables shown in Figure 2 (accessible via the “Download File for Verification” button). This file allows users to double-check for any inaccuracies. Specifically, it includes various linguistic metrics for each sentence, such as T-unit counts, Error-free T-units, Clause counts, Error-free Clauses, Extracted T-units, Extracted Clauses, and Clause-signaling verbs. Additionally, it provides error categories and a detailed mapping of the specific parts where errors occur, ensuring precise inspection. This enables users to verify T-unit and clause segmentation and confirm the accuracy of error detection and corrections. By facilitating this review process, users can conduct a final manual check, ensuring the reliability of reported figures, even in research contexts where numerical accuracy is critical. Since the program may not be 100% error-free, this human oversight serves as a safeguard. The philosophy behind Auto Error Analyzer’s development is rooted in the belief that, even in the AI era, continuous human validation remains essential. This approach highlights the importance of collaboration between AI and human expertise.

In line with open science principles in applied linguistics (Plonsky, Reference Plonsky2024)―which promote transparency and reproducibility in research (In’nami, Mizumoto, Plonsky, & Koizumi, Reference In’nami, Mizumoto, Plonsky and Koizumi2022), including application development (Mizumoto, Reference Mizumoto and Plonsky2024)―all code and files, except for the API key, are publicly accessible on OSF (https://osf.io/jyf3r/), GitHub (https://github.com/mizumot/AutoErrorAnalyzer), and IRIS (https://www.iris-database.org/details/37VPp-fKbgz). By making the code publicly available like Kim, Williams, and McCallum (Reference Kim, Williams and McCallum2024) (https://github.com/phoenixwilliams/RAML_tutorial) and Kyle (https://github.com/kristopherkyle/TAASSC), I hope other researchers will build upon this tool as a foundation for further development and enhancement.

To ensure that Auto Error Analyzer provides a reliable and valid measurement of linguistic errors, a systematic validation was conducted focusing on two critical aspects: the accuracy of T-unit and clause extraction, which forms the foundation for structural analysis, and the alignment between the tool’s automated error detection and human expert judgment.

Evaluation of the tool’s performance

To assess the validity of the metrics calculated by Auto Error Analyzer, a performance comparison was conducted on two levels: (1) T-units and clause extraction, and (2) error analysis. For this comparison, I randomly selected 100 essays written by Japanese EFL university students from the dataset provided by Mizumoto et al. (Reference Mizumoto and Plonsky2024), available on OSF (https://osf.io/bj8kq/), using Microsoft Excel’s RAND function. Each essay was segmented into individual sentences, and two researchers with Ph.D. degrees in applied linguistics independently counted the errors and assigned the error types listed in Table 2. Their counts were then recorded. To align with the LLM-based accuracy assessment, both researchers received training using examples from 30 additional essays analyzed for errors by ChatGPT (GPT-4) in Mizumoto et al. (Reference Mizumoto and Plonsky2024), which were not part of the selected 100 essays. This training allowed them to familiarize themselves with typical error patterns and types. For the initial inter-rater reliability assessment in terms of the number of errors, Krippendorff’s alpha was calculated, resulting in a high agreement rate (α = .92), surpassing the recommended threshold of .80 (Krippendorff, 2004). In cases of disagreement, the final error counts were determined through discussion between the two raters and the author of this article. The assignment of error types followed the same process.

For T-unit and clause validation, I compared the counts obtained using the Web-based L2SCA (https://aihaiyang.com/software/l2sca/) (Ai & Lu, Reference Ai, Lu, Díaz-Negrillo, Ballier and Thompson2013) with the T-units and clauses generated by Auto Error Analyzer for the same set of 100 essays. As seen in Figure 4, both T-units and clauses obtained from L2SCA and Auto Error Analyzer are nearly identical, with T-units (r = .95, 95% CI [.93, .97]) and Clauses (r = .97, 95% CI [.95, .98]). This indicates a very high level of agreement, confirming that the T-units and clauses extracted by Auto Error Analyzer align closely with those from L2SCA, a tool widely used in the L2 field.

Figure 4. Correlations of T-units and clauses calculated with web-based L2SCA and Auto Error Analyzer.

To compare human error evaluations with the automated error analysis provided by Llama 3.3 used in Auto Error Analyzer, I calculated the errors per 100 words for each essay using both methods. Since both metrics met the assumptions of normality, I computed the correlation coefficient, which showed a high agreement (r = .94, 95% CI [.92, .96] as shown in Figure 5). This result confirms the high level of trustworthiness in the Auto Error Analyzer’s automated accuracy calculation using Llama 3.3.

Figure 5. Correlation of errors per 100 words: Human raters and Auto Error Analyzer.

While correlation analysis demonstrates the general agreement between human raters and Auto Error Analyzer in quantifying errors, a more detailed evaluation of the tool’s classification accuracy is necessary. To comprehensively assess the performance of Auto Error Analyzer at the categorical level, I calculated three standard machine learning evaluation metrics: Precision, Recall, and F1-score.

These metrics, when applied to the errors identified by Auto Error Analyzer (i.e., Llama 3.3) versus those identified by human annotators, provide critical insight into the tool’s ability to correctly identify specific error instances (Precision), its comprehensiveness in finding all errors (Recall), and its overall effectiveness as measured by the harmonic mean of these values (F1-score). Precision measures the proportion of correctly identified errors out of all instances the model flagged as errors, reflecting the model’s ability to avoid false positives. Recall measures the proportion of actual errors that were successfully identified by the model, indicating its ability to detect errors comprehensively. F1-score combines precision and recall into a single metric, calculated as 2 × (precision × recall) ÷ (precision + recall), providing a balanced measurement that accounts for both the model’s accuracy in identifying errors and its ability to find all errors present in the text.

The calculation employed the microaverage method, meaning true positives, false positives, and false negatives were aggregated across all error categories as a single combined set. The microaverage approach was chosen due to the multi-label nature of the dataset, where each sentence could have multiple error categories assigned, and where the distribution of error categories varied significantly.

Using the dataset of 100 essays annotated by both human raters and Auto Error Analyzer, I obtained the following microaverage results: Precision = .96, Recall = .94, and F1-score = .95. These results indicate that Auto Error Analyzer demonstrates a high degree of agreement with human annotators in error detection. The strong Precision suggests that the model makes relatively few false-positive classifications, meaning it rarely marks correct language use as erroneous. The high Recall indicates that Auto Error Analyzer effectively identifies most errors that human raters detect. The balanced F1-score confirms the reliability of the model in automated error classification, further supporting its application in research contexts where detailed error analysis is required.

Ethical considerations in Auto Error Analyzer

The development and implementation of Auto Error Analyzer raises important ethical considerations regarding AI-assisted language assessment. Central to my approach is a commitment to transparency and open science. All code used in this tool is publicly available through OSF, GitHub, and IRIS repositories, allowing researchers to inspect, modify, and improve upon the underlying mechanisms. This transparency is crucial for building trust in automated assessment tools and enables the research community to understand precisely how error detection and classification occur. Regarding data privacy, Auto Error Analyzer processes texts on demand without storing user data on the server after analysis is complete. Each analysis session remains separate, ensuring that sensitive learner texts are not retained or used for purposes beyond the immediate analysis requested by the user. While the tool democratizes access to sophisticated error analysis capabilities for researchers and educators worldwide, I emphasize that Auto Error Analyzer is designed to complement, not replace, human judgment. Particularly in high-stakes assessment contexts, results should be verified by qualified human evaluators who can contextualize automated findings within broader educational goals and learner needs. By acknowledging these ethical dimensions, I hope to promote the responsible implementation of AI-assisted language analysis tools that empower rather than diminish human expertise in language teaching and research.

Concluding remarks

This research report, which aims to automate error metrics calculation, bridges theoretical foundations with practical application by detailing the development of a web app and presenting the tool’s validation results. Although the tool is not without flaws, it provides researchers with a means to streamline error analysis. Given that human expertise remains essential in defining error constructs, annotating examples, validating AI outputs, and interpreting assessment results, Auto Error Analyzer functions as a complementary tool, enhancing rather than replacing human judgment. This integration of AI and human expertise can support more informed and effective decision-making in language learning and assessment.

Looking forward, Auto Error Analyzer is envisioned as a valuable springboard for L2 research utilizing error-related metrics. This initial prototype includes only basic functionalities. Expanding it to incorporate additional linguistic indices, along with features such as lexical diversity, lexical sophistication, and cohesion measures (see Crossley, Reference Crossley2020), is feasible. Future versions could also integrate capabilities for automated holistic scoring (Mizumoto & Eguchi, Reference Mizumoto and Eguchi2023), rubric-based analytic scoring (Yavuz, Çelik, & Yavaş Çelik, Reference Yavuz, Çelik and Yavaş Çelik2024), and tailored feedback (Steiss et al., Reference Steiss, Tate, Graham, Cruz, Hebert, Wang and Olson2024).

Despite its promising capabilities, the tool has several important limitations to consider. The system can identify and categorize a wide range of common errors but may struggle with more nuanced, context-dependent errors that require deeper pragmatic understanding or cultural awareness. For instance, errors related to register appropriateness, genre conventions, or culturally specific expressions may be misidentified or overlooked entirely. The error classification system, though comprehensive, operates within the constraints of predetermined categories that may not capture all the subtleties human evaluators can recognize. Additionally, the current version offers limited customization options—users cannot modify the existing error classification scheme or add new categories to suit specific research needs. The tool may also show bias toward standard varieties of English, potentially misidentifying features of World Englishes or regional dialects as errors. Furthermore, since validation was primarily conducted with texts from Japanese EFL students, performance may vary with texts from writers with different first language (L1) backgrounds or when analyzing different text genres beyond academic writing. It is also worth noting that Auto Error Analyzer leverages Llama 3.3 via Groq’s API, benefiting from both its remarkable speed and affordable pricing. However, potential changes in API costs or model specifications could affect its continued viability. Although Llama is currently available as open-source from Meta, the possibility of future access restrictions cannot be entirely dismissed.

These possibilities and challenges are areas for resolution in further tool development. Fortunately, this tool and the foundational research behind it (e.g., Kyle, Reference Kyle2016; Kyle, Choe, Eguchi, LaFlair, & Ziegler, Reference Kyle, Choe, Eguchi, LaFlair and Ziegler2021; Lu, Reference Lu2010) uphold the principles of open science and the philosophy of “sharing is caring.” Therefore, regardless of any future changes, the development of tools like Auto Error Analyzer will continue to be possible, and further advancements can be expected.

Supplementary material

The supplementary material for this article can be found at https://osf.io/jyf3r/.

Data availability statement

The experiment in this article earned Open Data and Open Materials badges for transparent practices. The materials are available at https://osf.io/jyf3r/.

Acknowledgments

This project was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant No. 25K00482. In preparing this manuscript, I used ChatGPT (GPT-4) to enhance the clarity and coherence of the language, ensuring it aligns with the standards expected in scholarly journals. While ChatGPT assisted in refining the language, it did not contribute to the development of any original ideas. This project also benefited—albeit indirectly—from the work of pioneering researchers such as Dr. Xiaofei Lu and Dr. Kristopher Kyle, whose commitment to open science is exemplified by the public availability of their code.

References

Agustín-Llach, M. P. (2017). Vocabulary teaching: Insights from lexical errors. TESOL International Journal, 12(1), 63–74. https://files.eric.ed.gov/fulltext/EJ1247791.pdf Google Scholar

Ai, H., & Lu, X. (2013). A corpus-based comparison of syntactic complexity in NNS and NS university students’ writing. In Díaz-Negrillo, A., Ballier, N., & Thompson, P. (Eds.), Automatic treatment and analysis of learner corpus data (pp. 249–264). John Benjamins.CrossRef Google Scholar

Biber, D., Gray, B., Staples, S., & Egbert, J. (2020). Investigating grammatical complexity in L2 English writing research: Linguistic description versus predictive measurement. Journal of English for Academic Purposes, 46, 100869. https://doi.org/10.1016/j.jeap.2020.100869CrossRef Google Scholar

Chukharev-Hudilainen, E., Saricaoglu, A., Torrance, M., & Feng, H.-H. (2019). Combined deployable keystroke logging and eyetracking for investigating L2 writing fluency. Studies in Second Language Acquisition, 41(3), 583–604. https://doi.org/10.1017/S027226311900007XCrossRef Google Scholar

Crossley, S. A. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(3), 415–443. https://doi.org/10.17239/jowr-2020.11.03.01CrossRef Google Scholar

Crossley, S. A., Bradfield, F., & Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research, 11(2), 251–270. https://doi.org/10.17239/jowr-2019.11.02.01CrossRef Google Scholar

Crossley, S. A., Kyle, K., Allen, L. K., Guo, L., & McNamara, D. S. (2014). Linguistic microfeatures to predict L2 writing proficiency: A case study in Automated Writing Evaluation. Journal of Writing Assessment, 7(1). https://escholarship.org/uc/item/06n1v820 Google Scholar

Crossley, S. A., Kyle, K., & Berger, C. (2018). Assessing writing with the tool for the automatic analysis of lexical sophistication (TAALES). Assessing Writing, 38, 46–50. https://doi.org/10.1016/j.asw.2018.06.004CrossRef Google Scholar

Dagneaux, E., Denness, S., & Granger, S. (1998). Computer-aided error analysis. System, 26(2), 163–174. https://doi.org/10.1016/S0346-251X(98)00001-3CrossRef Google Scholar

Ferris, D. (2011). Treatment of error in second language student writing (2nd ed.). University of Michigan Press ELT. https://doi.org/10.3998/mpub.2173290CrossRef Google Scholar

Ferris, D. R. (1994). Lexical and syntactic features of ESL writing by students at different levels of L2 proficiency. TESOL Quarterly, 28(2), 414. https://doi.org/10.2307/3587446CrossRef Google Scholar

Garcés-Manzanera, A. (2024). Language bursts and text quality in digital writing by young EFL learners. Journal of New Approaches in Educational Research, 13(1), 12. https://doi.org/10.1007/s44322-024-00012-xCrossRef Google Scholar

Gebril, A., & Plakans, L. (2013). Toward a transparent construct of reading-to-write tasks: The interface between discourse features and proficiency. Language Assessment Quarterly, 10(1), 9–27. https://doi.org/10.1080/15434303.2011.642040CrossRef Google Scholar

Gilquin, G. (2024). The processing of multiword units by learners of English: Evidence from pause placement in writing process data. Languages, 9(2), 51. https://doi.org/10.3390/languages9020051CrossRef Google Scholar

Goh, T.-T., Sun, H., & Yang, B. (2020). Microfeatures influencing writing quality: The case of Chinese students’ SAT essays. Computer Assisted Language Learning, 33(4), 455–481. https://doi.org/10.1080/09588221.2019.1572017CrossRef Google Scholar

Hou, J., Loerts, H., & Verspoor, M. H. (2020). Coordination of linguistic subsystems as a sign of automatization? In Fogal, G. G. & Verspoor, M. H. (Eds.), Complex dynamic systems theory and L2 writing development (pp. 27–48). John Benjamins. https://doi.org/10.1075/lllt.54.02houCrossRef Google Scholar

Housen, A., & Kuiken, F. (2009). Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 461–473. https://doi.org/10.1093/applin/amp048CrossRef Google Scholar

Housen, A., Kuiken, F., & Vedder, I. (2012). Complexity, accuracy and fluency: Definitions, measurement and research. In Housen, A., Kuiken, F., & Vedder, I. (Eds.), Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA (pp. 1–20). John Benjamins. https://doi.org/10.1075/lllt.32.01houCrossRef Google Scholar

Hwang, H.-B., & Polio, C. (2023). Text length effects on the reliability of syntactic complexity indices. Research Methods in Applied Linguistics, 2(3), 100085. https://doi.org/10.1016/j.rmal.2023.100085CrossRef Google Scholar

In’nami, Y., Mizumoto, A., Plonsky, L., & Koizumi, R. (2022). Promoting computationally reproducible research in applied linguistics: Recommended practices and considerations. Research Methods in Applied Linguistics, 1(3), 100030. https://doi.org/10.1016/j.rmal.2022.100030CrossRef Google Scholar

Kim, S., Williams, P., & McCallum, L. (2024). Modelling the use of the tool for the automatic analysis of syntactic sophistication and complexity (TAASSC). Research Methods in Applied Linguistics, 3(1), 100087. https://doi.org/10.1016/j.rmal.2023.100087CrossRef Google Scholar

Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, 1, 423–430. https://doi.org/10.3115/1075096.1075150Google Scholar

Koizumi, R., & In’nami, Y. (2024). Predicting functional adequacy from complexity, accuracy, and fluency of second-language picture-prompted speaking. System, 120, 103208. https://doi.org/10.1016/j.system.2023.103208CrossRef Google Scholar

Kuiken, F., & Vedder, I. (2019). Syntactic complexity across proficiency and languages: L2 and L1 writing in Dutch, Italian and Spanish. International Journal of Applied Linguistics, 29(2), 192–210. https://doi.org/10.1111/ijal.12256CrossRef Google Scholar

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication [Georgia State University]. ScholarWorks. https://doi.org/10.57709/8501051Google Scholar

Kyle, K., Choe, A. T., Eguchi, M., LaFlair, G., & Ziegler, N. (2021). A comparison of spoken and written language use in traditional and technology-mediated learning environments. ETS Research Report Series, ets2.12329. https://doi.org/10.1002/ets2.12329Google Scholar

Kyle, K., & Crossley, S. A. (2018). Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. The Modern Language Journal, 102(2), 333–349. https://doi.org/10.1111/modl.12468CrossRef Google Scholar

Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly, 18(2), 154–170. https://doi.org/10.1080/15434303.2020.1844205CrossRef Google Scholar

Larsen-Freeman, D. (2009). Adjusting expectations: The study of complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 579–589. https://doi.org/10.1093/applin/amp043CrossRef Google Scholar

Lee, I. (2004). Error correction in L2 secondary writing classrooms: The case of Hong Kong. Journal of Second Language Writing, 13(4), 285–312. https://doi.org/10.1016/j.jslw.2004.08.001CrossRef Google Scholar

Lehtilä, E., & Lintunen, P. (2025). Individual differences in multilingual speakers’ fluency: A qualitative analysis of speech and eye-movement data. International Review of Applied Linguistics in Language Teaching. https://doi.org/10.1515/iral-2024-0195CrossRef Google Scholar

Lira-Gonzales, M.-L., & Nassaji, H. (2020). The amount and usefulness of written corrective feedback across different educational contexts and levels. TESL Canada Journal, 37(2), 1–22. https://doi.org/10.18806/tesl.v37i2.1333CrossRef Google Scholar

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. https://doi.org/10.1075/ijcl.15.4.02luCrossRef Google Scholar

Mizumoto, A. (2024). Developing and disseminating data analysis tools for open science. In Plonsky, L. (Ed.), Open science in applied linguistics (pp. 121–131). Applied Linguistics Press. https://www.appliedlinguisticspress.org/home/catalog/plonsky_2024 Google Scholar

Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050CrossRef Google Scholar

Mizumoto, A., Shintani, N., Sasaki, M., & Teng, M. F. (2024). Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment. Research Methods in Applied Linguistics, 3(2), 100116. https://doi.org/10.1016/j.rmal.2024.100116CrossRef Google Scholar

Mizumoto, A., Yasuda, S., & Tamura, Y. (2024). Identifying ChatGPT-generated texts in EFL students’ writing: Through comparative analysis of linguistic fingerprints. Applied Corpus Linguistics, 4, 100106. https://doi.org/10.1016/j.acorp.2024.100106CrossRef Google Scholar

Mohsen, M. A. (2021). L1 versus L2 writing processes: What insight can we obtain from a keystroke logging program? Language Teaching Research, 136216882110412. https://doi.org/10.1177/13621688211041292Google Scholar

Nicholls, D. (2003). The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. Proceedings of the Corpus Linguistics 2003 Conference, 572–581. http://www.corpus.unam.mx/cursocorpus/papers/nicholls.pdf Google Scholar

Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578. https://doi.org/10.1093/applin/amp044CrossRef Google Scholar

Ortega, L. (2012). Interlanguage complexity: A construct in search of theoretical renewal. In Kortmann, B. & Szmrecsanyi, B. (Eds.), Linguistic Complexity (pp. 127–155). DE GRUYTER. https://doi.org/10.1515/9783110229226.127CrossRef Google Scholar

Pallotti, G. (2017). Applying the interlanguage approach to language teaching. International Review of Applied Linguistics in Language Teaching, 55(4), 393–412. https://doi.org/10.1515/iral-2017-0145CrossRef Google Scholar

Pallotti, G. (2021). Measuring complexity, accuracy, and fluency (CAF). In Winke, P. M. & Brunfaut, T. (Eds.), The Routledge handbook of second language acquisition and language testing. Routledge.Google Scholar

Pfau, A., Polio, C., & Xu, Y. (2023). Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes. Research Methods in Applied Linguistics, 2(3), 100083. https://doi.org/10.1016/j.rmal.2023.100083CrossRef Google Scholar

Phuoc, V. D., & Barrot, J. S. (2022). Complexity, accuracy, and fluency in L2 writing across proficiency levels: A matter of L1 background? Assessing Writing, 54, 100673. https://doi.org/10.1016/j.asw.2022.100673CrossRef Google Scholar

Plakans, L., Gebril, A., & Bilki, Z. (2019). Shaping a score: Complexity, accuracy, and fluency in integrated writing performances. Language Testing, 36(2), 161–179. https://doi.org/10.1177/0265532216669537CrossRef Google Scholar

Plonsky, L. (Ed.). (2024). Open science in applied linguistics. Applied Linguistics Press. https://www.appliedlinguisticspress.org/home/catalog/plonsky_2024 Google Scholar

Polio, C. (1997). Measures of linguistic accuracy in second language writing research. Language Learning, 47(1), 101–143. https://doi.org/10.1111/0023-8333.31997003CrossRef Google Scholar

Polio, C., & Shea, M. C. (2014). An investigation into current measures of linguistic accuracy in second language writing research. Journal of Second Language Writing, 26, 10–27. https://doi.org/10.1016/j.jslw.2014.09.003CrossRef Google Scholar

Révész, A., Michel, M., Lu, X., Kourtali, N., Lee, M., & Borges, L. (2022). The relationship of proficiency to speed fluency, pausing, and eye-gaze behaviours in L2 writing. Journal of Second Language Writing, 58, 100927. https://doi.org/10.1016/j.jslw.2022.100927CrossRef Google Scholar

Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, … Olson, C. B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, 101894. https://doi.org/10.1016/j.learninstruc.2024.101894CrossRef Google Scholar

Tao, X., & Aryadoust, V. (2024). A Multidimensional Analysis of a High-Stakes English Listening Test: A Corpus-Based Approach. Education Sciences, 14(2), 137. https://doi.org/10.3390/educsci14020137CrossRef Google Scholar

Teng, M. F., & Huang, J. (2023). The effects of incorporating metacognitive strategies instruction into collaborative writing on writing complexity, accuracy, and fluency. Asia Pacific Journal of Education, 43(4), 1071–1090. https://doi.org/10.1080/02188791.2021.1982675CrossRef Google Scholar

Thewissen, J. (2013). Capturing L2 accuracy developmental patterns: Insights from an error-tagged EFL learner corpus. The Modern Language Journal, 97(S1), 77–101. https://doi.org/10.1111/j.1540-4781.2012.01422.xCrossRef Google Scholar

Yavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments. British Journal of Educational Technology, bjet.13494. https://doi.org/10.1111/bjet.13494Google Scholar

Zhang, X., Diaz, A., Chen, Z., Wu, Q., Qian, K., Voss, E., & Yu, Z. (2024). DECOR: Improving coherence in L2 English writing with a novel benchmark for incoherence detection, reasoning, and rewriting. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 11436–11458. https://doi.org/10.18653/v1/2024.emnlp-main.639CrossRef Google Scholar

Figure 1. Auto Error Analyzer landing page.

Table 1. Indices calculated in the Auto Error Analyzer

Table 2. Error types in Auto Error Analyzer

Figure 2. Sample of error analysis output in Auto Error Analyzer.

Figure 3. Error distribution plot in Auto Error Analyzer.

Figure 4. Correlations of T-units and clauses calculated with web-based L2SCA and Auto Error Analyzer.

Figure 5. Correlation of errors per 100 words: Human raters and Auto Error Analyzer.

Article contents

Automated analysis of common errors in L2 learner production: Prototype web application development

Abstract

Keywords

Introduction

Background

Complexity and fluency in L2 development research

Accuracy in L2 research: Significance and challenges

Automating accuracy measurement

Description of the tool

Indices calculated in the Auto Error Analyzer

Extraction of T-units and clauses

Technical implementation of error analysis with Llama 3.3

Error types detected by the tool

Evaluation of the tool’s performance

Ethical considerations in Auto Error Analyzer

Concluding remarks

Supplementary material

Data availability statement

Acknowledgments

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests