Hostname: page-component-77f85d65b8-5ngxj Total loading time: 0 Render date: 2026-03-29T22:02:06.194Z Has data issue: false hasContentIssue false

Aligning linguistic complexity with the difficulty of English texts for L2 learners based on CEFR levels

Published online by Cambridge University Press:  03 September 2025

Xiaopeng Zhang
Affiliation:
Xi’an Jiaotong University, Xi’an, China
Xiaofei Lu*
Affiliation:
The Pennsylvania State University , University Park, PA, USA
*
Corresponding author: Xiaofei Lu; Email: xxl13@psu.edu
Rights & Permissions [Opens in a new window]

Abstract

Selecting appropriate texts for second language (L2) learners is essential for effective education. However, current text difficulty models often inadequately classify materials for L2 learners by proficiency levels. This study addresses this deficiency by employing the Common European Framework of Reference for Languages (CEFR) as its foundational framework. A cohort of expert English-L2 educators classified 1,181 texts from the CommonLit Ease of Readability corpus into CEFR levels. A random forest model was then trained using 24 linguistic complexity features to predict the CEFR levels of English texts for L2 learners. The model achieved 62.6% exact-level accuracy across the six granular CEFR levels and 82.6% across the three overarching levels, outperforming a baseline model based on three existing readability formulas. Additionally, it identified shared and unique linguistic features across different CEFR levels, highlighting the necessity to adjust text classification models to accommodate the distinct linguistic profiles of low- and high-proficiency readers.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. Basic information of the selected texts and their related CEFR ranks

Figure 1

Table 2. Correlations between CEFR ranks with selected linguistic complexity indices

Figure 2

Figure 1. Random forest classification of English texts for L2 learners.Note: RF = random forest machine.

Figure 3

Figure 2. The working mechanism of an RF classifier.

Figure 4

Table 3. The predictive performance of the RF model on six CEFR levels

Figure 5

Table 4. Confusion matrix for the testing data based on six CEFR levels

Figure 6

Figure 3. Linguistic feature importance: Gini and accuracy.Note: A = Kuperman age of acquisition scores (CWs); B = Lexical decision time (CWs); C = phonographic neighbors (CWs); D = academic words; E = mean length of T-unit; F = complex nominals per clause; G = LDA age of exposure (inverse slope); H = argument type-token ratio; I = McDonald word co-occurrence probability; J = Brysbaert concreteness scores (AWs); K = moving average type-token ratio with 50-word window; L = Brysbaert concreteness scores (CWs); LDA = Latent Dirichlet Allocation; M = Bigram lemma type-token ratio; N = MRC word familiarity scores; O = pronoun to noun ratio; P = phonographic neighbors (FWs); Q = COCA academic bigram association strength (DPs); R = lexical density (types); S = overlap of lemmas across adjacent sentences; T = lexical density (tokens); U = COCA academic bigram to unigram association strength (DPs); V = coordinate phrases per T-unit; W = overlap of lemmas across adjacent two sentences; X = phonological neighbors (FWs).

Figure 7

Table 5. The predictive performance of the RF model on three CEFR levels

Figure 8

Table 6. Confusion matrix for the testing data based on three CEFR levels

Supplementary material: File

Zhang and Lu supplementary material

Zhang and Lu supplementary material
Download Zhang and Lu supplementary material(File)
File 74.1 KB