Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-07T12:23:01.954Z Has data issue: false hasContentIssue false

Textual form features for text readability assessment

Published online by Cambridge University Press:  11 November 2024

Wenjing Pan
Affiliation:
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China
Xia Li*
Affiliation:
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies, Guangzhou, China
Xiaoyin Chen
Affiliation:
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Rui Xu
Affiliation:
College of Engineering, Carnegie Mellon University, Silicon Valley, USA
*
Corresponding author: Xia Li; Email: xiali@gdufs.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

Text readability assessment aims to automatically evaluate the degree of reading difficulty of a given text for a specific group of readers. Most of the previous studies considered it as a classification task and explored a wide range of linguistic features to express the readability of a text from different aspects, such as semantic-based and syntactic-based features. Intuitively, when the external form of a text becomes more complex, individuals will experience more reading difficulties. Based on this motivation, our research attempts to separate the textual external form from the text and investigate its efficiency in determining readability. Specifically, in this paper, we introduce a new concept, namely textual form complexity, to provide a novel insight into text readability. The main idea is that the readability of a text can be measured by the degree to which it is challenging for readers to overcome the distractions of external textual form and obtain the text’s core semantics. To this end, we propose a set of textual form features to express the complexity of the outer form of a text and characterize its readability. Findings show that the proposed external textual form features can be used as effective evaluation indexes to indicate the readability of text. It brings a new perspective to the existing research and provides a new complement to the existing rich features.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Diagram of the distinctions between the outer textual form complexity and the inner semantic. Four texts are presented in terms of their inner semantic and outer textual form complexity. While text 1 and text 2 convey the same inner semantics, text 2 has a more sophisticated outside textual structure than text 1. Although the inner semantics of text 3 and text 4 are more difficult to comprehend, text 3 expresses them in a simple form, whereas text 4 expresses them in a more complex form.

Figure 1

Figure 2. An instance of the novel insight into text readability. Three parallel texts with the same core semantics from the different readability levels of OneStopEnglish corpus are given. The exterior textual form within the texts is marked with underlines. As can be seen, textual form complexity is positively correlated with the amount of exterior textual form, and the exterior textual form negatively affects text readability.

Figure 2

Figure 3. The flowchart of our method. To obtain the simplified texts, we begin by simplifying the original texts. The original and simplified texts are then passed to feature extraction as inputs to get the three subsets of textual form features. Using these features, we then train a classification model to predict text readability.

Figure 3

Table 1. Linguistic obstacles and simplification operations to remove them

Figure 4

Figure 4. The symbolism of the simplification process. We use the sentence-level simplification model MUSS to simplify each sentence $s_j$ and obtain a simplified sentence $s_j'$ for each text $d_i$ in the original document set $D_{orig}$. Finally, we combine the simplified sentences in each text $d_i'$ to produce a parallel simplified document set $D_{simp}$.

Figure 5

Figure 5. The extraction process of the feature $BertSim$ in step 2. The sentence representations $\vec{v_{s_k}}$ and $\vec{v_s^i}$ are obtained in step 2. For each text $d_i$, we average the value of each dimension in the sentence representations as the text representation. The text representation for the original text is $\vec{v_{d_i}}$, and the text representation for the simplified text is $\vec{v_{d_i'}}$. At last, the inner product of $\vec{v_{d_i}}$ and $\vec{v_{d_i'}}$ is calculated as the final value of the feature $BertSim$.

Figure 6

Figure 6. An instance of the overlap at the word level between the sentences $s_j$ (before simplified) and $s_j'$ (after simplified). We investigate the overlap from the perspective of part-of-speech. The common nouns, adjectives, and verbs in two sentences are counted separately.

Figure 7

Table 2. The hyperparameter settings in RandomForest and SMO

Figure 8

Table 3. Statistics for the OneStopEnglish corpus

Figure 9

Table 4. Statistics for the NewsEla corpus

Figure 10

Table 5. Statistics for the Weebit corpus

Figure 11

Table 6. Statistics for the Cambridge English exam data

Figure 12

Figure 7. Diagram of the experimental design for validation on the parallel corpus. We use documents from the most simplified level in the parallel corpus to facilitate feature extraction, and we use documents from the other levels for training the model and making predictions. We select the most simple version of each document in the training set and use the aligned sentence pairs in the two texts to extract our proposed textual form features.

Figure 13

Table 7. Simplification examples from the Cambridge corpus using MUSS and ReWordify tools

Figure 14

Table 8. Performance comparison of RandomForest and SMO trained with baseline Coh-Metrix features and added textual form features on OneStopEnglish and NewsEla parallel corpora

Figure 15

Table 9. Performance comparison of RandomForest with different types of feature sets on Weebit and Cambridge

Figure 16

Table 10. Performance comparison of SMO with different types of feature sets on Weebit and Cambridge corpora

Figure 17

Table 11. Performance comparison before and after integrating our proposed features into SOTA models

Figure 18

Table 12. Performance comparison of our models and existing readability assessment models on the WeeBit corpus

Figure 19

Figure 8. Top 15 feature importance ranks on the Weebit corpus. On the $x$ axis, the MDI scores are reported. On the $y$ axis, features ranked in the top 15 are reported, where the rank order increases from top to bottom. The blue bars represent our proposed features. The gray bars represent Coh-Metrix features.

Figure 20

Figure 9. Top 15 feature importance ranks on the Cambridge corpus. On the $x$ axis, the MDI scores are reported. On the $y$ axis, features ranked in the top 15 are reported, where the rank order increases from top to bottom. The blue bars represent our proposed features. The gray bars represent Coh-Metrix features.

Figure 21

Table 13. The classification accuracy on the VikiWiki-Es/Fr corpus

Figure 22

Table 14. Performance comparison of our method with the use of different simplification tools on Weebit corpus

Figure 23

Table 15. Performance comparison of our method with the use of different simplification tools on Cambridge corpus