Hostname: page-component-5db58dd55d-ggg9q Total loading time: 0 Render date: 2026-06-01T16:46:48.283Z Has data issue: false hasContentIssue false

Sentence-level detection of Hungarian plain language with feature-guided augmentation

Published online by Cambridge University Press:  19 February 2026

István Üveges*
Affiliation:
ELTE Centre for Social Sciences, Hungary
Rights & Permissions [Opens in a new window]

Abstract

In this study, we investigate Hungarian Plain Language (PL) and Simple Language (SL) with the primary objective of training a machine-learning-based sentence-level PL model that flags sentences where expert intervention may be needed during PL-oriented rewriting. The analysis uses a legal-administrative PL corpus and a news-based SL corpus, currently the only publicly available high-quality Hungarian resources for PL and SL. In low-resource settings, PL data are typically scarce, so selective data augmentation is a natural candidate for improving model performance. Our aims are threefold: (i) to provide a feature-based descriptive comparison of these Text Simplification resources; (ii) to test whether selectively chosen SL sentences can augment PL training data; and (iii) to evaluate the impact of such augmentation on sentence-level PL detection. Methodologically, we extract handcrafted linguistic features spanning surface, morphosyntactic and discourse properties. We derive a PL-likeness score from logistic-regression coefficients and use it to select SL sentences most similar to PL for augmentation, followed by supervised sentence-level PL detection with XLM-RoBERTa-large. Results show clear differences between PL and SL in sentence length, lexical diversity, syntactic depth and connective use. Selective inclusion of SL sentences yields modest gains in constrained settings, whereas indiscriminate mixing reduces precision and reliability.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Table 1. Sentence and token count in the PL corpus

Figure 1

Table 2. Sentence and token counts in the HunSimpleNews corpus

Figure 2

Table 3. Base corpus statistics. (Avg. tokens = average number of tokens per sentence, Avg. characters = average number of characters per sentence, # Sentences = total number of sentences)

Figure 3

Table 4. Lexical diversity and hapax legomena ratio across corpus and variant. (Tokens = total number of tokens, Unique Lemmas = total number of unique lemmas, TTR = type–token ratio, Hapax Ratio = proportion of words occurring only once – hapax legomena)

Figure 4

Table 5. The top 10 POS categories in the HunSimpleNews corpus (HSN), expressed as percentages. (NOUN = Nouns, ADJ = Adjectives, PUNCT = Punctuation, DET = Determiners, VERB = Verbs, PROPN = Proper Nouns, ADV = Adverbs, CCONJ = Coordinating Conjunctions, PRON = Pronouns, ADP = Adpositions)

Figure 5

Table 6. The top 10 POS categories in the Plain Language corpus (PLC), expressed as percentages. (NOUN = Nouns, ADJ = Adjectives, PUNCT = Punctuation, DET = Determiners, VERB = Verbs, PROPN = Proper Nouns, ADV = Adverbs, CCONJ = Coordinating Conjunctions, PRON = Pronouns, ADP = Adpositions)

Figure 6

Table 7. Morphological type richness across corpus and variant. (Avg. types/sentence = average number of unique word forms per sentence, Morph. TTR = morphological type–token ratio)

Figure 7

Table 8. Syntactic complexity measures by corpus and variant. (Avg. Tree Depth = average depth of the syntactic parse tree per sentence, Embedded Clause Ratio = proportion of clauses embedded within sentences)

Figure 8

Table 9. Average number of discourse connectives per sentence across corpora and variants

Figure 9

Table 10. Percentage of sentences containing discourse connectives across corpora

Figure 10

Table 11. Flesch readability scores and surface text characteristics across corpus and variant

Figure 11

Figure 1. Coefficients from logistic regression trained on PL sentences only. Positive values (blue) indicate greater likelihood of ‘plain’ classification, whereas negative values (red) indicate greater likelihood of ‘non-plain’ classification.

Figure 12

Figure 2. Coefficients from logistic regression trained on simple versus standard HSN sentences. Positive values (blue) indicate associations with the ‘simple’ label, whereas negative values (red) indicate associations with the ‘standard’ label.

Figure 13

Table 12. Comparison of logistic regression coefficients for selected linguistic features

Figure 14

Table 13. Baseline and augmented training/validation compositions

Figure 15

Table 14. Classification metrics (precision, recall, F1) for the PL-only model, evaluated on the held-out PL test set (non-plain versus plain instances)

Figure 16

Figure 3. Confusion matrix for the PL-only model. Label 1 = plain; label 0 = non-plain.

Figure 17

Figure 4. Distribution of PL-likeness scores for HSN/simple sentences. The top chart shows the full range, the bottom chart zooms into the 1st to 99th percentiles. Vertical lines indicate cutoffs for the top 5%, 10% and 25% of scores.

Figure 18

Figure 5. Effect of PL-likeness-based augmentation with HSN data on classification performance across F1-score, precision and recall for both classes.

Figure 19

Table 15. Classification results across augmentation levels

Figure 20

Table A1. Top 10 linguistic features and coefficients from the PL (logistic regression) model used for scoring HSN sentences