Hostname: page-component-6766d58669-kl59c Total loading time: 0 Render date: 2026-05-20T20:12:40.481Z Has data issue: false hasContentIssue false

Propagating machine translation traits to predict potential impact on the target language

Published online by Cambridge University Press:  24 June 2025

Nora Aranberri*
Affiliation:
HiTZ Center, University of the Basque Country UPV/EHU, Donostia-San Sebastián, Spain
Jose A. Pascual
Affiliation:
Intelligent Systems Group, University of the Basque Country UPV/EHU, Donostia-San Sebastián, Spain
*
Corresponding author: Nora Aranberri; Email: nora.aranberri@ehu.eus
Rights & Permissions [Opens in a new window]

Abstract

Research suggests that the texts produced using machine translation (MT) do not fully represent the linguistic traits of the natural language. Yet, the ever-increasing quality and access to MT is resulting in its steady adoption by both language professionals and general users. According to contact linguistic theories, such adoption might result in MT-specific language traits permeating the target languages. This work takes a first step into considering the changes that a language might endure over time by observing the variation of linguistic trends along a series of MT generations. We train ten sequential engines using each to produce the target side of the training corpus of the following and calculate a number of metrics to observe linguistic diversity at a lexical, morphological, and syntactic level for a large, fixed test set. Quantitative results show an initial loss of lexical diversity, which, albeit gradually, only continues at a much slower pace in the following MT generations. In turn, structural variations and, in particular, morphological variations across generations are less marked, which might indicate a more stable behaviour regarding grammatical consistency. Overall, the resulting MT language seems increasingly homogeneous, marked by the reduced presence or disappearance of low-frequency words, and compact, with a decreasing proportion of function words relative to content words.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. The Replicator model adapted to language evolution (from Steels (2017: 202))

Figure 1

Figure 1. Representation of our MT training (top) and testing (bottom) approach.

Figure 2

Table 2. Global quality automatic scores as reported by BLEU, TER, chrF, BLEURT, and COMET, where the output of each MT generation (Gen.), MT$_{0}$ to MT$_{10}$, is compared against the Spanish side of the test set. Scores are provided within a 0–100 range, with the best-scoring MT generation for each metric in bold. The $\downarrow$ symbol indicates that lower values of the metric correspond to better performance

Figure 3

Table 3. Lexical diversity and density scores as reported by type and token counts, type-token ratio (TTR), measure of textual lexical diversity (MTLD), Yule’s K, and lexical density (LD) for the Spanish side of the test set and the output of each MT generation, MT$_{0}$ to MT$_{10}$ (Gen.), and where the generation with the highest diversity for each metric is in bold. The $\downarrow$ symbol indicates that lower values of the metric correspond to better performance

Figure 4

Figure 2. Normalised token counts for content words across MT generations, MT0 to MT$_{10}$, where NOUN refers to nouns, PROPN to proper nouns, VERB to lexical verbs, ADJ to adjectives, and ADV to adverbs.

Figure 5

Figure 3. Normalised token counts for function words across MT generations, MT$_{0}$ to MT$_{10}$, where DET refers to determiners, PRON to pronouns, ADP to prepositions, AUX to auxiliary verbs, CCONJ to coordinating conjunctions, and SCONJ to subordinating conjunctions.

Figure 6

Figure 4. Normalised type counts for content words across MT generations, MT$_{0}$ to MT$_{10}$, where NOUN refers to nouns, PROPN to proper nouns, VERB to lexical verbs, ADJ to adjectives, and ADV to adverbs.

Figure 7

Figure 5. Normalised type counts for function words across MT generations, MT$_{0}$ to MT$_{10}$, where DET refers to determiners, PRON to pronouns, ADP to prepositions, AUX to auxiliary verbs, CCONJ to coordinating conjunctions, and SCONJ to subordinating conjunctions.

Figure 8

Table 4. Counts of types across frequency bands for the Spanish side of the test set and the output of each MT generation, MT$_{0}$ to MT$_{10}$ (Gen.), where Band 1 refers to words that appear over 100,000 times, Band 2 99,999–10,000 times, Band 3 9,999–1,000 times, Band 4 999–100 times, Band 5 99–10 times, and Band 6 9–1 times

Figure 9

Table 5. Morphological variety scores as reported by Shannon Entropy and Simpson’s Diversity Index for the Spanish side of the test set and the output of each MT generation, MT$_{0}$ to MT$_{10}$ (Gen.), where the generation with the highest diversity for each metric appears in bold. The $\downarrow$ symbol indicates that lower values of the metric correspond to better performance

Figure 10

Table 6. Structural similarity scores as reported by the perplexity measure and posTER metric, for the Spanish side of the test set and the output of each MT generation, MT$_{0}$ to MT$_{10}$ (Gen.), where P:ES represents the perplexity of the the Spanish LM at the word level, P:ES-POS represents perplexity of the Spanish POS LM, and P:EN-POS represents the perplexity of the English POS LM. The $\downarrow$ symbol indicates that lower values of the metric correspond to better performance. The generation with the best scores for each metric appears in bold