Hostname: page-component-77c78cf97d-57qhb Total loading time: 0 Render date: 2026-04-28T03:51:14.112Z Has data issue: false hasContentIssue false

Evaluating NMT using the non-inferiority principle

Published online by Cambridge University Press:  10 May 2024

María do Campo Bayón*
Affiliation:
Universitat Autònoma de Barcelona, Barcelona, Spain
Pilar Sánchez-Gijón
Affiliation:
Universitat Autònoma de Barcelona, Barcelona, Spain
*
Corresponding author: María do Campo Bayón; Email: maria.docampo@autonoma.cat
Rights & Permissions [Opens in a new window]

Abstract

The aim of this article is to propose a new neural machine translation (NMT) evaluation method based on the non-inferiority principle. In order to do that, we evaluate raw machine translation (MT) in terms of naturalness, which for this research is defined as not just the lack of fluency errors but also meeting the linguistic expectations of Galician end users when reading original texts in Galician. Our main objective is, in the first place, to validate the new methodology presented in our previous study by evaluating an NMT engine from Spanish into Galician for the social media domain that was retrained with a new Twitter corpus. This new methodology and NMT engine were applied after analyzing the conclusions of a pilot survey conducted among Twitter users to evaluate their perception of tweets translated from Spanish into Galician with our NMT engine created with a corpus of tweets. As in our preliminary study, our aim is to propose a robust quality approximation method based on the reception parameters of end users’ perceptions. This new survey was conducted in December of 2022 with the participation of 228 Galician-speaking Twitter users. Among the main changes proposed are the inclusion of more information about the participant profile, so the non-inferiority principle can be also evaluated according to these parameters; the inclusion of a new typology of tweets, the threads; the provision of context by means of presenting the tweets in their original display as shown in the Twitter app; a change in the number of tweets evaluated and the number of different questionnaires; the change in the distribution of the questionnaires; and the inclusion of an error classification human evaluation conducted by professional linguists to correlate the findings. We will present the steps carried out following the conclusions of the pilot study, describe the new study’s design, analyze the new findings, and present the final conclusions regarding the engine and the evaluation method based on the non-inferiority principle. Finally, we will also provide some examples of the use of this new methodology in the translation industry.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Distribution of tweets among the nine questionnaires

Figure 1

Table 2. Distribution of responses among the nine surveys

Figure 2

Table 3. Distribution of responses among age groups

Figure 3

Table 4. Tweets judged natural

Figure 4

Figure 1. Tweets judged natural.

Figure 5

Table 5. Global non-inferiority rate

Figure 6

Table 6. Non-inferiority rate by type of tweet

Figure 7

Table 7. Non-inferiority rate per type of tweet without random effects

Figure 8

Figure A1. Machine-translated short-sentence tweet.

Figure 9

Figure A2. Machine-translated long-sentence tweet.

Figure 10

Figure A3. Machine-translated short paragraph tweet.

Figure 11

Figure A4. Machine-translated long paragraph tweet.

Figure 12

Figure A5. Machine-translated thread.

Figure 13

Figure A6. Originally written short-sentence tweet.

Figure 14

Figure A7. Originally written long-sentence tweet.

Figure 15

Figure A8. Originally written short paragraph tweet.

Figure 16

Figure A9. Originally written long paragraph tweet.

Figure 17

Figure A10. Originally written thread.