Evaluating Optimal Reference Translations

The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this article, we propose a methodology for creating more reliable document-level human reference translations, called “optimal reference translations,” with the simple aim to raise the bar of what should be deemed “human translation quality.” We evaluate the obtained document-level optimal reference translations in comparison with “standard” ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.


Introduction
Machine translation (MT) is routinely evaluated using various segment-level similarity metrics against one or more reference translations.At the same time, reference translations acquired in the standard way are often criticized for their flaws of various types.For several high-resourced language pairs, MT quality reaches levels comparable to the quality of the reference translation (Freitag et al. 2022;Hassan et al. 2018) and sometimes MT even significantly surpasses humans in a particular evaluation setting (Popel et al. 2020).Given this, one could conclude that state-of-the-art MT has reached the point where reference-based evaluation is no longer reliable and we have to resort to other methods (such as targeted expert evaluation of particular outputs), even if they are costly, subjective and possibly impossible to automate.
The narrow goal of the presented work is to allow for an "extension of the expiry date" for reference-based evaluation methods.In a broader perspective, we want to formulate a methodology for creating reference translations which avoid the often-observed deficiencies of "standard" or "professional" reference translations, be it multiple interfering phenomena, inappropriate expressions, ignorance of topic-focus articulation (information structure) or other abundant shortcomings in the translation, indicating their authors' insensitivity to the topic itself, but above all to the source and target language.To this end, we introduce so-called optimal reference translations (ORT), which are intended to represent optimal (ideal or excellent) human translations (should they be the subject of a translation quality evaluation).a We focus on document-level translation and evaluation, which is in line with current trends in MT research (Maruf et al. 2019;Ma et al. 2020;Gete et al. 2022;Castilho 2022) and also this special issue of NLE.We hope that ORT will represent a new approach to the evaluation of excellent MT outputs by becoming a gold standard in the true sense of the word.Our work is concerned with the following questions: • How to navigate future MT research for languages for which the quality level of MT is already very good?• Is it worth creating an expensive optimal reference translation to compare with MT? • If various groups of annotators evaluate optimal reference and standard translations, will they all recognize the difference in quality?
Subsequently, our contributions are: • definition of optimal reference translation and an in-depth analysis of evaluations and the relationship between evaluation and translation editing; • reflection on what it means to be a high-quality translation for different types of annotators; • publication of the Optimal Reference Translations of English→Czech dataset with a subset evaluated in aforementioned manner.
After discussing related work in this context (Section 2), we focus on defining ORT and describe its creation process (Section 3).Next, we describe our evaluation campaign of ORT, the data, annotation interface, and annotation instructions (Section 3.2).We then turn to a statistical perspective of our data and measure the predictability of human ratings (e.g.Overall rating from Spelling, Style, Meaning, etc.) using automated metrics (Section 4).We pay special attention to predicting document-level rating from segment-level.In the penultimate Section 5, we provide a detailed qualitative analysis of human annotations and discuss this work in the greater perspective of human evaluation of translations (Section 6).Analysis code and collected data are publicly available.
proposed a new scoring metric which is focused primarily on meaning and emphasises adequacy rather than fluency, for several reasons (e.g., meaning preservation is a pressing challenge for low resource language pairs and assessing fluency is much more subjective).
Methods for automatic human translation quality estimation exist (Specia and Shah 2014;Yuan 2018), though the field focuses primarily on machine translation quality estimation.Furthermore, the definition of translation quality remains elusive and is plagued by subjectivity and low assessment agreement (House 2001;Kunilovskaya et al. 2015;Guerberof 2017).

Optimal Reference Translations
Our optimal reference translation (ORT) represents the ideal translation solution under the given conditions.Its creation is accompanied by the following phases and factors: • diversity at the beginning (multiple translations are available from different translators, i.e., in principle there are at least two independently-created translations available), • discussion among experienced translation theoreticians / linguists in search for the best possible solutions, leading to consensus, • editing the newly created translations, reaching a point where none of the translation creators comes up with a better solution.
Another important condition is the documentation of all stages of the translation creation (archiving the initial solutions, notes on shortcomings, suggestions for other potential solutions, notes on translation strategies and procedures, record of the discussion among the authors, reasons why a solution was rejected, record of the amount of time spent on each text, etc.).The final characteristic of the creation of an optimal reference translation is the considerable amount of time spent by the creators on the analysis, discussion, and creation of new translations.In our definition of ORT, optimality therefore refers to: • a carefully thought-out and documented translation process, and • the quality of the resulting translation.
It however does not include the time aspect, in the sense of minimizing the time spent on the translation process.This choice is likely one more key distinction from "professional" translation.Incontestably, more than one version of ORT may be produced.The resulting ORT may vary depending on the individuality of its creators.Of course, the creators take into account the purpose and intended audience of ORT, just like in standard translations, but different collectives of ORT creators may perceive the intended purpose and audience differently or consider finer details of these aspects.Moreover, factors such as idiolect, age, experience, etc. can also play a large role, but unlike standard translations, there must always be a consensus among the creators of ORT.

Translation Creation
The underlying dataset without the evaluation has already been described in Czech (Kloudová et al. 2022).The 130 original English texts (news articles available from the Internet, covering topics ranging from politics and economics to sports and social events) were translated from English into Czech by three human translators for the Conference on Machine Translation 2020 (WMT20).The three translators were hired by WMT organizers from a translation agency.The resulting three independent parallel Czech translations (P1, P2, P3) serve as basic reference translations, from which a final "optimal reference translation" could be synthesized.It was anticipated that our creators of ORT (two translators-cum-theoreticians -professionals who deal with translation Key Text SRC Professor Blair Grubb, Vice-Principal (Education) at the University, said: "To get to this stage, our students and graduates faced competition from peers attending some of the world's top universities." N1 Prorektor pro oblast vzdělávání profesor Blair Grubb prohlásil: "Aby se naši studenti dostali až sem, museli čelit konkurenci svých vrstevníků, kteří studují na nejlepších světových univerzitách." Vice-Principal for Education, professor Blair Grubb, said: "To get to this point, our students have had to face competition from their peers studying at the world's best universities." P1 Profesor Blair Grubb, univerzitní zástupce ředitele pro vzdělávání, uvedl: "Abychom se dostali až do této fáze, museli naši studenti a absolventi čelit konkurenci svých vrstevníků, kteří studují na nejlepších světových univerzitách." Professor Blair Grubb, the University's Deputy Director of Education, said: "For us, to get to this stage, our students and graduates have had to face competition from their peers studying at the world's best universities." P2 Profesor Blair Grubb, zástupce děkana (vzdělávání) na univerzitě, řekl: "Aby se dostali do této fáze, čelili naši studenti a absolventi konkurenci svým vrstevníkům, kteří navštěvují některé z nejlepších světových univerzit." Professor Blair Grubb, Associate Dean (Education) at the University, said: "To get to this stage, our students and graduates have faced competition to their peers who attend some of the world's top universities." P3 Profesor a zástupce ředitele pro vzdělávání Blair Brubb university uvedl: "Aby se naši studenti a absolventi dostali do této fáze, museli čelit vrstevníkům z několika nejlepších universit světa." Professor and Deputy Director of Education at Blair Brubb University said: "To get to this stage, our students and graduates have had to face peers from several of the best universities in the world."from both a practical and a theoretical point of view) b would always choose the best translation solutions from the existing three versions, or create new solutions if necessary.However, the available translations from the first-stage translators were often of insufficient quality.Therefore, in the creation of our optimal reference translations, more emphasis was placed on the input of the creators of the final version rather than on the synthesis of existing translations.The process of creating our ORT can be described as follows: our tandem of translators-cumtheoreticians worked as a translator and revisor pair.One of them produced a first version, which the other carefully compared with the original and critiqued if necessary.Notes on the first version of the translations were given in the form of comments on individual segments of the text.The author of the first version of the translations subsequently accepted or, with justification (and subsequent discussion), did not accept the suggestions in the comments.The crucial point in the discussion was always that the final solution should be fully in line with the beliefs of both translation authors.It is worth mentioning that the discussion between the two creators had, to a large extent, the written form of exchanging notes.ORT thus do not demand live, synchronous, attention of the creators.
The result of this process were two versions of ORT (many more versions could have evolved, though, our priority was not diversity, but above all quality -so we decided to create two version in parallel, N1 and N2).The first version (denoted N1) is closer to the original both in terms of meaning and linguistic (especially syntactic) structure.The second version (N2) is probably more readable, idiomatic and fluent, being even closer to the Czech news style, both syntactically, e.g. by emphasizing the ordering of syntactic elements typical of news reporting, and lexically, e.g. by a more varied choice of synonyms.The presented work is centered around the evaluation of the b One of them is a co-author of this article.However, the translations are later independently evaluated and hence, to the best of our knowledge and conscience, we do not consider this to be a conflict of interest or otherwise a methodological flaw.various human translations.Because N2 has not been created for all segments of the translation (not all the original segments allowed an appropriate linguistic variation, i.e.N1 was identical to N2), we decided not to use it.Thus, four translations were included in the evaluation -one optimal reference translation (N1) in addition to the three existing human translations.
In Figure 1, we show the sources and example translations (P1, P2, P3), together with one of the two versions of our optimal translation, N1.During the evaluation, each translation can be further edited by annotators in which case we identify the resulting segment as e.g. as "P1 EDIT by annotator A4."We will encounter examples in Section 5.

Annotators
We hired 11 native Czech annotators for the evaluation of translations in three groups: (1) four professional translators, c (2) four non-experts, (3) three students of MA Study Programme Translation and/or Interpreting: Czech and English at the Institute for Translation Studies.d Their proficiency and end-campaign questionnaire responses are presented in Section 4.1.

Data
Out of the original data (Section 3.1), we randomly selected, with manual verification, 8 consecutive segments in 20 documents which were to be annotated.We refer to these 8 segments as documents because they contain most of the documents' main points.Each segment corresponds approximately to one sentence, though they are longer (31 source tokens on average) than what we would find typical for the news domain.The data contain document-level phenomena (e.g.discourse), so segments can not be translated and evaluated independently.

Annotation Interface
We provided the annotators with online spreadsheets which showed the source text and all four translation hypotheses.This way each translation could be compared against the others while having the context available (e.g. to check for consistency).Each hypothesis column was distinguished by a colour, as shown in Figure 2, and based on annotator feedback (Section 4.1), we believe that it was manageable to perform annotations despite the amount of information shown.We showed the rest of segments in the source language for context but did not provide any translation hypotheses for the annotators to consult or rate.Each of the 20 documents was shown in a separate tab/sheet.The annotators worked on the evaluation in a span of 3 months in an uncontrolled environment.

Annotation Instructions
The task for annotators was three-fold, see Section 8 for the full annotation guidelines.
• Grade each segment translation on a decimal scale from 0 (least) to 6 (most) in categories Spelling, Terminology, Grammar, Meaning, Style, Pragmatics and Overall (e.g.4.0or5.8).This scale was chosen to balance the number of attraction points for annotators (integers) and to contain a middle point (3).• Grade each document as a whole on the same scale and categories.
• If a segment would not receive the highest grade, there would be something wrong in the translation.Therefore, the annotators should edit the hypothesis translation into a state to which they would give it the maximal scores.
c Defining who is a professional translator is not easy.The factors influencing the degree of professionalism of a translator include, among others, education and experience.Professional translators in our study have at least one of the following: (1) completed an M.A. degree programme in English-Czech translation studies, (2) completed an M.A. degree programme in interpreting or philology, or (3) have at least 10 years of translation experience.

Annotator Questionnaire
After the annotation campaign, the annotators filled a brief survey with questions about their perception of the task and their strategy.We did not constrain the annotators in what order they should perform the annotations.As a result, they employed various approaches, most popular being segment-category-translation.e While we attempted to not introduce a bias, almost all annotators filled in categories one-by-one as they were organized in the user interface.f This could have an effect on the rating.For example, by establishing and drawing attention to the specific 6 features, the final Overall rating may be influenced primarily by them and it would not have been if the ordering was reversed.Pragmatics and Overall were reported as the hardest to evaluate, while Spelling was the easiest, especially because errors in spelling can be seen even without deeper translatological analysis and there were not many of them in the translations.The annotators self-reported utilizing the preceding and following context around half the time to check for document-level consistency.While they proceeded mostly linearly, about 20% (self-reported estimate) of previously completed segments were later changed.We intentionally shuffled the ordering of translations (columns in each sheet) so that the annotators would not build a bias towards the translation source in e.g. the second column.However, the annotators reported that despite this, they were sometimes able to recognize a specific translation source based on various artifacts, such as systematically not translating or localizing foreign names.

Collected Annotations
We do not do any preprocessing or filtering of the collected data.This is justified by our all annotators working on the same set of documents and by the fact that we have established connections with each of the annotators and deem them trustworthy.Any bias of an annotator's rating would therefore be present in all documents which would not hinder even absolute comparisons.
e I.e.first finish all annotation categories in a translation, then all annotation categories in the second translation, etc., and afterwards move to the second segment.
f I.e.starting from Spelling and ending with Overall.
Nevertheless, we examine annotator variation later in this section.In total for 20 documents, we collected: • 7k segment-level annotations (1.8k annotations of 4 translation hypotheses).Each hypothesis is edited unless it received a very high score (in 4k cases).This amounts to 49k ratings across all categories.• 880 document-level annotations (220 annotations of 4 translation hypotheses.)This amounts to 6.2k ratings across all categories.

Quality of Initial Translations
Recall the grading scale from 0 (least) to 6 (most).The translation sources (P1, P2, and P3) were of varying quality, as shown in Figure 5. Overwhelmingly, N1 was evaluated the highest followed by P1, P2 and P3, in this order.Furthermore, there is a strong connection between the ratings on segment and document-level and also across evaluation categories.
The density distribution of features in Figure 3 shows the natural tendency of annotators to use integer scores.It also shows that all features are heavily skewed towards high scores and that on average documents receive lower scores than their segments.

Inter-Annotator Agreement
To measure inter-annotator agreement, we aggregate pairwise annotator Pearson correlations on the segment-level.g At first, this agreement is quite low (ρ = 0.33).It can however be explained upon closer inspection of agreement across translations.While inter-annotator correlations for the the worst translation P3 were ρ = 0.50, the best translation had ρ = 0.13.We hypothesize that with less variance and therefore signal for rating, the inter-annotator agreement drops.This is even more visible from the pairwise annotator correlations for the Grammar category, in which N1 has made g Even though the data are not normally distributed, the Perason correlation reveals agreement controlled for each annotator's mean and variance.almost no errors (ρ = 0.03).In 28% of cases, the ordering of Overall scores for segments was the same between pairs of annotators and in 66% of cases they differed by only one transposition.In other words, the difference in the score ordering was 2 positions or more only in 8% of cases.Further individual effects of annotators are discussed in Section 4.6.

Modelling Overall Quality from Components
In this section we attempt to model the Overall category based on individual categories, degree of translation editing and individual annotators.

Other Categories Individually
We first consider the predictability of individual categories and measure it using Pearson's correlation (0 = no relationship, 1 = perfect linear relationship).For both the document and segment level, we observe similar correlations, see Figure 4. Notably spelling is much less predictive of other categories than the rest.A possible explanation is that this was the least common mistake and the values are therefore concentrated around the highest possible score (Figure 3).Overall correlates the most with Meaning and Style.This can be explained similarly because those features had the largest variances.

Linear Regression on Other Categories
We treat the prediction of Overall from other categories as a regression task with 6 numerical input features (Spelling, Terminology, etc) and one numerical output feature (Overall).We subtract the mean to preserve only the variance to be able to interpret the learned coefficients of a linear regression model.We split document-and segment-level ratings into train/test as 778/100 and 6925/100, respectively.Figure 6 shows the results of fitting two linear regression models together with the coefficients of individual variables.Because the distributions of features is similar, as documented in Figure 3, we can interpret the magnitude of the coefficient as the importance in determining the Overall score.For both the document and segment level, Spelling and Meaning have the highest impact while Terminology and Style have the least impact.h The linear regression model h These interpretations are, however, not fully conclusive because of a possible latent co-dependent.The Overall variable may in reality be largely dependent on another variable X for which we do not have annotations.One hypothetical translation source could be very good if measured on the X variable and also Overall and unrelated to that also good in the Spelling level, which would yield similar results to those presented..37 .53 .55 .55 .62 .57 .66 .35 .56 .53 .56 .61 .59 .65 -.26 -.41 -.43 -.43 -.48 -.45 -.49 .36 .55 .52 .55 .60 .58 .62 .34 .51 .52 .54 .59 .54 .63 .34 .46 .43 .47 .50 .46As mentioned in Section 3.2, annotators were tasked to post-edit texts to a state which they would be content with.As a result, the annotators post-edited 62% of all the segments on average.We compute several automatic metric scores between the original and edited versions of segments and compare them to the collected scores, such as Overall.This allows us to answer the question: Does the post-edited distance (as measured by automated metrics) correspond to the annotator score (negatively)?The results in Figure 7 show that there is very little difference between individual metrics.Most score categories are equally predictive with the exception of Overall (most) and Spelling (least).The explanation for this phenomena for Spelling is again (Section 4.4) much lower variance.Overall, the more the annotators changed the original text in their post-editing, the lower score they assigned to the hypothesis.Including the metrics in the prediction of Overall in Section 4.5.2does not provide any additional improvement on top of other categories (final segment-level ρ is still 0.93).

Translators
Students Non-translators 1

Annotator Differences
Recall that we considered three types of annotators: professional translators, students of translation and non-translators.Despite the same annotation guidelines, their approach to the task was vastly different.For example, Figure 8 shows the distribution of segment-level ratings of Overall.Professional translators produced much more varying and spread-out distribution, especially compared to non-translators, who rated most segments very high.The group differences should be taken into account when modelling the annotation process statistically.When predicting segment-level Overall from other categories, as in Section 4.5.2, the individual annotator Pearson correlation ranges from as high as 0.98 to as low as 0.59.Similar to results of Karpinska et al. (2021) we find that expert annotators are important and have less noise.The average correlations with Overall for the translator, student and non-translator groups are 0.93, 0.91 and 0.80, respectively.The expertise feature alone yields 0.36 correlation with Overall and users alone 0.45.This is expected as the groups and users have different means of the variable.This information can be used in combination with other predictive features to push the segment-level correlation from 0.93 (Figure 6) to 0.95.Greater improvement is achieved when combined with the editing distance, such as pushing BLEU from 0.66 (Figure 7) to 0.76 when individual annotators are considered as an input feature (one-hot encoded).

Modelling Document-level Scores
Our annotation instructions explicitly reminded annotators to always consider the context.In other words, already our segment-level scores reflect the coherence and cohesion of the whole text, i.e, how the text is organized and structured in the previous and/or subsequent segments.This is a rather important difference from automatic segment-level evaluation which discards any context.Annotators reported that in deciding document-level scores, they focused on the segments which were previously rated the lowest: that means, an individual poorly rated segment greatly influences the rating of the whole.We consider this observation essential for various future translation evaluations.We confirm this with results in Figure 9 where the min aggregation of segment-level ratings is a good prediction (comparable to or slightly better than avg) of the document-level rating.Based on segment-level ratings, we are able to predict document-level Overall quality with ρ = 0.71.
It is worth noting that a similarly high correlation (ρ = 0.70) is achieved when predicting the document-level Style from the corresponding segment-level ratings.This category was supposed to reflect also the coherence and cohesion of the document.Annotators saw the entire original text, but only evaluated certain translated segments.However, they were assumed to have read the entire source text and to use the information for their evaluation.This was reflected in the Style category.
In Example 1, the original translator (ORIG) did not consider the context of the whole document, translated only word for word and committed numerous interferences.It is completely unusual in the context of Wimbledon to use the phrase "hostí turnament" (hosts the tournament, both words being examples of lexical interference from English).In the context of tennis, the phrase "venku na kurtech" (out on the courts) is also unusual.In Czech, feminine names names are typically marked using Czech morphology (e.g.Serena Williams → Serena Williamsová), which is the form predominantly found in the press.In this sentence, the name Williams follows the names The Associated Press and CNN, which is very confusing for the Czech reader.The feminine form thus makes the whole text easier to interpret and understand.The evaluator has correctly intervened in the text by using collocations such as "pořádá wimbledonský turnaj" (hosts the Wimbledon tournament) and writes about "venkovní kurty" (the outside courts) and uses the feminine form "Williamsová".All these changes are highlighted in bold in Example 1 and demonstrate the evaluator's (translation student) sense of textual continuity and their knowledge of the overall global context, which should have been the task of the original translator.If we take a closer look at the evaluation of all four translations by individual annotators, several types of qualitative comparisons can be made.We focus on the following two perspectives: characteristics of the segments (1) for which N1 scores worse than P{1,2,3} and (2) for which there are the biggest differences in ratings.Even though we include example translations in Czech, we provide explanations in English which are self-contained and hence do not require any knowledge of Czech.
N1 was evaluated with the highest scores in comparison to P1, P2, and P3, across all assessed features (see Figure 5).However, there is a small number of segments in N1 which were evaluated worse than those in P{1,2,3}.For a better overview, the frequencies at which the translations P{1,2,3} were evaluated better than N1 in the Overall category are: P1: 6.16%, P2: 4.96%, P3: 3.99%.We selected these segments (for each category, not only for Overall) and analysed them.
In most cases our analysis revealed that the evaluation and the related editing of the translation was conditioned by the erroneous judgment of the annotators, who did not check the correct wording/meaning/usage in Czech and were tempted by the source text and/or the wrong parallel translations P{1,2,3}.In other words, the optimal reference translation N1 stood the test and our analysis confirmed its quality, rather than the evaluators' judgement.Furthermore, we also encounter a reduced (imperfect) rating of some segments in N1, although no errors are apparent and no changes in the edited version have occurred in comparison to the original version.This finding is valid for all evaluated categories without exception.We list here a number of such segments with reduced (imperfect) rating for each category: Spelling: 1.0%, Terminology: 1.8%, Grammar: 2.0%, Meaning: 1.7%, Style: 2.8%, Pragmatics: 1.5%, Overall: 0.5%, any category: 5.4%.We perform a detailed qualitative analysis across all the seven rating categories.

Spelling
In the spelling category, the following segment in Example 2 demonstrates an ignorance on the part of the annotator (non-translator) and failure to reflect the correct spelling and declension of the name Narendra Modi in Czech (correctly in nominative singular: Naréndra Módí) and of the Czech equivalent to the verb harass (correctly: perzekvovat, although it often appears incorrectly as perzekuovat in the language usage).The proposed edits are wrong.There are more segments with incorrect or unnecessary spelling edits.Unfortunately, some annotators not only erroneously "correct" what is actually right, but also miscategorize the changes.We find erroneously corrected morphology in this category, etc.For example, for the source The man pleaded guilty to seven charges involving [...] the correct structure Muž se přiznal k sedmi trestným činům (dative case) týkajícím se (dative case) [...] has been edited to Muž se přiznal k sedmi trestným činům (dative case) týkajících se (genitive case, grammatical incongruency) [...].
It is quite surprising the extent to which annotators do not verify and follow up information, leaving errors in translations that are contained in the original.This is particularly evident in the spelling category.An example is a typo in the original: (Pete) Townsend (correctly Townshend; however, spelled correctly in the previous segment of the source text).P1 has Townsend, P2 Townshed [!], P3 Townsend.N1 uses the corrected form Townshend, but this form has been edited in the evaluation with the result Townsend.In the terminology category we also detected unnecessary or erroneous corrections.For example, the correction of the segment in Example 3 does not fall under terminology (and demonstrates, inter alia, the annotator's failure to verify the information; this time the annotator was actually a professional translator).The proposed edit is not correct/necessary.

Grammar
In the grammar category, the above mentioned segment (Sony, Disney Back To Work On Third Spider-Man Film) plays an interesting role, rated also 3.0 in this category, without any other changes.The proposed change (mentioned above) does not reflect grammar or spelling.As it turned out, this segment achieved the same rating from this annotator in other categories, too, namely meaning, style, and overall quality.It affects the meaning only, though, being an erroneous change.
We agree with some changes in syntax, e.g. in Example 4 (rating 5.0 for this segment by a student annotator).

Meaning
In the meaning category, we observe several inconsistencies in evaluating translated segments for N1 vs P{1,2,3}.For example, reduced rating for N1 (4.0, by a non-translator) occurs in the following segment in Example 5, though, there are no changes in the edited version.Both the translations P1 and P2 score 6.0, even though there are several erroneous meaning units.The initial therapy is počáteční léčba in Czech, not vstupní, and the verb require does not mean here that the patient himself required the therapy but that his/her medical condition required it.The expression chief medical officer refers to the Czech equivalent hlavní/vedoucí/vrchní lékař, not ředitel resortu zdravotnictví (= Director of the Ministry of Health).

Style
In the style category, we agree with some edits made for N1, for example in the following segment in Example 6 rated 4.0.However, the rating of P1 is 6.0, although there have been very similar modifications to the style in the edited version of P1 as those in N1, and, furthermore, the annotator (translator) uses the translation strategy proposed in N1 ORIG for his P1 EDIT version.

Pragmatics
The evaluation in the category of pragmatics is also inconclusive and the analysis of N1 segments rated worse than P{1,2,3} segments does not provide any convincing data.For example, one of the annotators (non-translator) evaluates Example 7 N1 5.0, whereas P3 is evaluated 6.0.However, the only change we find in the edited version of N1 is the elimination of the adjective nadšenou, although nadšená chvála is a typical collocation in Czech, in contrast to the rather unusual formulation (and too literal translation) zářná chvála used in P3, which remained unchanged.Furthermore, New York Magazine is usually used in the Czech media in its original, not translated form.Nevertheless, P3 uses New Yorský magazín (the adjective does not even exist in Czech).Other inappropriate or non-existent word units used in P3 include: díky (= thanks to) used in a negative context, Gettysburgského projevu (correctly: Gettysburského, without g), pro svůj historický význam (correctly: pro jeho historický význam).The word order is not based on the principle of the Czech functional sentence perspective and is non-idiomatic and non-standard.We could give many more similar examples.

Overall
The overall category shows similar inconsistencies as described in previous aspects.The annotators often neglect formal, meaning and other errors, as shown above.Example 8 shows that different types of errors in P2 and P3 have been ignored.The annotator (non-translator) correctly substitutes the word kredit for úvěr in P3, but does not recognize the wrong structure pokud jde o dobré jméno in P2: the Czech word kredit (a sum of money (credit) or other value as a loan for a specified period of time for a specified consideration (e.g.interest)) can also have a colloquial meaning trust, respectability which is not the case here.The collocation public accommodations includes all services, i.e. not only accommodation, but also catering, cultural activities (public spaces and commercial services that are available to the general public, such as restaurants, theaters, and hotels).The Czech word služby (= services) is correct, not ubytování (= accommodation).The trickiest collocation of this segment is jury service which is not soudnictví (= judiciary) in a general sense but, more specifically, účast v soudní porotě (= participation in jury trials).Leaving aside all the overlooked errors, the annotator evaluates P2 and P3 better than N1, although in N1 and P2 he/she made one change, in P3 two changes of a comparable nature.

Individual Differences in Ratings
In this part, we would like to highlight selected segments with the biggest differences across relevant categories and focus on finding out the reasons for the observed disprepancies.The following segment in Example 9 shows a very low rating and multiple changes in the edited version by annotator A1, whereas annotators A2, A3, and A4 evaluate it with (almost) best scores and overlook even obvious (to the authors and translatologists) mistakes.In the example, we use subscripts to expressions of interests.
Annotator A1 rightly notices errors in spelling (lower and upper case letters 1 ), terminology (name of the institution and other terms 2 ), meaning (unclear and contradictory statements 3 ), pragmatics (dealing with foreign realia and abbreviations 4 ).These individual assessments are also reflected in the category Overall.On the other hand, the annotators A2, A3, A4 do not (mostly) notice the errors mentioned above, or just change the wording of the abbreviation or replace the translation with the original abbreviation (A2) while maintaining the best rating.From our point of view, the correct annotator is A1 with their relevant, thoughtful and sensitive interventions in the text.
There are many segments with similarly unbalanced ratings in our evaluation.As the analysis shows, the biggest problem is that some annotators fail to recognize most of the errors.Problematic is also lowering the rating even though no changes were made in the edited version.It is unclear whether the annotators simply did not pay enough attention to their task and whether they would have reached the same conclusions even after a more careful consideration of the whole task.Our qualitative analysis of selected segments confirms the findings presented in Figure 5: translators are the most rigorous and careful (average rating 5.00), students are slightly less attentive (5.3), and non-translators notice errors the least (5.8).

Document-level Phenomena
In this section we present examples that show the extent to which authors of translations P1, P2, P3, and N1, and especially our evaluators have considered the context of the whole document.We examined the evaluated documents, including the source text and all four translations, looking for evidence of apparent respect or disregard for the document-level context.Documents in which certain terms occur that should be consistent throughout the text and/or should correspond in a meaningful way to the thematic and pragmatic area, appear to be appropriate material to demonstrate this (spans marked with 1).j These observations are in line with MT evaluation methods focused on terminology (Zouhar et al. 2020;Semenov and Bojar 2022;Agarwal et al. 2023).Another phenomenon to be observed might be a particular way of spelling words, which generally have two or more accepted spellings and the convention is just to achieve a consistent spelling throughout the document (spans marked with 2).
Finally, for the topic-focus articulation, also called functional sentence perspective (spans marked with 3), it is crucial to respect the context of the whole document (Daneš 1974;Sgall et al. 1986;Hajičová et al. 2013).It is concerned with the distribution of information as determined by all meaningful elements, including context.In Example 12, evaluated by a non-translator, we selectively document the distribution of the degrees of communicative dynamism over sentence elements in Czech, which determines the orientation or perspective of the sentence (Firbas 1992).k  Our first example in this section, Example 10, illustrates the disregard for proper terminology.Translations P1,2 and N1 were not edited.The evaluator was a non-translator.
In the next example, evaluated by the same non-translator, P1, P2, and N1 are consistent in terminology and spelling in this document.
Our third example in this section, Example 12, is represented by the article China Says It Didn't Fight Any War Nor Invaded Foreign Land which discusses armed conflicts between China and other countries.
In Czech, it is common to put the adverbials of time at the beginning or in the middle of a sentence (depending on the meaning and function of other sentence elements).When appearing at the end of a sentence, they become the focus of the statement, so the communicative dynamism and sentence continuity may get broken (in 1979 / v roce / roku 2017, in 1979 / v roce 1979, spans 3a).Based on the information in the previous text (Example 12), the diplomatic resolution (diplomatically resolved, vyřešen diplomaticky / diplomatickou cestou) stands in contrast to the armed conflicts, so it represents the focus of the statement and should appear at the end of the sentence (after the verb) (spans 3b).Finally, the states Vietnam, Malaysia, the Philippines, Brunei and Taiwan should be placed at the end of the Czech sentence (this becomes evident after reading and understanding the entire document where we are introduced to information about which countries China has had conflicts with) (spans 3c).The word claims (vznášejí nároky, mají protinároky, mají opačné nároky, si činí nárok) belongs to the topic of the statement.

Example: 12
Previous article content: China on Friday said it has not provoked a "single war or conflict" or "invaded a single square" of foreign land, skirting any reference to the 1962 war with India."China has always been dedicated to resolving territorial and maritime delimitation disputes through negotiation and consultation," stated an official white paper released, four days ahead of the country set to celebrate its 70th anniversary of the leadership of the ruling Communist Party of China (CPC) on October 1. "China safeguards world peace through real actions.Over the past 70 years, China has not provoked a single war or conflict, nor invaded a single square of foreign land," the paper titled "China and the World in the New Era" said.The white paper, while highlighting the CPS's "peaceful rise" made no reference of the bloody 1962 war with India and the vast tracts of land, especially in the Aksai Chin area, occupied by China.The Sino-India border dispute involving 3,488-km-long Line of Actual Control (LAC) remained unresolved.China also claims Arunachal Pradesh as part of South Tibet, which India contests.So far, the two countries held 21 rounds of Special Representatives talks to resolve the border dispute.SOURCE: Besides the 1962 war, India and China had a major military standoff at Doklam in 2017 when the People's Liberation Army (PLA) tried to lay a road close to India"s narrow Chicken Neck corridor connecting with the North-Eastern states in an area also claimed by Bhutan.

Discussion
Evaluating optimal reference translation(s) is in many ways a more difficult task than evaluating a "standard" (human or machine) translation.It is already a common practice in the translation industry to have multiple workers included in a translation of a single document (e.g.initial translator and quality assurance translator).Based on our analysis of the optimal reference translation evaluation, it turns out that it is very crucial who evaluates such translations: Do the annotators have professional translation experience, or are they students of translation, or laymen in the field?It appears that laypeople are less able to notice even critical mistakes in translations.As a result for quality assurance, hiring only annotators with lots of translating experience seems to be a requirement.
However, important to determine is who the translation is for.If it is for a wide audience who do not scrutinize the translation quality, it may not be worth the extra cost to hire highly skilled translation evaluators.In turn, for evaluation of machine translation systems that have reached this very high level of quality, highly skilled evaluators are needed.
We do note, however, that perfect translations or annotations likely do not exist, only their approximations.The cost of uncovering more translation errors is likely hyperlinear -i.e. two rounds of annotations do not uncover twice as many mistakes.Each use-case should therefore make explicit what the target quality level is and adjust the annotation protocol accordingly.

Conclusion
We defined the concept of optimal reference translation (ORT), geared towards regaining informative results in reference-based machine translation evaluation.We then performed a careful manual evaluation and post-editing of ORT in comparison with three standard professional translation.The evaluation confirms that ORT deserve their name and can be regarded as a truly golden reference.In fact, the few times when ORT did not score best were examples of errors in this follow-up annotation, not examples of ORT deficiencies.Additionally, we documented that manual evaluation at these high levels of quality can not be delegated to inexperienced annotators.Only people with substantial translation experience are sensitive to the subtle differences and can provide qualified judgements.

Future work
While we focused on evaluating human translations, the idential setup could be used for evaluating MT models, which we plan to address in future work (Zouhar and Bojar 2024).This is not part of the present work which is focused on showing that the reference translations usually used are of insufficient quality and need to be reconsidered.Our next step will be to assess which of the multitude of automatic metrics of MT quality are sensitive to the subtleties captured in our ORT and can thus be used to reliably evaluate MT outputs of high quality.This will again require careful expert manual evaluation.

Annotation Guidelines
The following is the main part of instructions which were distributed to the annotators.

Introduction
The goal of this study is to annotate the translation quality in seven categories.There are 20 documents in the shared Google sheet, marked as Edit1, Edit2 etc. (Orig1, . . .are described later in the text).The first column contains the source text in English, followed by four Czech translations.However, only eight segments should be evaluated in each document.If you don't see a translation for some segments, it is not meant to be evaluated.You will evaluate the translations both at the segment level and at the level of whole documents (or at the level of the eight continuous segments).You will also indicate a better translation if you are not satisfied with the current version.Please read the source text first.The following is a possible evaluation procedure, but it is up to you how you proceed.The next steps are (for individual translations): 1. reading the translation, 2. evaluating the segments, 3. evaluating the whole document, 4. editing the segments so that you are satisfied with the translations, 5. reading the entire newly created text and possibly making minor changes.Please keep in mind that although you are also evaluating the segments separately, they are always part of a larger text, so you should pay special attention to how they relate to each other, i.e., also to the coherence and cohesion of the whole text.This should also be reflected in the assessment (category "style" below).

Evaluation of segments
Rate each of the four translations in the following seven categories on a scale from 0 (worst) to 6 (best): • spelling, punctuation, typography, typos, • terminology (correctness, consistency, normativity), • grammar: morphology (word forms) and syntax (sentence structure, functional sentence perspective), • meaning accuracy (mistranslation, addition, omission, untranslated text segment etc.), • style (appropriateness, consistency, idiomaticity, cross-sentence coherence and cohesion), • pragmatics (culture-specific reference, locale conventions, appropriateness for the Czech reader), • overall quality (evaluation of the translation in all the above-mentioned categories).

Important notes
You can rate from 0 (the worst rating) to 6 (the best rating); in addition to whole numbers (0, 1, 2, 3, 4, 5, 6), decimal numbers with one decimal place (e.g.0.1 or 4.5) are allowed.It is not necessarily the goal to use the full range of ratings for individual translations, i.e., if you do not see an error in a given category (even if the translation of the rated segment is very easy and does not pose a challenge for the translator), you will rate the highest possible score (6).We leave it to the discretion of each evaluator to decide how serious they consider a particular error to be and how many points to deduct for it.If an error affects more than one category (typically, e.g., both categories 3 and 4), this should result in a reduced rating in all relevant categories.

Evaluation of documents
Rate the entire translation at the document level in the seven categories (the same as above for segment evaluation) on a scale from 0 to 6 (the same conditions as above for segment evaluation).The rating of the whole document is on the last line of each sheet.

Editing of translated segments
If a segment translation does not receive the highest rating (6) in overall quality, please edit the translation with minimal editing (changes, corrections) to the state that you would give the highest rating (6).To clarify, if translations 1, 2, 3, and 4 get an overall quality rating of 6, 5, 3, 6, respectively (for particular segments), you must edit translations 2 and 3 independently.The resulting translations should be based on the original translations, i.e., most of the time they will be different from each other even after your edits.You can use dictionaries or search the internet, but please do not use any machine translation systems.If possible, try not to copy text segments from previous translations, even if you like them.Since you probably weren't satisfied with some of the translations and didn't give them the highest possible rating, you have edited some segments.For comparison, you can look at the original translation (OrigT), which is in another sheet.For example, for document 3, the sheet is called Orig3 and is listed just after Edit3.Edit only the EditT sheet.

Time range
To process one document in all four translations takes on average 25-75 minutes.Please indicate the time spent on the annotation of each document (in minutes) in the appropriate box in each sheet.If you are systematically outside this range, send us an email.Please note that annotating the first document usually takes much more time than annotating subsequent documents.

Figure 1 .
Figure 1.Example translations of the same source into Czech.Literal transcriptions of the translations are shown in italics.N1: translatologist collaboration (optimal translation), P1: professional translation agency (post-edited MT), P2, P3: professional translation agency.

Figure 2 .
Figure 2. First 5 rows of a screen for a single document with source and 4 translations in paralel.Screens were accessed by annotators in an online spreadsheet program.Note: Scalable graphicszoom in.

Figure 3 .
Figure 3. Distribution densities of ratings of each collected variable (thin tail cropped ≥ 3 for higher resolution of high-density values).Numbers and horizontal lines show feature means.

Figure 6 .
Figure 6.Predictions of linear regression models (on document-and segment-level) for all test set items sorted by true Overall score.Formulas show fitted coefficients and Pearson's correlations with the true scores.Only a random subset of points shown for visibility.

Figure 7 .
Figure 7. Segment-level Pearson correlations between the collected scores and automated metrics between the original and edited versions of a segment.Color is based on absolute value of the correlation (note TER).i

Figure 8 .
Figure 8. Distribution densities of ratings of Overall for individual annotators.