Hostname: page-component-89b8bd64d-9prln Total loading time: 0 Render date: 2026-05-08T20:34:09.676Z Has data issue: false hasContentIssue false

Evaluating large language models with a word-level translation alignment task between Ancient Greek and English

Published online by Cambridge University Press:  17 September 2025

Peter M. Nadel*
Affiliation:
Research Technology, Tufts University , Medford, MA, USA
Gregory Crane
Affiliation:
Classical Studies, Tufts University , Medford, MA, USA
*
Corresponding author: Peter M. Nadel; Email: peter.nadel@tufts.edu
Rights & Permissions [Opens in a new window]

Abstract

In this article, we evaluate several large language models (LLMs) on a word-level translation alignment task between Ancient Greek and English. Comparing model performance to a human gold standard, we examine the performance of four different LLMs, two open-weight and two proprietary. We then take the best-performing model and generate examples of word-level alignments for further finetuning of the open-weight models. We observe significant improvement of open-weight models due to finetuning on synthetic data. These findings suggest that open-weight models, though not able to perform a certain task themselves, can be bolstered through finetuning to achieve impressive results. We believe that this work can help inform the development of more such tools in the digital classics and the computational humanities at large.

Information

Type
Short Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. An example of a single line alignment (Od. 5.1). English is aligned to Ancient Greek with the use of square brackets. A 0 was used when there was no direct alignment between the English and the Ancient Greek.

Figure 1

Table 1. F1 scores before finetuning compared to Yousef et al. (Yousef et al. 2022). Bold values indicate the highest score.

Figure 2

Table 2. AER scores before finetuning compared to Yousef et al. (Yousef et al. 2022). Bold values indicate the highest score.

Figure 3

Table 3. F1 scores of open-weight models after finetuning. Bold values indicate the highest score.

Figure 4

Table 4. AER scores of open-weight models after finetuning. Bold values indicate the highest score.

Figure 5

Figure 2. Comparison of model performance across different shot counts, showing the relationship between precision and recall scores. The diagonal dashed line represents equal precision and recall scores. Finetuned models (lighter colors) generally show improved performance over their base versions (darker colors).

Figure 6

Figure 3. Distribution of performance between base and finetuned Llama models, across different shots. Finetuned models show significant improvement from their model counterparts.

Figure 7

Table 5. Standard deviation on the best scores of proprietary models and finetuned open-weight models compared to Yousef et al.

Figure 8

Table 6. Average time to align one sample, with Yousef et al. far surpassing any decoder-only architecture. Bold values indicate the highest score.

Figure 9

listing 1. Training prompt template.

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.