Hostname: page-component-89b8bd64d-46n74 Total loading time: 0 Render date: 2026-05-07T18:22:22.041Z Has data issue: false hasContentIssue false

Automated annotation of parallel bible corpora with cross-lingual semantic concordance

Published online by Cambridge University Press:  25 January 2024

Jens Dörpinghaus*
Affiliation:
Federal Institute for Vocational Education and Training (BIBB), Bonn, Germany University of Bonn, Bonn, Germany University of Koblenz, Mainz, Germany
Rights & Permissions [Opens in a new window]

Abstract

Here we present an improved approach for automated annotation of New Testament corpora with cross-lingual semantic concordance based on Strong’s numbers. Based on already annotated texts, they provide references to the original Greek words. Since scientific editions and translations of biblical texts are often not available for scientific purposes and are rarely freely available, there is a lack of up-to-date training data. In addition, since annotation, curation, and quality control of alignments between these texts are expensive, there is a lack of available biblical resources for scholars. We present two improved approaches to the problem, based on dictionaries and already annotated biblical texts. We provide a detailed evaluation of annotated and unannotated translations. We also discuss a proof of concept based on English and German New Testament translations. The results presented in this paper are novel and, to our knowledge, unique. They show promising performance, although further research is needed.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use and/or adaptation of the article.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Illustration of a parallel bible view provided by https://www.stepbible.org. It shows two English translations (ESV and KJV) and a Greek text (SBLG).

Figure 1

Table 1. Overview of training and test data. Here tft refers to thought-for-thought, pa to paraphrase approach, and wfw to word-for-word (formal equivalence). Texts with Strong’s numbers are used for training and testing, and texts without Strong’s numbers only for testing. The Remarks column indicates special cases: For the Leonberger Bible, translations based on two different Greek texts are available, and the VOLX-Bible provides a text based on the German colloquial youth language

Figure 2

Figure 2. A snippet of the XML output for Acts 1:1 from diatheke.

Figure 3

Algorithm 1 Dictionary-based-matches I

Figure 4

Figure 3. The proposed method with example data (Acts 1:1). In general, we use as input the target text (a translated text) verse by verse and an existing annotated text. The original Greek text with annotations could be used when adding a translation or dictionary. Existing dictionaries can be used, or dictionaries can be created from the use of Strong’s numbers in a given translation. First, we use POS tagging and lemmatization to extract the matching words. Then we annotate the target gloss by finding the best matches, either by grouping words at POS or by considering all available terms.

Figure 5

Algorithm 2 Dictionary-based-matches II

Figure 6

Table 2. Overview of the approaches evaluated in this paper. POS and POS0 differ in the number of parts of speech considered for matching. The column “categories” describe whether only elements within a category are matched (all), whether all elements are mixed (none), or whether only conjunctions, prepositions, and pronouns are mixed (cpp)

Figure 7

Algorithm 3 Extract Dictionary

Figure 8

Table 3. Results of algorithm $(bible, POS_{0}, all)$ with $\varepsilon =2$ and $f=pos_z(x,y)$ for Luther 1912 and GerLeoNA28 as target texts. The column “$F_1$ (D)” shows the results by Dörpinghaus and Düing (2021)

Figure 9

Table 4. Results of algorithm $(bible, POS_{0}, all)$ with $f=pos_z(x,y)$ for KJV ($\varepsilon =2$) and ESV ($\varepsilon =16$) as target texts

Figure 10

Table 5. Results of algorithm $(bible, POS_{0}, all)$ with $\varepsilon =2$ and $f=pos_z(x,y)$ for Luther 2017 and HFA as target texts

Figure 11

Table 6. Results of algorithm $(bible, POS_{0}, all)$ with $\varepsilon =2$ and $f=pos_z(x,y)$ for NRSV and WEB as target texts

Figure 12

Table 7. This table shows the minimum, average, and maximum difference between the total number of references to Greek words (Strong’s numbers) and the detected number of words. Interestingly, these numbers are the same for all texts in one particular language

Figure 13

Figure 4. $F_1$ Score for different values of $\varepsilon$ (x-axis) for HFA (left) and Luther 2017 (right) as target text.

Figure 14

Figure 5. $f_1$ Score for different values of $\varepsilon$ (x-axis) for NRSV(left) and WEB (right) as target text.

Figure 15

Figure 6. Example of KJV (top) and ESV (bottom) annotations for Acts 1:3.

Figure 16

Figure 7. Application to Luther 2017 (Matthew 1:2). The corresponding English text according to the ASV is: “Abraham begat Isaac, and Isaac begat Jacob, and Jacob begat Judah and his brethren.”

Figure 17

Figure 8. Application to HFA (Matthew 1:2). The corresponding English text according to the ASV is: “Abraham begat Isaac, and Isaac begat Jacob, and Jacob begat Judah and his brethren.”

Figure 18

Table 8. Results of algorithm $(bible, POS_{1}, all)$ with $\varepsilon =6$ and $f=pos_z(x,y)$ for Luther 1912 and GerLeoNA28 as target texts

Figure 19

Table 9. Results of algorithm $(bible, POS_{1}, all)$ with $\varepsilon =6$ and $f=pos_z(x,y)$ for Luther 2017 and SLT as target texts

Figure 20

Figure 9. Assignment on VOLX in Matt. 1:2 (Abraham begat Isaac; and Isaac begat Jacob; and Jacob begat Judah and his brethren).

Figure 21

Figure 10. $f_1$ Score for different values of $\varepsilon$ (x-axis) for Luther 2017 as target text.

Figure 22

Table 10. Results of algorithm $(bible, POS_{1}, all)$ with $f=pos_z(x,y)$ for KJV ($\varepsilon =13$) and ESV ($\varepsilon =8$) as target texts

Figure 23

Table 11. Results of algorithm $(bible, POS_{1}, all)$ and $f=pos_z(x,y)$ for NRSV ($\varepsilon =2$) and WEB as target texts

Figure 24

Figure 11. $F_1$ Score for different values of $\varepsilon$ (x-axis) for KJV as target text.

Figure 25

Figure 12. $F_1$ Score for different values of $\varepsilon$ (x-axis) for ESV as target text.

Figure 26

Figure 13. Example of existing (top) and $POS_1$ (bottom) annotations for Romans 20:5.

Figure 27

Figure 14. Application to Luther 2017 (Matthew 1:2). The corresponding English text according to the ASV is: “Abraham begat Isaac, and Isaac begat Jacob, and Jacob begat Judah and his brethren.”

Figure 28

Figure 15. Application to HFA (Matthew 1:2). The corresponding English text according to the ASV is: “Abraham begat Isaac, and Isaac begat Jacob, and Jacob begat Judah and his brethren.”

Figure 29

Figure 16. Assignment on VOLX in Matt. 1:2 (Abraham begat Isaac; and Isaac begat Jacob; and Jacob begat Judah and his brethren).