Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-08T13:23:28.846Z Has data issue: false hasContentIssue false

Creating a large-scale diachronic corpus resource: Automated parsing in the Greek papyri (and beyond)

Published online by Cambridge University Press:  15 August 2023

Alek Keersmaekers*
Affiliation:
Department of Linguistics, KU Leuven, Leuven, Belgium
Toon Van Hal
Affiliation:
Department of Linguistics, KU Leuven, Leuven, Belgium
*
Corresponding author: Alek Keersmaekers; Email: alek.keersmaekers@kuleuven.be
Rights & Permissions [Opens in a new window]

Abstract

This paper explores how to syntactically parse Ancient Greek texts automatically and maps ways of fruitfully employing the results of such an automated analysis. Special attention is given to documentary papyrus texts, a large diachronic corpus of non-literary Greek, which presents a unique set of challenges to tackle. By making use of the Stanford Graph-Based Neural Dependency Parser, we show that through careful curation of the parsing data and several manipulation strategies, it is possible to achieve an Labeled Attachment Score of about 0.85 for this corpus. We also explain how the data can be converted back to its original (Ancient Greek Dependency Treebanks) format. We describe the results of several tests we have carried out to improve parsing results, with special attention paid to the impact of the annotation format on parser achievements. In addition, we offer a detailed qualitative analysis of the remaining errors, including possible ways to solve them. Moreover, the paper gives an overview of the valorisation possibilities of an automatically annotated corpus of Ancient Greek texts in the fields of linguistics, language education and humanities studies in general. The concluding section critically analyses the remaining difficulties and outlines avenues to further improve the parsing quality and the ensuing practical applications.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Base representational format for ellipsis (“by nature” “wise” (particle) “nobody” – “Nobody [is] wise by nature“).

Figure 1

Figure 2. New representational format for elliptic copula (sentence: see Figure 1).

Figure 2

Figure 3. New representational format for comparative constructions (sentence: “now” “so” “I write” “to you” “since” “like” “father” “of them” “are” “you” – ’So I write to you now since you are like their father“).

Figure 3

Figure 4. New representational format for infinitives without a main verb (sentence: “to Zenon” “be happy” “Horos” – “Horos [tells] Zenon to be happy,” i.e. “Horos greets Zenon“).

Figure 4

Table 1. Dependency treebanks of Greek available by the end of 2020

Figure 5

Table 2. Description of the training data for syntactic parsing

Figure 6

Table 3. Description of the development data for syntactic parsing

Figure 7

Table 4. Select possible representational formats for coordination structures, illustrated through the sentence “I fought” “with Psentinbaba” “and” “I went away”

Figure 8

Figure 5. Representational format for damaged sentences: “[gap of 11 characters] ” “I have written you a letter … letter”. The asterisk refers to the original spelling.

Figure 9

Table 5. Example of a Greek sentence in CONLL format (without Lemma; sentence: ; “are you sleeping again?“)

Figure 10

Table 6. Overview of the main results of the test set (papyrus corpus, with N = 17,609)

Figure 11

Table 7. Results of combined model with automatically predicted POS and features (gold morphology)

Figure 12

Table 8. Results of model with automatically predicted morphology by genre

Figure 13

Figure 6. Confusion matrix of syntactic relations (colours are normalised by total counts).

Figure 14

Table 9. Qualitative analysis of 500 parsing errors