Hostname: page-component-89b8bd64d-nlwjb Total loading time: 0 Render date: 2026-05-07T23:25:00.288Z Has data issue: false hasContentIssue false

Cross-lingual dependency parsing for a language with a unique script

Published online by Cambridge University Press:  09 September 2024

He Zhou*
Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
Daniel Dakota
Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
Sandra Kübler
Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
*
Corresponding author: He Zhou; Email: hzh1@iu.edu
Rights & Permissions [Opens in a new window]

Abstract

Syntactic parsing is one of the areas in Natural Language Processing. The development of large-scale multilingual language models has enabled cross-lingual parsing approaches, which allows us to develop parsers for languages that do not have treebanks available. However, these approaches rely on the assumption that languages share orthographic representations and lexical entries. In this article, we investigate methods for developing a dependency parser for Xibe, a low-resource language that is written in a unique script. We first investigate lexicalized monolingual dependency parsing experiments to examine the effectiveness of word, part-of-speech, and character embeddings as well as pre-trained language models. Results show that character embeddings can significantly improve performance, while pre-trained language models decrease performance since they do not recognize the Xibe script. We also train delexicalized monolingual models, which yield competitive results to the best lexicalized model. Since the monolingual models are trained on a very small training set, we also investigate lexicalized and delexicalized cross-lingual models. We use six closely related languages as source language, which cover a wide range of scripts. In this setting, the delexicalized models achieve higher performance than lexicalized models. A final experiment shows that we can increase performance of the cross-lingual model by combining source languages and selecting the most similar sentences to Xibe as training set. However, all cross-lingual parsing results are still considerably lower than the monolingual model. We attribute the low performance of cross-lingual methods to syntactic and annotation differences as well as to the impoverished input of Universal Dependency Part-of-Speech tags that the delexicalized model has access to.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Xibe alphabet of vowels and consonants, with Latin transliterations

Figure 1

Table 2. Unified Mongolian graphemes and graphemes specific to the Xibe writing system

Figure 2

Figure 1. Example of a dependency tree; Eng.: “(We) have carried serious duty of history on our shoulder.”

Figure 3

Table 3. Case markers in written Xibe

Figure 4

Table 4. Converb suffixes in written Xibe

Figure 5

Figure 2. Example dependency tree; Eng.: “He returned home following (his) mother.”

Figure 6

Figure 3. Dependency annotation of the converb construction from example 2.

Figure 7

Table 5. Number of sentences and tokens of the selected UD treebanks

Figure 8

Table 6. Hyper-parameter settings for the parser

Figure 9

Table 7. Results for parsing Xibe in a monolingual setting using different feature combinations.

Figure 10

Table 8. Subwords for the sentence in example (3) tokenized by the pre-trained language models

Figure 11

Table 9. Accuracy of correct head attachment per POS tag (sorted in descending order by the absolute frequency of the POS tag in the Xibe treebank)

Figure 12

Table 10. Out-of-vocabulary ratio in three folds of the Xibe treebank

Figure 13

Table 11. Inflectional suffixes of Xibe verbs based on the General Introduction to Xibe Grammar (Šetuken, 2009)

Figure 14

Table 12. Parsing results of the POS ONLY model and the POS + VERB SUFFIX model

Figure 15

Table 13. F1 score of VERB and dependency relations for clauses of the POS ONLY model and the POS + VERB SUFFIX model

Figure 16

Figure 4. Parse by the POS ONLY model (left) and the POS + VERB SUFFIX model (right). Eng. “After (you) meeting with him, tell (him) that I ask after (his) health.”

Figure 17

Table 14. Cross-lingual dependency parsing results.

Figure 18

Table 15. Number of unseen labels between the training and test treebank. The average rate reports the percentage of these labels in the Xibe test set

Figure 19

Table 16. F1 scores for dependency relations obtained from four delexicalized models training, respectively, on the Turkish BOUN, Korean Kaist, and Japanese GSD and GSDLUW treebanks

Figure 20

Table 17. The five most frequent head-dependent pairs of adverbial clauses ("advcl") in the Korean Kaist, GSD, and Xibe treebank

Figure 21

Figure 5. Dependency tree and an erroneous parse, for example (1). The edges above the POS tags show the treebank annotation, the edges below show the parse using the Turkish BOUN model for training.

Figure 22

Table 18. Multi-source parsing results using the full multi-source treebank, and subsets with sentence perplexity scores lower than $n = \{10, 15, 20\}$.

Figure 23

Table 19. Label performance in monolingual, single-source cross-lingual, and multi-source cross-lingual parsing