Hostname: page-component-89b8bd64d-9prln Total loading time: 0 Render date: 2026-05-07T08:46:50.179Z Has data issue: false hasContentIssue false

Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks

Published online by Cambridge University Press:  27 May 2020

Jenna Kanerva*
Affiliation:
TurkuNLP Group, Department of Future Technologies, University of Turku, Turku, Finland
Filip Ginter
Affiliation:
TurkuNLP Group, Department of Future Technologies, University of Turku, Turku, Finland
Tapio Salakoski
Affiliation:
TurkuNLP Group, Department of Future Technologies, University of Turku, Turku, Finland
*
*Corresponding author. Email: jmnybl@utu.fi
Rights & Permissions [Opens in a new window]

Abstract

In this paper, we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word bring enough information to resolve lemma ambiguities while keeping the context representation dense and more practical for machine learning systems. Additionally, we study two different data augmentation methods utilizing autoencoder training and morphological transducers especially beneficial for low-resource languages. We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems. Compared to the best overall baseline, UDPipe Future, our system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2020. Published by Cambridge University Press
Figure 0

Figure 1. Percentage of running tokens with ambiguous lemma and token-tag pairs with ambiguous lemma calculated from the UD v2.2 training data. An ambiguous token is a word occurring with more than one lemma in the training data, whereas an ambiguous token-tag pair is a (word, morphosyntactic tags) -tuple occurring with more than one lemma in the training data. All treebanks of one language are pooled together.

Figure 1

Figure 2. Frequency comparison of the most common and the second most common lemmas in the training data for words which are ambiguous at the word-tag level. The top-100 most common ambiguous words are shown for Urdu (left), Hindi (middle), and Spanish (right), the three languages with the highest ambiguity rate in Figure 1.

Figure 2

Figure 3. Our encoder–decoder model architecture.

Figure 3

Figure 4. Visualization of the step-wise attention weights (actual system output), where the x-axis corresponds to the input sequence and the y-axis to the generated output sequence. In post-processing, each generated UNK symbol is replaced with the input symbol that has the maximal attention at the respective time step.

Figure 4

Figure 5. Parallel coordinates graphs for visualizing hyperparameter optimizer runs for three different treebanks (top: English, middle: Finnish, bottom: Irish). On the left side of the figure are all optimizer runs completed during the 24-hour time window, while on the right side these runs are filtered based on the validation loss to demonstrate parameter ranges resulting in low validation loss values.

Figure 5

Figure 6. Test set word-level error rates for our system as well as all baseline systems divided into three different treebank groups, big, PUD, and small, as well as macro-average over all treebanks belonging to these groups.

Figure 6

Figure 7. Test set macro-average error rates of five low-resource category treebanks for two our models as well as all baseline systems.

Figure 7

Table 1. Evaluation of our two data augmentation methods, augmented with autoencoder and augmented with transducer as well as a mixed method, compared to our basic models. Additionally, we measure average percentage of words recognized by the transducer (Transducer Coverage) and average percentage of words having the correct lemma among the possible analyses (Transducer Recall), which represents an oracle accuracy achievable by transducers if all lemmas could be disambiguated correctly. All metrics are measured on token level, and in each column the highest accuracy value is bolded

Figure 8

Table 2. Lemmatization accuracies for all 76 treebanks studied in this paper measured on test data with predicted segmentation. Green color indicates treebanks where our overall best method, Augm. Mixed 8K + 8K, outperforms the best overall baseline, UDPipe Future

Figure 9

Table A1. Lemmatization accuracy for all treebanks measured on gold and predicted segmentation and tagging