Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

In this paper we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word bring enough information to resolve lemma ambiguities while keeping the context representation dense and more practical for machine learning systems. Additionally, we study two different data augmentation methods utilizing autoencoder training and morphological transducers especially beneficial for low resource languages. We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems. Compared to the best overall baseline, UDPipe Future, our system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.


Introduction
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Traditionally, word base forms have been used as input features for various machine learning tasks such as parsing, but also find applications in text indexing, lexicographical work, keyword extraction, and numerous other language technology -enabled applications. Lemmatization is especially important for languages with rich morphology, where a strong normalization is required in applications. Main difficulties in lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context. The classical approaches to lemmatizing highly inflective languages are based on two-level morphology implemented using finite state transducers (FST) (Koskenniemi, 1984;Karttunen and Beesley, 1992). FSTs are models encoding vocabulary and string rewrite rules for analyzing an inflected word into its lemma and morphological tags. Due to surface form ambiguity the FST encodes all possible analyses for a word, and the early work on context-sensitive lemmatization was based on disambiguating the possible analyses in the given context. (Smith et al., 2005;Aker et al., 2017;Liu and Hulden, 2017) The requirement of having a pre-defined vocabulary is impractical especially when working with Internet or social media texts where the language variation is high and adaptation fast. Therefore, there has been an increasing interest in the application of context-sensitive machine learning methods that are able to deal with open vocabulary. arXiv:1902.00972v2 [cs.CL] 15 Apr 2020 In this paper we present a sequence-to-sequence lemmatizer with a novel context representation. This method was used as part of the TurkuNLP submission  in the CoNLL-18 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018) where it ranked 1st out of 26 participants on the lemmatization sub-task. In addition to plain lemmatization, the system ranked 1st on the BLEX evaluation metric as well, a metric combining evaluation of both lemmatization and syntactic dependencies. Our Shared Task work is extended in several directions. Firstly, we analyze and justify the particular context representation used by the system using data from 52 languages, secondly we carry out comparison to state-of-the-art lemmatization methods, thirdly we test and evaluate two different data augmentation methods for automatically expanding training data sizes, and finally, we release the system together with models for all 52 languages as a freely available parsing pipeline, containerized using Docker for ease of use.
The rest of the paper is structured as follows. In Section 2 we discuss the surface form ambiguity problem in the context of lemmatization, as well as present a data-driven study for justifying our contextual representation for resolving the problem. In Section 3 we describe the most important related work. In Section 4 we present our problem setting, model architecture and implementation. Experimental setups for our main evaluation as well as results are given in Sections 5 and 6. In Section 7 we describe our data augmentation studies to increase training set sizes leading to a higher prediction accuracy. In Section 8 we summarize the results as well as discuss the practical issues related to our method, most importantly prediction speed and software release. Finally we conclude the paper in Section 9.

Lemmatization Ambiguity and Morphosyntactic Context
Lemmatization methods can roughly be divided into two categories, context-aware methods where the lemmatization system is aware of the sentence context where the word appears, and methods where the system is lemmatizing individual words without contextual information. The advantage in the former approach is the ability to correctly lemmatize ambiguous words based on the contextual information while the latter is only able to either give one lemma for each word even though its lemmatization can vary in different contexts, or list all alternatives. While some of the ambiguous words are assigned the same lemma, such as love in the verb-noun contrast (I love you vs. Love is all you need), are typically assigned the same lemma (love in this case rather than to love), it is not always the case. For example the English word lives receives a different lemma depending on the part-of-speech (live vs. life). Additionally words can be ambiguous within a single part-of-speech class. For example in Finnish the word koirasta is always a noun but depending on the grammatical case it should be lemmatized to koira (a dog inflected in elative case) or to koiras (a male inflected in partitive case). Note that the knowledge of the part-of-speech and inflectional tags, i.e. morphosyntactic features of the word, is sufficient to correctly lemmatize these two abovementioned examples. This holds for the majority of cases, with rare exceptions. For example, the Finnish word paikkoja is a noun in plural partitive, but it can be an inflection of two different lemmas, paikka (a place or patch) or paikko (a spare in bowling). In these rare cases, the meaning, and therefore the correct lemma, can only be derived from the semantic context, i.e. the actual meaning or topic of the sentence. Bergmanis and Goldwater (2018) did a careful evaluation of lemmatization model effectiveness with and without contextual information. They show that including a sliding window of nearby characters significantly improves the performance compared to the context-free version of the same system. However, they only evaluate the system using a textual context (i.e. n characters/words before and after the word to be lemmatized). Suspecting that this lexical context representation suffers from sparseness, we hypothesize that the morphosyntactic features will uniquely disambiguate the lemma in all but the rarest of cases, and can serve as a more practical, dense context representation for the lemmatization task. In order to establish how uniquely the features disambiguate the lemma, we measure different levels of ambiguity on the Universal Dependencies (UD) v2.2 treebanks and present the results in Figure 1. We measure how many times a (word, morphosyntactic tags) -tuple is seen with more than one lemma compared to how many times a plain word is seen with more than one lemma in the training data.
We can see that the proportion of ambiguous lemmas drastically drops for most languages when morphosyntactic tags are taken into account, on average the token-tag pair ambiguity being close to 3% of running tokens, while plain token ambiguity is close to 12%. For more than half of the languages the ambiguity drops below 1% of running tokens, to the level which does not pose an issue anymore, or, from a different point of view, can be expected to cause an issue to any machine learning system due to the rareness of the words involved as we will demonstrate shortly. However, for few languages the ambiguity remains on surprisingly high level, especially for Urdu (36%) and Hindi (22%), both being Indo-Aryan languages and closely related to each other, as well as for Spanish (14%), a Romance language. To shed some light specifically on these three languages, we plot in Figure 2 the frequencies of most common and second most common lemmas for the 100 most common ambiguous words. For all three languages, and all but a handful of words, the distribution is extremely inbalanced with only a small number of occurrences of the less frequent lemma. When investigating similar cases in languages we are familiar with, we can see that in addition to real ambiguities Figure 1: Percentage of running tokens with ambiguous lemma and token-tag pairs with ambiguous lemma calculated from the UD v2.2 training data. An ambiguous token is a word occurring with more than one lemma in the training data, whereas an ambiguous token-tag pair is a (word, morphosyntactic tags) -tuple occurring with more than one lemma in the training data. All treebanks of one language are pooled together. Figure 2: Frequency comparison of the most common and the second most common lemmas in the training data for words which are ambiguous at the word-tag level. The top-100 most common ambiguous words are shown for Urdu (left), Hindi (middle) and Spanish (right), the three languages with the highest ambiguity rate in Figure 1. in many cases these turn out to be annotation inconsistencies. For example, while the word vs. as ADP has only one meaning in the English training data and therefore should also have only one lemma, it is lemmatized 17 times as vs. and once as versus. Similarly, most of the ambiguous cases in the Finnish data are inconsistencies in the placement of compound boundary markers. Even with the real ambiguities, it is debatable whether heavily skewed distributions, where the most common lemma can be several orders of magnitude more common, can be learned given the minimal number of training examples for the rarer lemma.
In the light of these findings, we therefore argue that the part-of-speech and rich morphosyntactic features are, from the practical standpoint of building a multilingual lemmatization system, sufficient to resolve the vast majority of ambiguous lemmatizations in the vast majority of the 52 languages covered by the UD dataset.

Related Work
The most common machine learning approaches to lemmatization are based on edit tree classification, where all possible edit trees or word-to-lemma transformation rules are first gathered from the training data, and then a classifier is trained to choose the correct one for a given input word. These methods do not require that the input word is known in advance as long as the correct edit pattern is seen during training. Edit-tree classifiers are used for example in (Müller et al., 2015;Straka et al., 2016;Chakrabarty et al., 2017), and the sentence-context for resolving ambiguous words can be incorporated into these classifiers for example by using global sentence features (Müller et al., 2015) or contextualized token representations (Straka et al., 2016;Chakrabarty et al., 2017;Straka, 2018b).
Many recent works build on the sequence-to-sequence learning paradigm. Bergmanis and Goldwater (2018) present the Lematus context-sensitive lemmatization system, where the model is trained to generate the lemma from a given input word one character at a time. Additionally, a context of 20 characters in each direction is concatenated with the input word, resulting in a 12% relative error decrease compared to only the word being present in the input. The Lematus system outperforms other context-aware lemmatization systems, including (Chrupała et al., 2008;Müller et al., 2015;Chakrabarty et al., 2017), and can be seen at the time of writing as the current state of the art on the task. However, the task is naturally an active research area with new directions pursued e.g. by Kondratyuk et al. (2018).
The 2018 CoNLL Shared Task on multilingual parsing included lemmatization as one of the objectives, and has given raise to a number of machine learning approaches. Among the top three performing systems on large treebanks, together with our work and the abovementioned edit-tree classifier of Straka (2018b), ranked the Stanford system (Qi et al., 2018). Here words whose lemma cannot be looked up in a dictionary are lemmatized using a sequence-tosequence model without any additional context information.
Sequence-to-sequence models have also been widely applied in the context of morphological reinflection, the reverse of the lemmatization task. In the CoNLL-SIGMORPHON 2017 Shared Task on Universal Morphological Reinflection (Cotterell et al., 2017) the objective was to generate the inflected word given a lemma and morphosyntactic tags.
Here several of the top-ranking systems were based on sequence-to-sequence learning (Kann and Schütze, 2017a;Bergmanis et al., 2017). The entry ofÖstling and Bjerva (2017) additionally tried to boost the inflection generation by learning the primary morphological reinflection objective jointly with the reverse task of lemmatization and tagging.

Methods
Taking inspiration from the top systems in the CoNLL-SIGMORPHON 2017 Shared Task, we cast lemmatization as a sequence-to-sequence rewrite problem where lemma characters are generated one at a time from the given sequence of word characters and morphosyntatic tags. We diverge from previous work on lemmatization by utilizing morphosyntactic features predicted by a tagger to represent the salient information from the context, instead of using for example contextualized word representations or sliding window of text. We modify the usual order of a parsing pipeline to include the lemmatizer as the last step of the pipeline, running after the tagger and thus making it possible to access the predicted part-of-speech and morphological features at the time of lemmatization. In this study, we use the part-of-speech tagger of  modified to predict also morphological features . More detailed discussion of the tagger is included in Section 5.1.2.
The input of our sequence-to-sequence lemmatizer model is the sequence of characters of the word together with the sequence of its morphosyntactic tags, while the output is the sequence of lemma characters. In the UD representation, three different columns are available for morphosyntactic tags: universal part-of-speech (UPOS), language-specific part-of-speech (XPOS) and morphological features, a sorted list of feature category and value pairs (FEATS). All three are used in the input together with the word characters. For example, the input and output sequences for the English word lives as a noun are: INPUT: l i v e s UPOS=NOUN XPOS=NNS Number=Plur OUTPUT: l i f e Once cast in this manner, essentially any of the recent popular sequence-to-sequence model architectures can be applied to the problem. Similarly to the Lematus system, we rely on an existing neural machine translation model implementation, in our case OpenNMT: Open-Source Toolkit for Neural Machine Translation (Klein et al., 2017).

Sequence-to-sequence Model
The model implemented by OpenNMT is a deep attentional encoder-decoder network. The encoder uses learned character and tag embeddings, and two bidirectional LSTM layers to encode the sequence of input characters and morphosyntactic tags into a same-length sequence of encoding vectors. The sequence of output characters is generated by a decoder with two unidirectional LSTM layers with input feeding attention (Luong et al., 2015b) on top of the encoder output. The full model architecture is illustrated in Figure 3.
An important requirement for sequence-to-sequence models is the ability to correctly deal with out-of-vocabulary (OOV) items at inference time. For example, in machine translation foreign person and place names should often be copied into the output sequence, which is not possible if the generation is based on a straightforward classification over output vocabulary learned during training. In the case of lemmatization, this issue manifests itself as characters not seen during training. Since in some languages foreign names inflect, copying full words that contain OOV characters is not a sufficient solution. For instance, a Finnish lemmatizer model trained on a typical Finnish corpus will have a vocabulary of mostly Scandinavian characters, and will be unable to correctly lemmatize the case-inflected Czech name Růžičkalla into Růžička.
In machine translation, the problem of OOV words is for the most part solved using Byte Pair Encoding (BPE) or other sub-word representations, reducing vocabulary size and handling inference-time unknown words (as unknown words can be split into known subwords) (Sennrich et al., 2016). As the lemmatizer operates on the level of characters, indivisible into smaller units, we instead rely on an alternative technique whereby the model is trained to predict an unknown symbol UNK for rare and unseen characters, and as a post-processing step, each such UNK symbol is subsequently substituted with the input symbol with the maximal attention value of the model at that point (Luong et al., 2015a;Jean et al., 2015). For instance, for the inflected name Růžičkalla, we would get INPUT: Růž ič k a l l a UPOS=PROPN XPOS=N Case=Ade Number=Sing OUTPUT: R UNK UNK i UNK k a as the initial output of the system, later post-processed to the correct lemma Růžička based on attention weights visualized in Figure 4.

Evaluation
Next we carry out an extensive evaluation of the lemmatization framework on 52 different languages with varying lemmatization complexity and training data sizes. We compare our system to several competitive lemmatization baselines. First, we give a detailed description of our experimental setup, the baseline systems and model parameters, and after that we present the evaluation results. Figure 4: Visualization of the step-wise attention weights (actual system output), where the x-axis corresponds to the input sequence and the y-axis to the generated output sequence. In post-processing, each generated UNK symbol is replaced with the input symbol that has the maximal attention at the respective timestep.

Universal Dependencies Treebanks
We base our experiments on Universal Dependencies (UD) v2.2 , a multilingual collection of 122 morpho-syntactically annotated treebanks for 71 languages, with cross-linguistically consistent annotation guidelines, including also gold standard lemma annotation (Nivre et al., 2016). The UD treebanks therefore allow us to test the lemmatization methods across diverse language typologies and training data sizes, ranging from a little over 100 to well over 1 million tokens. We restrict the data to the subset of 82 treebanks (57 languages) used in the CoNLL-18 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018). In addition to allowing a direct comparison with the state-of-the-art parsing pipelines participating in the Shared Task, the treebanks from this subset all have a test set of at least 10,000 tokens, ensuring a reliable evaluation. Note that even though the test set is always at least 10,000 tokens, training sets may be considerably smaller, in several instances about 100 tokens.
Furthermore, it was also necessary to remove two treebanks with no lemma annotation (Old French-SRCMF and Thai-PUD) and four treebanks with no training data (Breton-KEB, Faroese-OFT, Japanese-Modern and Naija-NSC). The four parallel "PUD" treebanks included in the Shared Task (Czech-PUD, English-PUD, Finnish-PUD and Swedish-PUD, each including the same 1,000 sentences translated into the target language and annotated into UD) do not have dedicated training data, but can be used as additional test sets for models trained on the Czech-PDT, English-EWT, Finnish-TDT and Swedish-Talbanken treebanks, which are sufficiently similar in annotation style. Altogether, we therefore evaluate on 76 treebanks representing 52 different languages. During evaluation we show results separately for several different groups categorizing treebanks by size or other properties. These groups are PUD for 4 additional parallel test sets, big for 60 treebanks with more than 10,000 tokens of training and 5,000 tokens of development data, small for 7 treebanks with reasonably sized training data but no additional development data, and low resource for 5 treebanks with only a tiny sample of training data (around 20 sentences) and no development data. These are the same treebank groups as defined in CoNLL-18 Shared Task.
To ensure that treebanks in the small and low resource categories also have a development set for hyperparameter tuning and model selection, we adopt the data split provided by the Shared Task organizers, which creates the development set from a portion of the training data when necessary (Straka, 2018a). This data split was also used to train the Shared Task baseline model, one of the systems we compare our results to. The final numbers are always reported on the held-out test set directly specified in the UD release for each treebank. The original test section of the UD data is never used in system training and development, as suggested by the data providers and so as to be able to distribute the trained models for further comparison. For this reason we also decided not to apply N-fold cross-validation for low-resource treebanks, which otherwise would have been an option to decrease variance in the results. Furthermore, the training and development set split is also kept fixed as the development data is used only for early stopping and model selection, which we do not expect to greatly affect the numbers, and hyperparameters are not tuned separately for each treebank.

Part-of-Speech and Morphological Tagger
As the input of our lemmatizer is a word together with its part-of-speech and morphosyntactic features, we need a tagger to predict the required tags before the word can be lemmatized. We use the one by , which is based on the winning Stanford part-of-speech tagger  from the CoNLL-17 Shared Task on multilingual parsing . The tagger has two classification layers (predicting UPOS and XPOS) over tokens in a sentence, where tokens are first embedded using a sum of learned, pre-trained and character-based LSTM embeddings, which are then encoded with a bidirectional LSTM to create a sequence of contextualized token representations. The classification layers are trained jointly on top of these shared token representations. By default, the original tagger does not predict the rich morphosyntactic features (FEATS column in CoNLL-U format). To this end, in  we modified the tagger training data by concatenating the morphosyntactic features with the language-specific part-of-speech tag (XPOS), thereby forcing the tagger to predict the XPOS tag and all morphosyntactic features as one multi-class classification problem. For example, in Finnish-TDT the original XPOS value N and FEATS value Case=Nom|Number=Sing are concatenated into one long string XPOS=N|Case=Nom|Number=Sing which is then predicted by the tagger. The morphological features are sorted so as to avoid duplicating label strings having the same tags in different order. After prediction, the morphosyntactic features are extracted into a separate column. The evaluation in  shows that this data manipulation technique does not harm the prediction of the original XPOS tag, and accuracy of morphosyntactic feature prediction (FEATS field) is comparable to the state-of-the art in the CoNLL-18 Shared Task, ranking 2nd in the evaluation metric combining both morphosyntactic features and syntactic dependencies, and 3rd in the evaluation of plain morphosyntactic features. In our preliminary experiments, we expected the complex morphology of some languages to result in a large number of very rare feature strings if combined in such a simple manner. We tested several models, for instance predicting a value for each category separately (for example Nominative for Case) from a shared representation layer. However, the results were surpassed by the simple concatenation of morphological features. The conclusion of this experiment was that even though some languages have many unique feature combinations (number of unique combinations ranging from 15 to 2,500) the most common ones cover the vast majority of the data, with the rare classes having no practical effect on the prediction accuracy (more detailed discussion is given in ).

Parameter Optimization
To optimize the hyperparameters of our lemmatization models, we use the RBFOpt library designed for optimizing complex black-box functions with costly evaluation (Costa and Nannicini, 2018). Different values of embedding size, recurrent layer size, dropout, learning rate, and learning rate decay parameters are experimented with. We let the RBFOpt optimizer run for 24 hours on three different treebanks, completing about 30 training runs for Finnish and English, and about 300 for the much smaller Irish treebank. The findings are visualized in Figure 5: On the left side of the figure all different runs completed by the optimizer are shown as a parallel coordinates graph, while on the right side we use a validation loss filter to show only those runs that result in low validation loss values. From this we can more easily determine the optimal parameter ranges and their mutual relationship.
Based on these optimizer runs, the lemmatization models seem to be moderately stable, most of the parameter values having individually only a small influence on the resulting validation loss, once the RBFOpt optimizer finds the appropriate region in the parameter space. The learning rate parameter (lr column) appears to have the largest impact, where lower learning rate values generally work better. Overall, the learning is stable across the parameter space, and the parameter optimization does not play a substantial role. Even default values as defined in the OpenNMT toolkit worked comparatively well.
In the final experiments, apart from the batch size, uniform hyperparameter settings based on the observations of the three optimization runs are used for all treebanks. We set the embedding size to 500, dropout to 0.3, recurrent size to 500, and we use the Adam optimizer (Kingma and Ba, 2015) with initial learning rate of 0.0005 and learning rate decay with 0.9 starting after 20 epochs. All models are trained for 50 epochs, but for smaller treebanks we decrease the minibatch size to increase the number of updates applied during training. Our default minibatch size is 64, but for treebanks with less than 2,000 training sentences and less than 200 training sentences we use 32 and 6, respectively. Models usually converge around epochs 30-40, and final models are chosen based on prediction accuracy on the validation set. During prediction time we use beam search with beam size 5.

Baselines
We compare our lemmatization performance to several, recent baseline systems. Baseline UDPipe (Straka et al., 2016) is the organizers' baseline parsing pipeline from the CoNLL-18 Shared Task, which, due to its easy usability and availability of pretrained models, has been the go-to tool for parsing UD data. UDPipe Future (Straka, 2018b) is an updated version of the baseline UDPipe pipeline ranking high across the CoNLL-18 ST evaluation metrics. Both UDPipe versions have a lemmatizer based on the edit-tree classification method. The Stanford system (Qi et al., 2018) is a dictionary look-up followed by a context-free sequence-to-sequence lemmatizer for words unseen in the training data. Together with our entry, UDPipe Future and Stanford form the top three performing entries in the lemmatization evaluation of the CoNLL-18 ST on the big treebank category. In addition to top ranking systems from the CoNLL-18 ST, we also compare to the context-aware Lematus sequence-to-sequence lemmatizer (Bergmanis and Goldwater, 2018) which outperformed all its baselines in the earlier studies, and can be seen as a current state-ofthe-art in lemmatization research. Our final baseline (Look-up) is a simple look-up table, where lemmas are assigned based on the most common lemma seen in the training data, while unknown words are simply copied unchanged to the lemma field.

Results
The results are shown in Figure 6, where we measure word-level error rates separately on three treebank categories, big, PUD and small, as well as macro-average error rate over all treebanks belonging to these three categories.
On all three categories our system outperforms all the baselines with an overall error rate of 4.61 (macro-average across the treebanks in the three categories). Compared to the second best overall system, UDPipe Future, our error Figure 6: Test set word-level error rates for our system as well as all baseline systems divided into three different treebank groups, big, PUD and small, as well as macro-average over all treebanks belonging to these groups.
rate is 1.35 absolute percent point lower, reducing errors by 23% within these three treebank categories. The widest margin from our system to the second best systems is in the small treebank category where our system reduces errors by 30%, from 12.75 to 8.98, compared to the second best Lematus system. The simplistic Look-up baseline is clearly worse than all other systems, reflecting that plain memorizing training tokens and fallback copying unknowns is not a sufficient strategy for language universal lemmatizer. The three most recent baseline systems (Stanford, UDPipe Future and Lematus) perform evenly in terms of average error rate, outperforming the older Baseline UDPipe.
The fourth treebank category used in the CoNLL-18 ST is low-resource, where only a tiny training data sample is available, usually around 20 sentences. Results for this group are given separately in Figure 7 where we measure macro-average word-level error rate over the five treebanks belonging to this category. Few dozens of training sentences cannot be expected to result in a well-performing lemmatization system, and indeed, all systems have error rates near 40-50%, where almost half of the tokens are lemmatized incorrectly. Here even the Look-up baseline performs comparably to the other systems, which is for the most part caused by the fallback copying of the unknown words unchanged to the lemma field, and therefore getting the easy words correct. For our system, we report two different runs, basic is trained purely on the tiny training data sample, while official is our official submission for the CoNLL-18 ST where we experimented with preliminary data augmentation methods for automatically enriching the tiny training data sample with words analyzed by morphological transducers. The two lowest average error rates in the low-resource category are achieved by the two different versions of UDPipe (UDPipe Baseline and UDPipe Future), both belonging to the category of edit-tree classification systems. Systems based on sequence-to-sequence learning (Stanford, Lematus and ours) are hypothesized to be more data hungry, and these systems indeed achieve clearly worse results in the low-resource category, all making more errors than correct predictions. However, when we include the additional training data obtained with data augmentation methods, we are able to boost our performance (Our official) to the level of the two edit-tree classification systems reducing errors by 24% compared to our basic models. Nevertheless, as all results are about the same level as the simple Look-up baseline, the achieved improvement is mostly theoretical.

Training Data Augmentation
In our initial attempt to improve lemmatization performance on the low-resource languages in the CoNLL-18 Shared Task, we observed a substantial improvement over our basic run when the morphological transducers are used to Figure 7: Test set macro-average error rates of five low-resource category treebanks for two our models as well as all baseline systems.
generate additional training data. However, the overall accuracy of those datasets is below the limits of usable realworld systems and thus the seen improvements are more theoretical than practical. Next, we investigate whether automatic training data augmentation methods are useful for languages with much better baseline accuracy to improve lemmatization performance in a real-life setting as well. We test two different methods on a full set of treebanks suitable for a given method. First, we apply an autoencoder style secondary learning objective, where the lemmatizer model is trained to repeat the given input sequence without any modification. The benefit of such objective is to support the stem generation without requiring any additional resources. Secondly, we repeat the experiment with the morphological transducers for all languages which have an Apertium morphological transducer available. We generate additional inflection-lemma pairs based on the known vocabulary and inflection paradigms encoded as a transducer, and these new training examples are then mixed with the original training data. Next, we explain both data augmentation methods in detail, and afterwards compare the results.

Autoencoding Random Strings
In our first data augmentation method we apply joint learning of autoencoding and lemmatization. The basis of the required work in sequence-to-sequence lemmatization is the ability to repeat the word stem in the output generation. As suggested by Kann and Schütze (2017b) in the context of morphological reinflection we hypothesize that learning to repeat the input characters as a secondary task with additional training examples could simplify the lemmatization complexity the model has to learn especially for treebanks with less training data. If the model is taught separately to repeat the input characters in the generated output, the actual lemmatization rewriting task could be learnable with less training material. In particular, this approach should be able to help in low-resource settings when the amount of training data is not necessarily sufficient for learning the complex task from scratch.
Following the autoencoding idea of Kann and Schütze (2017b), we enrich our lemmatization training data for each treebank by adding randomly generated strings where the input and output sequences are verbatim copies. These random strings are not equipped with any morphosyntactic tags, but instead a special tag is added to give the model the ability to distinguish these from the actual lemmatization examples to avoid confusion. Each random string is generated by sampling with replacement 3-12 characters individually from the known character vocabulary with character probabilities calculated from the training data, producing word-like items of varying lengths. However, we force each character in the vocabulary to be sampled at least once to better cover the known character vocabulary. This is achieved by first generating as many random strings as there are characters in the alphabet, each string containing the respective alphabet character at a random position. The rest of the strings are randomly sampled without any further restrictions on the alphabet. These generated strings are then mixed together with the actual training examples by randomly shuffling all training examples, and both tasks are thus trained simultaneously. The random shuffling of training examples (i.e. individual words), and therefore breaking the semantic context, does not harm the training of our lemmatizer as it is anyway looking at individual words at a time. As in our training data the morphosyntactic tags are already included for each word, and the random autoencoder strings do not use any morphosyntactic tags, there is no requirement of running the tagger at training time, thus making the training data shuffling procedure straightforward. We chose to autoencode random strings rather than actual words as that way we do not need any external resources and the method is easily repeatable for any language.

Morphological Transducers
In our second data augmentation method we lean on additional morphological/lexical resources available for a particular language. In addition to Universal Dependencies, other projects are also striving to build unified morphological resources across many different languages. For example the UniMorph project (Kirov et al., 2016) extracts and normalizes morphological paradigms from the Wiktionary free online dictionary site. Further, finite state transducers for morphological analysis and generation for a multitude of languages are available in the Apertium framework, which includes a pool of open source resources for natural language processing (Tyers et al., 2010). Both UniMorph and Apertium frameworks can be used to collect inflected words and for each word a set of possible lemmas together with the corresponding morphological features. However, while these resources are unified within a project, their schema and guidelines differ from each other across different projects. For this reason using a mixture of training examples gathered from two or more different sources is not a straightforward task. While harmonized annotations across different languages give a good starting point for multilingual conversion, the mapping is usually not fully deterministic (see e.g. McCarthy et al. (2018) for detailed study of mapping from Universal Dependencies into UniMorph).
We expand our preliminary data augmentation experiments carried out during the CoNLL-18 ST where we used the Apertium morphological transducers to collect additional training examples. A morphological transducer is a finite-state automaton including morphological paradigms (inflection regularities/rules) and a lexicographical database (lexicon), where each lexical entry (lemma) is assigned to the inflection paradigm it follows. These linguistic resources can be compiled into an efficient finite-state transducer, an automaton which is able to return all matching lemmas and morphological hypotheses encoded in it for the given input word.
We set out to test whether improvements similar to those achieved with low-resource languages can also be seen with languages already including a reasonable amount of initial training data. We develop a language-agnostic feature mapping from Apertium features into UD, allowing us to cover all UD languages which have an Apertium morphological transducer available (Arabic, Armenian, Basque, Bulgarian, Buryat, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Italian, Kazakh, Kurmanji, Latvian, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Ukrainian and Urdu).
For each of these languages we first gather a full vocabulary list sorted by word frequencies in descending order. These lists are gathered mainly from the web crawl datasets , but for languages not included in the distributed web crawl dataset (Armenian, Buryat, Kurmanji) we use Wikipedia dumps instead. The word frequency lists are then analyzed by the Apertium morphological transducers where for each unique word we obtain a set of possible lemmas and their corresponding morphological features. Words not recognized by the transducer (not part of the predefined lexicon) are simply discarded. All of these Apertium analyzes are then converted into the UD schema using our language-agnostic feature mapping where each morphological feature is converted into UD, based on a manually created look-up table. As the mapping from Apertium features into UD features is not a fully deterministic task, our language-agnostic feature mapping is designed for high precision and low recall, meaning that if a feature cannot be reliably translated, it will be dropped from the UD analyses. This approach may produce incomplete UD analyses, but we hypothesize that the lemmatizer model is robust enough to be able to utilize existing features without missing ones being too harmful for the training process, especially since in the actual training data these augmented examples are mixed together with the actual ones. The lemmas, on the other hand, we assume to be relatively harmonized between UD and Apertium by default, and these are used without any conversion or modification. After feature translation, we skip words which already appear in the original treebank training data, as well as all lemmas with a missing part-of-speech tag in the UD analysis due to an incomplete feature conversion, and all ambiguous words having two or more different lemmas with exactly the same morphological features. Finally, we pick a number of most common words from the UD converted and filtered transducer output, which are then mixed together with the original treebank training data. All training examples are randomly shuffled before training.  Table 1: Evaluation of our two data augmentation methods, augmented with autoencoder and augmented with transducer as well as a mixed method, compared to our basic models. Additionally, we measure average percentage of words recognized by the transducer (Transducer Coverage) and average percentage of words having the correct lemma among the possible analyses (Transducer Recall), which represents an oracle accuracy achievable by transducers if all lemmas could be disambiguated correctly. All metrics are measured on token level.

Data Augmentation Results
First we compare the two augmentation methods against our basic system, where, based on observations in Bergmanis et al. (2017), we mix 4,000 additional training examples together with the original training data in both experiments. We decided to use a constant number of additional examples rather than a percentage to better account for the lowresource languages, the ones benefiting most from the experiment, where for example a 20% increase in training data would still translate to having less than 500 training examples. Secondly, we add experiments on using a mixture of both augmentation techniques and increasing the number of additional examples included. Additionally, we test how well a morphological transducer itself could serve as a lemmatizer by measuring its coverage (how many words from the test data are recognized by the transducer) and lemma recall (how many words from the test set have the correct lemma among the possible analyses given by the transducer). Lemma recall therefore gives an upper-bound, oracle accuracy achievable by the transducer, assuming that all lemmas in its output can be correctly disambiguated. Results are given in Table 1. We measure macro accuracy over all treebanks and results are given separately for three treebank groups: All treebanks includes all 76 treebanks studied in this paper, Excluding low resource is all treebanks except the five low resource treebanks, and Transducer-only treebanks is a set of 47 treebanks representing languages which have a morphological transducer available. Note that in All treebanks results the Augm. transducer row uses the basic model for treebanks where a transducer is not available, giving a realistic comparison against the Augm. autoencoder method which does not suffer from lacking resources. In the mixed experiments, if a transducer is not available for a language, the training data is enriched only with the autoencoder examples. The two direct transducer metrics (Transducer Coverage and Recall), however, can be realistically measured only for languages having a transducer available and the results reported for the Transducer-only treebanks group allow for a direct comparison between plain transducers and our models.
In all three groups, all augmentation methods are able to surpass the basic model, with the transducer-based method giving slightly better overall results than the autoencoder. When mixing the two methods, the same amount of total examples as in the plain transducer augmentation is divided evenly between the two methods. The mixed method is not able to surpass the transducer-based one, but when increasing the amount of additional mixed data, the performance also increases slightly, the mixed 8K + 8K, the largest mixed method tested, giving the best overall performance. When considering a macro-average over all treebanks, errors are reduced by 13% relative compared to our basic models. However, when excluding the five low resource treebanks already discussed in Section 6 the difference is smaller, and the relative error reduction becomes a mere 3%, demonstrating that -unsurprisingly -most of the benefit comes from the low resource languages and only a minimal improvement can be seen with reasonably-sized training data sets.
The average coverage for the morphological transducers is 86%, with recall being 78%. These numbers are clearly below our lemmatization methods, showing that, averaged across many languages, the approach relying on a predefined lexicon and ruleset does not fare favorably to sequence-to-sequence machine learning methods. The average transducer coverage is on par with the one reported by Tyers et al. (2010), where coverage numbers reported for a set of languages varies between 80% and 98%, however with our set of languages the variation is much higher ranging between 5% and 99%, and clearly the transducers in the lower coverage region are missing much of the core vocabulary. These are measured without using morphological guessers, where unknown words can be analysed based only on their morphological shape (for example known suffixes). However, as the guessers consider every possible mapping allowed by the rules of the language, in many cases a great number of different alternatives is returned, which would need to be disambiguated later on. We therefore leave it as a future work to study whether morphological guessers and sequence-to-sequence lemmatizers can have a shared interest. By comparing the transducer coverage and recall, we can have an estimate of how harmonized the lemmas are between Apertium transducers and UD treebanks on average. If 86% of words are recognized by the transducer, but only 78% are having a "correct" lemma analysis, then 8% of the treebank words are recognized but with a "wrong" lemma, hinting at an incompatible analysis. We leave it as a future study to examine, whether the differences are systematic and further gains could be obtained with filtering or harmonizing the lemma annotations between Apertium and UD in addition to harmonizing morphological features. Such a study however requires the knowledge of each of the involved languages.

Result Summary
In Table 2 we summarize the results of all the major experiments reported in this paper. For each treebank we present the accuracy of our best overall method, Augm. mixed 8K + 8K, and for comparison, we also add results for our basic method as well as the best overall baseline method, UDPipe Future. The comparison of our system and the UDPipe Future baseline is visualized by coloring each line green where our Mixed 8K+8K method is better than the UDPipe Future baseline. As discussed in Section 5.3, all numbers are measured on top of predicted segmentation, therefore reflecting a realistic expectation of the performance with no gold-standard data used at any point during prediction.
Out of the 76 treebanks, our method outperforms the UDPipe Future baseline on 62 treebanks. On average, across the 76 treebanks, this translates to a relative 19% error reduction. On 36 treebanks the relative error reduction is more than 20%, meaning that we are able to remove at least one fifth of the errors the best baseline system is making.
While the autoencoding augmentation method does not require any additional data, the transducer-based techniques move the system into an unconstrained setting, if considering a task setup where only the given treebanks are allowed in system training. However, in real-life situations, where all available data is allowed, the comparison between our augmented system and the baseline systems is fair. Such a real-life task setting was used for example in the CoNLL 2018 and 2017 multilingual parsing shared tasks, where a list of additional resources apart from the treebanks were given to all task participants. These allowed resources also included the Apertium morphological transducers, which makes the comparison between our augmentation methods and baseline systems from the CoNLL 2018 shared task fair. The difference between our system and a standard context-based lemmatization system is that integrating information from these additional sources is much easier with our task setting where the lemmatizer does not need the words to appear in a natural context.

Generalization and Error Propagation
To understand the generalization capability of the lemmatizer when the segmentation and morphological tagging effects are disregarded, we compare the lemmatization accuracy on top of predicted segmentation to gold-standard segmentation (sentence and word level), as well as on top of predicted morphosyntactic features to gold-standard morphosyntactic features. The same experiment also measures the risk of error propagation, where the lemmatizer makes a mistake due to incorrectly predicted morphosyntactic features. Results for all treebanks are available in Appendix A. When comparing the lemmatization accuracy of the 5 low-resource languages (Armenian, Buryat, Kazakh, Kurmanji, Upper Sorbian) on predicted and gold morphosyntactic features, the four transducer languages (Armenian, Buryat, Kazakh, Kurmanji) appear to generalize extremely well, gold morphosyntactic features increasing the accuracy from 58%-74% to 91%-96%. For Upper Sorbian, the one low-resource language without a transducer, the generalization ability is clearly worse, gold tags increasing accuracy only from 55% to 74%. These results suggest that the data augmentation techniques utilizing a morphological transducer are sufficient enough to train a high quality lemmatizer if reliable morphosyntactic features are available. However, at the same time it shows that in extreme cases where the accuracy of part-of-speech tagging is barely above 50%, errors from the tagger component propagate notably. As a future work, it would be interesting to study whether morphological transducers could be used to create artificial data for context-dependent morphological tagging so as to improve the tagger performance as well.
Currently, the lemmatizer is the last component in the parsing pipeline, thus not affecting the labeled attachment score of the syntactic parser. The parser currently used in the pipeline was originally designed to not consider lemmas at all, however, the lemmatizer component could be located before the syntactic parser as well, making it possible to establish whether using lemmas as additional features during parsing would improve its performance.  Table 2: Lemmatization accuracies for all 76 treebanks studied in this paper measured on test data with predicted segmentation. Green color indicates treebanks where our overall best method, Augm. Mixed 8K + 8K, outperforms the best overall baseline, UDPipe Future.

Future Work
We acknowledge that the morphological transducers used in our data augmentation study may not have been utilized to their full power. Our straightforward feature mapping from the Apertium framework into Universal Dependencies was designed to be language agnostic, thus suffering from inconsistencies in annotations between different languages and treebanks. A more focused attempt on a particular, well chosen language with an improved morphological transducer, language specific conversion or detailed parameter tuning could yield better results. While Apertium can be considered a trustworthy source for unified morphological resources, for many languages, more developed language-specific transducers exist. For example, if particularly working on Turkish, Finnish or Hungarian, one should consider using morphological transducers by Çöltekin (2010), Pirinen (2015) and Trón et al. (2006). A focused per-language effort is naturally entirely out-of-scope of this current work, which can nevertheless serve as a basis for such a languagespecific development. Similar argumentation is suggested by Pirinen (2019), who carried out a focused evaluation of our lemmatization system and the OMorFi morphological analyzer (Pirinen, 2015) on the Finnish language. OMorFi is a mature system, being the result of a major development effort spanning over several years. Its output is in the Universal Dependencies scheme, providing a valid point of comparison. A lemmatization performance of our pipeline far superior to that of OMorFi is reported, leading to the conclusion that the machine learning approach is indeed highly competitive with the traditional transducers and can be seen as the preferred approach to developing lemmatizers for new languages. However, we leave it as a future work to study whether combining such a morphological transducer and machine learning approach in a targeted data augmentation effort would yield higher improvements for lemmatization accuracy than presented in this paper.
Another interesting direction to expand the work in future would be to test how well the lemmatizer works on short text segments, for example with search queries, where deep learning systems traditionally need to be trained separately to match the different style of writing, for example very often omitting the main verb. As the lemmatizer is operating on the word level without a notion of context, this should not pose an issue during the lemmatization. However, a separate question is how reliable a morphological tagger would be with such short text segments.

Model and Software Release
We release trained models for all 76 treebanks experimented in this paper, embedded into a full parsing pipeline including segmentation, tagging, syntactic parsing and lemmatization. The parsing pipeline source code is available at https://turkunlp.org/Turku-neural-parser-pipeline under the Apache 2.0 license. It includes trained models for all the necessary components (segmentation, tagging, syntactic parsing and lemmatization), trained on the UD v2.2 treebanks. The whole processing pipeline can be executed with a single command, removing the need for data reformatting between the different analysis components. The pipeline runs in a Python environment which can be installed with or without GPU support. To increase the usability across different platforms we also provide a publicly accessible Docker image, which wraps the pipeline in a container which can be executed without manual installation, assuring that the pipeline can be executed and the results replicated also in the future.

Training and Prediction Speed
Typical training times for the lemmatizer models on UD treebanks with 50 training epochs are 1-2 hours on one Nvidia GeForce K80 GPU card. The largest treebanks (Czech-PDT 1.2M tokens and Russian-SynTagRus 870K tokens) took approximately 15 hours to train for the full 50 epochs. However, the training usually converges between epochs 30 and 40, and therefore training time could be reduced using an early stopping criterion.
In prediction time, we present several advantages over previous sequence-to-sequence lemmatizer models. First, by using morphosyntactic features instead of a sliding window of text to represent the contextual information, after running the context-dependent morphological tagger, the lemmatizer is able to process each word independently from its textual sentence context, and therefore we only need to lemmatize each unique word and feature combination. This enables us to 1) only lemmatize unique items inside each textual batch, and 2) store a cache of common preanalysed words, and only run the sequence-to-sequence model for words not already present in this global lemma cache. Together with the trained models we distribute such a global cache file for each language.
Prediction times for the full parsing pipeline, including segmentation, tagging, syntactic parsing and lemmatization, are on the order of 1,300 tokens per second (about 100 sentences per second) on an Nvidia GeForce GTX 1070 card. On a server-grade CPU-only computer (24 cores and 250GB RAM) prediction times are 350 tokens per second, while on a consumer CPU-only laptop (8 cores and 8GB of RAM), the full pipeline can process about 280 tokens per second. These are measured with a pre-analysed lemma cache collected from the training data, and prediction times especially on CPU could be yet improved by collecting a larger pre-analysed lemma cache using, for example, large web corpora.

Conclusions
In this paper we have introduced a novel sequence-to-sequence lemmatization method utilizing morphosyntactic tags to inform the model about the context of the word. We validated the hypothesis that the tags provide a sufficient disambiguation context using statistics from the Universal Dependencies treebanks across a large number of languages. We presented a careful evaluation of our method over several baselines and 52 different languages showing that the method surpasses all the baseline systems, reducing relative errors on average by 19% across 76 treebanks compared to the best overall baseline. The lemmatizer presented in this work was also used as our entry in the CoNLL-18 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies, where we achieved the 1st place out of 26 teams on two evaluation metrics incorporating lemmatization. Additionally, we investigated two different data augmentation methods to boost the lemmatization performance of our base system. We found that augmenting the training data using a mixture of autoencoder training and the output of a morphological transducer decreases the error rate by 13% relative to the un-augmented system, with the gain being unsurprisingly concentrated on the low-resource languages.
As an overall conclusion, we have demonstrated a highly competitive performance of the generic sequence-to-sequence paradigm on the lemmatization task, surpassing in accuracy prior methods specifically developed for lemmatization.
The lemmatization models for all languages reported in the paper, source code and materials for all experiments, the full parsing pipeline source code and parsing models, as well as an easy-to-use Docker container, are available at https://turkunlp.org/Turku-neural-parser-pipeline under the Apache 2.0 license.  Table 3: Lemmatization accuracy for all treebanks measured on gold and predicted segmentation and tagging.