Joint learning of morphology and syntax with cross-level contextual information flow

Abstract We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor ((2018). Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The Association for Computational Linguistics, pp. 81–91.) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for future comparison.


Introduction
NLP tasks such as morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto semantic dependencies have been widely investigated for extracting the meaning of a clause. These tasks can be considered to be different stages in reaching semantics: morphological segmentation and morpheme tagging deal with the morphemes inside a word; a POS tagging captures the words in a particular context, usually in a clause; dependency parsing views each clause as a sequence of words and the relations between them, such as subject, modifier, complement.
In these tasks, morpheme tagging labels each morpheme in a word with the syntactic information of the morpheme in that particular context, for example, person or tense. POS tagging assigns a syntactic category to each word in a context (e.g., noun and adjective). Dependency parsing finds relations such as argument, complement, and adjunct between words in terms of word-word dependencies. Dependency labeling assigns a tag (e.g., subject, object, and modifier) to every dependency relation. Figure 1. A Turkish clause with labeled dependency relations between morphologically complex words. First line: The orthographic form ("-" is for "null"). Second line: morphological segments. Third line: morphological tags (-GEN=genitive, .3s=3rd person singular, -ACC=accusative, -PAST=past tense). Dependencies in the article are arrowed (head to dependent) and labeled UD dependencies (de Marneffe et al. 2021).
We introduce a joint learning framework for morphological segmentation, morpheme tagging, POS tagging, and dependency parsing, with particular modeling emphasis on agglutinating languages such as Turkish. We have tested our system with other languages as well.
The proposed joint learning is a stronger form of multitask learning. Instead of cascaded horizontal processes which are common in multitasks, we propose a different design. We can motivate the design choices as follows.
In Turkish, morphology is agglutinating. Syntactic information is encoded in inflectional morphemes, which are suffixed to the word. An example is given in Figure 1. In the example, -ı in kapag-ı-nı indicates possession, which is an inflection. It also reveals that its host word is a nominal. It is in a possessive dependency relation with "book." If the stem were a verb, for example, yap-tıg-ı-nı (do-COMP-POSS.3s-ACC) "his/her doing," the same phonological shape ı would mark agreement with subordinate subject of this verb, rather than possession. This is relevant to choice of dependency. There is no possessive meaning in yaptıg-ı-nı, as can be observed in Ben [[sigara içme-nin] kanser yap-tıg-ı-nı] bilmiyordum (I [cigarette smoking] -GEN.3s cancer do-COMP-POSS.3s-ACC I-not-know) "I did not know that smoking caused cancer." There is possessive meaning in kapag-ı-nı. One implication is that their morphological tags must be different, because of differences in semantics. d Morphology's role is crucial and it can be nonlocal in setting up the correct relation of meaning. As we shall see later, in Figures 9(c) and 10, in many cases looking at the previous word is not enough to disambiguate the possessor even in examples where we can determine that the shape has possession semantics; therefore, decoding on this basis alone would not be sufficient (cf. Akyürek, Dayanık, and Yuret 2019). It needs an attention mechanism or its functional equivalent. Therefore, a standard method that involves learning morphology, syntax, and semantics in a horizontal pipeline process would eventually suffer in such circumstances.
Joint learning is a preparation for these concerns. It is defined as learning some related tasks in parallel by using shared knowledge contained in various tasks; therefore, each task can help other tasks to be learned better (Caruana 1997). Representational autonomy of the tasks is the expedient to joint modeling.
We represent distributional meaning in terms of neural vector representations that help to learn some underlying form of a sentence. We use neural character and morpheme representations to learn the word representations for learning POS tags and dependency relations between d This has been a source of variation in Turkish linguistics, which has been carried over to computational linguistics. For example, Göksel and Kerslake (2005) (p.135) mark the dependent as GEN and head as POSS, for both the genitive-possessive construction and for agreement in finite subordination. However, Kornfilt (2001) (p.187), who is the main resource for calling the elements on subordinate verbs "agreement markers" rather than possession markers, since 1980s, marks the first one as GEN and the second as 3SG, for third person singular. To limit interpretation, when the actual morph is shown, we use both notations, for example, GEN.3s for third person singular genitive and POSS.2s for second person singular possessive.
words. In learning morphological segmentation of words, we aim to obtain a word representation based on morphemes. Recent work shows that representations that are learned from words as separate tokens are not good enough to capture the syntactic and semantic information of words in agglutinating languages because of severe sparsity. Character-level or morpheme-level representations lead to a better prediction of embeddings in such languages (e.g., Üstün, Kurfalı, and Can 2018). Vania and Lopez (2017) provide a comparison of word representations. Their results confirm that although character-level embeddings are effective for many languages, they do not perform as well as the models with explicit morphological information. We use both character and morpheme representations in learning POS tags and dependencies.
Our framework is built upon the joint POS tagging and dependency parsing model of Nguyen and Verspoor (2018), to which we introduce two more layers for morphological segmentation and morphological tagging. The results show that a unified framework benefits from joint learning of various levels, from morphology to dependencies. Our results with joint learning of morphological tagging and dependency parsing are encouraging compared to the model of Nguyen and Verspoor (2018), who model them separately.
The rest of the paper is organized as follows. Section 2 reviews related work in five tasks that we model. Section 3 describes the proposed joint learning framework and the components involved in the architecture. Section 4 presents experimental results and detailed error analysis. A brief section concludes (Section 5).

Related work
Adequate treatment of morphology has been one of the long-standing problems in computational linguistics. Some work in the literature deals with it as a segmentation task that only aims to segment each word into its morphemes without identifying the syntactic or semantic role of the morpheme (Goldsmith 2001;Creutz and Lagus 2007;Can and Manandhar 2010;Goldwater, Griffiths, and Johnson 2011). Some work deals with morphology as both segmentation and sequence labeling, which is called morphological tagging (Müller et al. 2015;Cotterell and Heigold 2017;Dayanık, Akyürek, and Yuret 2018). POS tagging (Schütze 1993;Clark 2000;Van Gael, Vlachos, and Ghahramani 2009) and dependency parsing (Kudo and Matsumoto 2002;Oflazer et al. 2003;Nivre and Nilsson 2005) are well-known tasks in computational linguistics.
In recent years, deep neural networks have been extensively used for all of these tasks. Regarding morphological analysis, Heigold, Neumann, and van Genabith (2017) present an empirical comparison between convolutional neural network (CNN) (Lecun et al. 1998) architectures and long-short-term memory network architectures (LSTMs) (Hochreiter and Schmidhuber 1997). Character-based architectures are meant to deal with sparsity. LSTMs give slightly better results compared to CNNs. Cotterell and Heigold (2017) apply transfer learning by jointly training a model for different languages in order to learn cross-lingual morphological features. Shared character embeddings are learned for each language as in joint learning, under a joint loss function. Dayanık et al. (2018) propose a sequence-to-sequence model for morphological analysis and disambiguation that combines word embeddings, POS tags, and morphological context.
Deep neural networks have also been used for POS tagging. Dos Santos and Zadrozny (2014) employ both word embeddings and character embeddings obtained from a CNN. Ling et al. (2015) introduce a model to learn word representations by composing characters. The open vocabulary problem is addressed by employing the character-level model, and final representations are used for POS tagging. The results are better compared to that of Dos Santos and Zadrozny (2014).
Joint learning has re-attracted attention in recent years (Sanh, Wolf, and Ruder 2019;Liu et al. 2019;Li et al. 2020) after the resurgence of neural networks. To our knowledge, there has not been any study that combines morphological segmentation, morpheme tagging, POS tagging, and dependency parsing in a single framework using deep neural networks, especially for morphologically rich languages. There have been various studies that combine some of these taks. Nguyen and Verspoor (2018) introduce a model that extends the graph-based dependency parsing model of Kiperwasser and Goldberg (2016) by adding an extra layer to learn POS tags through another layer of bidirectional LSTM (BiLSTM), which is used as one of the inputs to dependency LSTMs. Yang et al. (2018) propose a joint model to perform POS tagging and dependency parsing based on transition-based neural networks. Zhang et al. (2015) introduce a joint model that combines morphological segmentation, POS tagging, and dependency parsing. The model does not perform morpheme tagging. It is based on randomized greedy algorithm that jointly predicts morphological segmentations, POS tags, and dependency trees. Straka (2018) proposes a pipeline model called UDPipe 2.0 that performs sentence segmentation, tokenization, POS tagging, lemmatization, and dependency parsing. The pipeline framework follows a bottom-up approach without jointly training the tasks. Kondratyuk and Straka (2019) extends UDPipe 2.0 with a BERT replacement of embedding and projection layers, which is trained across 75 languages simultaneously.
Our proposed model differs from these models in trying to combine five different tasks in a single learning framework, with the expectation that all five tasks improve by joint modeling, and with cross-level information. The baseline of our framework, Nguyen and Verspoor (2018), addresses only POS tagging and dependency parsing. We extend their model with extra layers for morphological segmentation and identification, enabling the model to perform both morphological and syntactic analysis for morphologically rich languages.

Joint learning of morphology and syntax
In the base model of Nguyen and Verspoor (2018), one component is for POS tagging. It is based on a two-layer BiLSTM (Hochreiter and Schmidhuber 1997) which encodes sequential information that comes from each word within a sentence. Encoded information is passed through a multilayer perceptron with a single layer that outputs the POS tags. Their second component is for graph-based dependency parsing. It also involves BiLSTM, which learns the features of dependency arcs and their labels by using the predicted POS tags obtained from the POS tagging component, word embeddings, and character-level word embeddings.
The overview of our architecture is shown in Figure 2. We have added one component to learn the morphological segmentation and another to do morphological identification to obtain the morpheme labels. The vectorial representations used in the model are illustrated in Figure 3. We now describe the architecture.

Morphological segmentation
The lowest-level component in our joint learning framework performs segmentation, which uses characters as proxies for phonological shape. For agglutinating languages, in particular, the task is to make the segmentation morphologically salient, that is, corresponding to a morph. The overall sub-architecture of the segmentation component is given in Figure 4. A BiLSTM (BiLSTM seg ) which encodes characters in a word is used to learn the sequential character features for segment boundaries. We feed this BiLSTM with one hot character encoding of each character in the word.
where e 1:n denotes a sequence of one hot vectors for each character in the word. Each output vector is reduced to a single dimension by a multilayer perceptron (MLP) with one hidden layer to predict a segment boundary at a given step: The layers of the proposed joint learning framework. The sentence "Ali okula gitti." ("Ali went to school") is processed from morphology up to dependencies.
whereŷ i denotes prediction probability of being a segment boundary after the ith character in the word. MLP has one sigmoid output that predicts whether there is a segment boundary at a time step. 1 indicates that there is a segment boundary after the character and 0 indicates that the word will not be split at that point.
The sigmoid is a continuous function. We use it for binary tagging, where a value above 0.6 refers to a morpheme boundary and a value below 0.6 refers to a non-boundary during testing. Therefore, we amplify the values above the threshold to predict a segmentation boundary during testing. However, we sample a segmentation boundary during training based on a probability value sampled randomly. . The layers of the proposed joint learning framework working on the sentence "okula gitti" ("(he/she) went to school"). The vectors c, w, e, mt, s, and p denote character, word, concatenated character and word, morphological tag, segment, and POS tag, respectively.
Based on the predicted outputs for each position inside the word, we compute the loss defined by binary cross-entropy as follows for Y number of characters inside a word: whereŷ denotes the predicted segmentation and y denotes the gold segmentation. The resulting vector obtained from each state of the BiLSTM is fed into a multilayer perceptron with one hidden layer and with a sigmoid activation function in the output layer. The output of the sigmoid function is a value between 0 and 1, where 1 corresponds to a segment boundary and 0 indicates a non-boundary. In the example, there is a boundary after the character i, that is, 1 is the output from the MLP at that time step.

Morphological tagging
The second additional component in the model is for morphological tagging. We use an encoderdecoder model which takes each word as input, encodes the characters of the word along with the context of the word, and performs decoding by generating the morpheme tags of the word. Decoding is performed in a sequential manner by predicting morpheme tags of a word one by one using LSTM. However, this sequential generation may not refer to the actual order of morphemes in a word. It reveals the correspondent morpeme tags within a word regardless of their position. Therefore, decoding is not tied to agglutination, and portmanteau, circumfixation and lexically marked differences in a word are representable (e.g., run/ran).
The overall sub-architecture of the morphological tagging component is depicted in Figure 5. Here, we define a two-layer BiLSTM for the character encoder, where each character is encoded as a vector which is the output of that time step in BiLSTM that involves all the sequential information to the right and left of the current character, denoted as m char where − → h c i is the forward representation and ← − h c i is the backward representation obtained for the ith character in the current word. In order to decode a token into a morphological tag through the decoder, we apply an attention mechanism to learn which characters in the actual word have more impact on the current tag that will be generated by the decoder.
We estimate the weight of each character c i on each decoded token z j as follows, which is normalized over all characters in the original word: exp score c i , z j (5) Figure 5. The encoder-decoder sub-architecture of the morphological tagging component. The top part is the encoder and the bottom part is the decoder. The example is "kahveleri bende içelim" ("let's drink coffee at my place").
Y is the length of the encoded word. The unnormalized score(c i , z j ) that measures how much contribution each character has on the output tokens is computed with the relation between the encoded character c i and the decoded state z j as follows: where v 1 , W 1 , and W 2 are the parameters to be learned during training. This is a one-layer neural network that applies tanh function to its output. It corresponds to the attention mechanism for the character encoder. Once the weights are estimated, the total contribution of all characters to the current word is computed as the weighted sum of each character at time step t: where c t is the attended word characters of the current word that will be one of the inputs to the decoder.
For the word encoder, we define a second one-layer BiLSTM, where each context word is encoded as a vector, m word where − → h w k is the forward representation and ← − h w k is the backward representation obtained from the jth context word in the current sentence. The input to BiLSTM w_encode is generated by another BiLSTM that processes the characters of each word and produces a character-level word embedding of a given word using character embeddings. Another attention mechanism is used for encoding the context to predict the correct morphological tags based on the context, thereby enabling morphological disambiguation. To this end, we estimate the weight of each context word w k (including the current word) on each decoded token z j as follows, which is normalized over all words in the sentence: 9) where N denotes the number of words in the sentence. The unnormalized score score(w k , z j ) that measures how much contribution each context word has on the output tokens is computed based on the encoded word w k . The resulting decoded state t x is defined as follows: where v 2 , W 3 , and W 4 are the parameters to be learned during training. This is a one-layer neural network that applies tanh function to its output. It is the attention mechanism for the context encoder.
Once the weights are estimated, the total contribution of all words in the sentence is computed as the weighted sum of each word at time step t: where wv t is the attended context words that will be another input to be fed into the decoder.
The input of the decoder for each time step t is therefore the concatenation of the embedding of the morpheme tag produced in the last time step t − 1, the weighted sum of the BiLSTM outputs of both word encoder and context encoders: The decoder is a unidirectional LSTM with the output at each time step t: whereŷ t is the predicted morpheme tag at time step t; the tth morpheme tag generated for the current word. The output of each decoder state is fed into a one-layer MLP with a softmax function to predict the next morpheme tag. Finally, categorical cross-entropy loss is computed for each position in the decoder. The morphological tagging loss (L mtag ) is

Word vector representation
We use word-level, character-level, and morpheme-level word embeddings for both POS tagging and dependency parsing. Word-level word embeddings are pretrained from word2vec (Mikolov et al. 2013). We learn the character embeddings through a BiLSTM. The output of the BiLSTM is used for character-level word embeddings. For the morpheme-level embeddings, we have two types. One is generated from actual morphemes. The other one is generated from morpheme tags. For the embeddings of the actual morphemes, we use morph2vec (Üstün et al. 2018) to pretrain the morpheme embeddings. Once the morpheme embeddings are learned, we use a BiLSTM that encodes each morpheme embedding to predict the morpheme-level word embeddings. We use another character BiLSTM for the unseen morphemes that are fed with character embeddings. For the embeddings of the morpheme tags, we randomly initialize the morpheme tag embeddings, where the morpheme tags are predicted by the morpheme tagging component described above.
The final vector e i for the ith word w i in a given sentence s = w 1 , w 2 , · · · w N is where e w w i is the word-level word embedding, e c w i is the character-level word embedding, e m w i is the morpheme-level word embedding, and e mt w i is the encoding of the sequence of morpheme tags for the given word.

POS tagging
We use a BiLSTM to learn the latent feature vectors for POS tags in a sentence. We feed the sequence of vectors e 1:N for the sentence to BiLSTM POS . The output of each state is a latent feature vector v (POS) i that refers to the feature vector of the ith word w i in the sentence: v POS i = BiLSTM POS (e 1:n , i) (16) In order to predict the POS tag of each word, we use an MLP with softmax output function similar to the joint model of Nguyen and Verspoor (2018): where POS i denotes the predicted POS tag of the ith word in the sentence and T is the set of all POS tags. We use the categorical cross-entropy loss L POS .
where POS t is the gold POS tag.

Dependency parsing
We employ a BiLSTM (BiLSTM dep ) to learn the dependency latent feature vectors for dependency relations between words. We feed BiLSTM dep with the POS tag embeddings and the word embeddings e i . Once the POS tags are predicted, we represent each POS tag with a vector representation e (p) p i , which is randomly initialized. The final input vector representations are defined as follows for the dependency parsing component: = e (p) We feed the sequence of vectors x 1:N into BiLSTM dep . Latent feature vectors v 1:N are obtained from BiLSTM dep : where − → h d k is the forward representation and ← − h d k is the backward representation obtained from the output of the kth state in BiLSTM dep . Figure 6. The pointer network which is used to predict a dependency score between words in a sentence. Some Turkish dependencies are shown for exemplifying head-dependent relations.
To predict the dependency arcs, we follow Nguyen and Verspoor (2018) and McDonald, Crammer, and Pereira (2005). Features are enriched with morpheme embeddings obtained from morph2vec (Üstün et al. 2018). The arcs are scored by an MLP that produces a score denoted by score (i,j) arc with a single output node: The MLP consists of a pointer network ( Figure 6) which predicts whether two words in a sentence have a dependency relation or not.
Once the scores are predicted, we use the decoding algorithm of Eisner (1996) to find the projective parse tree with the maximum score: e score(s) = arg maxˆt ∈T(s) (h,m)∈t score arc (h, m) where h denotes the head word of the dependency, m denotes the modifier word, T s is the set of all possible parse trees for the sentence s, andt is the parse tree with the maximum score. We also predict the labels for each arc with another MLP, which is MLP rel , with softmax output. An output vector v (h,m) ) is computed for an arc (h,m):

Cross-level contextual information flow
In the architecture which we have described so far, every time step in each layer is fed with only that time step's output obtained from one layer below. We further investigate data sharing between the layers in various time steps. For this, we incorporate contextual information from the previous word into the input of the layers for the current word, by adding contextual information obtained from morpheme tagging encoding and morpheme encoding of the previous word to POS and dependency layers of the current word. Similarly, we incorporate POS tagging encoding of the previous word into the dependency layer of the current word. An illustration of the cross-level contextual information flow is in Figure 7.
We do this by various methods. One of the methods is weighted sum: where α 1 and α 2 are weights in combining the encoding of the previous word e w i−1 and encoding of the current word e w i . The updated encoding of the current word is denoted by e w i . f The encodings may correspond to either morpheme-based word embeddings, morpheme tag encodings, or POS tag encodings of words, depending on which layer the cross-level contextual information flow applies. The second method is elementwise multiplication between the encoding of the previous word and the encoding of the following word: In the third method, we experimented with contextual information by using MLP to let the full joint model learn how to combine the encodings of successive words: First, the encodings of the previous and the current word are concatenated, then fed into a onelayer MLP. The parameters of the MLP are learned jointly with the model. In the fourth method, we employed a concatenative (or additive) attention network to learn the weights of the contextual information automatically. We used a fixed size window for the history to estimate the weights. g The weight of a previous word is estimated as: where w i−1 is the previous word, x i is the encoding of the ith token (particularly, the POS tag or the dependency relation), and Y is the set of context words (including the current word itself). Therefore, α i gives the relevance of w i−1 for x i in that layer. The unnormalized score (w i−1 , x i ) that estimates the weight of each context word is defined as follows: f We use manually set weights, α 1 = 0.3 and α 2 = 0.7. The weights are assigned empirically as a result of several experiments performed on the development set. g We set the window size to 2 or 5 to the left, and 1 or 2 to the right to incorporate narrower and larger contextual information. The results are discussed in Section 4. where e w i−1 is the encoded information of w i−1 (either with morphemic information, morpheme tags, or POS tags) that is obtained from a lower layer to be fed into an upper layer, and v 3 , W 4 , and W 5 are the parameters to be learned during training. Once the weights are learned, max pooling is applied for the final encoding that will be fed into the upper layer.
In the fifth method, we used a dot-product attention model. The unnormalized score score 2 (w i−1 , x i ), which estimates the weight of each context word is defined as follows: We also employed CNNs to blend the contextual information using convolution and pooling layers along with a rectifier unit.
An illustration of the attention network that allows cross-level contextual information flow for two previous words is given in Figure 8. The figure shows the attention network between the POS Figure 8. An illustration of morpheme tag cross-level contextual information flow between POS layer and morpheme tagging layer. The context contains only two previous words in the example. The POS layer also takes the morpheme segmentation embeddings and word embeddings which are also based on their own attention networks; they are excluded from the figure for the sake of simplicity.
layer and morpheme tagging layer, where e mt i−2 and e mt i−1 correspond to the previous two words but particularly to their morpheme tags and w i is the current word with a POS tag p i . Therefore, the attention network learns how the morpheme tags of two previous words affect the POS tag of the current word. We generalize it further to a larger context to the left and right, and between various layers in the architecture. A similar cross-level contextual information flow is also built between dependency layer and morpheme tagging layer, or dependency layer and segmentation layer; or POS tagging layer and segmentation layer. Attention is replaced with all the methods mentioned (weighted sum, elementwise multiplication, MLP, concatenative attention, dot-product attention, and CNN), to combine the contextual information in each layer to feed into upper layer.

Model training
In the proposed joint learning framework, each component has its own error that contributes to the total loss of the model L: We use training and validation corpora to train the model. During training, at each epoch, the model is first trained by Adaptive Moment Estimation (Adam) optimization algorithm (Kingma and Ba 2015). Then, per-token accuracy is calculated using the validation set. At every tenth epoch, all momentum values are cleared in the Adam Trainer. We applied step-based learning rate decay and early stopping.

Realization
We implemented our model in DyNet v2.0 (Neubig et al. 2017), a dynamic neural network toolkit. The implementation of the model is publicly available at https://github.com/halecakir/JointParser. For the morphological segmentation component, we split each token into characters. We represent each character with a 50-dimensional character embedding. Using characters as input, we build a BiLSTM with 50-dimensional unit size to encode the character context. We feed an MLP with a sigmoid output that has an output size of 1 with the encoded characters to capture the segment boundaries.
Based on the segmentation probabilities obtained from the MLP, we created a morpheme list that makes up the input word. We encode morphemes with 50-dimensional embeddings. We feed another BiLSTM with 50-dimensional units to encode each input morpheme. We concatenate the 200-dimensional word-level word embeddings, 50-dimensional character-level word embeddings, and 100-dimensional morpheme-level word embeddings for the final word representation.
For the morpheme tagging component, we use a two-layered BiLSTM with 128-dimensional unit size to encode the character context. The decoder is also based on a two-layered BiLSTM with 128-dimensional unit size that is fed with the attended characters that are encoded by the encoder.
For the POS tagging component, we use a two-layered BiLSTM with 128-dimensional unit size to encode the word context. We feed the output of the word context into an MLP with a softmax activation function and 100-dimensional hidden unit size. The output size of the MLP is equal to the number of distinct POS tags in a language. Softmax gives the probability vector for the POS categories. We use the negative log-likelihood function as a loss function for the POS tagging component.
For the dependency parsing component, we use a two-layered BiLSTM with 128-dimensional unit size and a pointer network with 100-dimensional hidden unit size. h We use negative loglikelihood function for training.
We adopt dropout to penalize some random weights to zero and observed significant improvement when we applied dropout with a rate of 0.3 after each LSTM.

Experiments and results
We have tested the system in various languages. First, we report Turkish results.

Turkish data
We used the UD Turkish Treebank for both training and evaluation. The dataset is a semiautomatic conversion of the IMST Treebank (Sulubacak et al. 2016), which is a re-annotated version of the METU-Sabancı Turkish Treebank (Oflazer et al. 2003). All of the three treebanks share the same raw data, with 5635 sentences from daily news reports and novels.
For the pretrained word embeddings, we use pretrained 200-dimensional word embeddings trained on Boun Web Corpus (Sak, Güngör, and Saraçlar 2008) provided by CoNLL 2018 Shared Task. We also pre-train morph2vec (Üstün et al. 2018) to learn the morpheme embeddings on METU-Sabancı Turkish Treebank. We obtain the gold segments from rule-based Zemberek (Akın and Akın 2007), to train the segmentation component by disambiguating segmentation of a word in the given context in METU-Sabancı Treebank. Each word token then has only one interpretation.

Turkish results
We performed several experiments with the alternating components of the proposed model. All models are trained jointly with a single loss function. Depending on the examined joint tasks, we excluded either morpheme tagging or segmentation or both tasks from the full model. We observed how the lower levels that correspond to morpheme tagging and segmentation tasks affect the upper levels that refer to POS tagging and dependency parsing.
We followed the standard evaluation procedure of CoNLL 2018 shared task defined for dependency parsing, morphological tagging, and POS tagging. For the POS tagging, we evaluated our results with an accuracy score that is the percentage of the correctly tagged words. For the dependency parsing task, we present two different evaluation scores: labeled attachment score (LAS) and unlabeled attachment score (UAS). The attachment score is the percentage of the words that have the correct head and the dependency arc. LAS measures the percentage of the words that have the correct head with the correct dependency label, whereas UAS only measures the percentage of the words that have the correct head without considering the dependency labels. h All dimensions in the network are determined empirically. We evaluate segmentation results also for accuracy, which is the percentage of the correctly segmented words. We evaluate morpheme tagging with an accuracy measure, which is in this case the percentage of correctly tagged words. We evaluate morpheme tagging based on all of the morpheme tags that each word bears. We assume that all the morpheme tags of the word must be correct in order to count the word as correctly tagged. It is labeled FEATS in the tables (universal morphological tags).
Because gold segments are not provided in the CoNLL datasets, we segmented the test set with Zemberek morphological segmentation tool of Akın and Akın (2007) and evaluated training based on Zemberek results. Therefore, our segmentation results are computed relative to Zemberek.
All the results obtained for the proposed models of Turkish are given in Table 1. We compare them with the joint POS tagging and dependency parsing model of Nguyen and Verspoor (2018), which is the base model for us. Recall that it has two layers, given as POS-DEP (Baseline) in the table. Our baseline with these two layers only approximately replicate the results of Nguyen and Verspoor (2018). We obtained an UAS score of 70.42% and a LAS score of 62.71% from the baseline model. The results of the baseline in Table 1 are obtained using Nguyen and Verspoor (2018)'s own implementation. However, we have been unable to replicate their reported scores identically, possibly due to some minor parametric differences because of hyperparameters. POS tagging accuracy is the highest with 94.78% when all the components (i.e., addition of morpheme tagging and morphological segmentation) are included in a full joint model. The same also applies to UAS, LAS, morpheme tagging accuracy, and segmentation accuracy, which are the highest when all the components are adopted in the model. This shows that all the layers in the model contribute to each other during learning.
Ours being the first attempt to combine five tasks in a single joint model, we compare the performance of each component separately with different models. Results of dependency parsing are in Table 2. The table shows the UAS and LAS scores of different dependency parsing models. We compare our results with the joint POS tagging and dependency parsing model by Nguyen and Verspoor (2018), with the winning contribution to CoNLL 2018 shared task that incorporates deep contextualized word embeddings (Che et al. 2018), with the horizontal pipeline system that performs tokenization, word segmentation, POS tagging, and dependency parsing by Qi et al. (2018), with the tree-stack LSTM model by Kırnap, Dayanık, and Yuret (2018), and with the dependency parser that incorporates morphology-based word embeddings proposed by Özateş et al. (2018).
The results show that our model has nontrivial improvement on the baseline model of Nguyen and Verspoor (2018). Moreover, our model outperforms most of the models that participated in CoNLL 2017 and CoNLL 2018. The only models that perform better than our model are the ones proposed by Che et al. (2018) and Qi et al. (2018). Their models give a UAS score of 72.25% and 71.07%, LAS score of 66.44% and 64.42%, respectively, whereas our model gives a UAS score of 71.00% and LAS score of 63.92%. Our UAS score is competitive with the deep contextualized model by Che et al. (2018); however, there is a 2.5% difference in the LAS scores of the two models. As the authors of that work also state, deep contextualized word embeddings (BERT, Devlin et al. 2019) affect parsing accuracy significantly.  Apart from the usage of contextualized word embeddings, our base model is similar to their model. Our full model performs morpheme tagging and morphological segmentation additionally. Qi et al. (2018) presents another prominent model which competed in CoNLL shared task 2018. They were placed second in the shared task with their UAS and LAS scores. Similar to our model, they also introduce a neural pipeline system that performs tokenization, segmentation, POS tagging, and dependency parsing. Their architecture is similar to ours, differently they use biaffine classifier for POS tagging similar to that of Dozat and Manning (2017).
The comparison of our POS tagging results with other POS tagging models is given in Table 3. We compare our model with Che et al.  The comparison of our morphological tagging results (FEATS) with other models is given in Table 4. We compare our model with Straka (2018) . The best performing model is by Straka (2018) with an accuracy of 91.25%. It is worth mentioning that none of these models performs a joint learning for morphological tagging. Our model is the only joint model that performs morphological tagging along with other syntactic tasks. The results are very competitive with other models although they fall slightly behind them.

The effect of cross-level contextual information flow
The cross-level contextual information flow has been investigated with the following methods: weighted sum, elementwise multiplication, MLP, and attention network. We include only the previous word as context in weighted sum, elementwise multiplication, and MLP, and we incorporate various lengths of contextual information both from left and right in the attention mechanism.
The results of the proposed methods for cross-level contextual information flow on Turkish are given in Table 5. Highest results are obtained with basic dot-product attention.
Attention networks resolve the issue of manual weight assignment by learning the weight of each context word automatically during training. However, attention with concatenation requires a separate development set for hyperparameter tuning to find the optimum weights for the contextual words. Therefore, it has to be performed for every language and dataset separately using a development set, which is preferably different than the training dataset. We incorporated various lengths of contextual knowledge both for left and right context. For a smaller context, we incorporated two previous words; and for a larger context, we incorporated five previous words. The results obtained from two previous words are comparably lower than that of five previous words. We also incorporated right context with one and two following words including previous words as well. The results with smaller context again performs better than a larger context. This could be due to various lengths of sentences in dataset where there is not enough history for especially larger contexts.
We split the dataset for training and test sets to include various lengths of contextual knowledge for shorter (having less than 10 words) and longer sentences (having 10 or more words). We used five words in the left and one word on the right for longer sentences and used two words in the left and one word on the right for shorter sentences. The overall results are comparably better than that of using a fixed length of context for all sentences, especially in dependency parsing. However, POS tagging and morphological tagging results fall behind that of using a fixed amount of context. This could be due to the size of the training sets when split. However, it still shows that using a larger context indeed helps at the syntactic level.

Results on English, Czech, Hungarian, and Finnish
There is nothing in our system which is specific to Turkish. We have applied our joint system to other languages, where we use Universal Dependencies v2.3 corpora. The results for English, Czech, Hungarian, and Finnish are given in Table 6. We excluded the morphological segmentation layer from the joint model for these experiments, since gold morphological segmentations are not available for these languages; only morphological tags are available in UD treebanks.
We performed experiments both without using cross-level contextual information flow and with using attention mechanism for the cross-level information flow. We used only previous two words for the contextual information, since the highest scores are obtained from this setting on Turkish. We applied concatenative attention network for English, Hungarian, and Czech, dotproduct attention for Finnish due to the highest evaluation scores.
As can be seen from the results, the cross-level contextual information flow considerably improves the results in all languages. The results demonstrated a substantial improvement, especially in English LAS and Hungarian FEATS.
We compared the highest scores obtained in these languages with other state-of-the-art participated in CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman and Hajič 2018). The results are given in Table 7. There are 26 participants who shared their results on the given languages. We only compared our scores with the top five systems participated in the shared task. We also provide the scores of Nguyen and Verspoor (2018) to show the improvement obtained by the joint model and cross-level contextual information flow.
Our joint model outperforms all other participants in English for both UAS and LAS. Although we have a significant improvement over the baseline model by Nguyen and Verspoor (2018) in Finnish, Hungarian, and Czech, our model falls behind the top scores in other languages. However, the joint model still performs well on other languages when all participants in the shared task are considered. For example, our model is ranked 11th in Finnish, 13th in Hungarian, and 14th in Czech according to LAS scores; 10th in Finnish, 13th in Hungarian, and 15th in Czech according to UAS scores; 5th in English, 10th in Finnish, and Hungarian, 6th in Czech according to universal POS (UPOS) scores.

Error analysis for Turkish
We explored the errors made by the proposed model variations in greater detail in one language, Turkish. The following experiments aim to classify parsing errors with respect to structural properties of the dependency graphs. In all experiments, four different joint models have been trained and tested with the standard split of Turkish IMST universal dependencies. i Error analysis has been carried out using four properties: sentence length, projectivity, arc type, and POS.

The effect of sentence length
We first try to see the effect of sentence length on the robustness of the four model variations. For that purpose, the test data have been divided into equally sized sentence length intervals. The interval length is 10. The UAS and LAS scores according to different sentence lengths are given in Tables 8 and 9, respectively. It can be seen that the longer the sentences are, the lower both scores are for all model variations. However, MorphTag-POS-DEP and SEG-MorphTag-POS-DEP models perform rather well in shorter sentences (with length 0-20) with regard to both UAS and LAS.
What is interesting about the scores is that the models that have higher UAS and LAS scores in shorter sentences are worse in longer sentences (length 30-50). Parsing errors gradually increase as the sentence length increases. In literature, a similar outcome has been reported for other languages, for example, for English and Vietnamese (McDonald and Nivre 2007; Van Nguyen and Nguyen 2015).

Effect of projectivity
This effect concerns dependencies.
An arc in a dependency tree is projective if there is a directed path from the head word to all the words between the two endpoints of the arc, and therefore there are no crossing edges in the dependency graph. Projectivity is less of a problem in languages with more strict order such as English. However, non-projective trees can be seen more frequently in languages such as Czech,   German and Turkish. Table 10 shows some statistics about the test set. Turkish being a free order language, the number of non-projective dependency graphs is quite significant. However, it is not only relatively free word order in languages such as Turkish that leads to various number of non-projective dependencies. Turkish root relativization may be nested Figure 9(a)-(b), or crossing Figure 9(c), without manipulating free word order effects.
In the case of long-range dependencies crossing clause boundaries, Turkish is always crossing, as can be seen from Figure 10; therefore, a dataset that contains abundance of such examples would afford very little chance of adequate coverage when non-projective dependencies are eliminated.
There is no limit to crossings in Turkish, because there is no limit to long-range relativization. N-level embedded complementation in an expression requires n crossings to capture the dependency relation of the extracted element with its verb, as shown in Figure 10. Therefore, there is no such thing as mildly non-projective Turkish long-range relative clause dependency (cf. Kuhlmann and Nivre 2006).
We point out that filtering out crossing dependencies for evaluation as, for example, done by Sulubacak and Eryigit (2018) would take away a lot of non-crossing dependencies as well, so all kinds of parsers, projective and non-projective, would be affected by this elimination in their training and performance in simpler cases. (c) Figure 9. Dependencies in Turkish relative clauses (in brackets), and with the relativized noun (in italic). They are (a) rootclause object relativization; (b) root-clause subject relativization; (c) relativization out of possessive NPs, here from "man's car": "adam-GEN.3s araba-POSS.3s." Dotted edges indicate which dependency relation "acl:rel" dependency is expected to capture.
We designed the joint-task cross-level contextual information flow with these considerations in mind, so that all kinds of dependencies when available in standard datasets for evaluation can be modeled. Adding attention along with cross-level contextual information handling must be regarded as the first-step toward handling such data with dependencies when they are available, which would allow a meaningful comparison of dependency parsing with, for example, CCG parsing (Hockenmaier and Steedman 2007; Çakıcı, Steedman, and Bozşahin2018), where so-called non-projectivity is handled by bounded function composition on complement clauses.
We hope that our sticking to dependency parsing for the moment may help assess current state of the art and provide a prospect for the future with such kind of data. In particular, we suggest reporting constructions by genera (control, suboordination, relativization, coordination, etc.), cross-listed by bounded and long-range dependencies, as also pointed out by Hockenmaier and Steedman (2007). Without such standardizations, significance tests for improvement or degradation over datasets would not reveal much about what is learned about a language in a model. Figure 11 gives the percentages of projective and non-projective trees according to sentence length. The results show that as the complexity of sentence increases; therefore, the length, the probability of generating non-projective trees also increases. The number of projective trees is higher for the sentence lengths smaller than 30. The number of non-projective trees is higher for the sentence lenghts greater than 40. For example, for all the sentences that are longer than 50, the dependency tree is non-projective.
We analyze the labeled and unlabeled accuracy scores obtained from the projective and nonprojective trees in testing. As can be seen in Table 11, the MorphTag-POS-DEP model outperforms other models both for labeled and unlabeled dependency arcs in projective trees. However, for the non-projective trees, the SEG-MorphTag-POS-DEP model outperforms other models. We aim in the long run for wide-coverage systems with reasonable approximation of semantics; therefore, we take it as a good sign that the full model is responsive to non-projective, therefore more complex sentences.

The effect of constructions
Finally, we analyze the syntactic properties of dependency relations based on their POS labels obtained from different models. We compare the results based on dependent word's POS tag to see whether it plays a role in the accuracy of dependencies. Following McDonald and Nivre (2007), we make use of coarse POS categories instead of UPOS categories in order to observe the performance differences between the models in loss of useful syntactic information. In coarse POS categories, nouns include nouns and proper nouns, pronouns consist of pronouns and determiner, and verbs include verbs and auxiliary verbs.
UAS and LAS accuracies of the four models are given in Tables 12 and 13, respectively for the coarse POS categories. As can be seen from Table 12, there is a slight improvement of UAS   Tables A1 and A2, respectively, in Appendix A.

SEG-POS
The overall results show that each component in the model has an impact on the UAS scores for different POS categories of the dependents, and the full joint model that includes all the components has a higher overall UAS accuracy for different dependent word's POS categories.
As can be seen from Table 13, the SEG-MorphTag-POS-DEP model tends to be more accurate in nouns (NOUN) and conjunctions (CONJ). With the addition of morpheme tags, the MorphTag-POS-DEP model has higher LAS accuracies for adjectives (ADJ), adverbs (ADV), and verbs (VERB). There is a considerable improvement of LAS in adpositions (ADP) after  (PRON), which can be a sign that pronouns are usually not inflected and the morph tags and segmentation do not contribute much for the pronoun category.

Ablation
In order to measure the contribution of components in the model, we remove one component tentatively, and test. It is an ablation analysis, and, in the previous sections, we have done a coarse version of it. There is also another version of ablation analysis that measures the influence of the individual components in the model by incrementally replacing the component with their gold annotations. This version of ablation analysis is reported by Qi et al. (2018). Using this approach, researchers have been able to decide whether an individual component is useful for higher-level tasks and estimate the limitations of their studies under perfect conditions. This way, the errors that flow through the network from different components are filtered out to measure the actual effect of each component on the final scores. The results obtained from the pipeline ablation analysis of the joint model are given in Table 14. The first row gives the results of the full joint model without gold annotations. In the second row (POS-MorphTag-GoldSEG-DEP), the morpheme segmentation component is replaced with the gold morphological segments rather than predicting them, and the rest of the components are left as they are. In the third row (POS-GoldMorphTag-SEG-DEP), the morphological tagging component is replaced with the gold morpheme tags.
Finally, both morpheme segmentation and morphological tagging components are replaced with their gold annotations in the fourth row (POS-GoldMorphTag-GoldSEG-DEP). The results show that using the gold morpheme tags contributes to the dependency scores the most. However, the highest improvement is obtained when both gold morphological segmentations and gold morpheme tags are incorporated into the model for POS tagging. We are encouraged by this result because of our motivation to add morpheme information to joint learning from the beginning.
When individual components are replaced with the gold annotations incrementally, from bottom tasks to the upper layers, there is still a huge gap between the ablation scores and the gold scores. This is where linguistic theories can fill in the gap, for morphology, phonology, and syntax, by narrowing the solution space of every task of a joint model.

Conclusion and future work
We presented a joint learning framework for learning POS tagging, morphological tagging, morphological segmentation and dependency parsing to the extent of obtaining dependency relations of who does what to whom. We use a deep neural architecture where each layer learns a specific level, and this knowledge is shared through other layers.
The results show that using morphological knowledge such as morpheme tags and morphs themselves improve both dependency parsing and POS tagging. Morphemic knowledge plays an important role in morphologically rich languages.
Applying deep contextualized word embeddings such as BERT (Devlin et al. 2019) or Elmo (Peters et al. 2018 remain as a future goal in this work. We plan to replace the LSTMs with selfattention mechanisms of variable reach at the levels of the current architecture, as a controlled means of extending the context. We believe that using self-attention mechanism instead of LSTMs will handle the long-term dependencies better, especially in longer sentences, and in larger left and right contexts.