Morphological Analysis Using a Sequence Decoder

Abstract We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and to outperform whole-tag models. In addition, generating morphological features as a sequence rather than, for example, an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the-art results in nine languages of different morphological complexity under low-resource, high-resource, and transfer learning settings. We also introduce TrMor2018, a new high-accuracy Turkish morphology data set. Our Morse implementation and the TrMor2018 data set are available online to support future research.1 See https://github.com/ai-ku/Morse.jl for a Morse implementation in Julia/Knet (Yuret, 2016) and https://github.com/ai-ku/TrMor2018 for the new Turkish data set.


Introduction
MorphNet 1 is a sequence-to-sequence recurrent neural network model that takes sentences in plain text as input and attempts to produce the correct morphological analysis of each word as output.Traditional methods, e.g.(Oflazer and Tür, 1997), first produce all possible analyses of a word using finitestate transducers and then perform disambiguation using statistical or rule-based methods.In contrast, MorphNet combines the analysis and disambiguation in a single model and it obtains state-of-the art or comparable results in producing the correct morphological analysis of each word. 1 The code will be released upon publication.Table 1: Morphological analysis for Turkish word masalı.

Word Analysis & Context
An example context and its translation is given below each analysis.
Morphological analysis identifies the structure of words and word-parts (morphemes) such as stems, prefixes and suffixes.For example, Table 1 shows the three possible analyses for the ambiguous Turkish word "masalı": the accusative and possessive forms of the root "masal" (tale) and the +With form of the root "masa" (table) are expressed with the same surface form (Oflazer, 1994).Producing the correct analysis is essential for downstream syntactic and semantic processing.
The importance of morphological analysis varies significantly by language family.Analytic languages like Chinese, and near-analytic languages like English show a low ratio of morphemes to words and express most grammatical relations using function words or word order.According to (Baayen et al., 1995), less than 10% of word tokens in English text carry an affix and less than 30 word forms have the type of morphological ambiguity observed in the "masalı" example (e.g.leaves=leaf-f+ves vs leave+s).Thus, for most purposes, simple table lookup tends to be sufficient for English mor-phological analysis.
In contrast, agglutinative languages like Turkish and Finnish tend to have a high ratio of morphemes to words and express many grammatical relations using affixes.In the Turkish training data used by (Yuret and Türe, 2006), 48% of the word tokens carry an affix and 59% of these are morphologically ambiguous.For downstream syntactic analysis, (Oflazer et al., 1999) observes that words in Turkish can have dependencies to any one of the inflectional groups of a derived word.For example, in "mavi masalı oda" (room with blue table) the adjective "mavi" (blue) modifies the noun root "masa" (table) even though the final part of speech of "masalı" is an adjective.Accurate morphological analysis and disambiguation are important prerequisites for further syntactic and semantic processing in agglutinative languages.
Previous work has separated the tasks of morphological analysis and morphological disambiguation.Morphological analysis is taken to be the task of producing all possible morphological parses of a given word.Morphological analyzers have typically been implemented as Finite State Transducers (FSTs) with language specific, manually generated rules.Morphological disambiguation is the task of selecting the correct parse for a given word in a given context among all possible parses.Various rule based, statistical and neural network models have been implemented for morphological disambiguation.These models are described in Section 2.
In this work, our motivation is to eliminate the need for separate morphological analyzer and disambiguator components and to provide a single, easy-to-use model.We present MorphNet, a sequence to sequence (Sutskever et al., 2014) model for morphological analysis and disambiguation.Once trained, the model can be used either as a stand-alone application or with an external analyzer to eliminate errors for unambiguous tokens.The model uses three Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) encoders, a character based encoder to obtain word embeddings, a word based bidirectional LSTM encoder to obtain context embeddings and a unidirectional LSTM encoder to obtain output embeddings of preceding word analyses.The decoder consists of a two layer LSTM model.The first layer's hidden state is initialized with the context embedding, and the second layer's hidden state is initialized with the combination of word and output embeddings.The decoder learns to predict the correct morphological analysis including root characters and morphemes.Figure 1 gives the model architecture.
When we were evaluating our model on available Turkish datasets we realized that existing datasets suffer from low accuracy and small test sets, which makes model comparison difficult due to noise and statistical significance problems.To address these issues we created a new dataset, TrMor20182 , which contains 460K tagged tokens and has been verified to be 97%+ accurate by trained annotators.We report our results on this new dataset as well as previously available datasets.
The main contributions of this work are: • A new model that performs morphological analysis and disambiguation together.
• Release of a new morphological disambiguation dataset for Turkish.
• State-of-the-art or comparable results on nine different datasets in seven different languages.
In the rest of the paper, we discuss related work in Section 2, detail our model's input output representation and individual components in Section 3, describe our datasets and introduce our new Turkish dataset in Section 4, present our experiments and results in Section 5.

Related Work
In this section, we summarize the previous work on Morphological Analysis, Morphological Disambiguation and Morphological Tagging.

Morphological Analysis
Morphological analysis is generally performed by finite-state transducers (FST) which produce all possible parses for a given word (Koskenniemi, 1983).The analyzers are language dependent rule based systems that typically consist of a lexicon, two-level phonological rules, and finite state transducers that encode morphotactics (Koskenniemi, 1981;Karttunen and Wittenburg, 1983).The first rule based analyzer for Turkish was developed in (Oflazer, 1994), we used an updated version of this analyzer when creating our new Turkish dataset.Other analyzers for Turkish include (Eryigit and Adalı, 2004), (C ¸öltekin, 2010).

Morphological Disambiguation
Previous work on morphological disambiguation can be grouped into four broad categories in terms of the technique they use.These categories are rule based, statistical, hybrid and neural network based approaches.
The rule-based approaches (Karlsson et al., 1995;Oflazer and Kuruöz, 1994;Oflazer and Tür, 1996;Daybelge and C ¸ic ¸ekli, 2007;Daoud, 2009) exploit hand-crafted rules to select the correct parse among the candidates.The rules, typically language specific, are designed such that the model can capture the relationship between the context of the word that is subject to disambiguation and all its possible parses.Besides selecting the correct analysis, some systems (Oflazer and Tür, 1996) use these rules to eliminate the incorrect parses.
Statistical approaches try to disambiguate the target word according to the statistics calculated on the corpus they use.For example, (Hakkani-Tür et al., 2002) breaks up the morphosyntactic tags into inflectional groups to handle a large set of tags and then the model assigns a probability to each morphosyntactic tag by considering statistics over the individual inflection groups in a trigram model.(Yuret and Türe, 2006) uses decision lists to vote on each of the potential parses of a word.For each tag in the corpus, a different decision list is learnt using the Greedy Prepend Algorithm (Yuret and de la Maza, 2006).
Hybrid approaches combine hand-crafted rules with statistical methods.(Hajič et al., 2007) employs a rule-based component to reduce the number of possible parses and then runs a statistical Part of Speech (POS) tagger to disambiguate words in the Czech language.
Recently, a number of studies (Yıldız et al., 2016;Shen et al., 2016;Toleu et al., 2017) based on neural networks were also tried in morphological disambiguation.(Yıldız et al., 2016), proposed an architecture based on a convolutional neural network (CNN).The CNN is used to create a representation of the target word by using the root of the word and its morpheme features.Then by using this representation and ground truth annotations of previous words, the model predicts the correct analysis.In contrast to (Yıldız et al., 2016), to capture the context and relationship between target word and its surroundings, (Shen et al., 2016) exploits LSTM based neural network architectures.They customize the neural network architecture based on the input language.(Toleu et al., 2017) proposes a neural network model for the morphological disambiguation task on Kazakh and Turkish languages.

Morphological Tagging
In contrast to the two stage systems described above which use separate modules for morphological analysis and disambiguation, morphological tagging attempts to solve both problems at once and assign the correct tag to a given word without the aid of an analyzer that produces possible parses.This is the approach we take in MorphNet.Prior to MorphNet, (Mueller et al., 2013) attempted to solve the morphological tagging problem by employing a model based on Conditional Random Fields (CRFs) (Lafferty et al., 2001) and (Heigold et al., 2017) proposed two morphological tagging architectures based on neural networks.Their best performing architecture, similar to MorphNet, uses unidirectional and bidirectional LSTMs and takes only raw sentences as input.However, our work differs from theirs in a couple of important ways.First, given a word in a context, they predict a morphological tag as a complete unit, rather than as a combination of component features.Consequently, their model can only produce analyses observed in the training data.In contrast, MorphNet produces the morphological tag of the target word a feature at a time.Second, their system ignores the root of the target word and predicts only the morphological features whereas MorphNet generates the stem as well as the morphological tags.

Model
MorphNet produces the morphological analysis (stem plus morphological features) for each word in a given sentence.It is based on the sequenceto-sequence encoder-decoder network approach pro-posed by (Sutskever et al., 2014) for machine translation.However, we use three distinct encoders to create embeddings of various input features.First, a word encoder creates an embedding for each word based on its characters.Second, a context encoder creates an embedding for the context of each word based on the word embeddings of its left and right neighbors.Third, an output encoder creates an output embedding using the analyses of previous words.These embeddings are fed to the decoder which produces the stem and the morphological features of a target word.In the following subsections, we explain each component in detail.

Input Output
The input to the model consists of an N word sentence S = [w 1 , . . ., w N ], where w i is the i'th word in the sentence.Each word is input as a sequence of characters w i = [w i1 , . . ., w iL i ], w ij ∈ A where A is the set of alphanumeric characters and L i is the number of characters in word w i .
The output for each word consists of a stem, a part-of-speech tag and a set of morphological features, e.g."masal+Noun+A3sg+P3sg+Nom" for "masalı".The stem is produced one character at a time, and the morphological information is produced one feature at a time.A sample output for a word looks like [s i1 , . . ., s iR i , f i1 , . . ., f iM i ] where s ij ∈ A is an alphanumeric character in the stem, R i is the length of the stem, M i is the number of features, f ij ∈ T is a morphological feature from a feature set such as T = {Noun,Adj,Nom,A3sg, . ..}.

Word Encoder
We map each character w ij to an A dimensional character embedding vector a ij ∈ R A .The word encoder takes each word and processes the character embeddings from left to right producing hidden states [h i1 , . . ., h iL i ] where h ij ∈ R H .The final hidden state e i = h iL i is used as the word embedding for word w i .

Context Encoder
We use a bidirectional LSTM as the context encoder.
The inputs are the word embeddings e 1 , • • • , e N produced by the word encoder.The context encoder processes them in both directions and constructs a unique context embedding for each target word in the sentence.For a word w i we define its corresponding context embedding c i ∈ R 2H as the concatenation of the forward − → c i ∈ R H and the backward ← − c i ∈ R H hidden states that are produced after the forward and backward LSTMs process the word embedding e i .In Figure 1(c), the creation of the context vector for the target word "elini" is illustrated.

Output Encoder
The output encoder captures information about the morphological features of words processed prior to each target word.For example, in order to assign the correct possessive marker to the word "masalı" (tale) in "babamın masalı" (my father's tale), it would be useful to know that the previous word "babamın" (my father) has a genitive marker.During training we use the gold morphological features, during testing we use the output of the model.The output encoder only uses the morphological features, not the stem characters, of the previous words as input: We map each morphological feature f ij to a B dimensional feature embedding vector b ij ∈ R B .A unidirectional LSTM is run over the whole sentence up to the target word to produce hidden states [t 11 , . . ., t i−1,M i−1 ] where t ij ∈ R H .The final hidden state preceding the target word o i = t i−1,M i−1 is used as the output embedding for word w i .
Figure 1: Model illustration for the sentence "Sonra gülerek elini kardes ¸inin omzuna koydu" (Then he laughed and put his hand on his brother's shoulder) and target word "elini" (his hand).We use the morphological features of the words preceding the target as input to the output encoder: "Sonra+Adv gül+Verb+PosˆDB+Adverb+ByDoingSo".

Decoder
The decoder is implemented as a 2-Layer LSTM network that outputs the correct tag for a single target word.By conditioning on the three input embeddings and its own hidden state, the decoder learns to generate y i = [y i1 , . . ., y iK i ] where y i is the correct tag of the target word w i in sentence S, y ij ∈ A ∪ T represents both stem characters and morphological feature tokens, and K i is the total number of output tokens (stem + features) for word w i .The first layer of the decoder is initialized with the context embedding c i .
where, W d ∈ R H×2H , W db ∈ R H and ⊕ is elementwise summation.We initialize the second layer with the word and output embeddings after combining them by element-wise summation.
We parameterize the distribution over possible morphological features and characters at each time step as where W s ∈ R |Y|×H and W sb ∈ R |Y| where Y = A ∪ T is the set of characters and morphological features in output vocabulary.

Datasets
We evaluate MorphNet on several different languages and datasets.First we describe the multilingual datasets we used from the Universal Dependency Treebanks.We then describe two existing datasets for Turkish and introduce our new dataset TrMor2018.

Universal Dependency Datasets
We tested MorphNet on a diverse set of languages selected from Universal Dependency (UD) (Nivre et al., 2016) treebanks.and Ture, 2006) and TrMor2016 (Yildiz et al., 2016).They share the same training set.TrMor2006 uses "Test" and TrMor2016 uses "T20K" as their test set.

Turkish Datasets
For Turkish we evaluate our model on two existing datasets.The first, TrMor2006, was first used in (Yuret and Türe, 2006).The training set was disambiguated semi-automatically and has limited accuracy.The test set was hand-tagged but is very small (862 tokens) to reliably distinguish between models with similar accuracy.We randomly extracted 100 sentences from training set and used them as the development set while training our model.
The second dataset we used, TrMor2016, was prepared by (Yıldız et al., 2016).The training set is the same with TrMor2006 but they manually retagged a subset of the training set containing roughly 20000 tokens to be used as a larger test set (T20K in the table).Unfortunately they did not exclude the sentences in T20K from the training set in their exper-iments.Also, they do not provide any accuracy or inter-annotator-agreement results on the new test set.

TrMor2018
We also evaluate MorphNet on a new dataset, Tr-Mor2018, that we release with this paper.Our goal was to address the problems with the accuracy of the training set in TrMor2006 and to provide a larger disjoint test set to more reliably distinguish the accuracy of similar models.The new dataset consists of 34673 sentences and 460663 words in total.Similar to (Yuret and Türe, 2006), it was annotated semiautomatically in multiple passes.In order to estimate the noise level of the dataset, we randomly selected a subset from the dataset and manually disambiguated it.The subset contains 2090 sentences and 28909 words.Two annotators annotated each word independently and we assigned the final morphological tag of each word based on the adjudication by a third.Then, we compared the manually disambiguated subset with the semi-automatic results and found the noise level of the dataset as 3% approximately.In our experiments, we split TrMor2018 to train, development and test sets by randomly selecting the 80%, 10% and 10% of the sentences respectively.Table-4 shows the statistical information about TrMor2018.

Experiments and Results
In this section we describe our training procedure, give experimental results, and provide an ablation analysis of our model.

Training
All LSTM units have H = 512 hidden units in our model.Size of the character embedding vectors are A = 64 in the word encoder.In the decoder part, size of the output embedding vectors are B = 256.We initialized model parameters with Xavier initialization (Glorot and Bengio, 2010).We train our networks by using back-propagation through time with stochastic gradient descent.We use learning rate l=1.6 and we apply learning rate decay according to development accuracy.We applied learning rate decay by a factor of 0.8 and early stopping if the development set accuracy is not improved during 6 consecutive epochs.We applied dropout with the rate of 0.3 before and after each of the LSTM units in MorphNet as well as embedding layers.

Turkish Results
Table-5 shows the results of several systems for different Turkish datasets.The (W/) column shows the performance when unambiguous tokens are tagged perfectly.The (W/O) column shows the accuracies when unambiguous tokens are also tagged by the model.The reason we made this distinction w as to get results comparable to older models: When we use MorphNet to tag unambiguous words, it does not achieve 100% accuracy.Unlike MorphNet, all of the models we compare with take the output of morphological analyzer as input and predict the correct analysis among them, thus they always get the right result for unambiguous tokens.For the TrMor2006 dataset, we report 96.86% total accuracy which is the best result to date, although the difference is not statistically significant.For ambiguous words, Mor-phNet performs with 92.86% accuracy showing its ability to simulate both analyzer and disambiguator internally.Maybe more importantly, the standalone MorphNet (W/O) performs better than the disambiguators even though it does not rely on an external analyzer.When we test our system on the test set with 20K tokens, we achieved a slightly better performance than (Yıldız et al., 2016) on ambiguous words.We hope that the new TrMor2018 dataset will allow for better system comparison due to its high accuracy and large test set.

Multilingual Results
MorphNet (Heigold et al., 2017)  To prove MorphNet is a language agnostic model and it can easily be used with languages other than Turkish we evaluated our model on 8 different languages and compared results with two other methods one of which is an another neural network based approach (Heigold et al., 2017) and the other is a strong non-neural baseline (Mueller et al., 2013).Table -6 shows test set performances on these languages.For Turkish (Tr) we obtain the best score with 89.54% accuracy.For Bulgarian (Bg) we performed better than non-neural baseline by 0.66% while (Heigold et al., 2017) achieves the best score by additional 0.21% gain.We observed 1.52% improvement over the (Heigold et al., 2017), when we is the only language where non-neural model performs the best score.This may be result of the limited morphological complexity of the language or using a model that is not specifically optimized for the French language.Among those languages, only for Hungarian (Hu), we have a significantly lower test set accuracy.We believe this is related with the amount of training size of Hungarian language which is the minimum of eight languages.We cannot compare ourself with the others on Catalan (Ca), Italian (It) and Danish (Da) since we couldn't find any reported results.

Ablation Analysis
In this section, the contributions of the individual components of the full model are analyzed.In Figure-2 the components that are identical to the finalized model are demonstrated with gray boxes.In the following two ablation studies, we disassemble individual modules to investigate the change in the performance of the model.We use the TrMor2018 dataset in two different configurations.In the first configuration we only used a small portion of the training data in order to show the difference between ablation models in a regime far from the noise level of the dataset.We randomly sampled 5 different subsets from training data and report the average performance.The size of each subset is approximately the 10% of the original training set.We train all ablation models and MorphNet on these five subsets separately and evaluated each using the original development and test sets.In the second configuration, we use all the available training data.Table -7 shows the test set accuracy of each ablated model as well as the finalized model.
We start our ablation studies by removing both the context encoder and the output encoder.The resulting model (seq2seq) is a standard encoder-decoder model which is only able to employ word embeddings (i.e.no context information).In that case, we record 91.99% total accuracy.
We then improved the model by reassembling the context encoder (seq2seq+context) to the model.We observed 0.93% and 1.03% increase in ambiguous word accuracies according to the amount of training data we use.This is the minimal version of MorphNet which is capable of learning more than only the most frequent morphologic analysis of each wordform.For example, Table-8 shows the words with the same stem "röportaj" (interview) and their analyses in the training set.We tested both models on the never before seen word "röportajı"5 .While seq2seq failed by selecting the most frequently occurred tag of "röportaj", seq2seq+context disambiguated the tar- As a next step, we also include the output encoder to our model (seq2seq+context+output).This gives our full MorphNet model.It improved our disambiguation performance on ambiguous tokens by 0.21% in the first configuration and 0.61% in the second configuration.
These experiments show that each of the model components have significant individual contributions to the overall performance.

Conclusion
In this paper, we present MorphNet, a language independent neural sequence-to-sequence model and TrMor2018 a new Turkish dataset for morphological disambiguation.MorphNet employs two different unidirectional LSTMs to obtain word and output embeddings and a bidirectional LSTM to obtain the context embedding of target word.It outputs the stem of the word a character at a time followed by morphological features, one feature at a time.We evaluated MorphNet on eight different languages, and obtained state-of-the art or comparable results for all but one language.We also release a 6 röportaj+Noun+A3sg+Pnon+Acc new morphology dataset for Turkish which is semiautomatically generated and manually confirmed to have 97%+ accuracy.

Table 2 :
Data statistics of UD Languages.The values on {Train,Dev,Test}-Set columns are the number of tokens in the splits.Unique tags column gives the number of distinct sets of tags (pos + morphological features) assigned to words, unique features gives the number of distinct features that exist in the dataset.

Table - 3
gives the statistics for these datasets.

Table 4 :
Data statistics of TrMor2018 dataset.Number of tokens for each split in the dataset

Table 5 :
Test set performance of several models on Turkish Datasets.A:Ambiguous token accuracy, W:Accuracy on all tokens if unambiguous tokens are tagged perfectly, W/O:Accuracy on all tokens if unambiguous tokens are also tagged with MorphNet.TrMor2006:(Yuret and Ture,2006), TrMor2016:(Yildiz et al.,2016), TrMor2018:This paper.

Table 6 :
Test set performance of MorphNet and other models on several languages from UD treebank.The results are obtained by using stand-alone MorphNet

Table 8 :
Analyses of the words with the same root röportaj.In training set, +Noun+A3sg+Pnon+Nom is the most frequent analysis among them.