## Abstract

This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data.

## 1. Introduction

Unsupervised learning of morphology has been an important task because of the benefits it provides to many other natural language processing applications such as machine translation, information retrieval, question answering, and so forth. Morphological paradigms provide a natural way to capture the internal morphological structure of a group of morphologically related words. Following Goldsmith (2001) and Monson (2008), we use the term **paradigm** as consisting of a set of stems and a set of suffixes where each combination of a stem and a suffix leads to a valid word form, for example, {*walk*, *talk*, *order*, *yawn*}{*s*, *ed*, *ing*} generating the surface forms *walk* + *ed*, *walk* + *s*, *walk* + *ing*, *talk* + *ed*, *talk* + *s*, *talk* + *ing*, *order* + *s*, *order* + *ed*, *order* + *ing*, *yawn* + *ed*, *yawn* + *s*, *yawn* + *ing*. A sample paradigm is given in Figure 1.

Recently, we introduced a probabilistic hierarchical clustering model for learning hierarchical morphological paradigms (Can and Manandhar 2012). Each node in the hierarchical tree corresponds to a morphological paradigm and each leaf node consists of a word. A single tree is learned, where different branches on the hierarchical tree correspond to different morphological forms. Well-defined paradigms in the lower levels of trees were learned. However, merging of the paradigms at the upper levels led to undersegmentation in that model. This problem led us to search for ways to learn multiple trees. In our current approach, we learn a forest of paradigms spread over several hierarchical trees. Our evaluation on Morpho Challenge data sets provides better results when compared to the previous method (Can and Manandhar 2012). Our results are comparable to current state-of-the-art results while having the additional benefit of inferring the hierarchical structure of morphemes for which no comparable systems exist.

The article is organized as follows: Section 2 introduces the related work in the field. Section 3 describes the probabilistic hierarchical clustering model with its mathematical model definition and how it is applied for morphological segmentation; the same section explains the inference and the morphological segmentation. Section 4 presents the experimental setting and the obtained evaluation scores from each experiment, and Section 5 concludes and addresses the potential future work following the model presented in this article.

## 2. Related Work

There have been many unsupervised approaches to morphology learning that focus solely on segmentation (Creutz and Lagus 2005a, 2007; Snyder and Barzilay 2008; Poon, Cherry, and Toutanova 2009; Narasimhan, Barzilay, and Jaakkola 2015). Others, such as Monson et al. (2008), Can and Manandhar (2010), Chan (2006), and Dreyer and Eisner (2011), learn morphological paradigms that permit additional generalization.

A popular paradigmatic model is Linguistica (Goldsmith 2001), which uses the Minimum Description Length principle to minimize the description length of a corpus based on paradigm-like structures called **signatures**. A signature consists of a list of suffixes that are seen with a particular stem—for example, *order*-{*ed*, *ing*, *s*} denotes a signature for the stem *order*.

Snover, Jarosz, and Brent (2002) propose a generative probabilistic model that defines a probability distribution over different segmentations of the lexicon into paradigms. Paradigms are learned with a directed search algorithm that examines subsets of the lexicon, ranks them, and incrementally combines them in order to find the best segmentation of the lexicon. The proposed model addresses both inflectional and derivational morphology in a language independent framework. However, their model does not allow multiple suffixation (e.g., having multiple suffixes added to a single stem) whereas Linguistica allows this.

Monson et al. (2008) induce morphological paradigms in a deterministic framework named ParaMor. Their search algorithm begins with a set of candidate suffixes and collects candidate stems that attach to the suffixes (see Figure 2(b)). The algorithm gradually develops paradigms following search paths in a lattice-like structure. Probabilistic ParaMor, involving a statistical natural language tagger to mimic ParaMor, was introduced in Morpho Challenge 2009 (Kurimo et al. 2009). The system outperforms other unsupervised morphological segmentation systems that competed in Morpho Challenge 2009 (Kurimo et al. 2009) for the languages Finnish, Turkish, and German.

Can and Manandhar (2010) exploit syntactic categories to capture morphological paradigms. In a deterministic scheme, morphological paradigms are learned by pairing syntactic categories and identifying common suffixes between them. The paradigms compete to acquire more word pairs.

Chan (2006) describes a supervised procedure to find morphological paradigms within a hierarchical structure by applying latent Dirichlet allocation. Each paradigm is placed in a node on the tree (see Figure 2(a)). The results show that each paradigm corresponds to a part-of-speech such as adjective, noun, verb, or adverb. However, as the method is supervised, the true suffixes and segmentations are known in advance. Learning hierarchical paradigms helps not only in learning morphological segmentation, but also in learning syntactic categories. This linguistic motivation led us toward learning the hierarchical organization of morphological paradigms.

Dreyer and Eisner (2011) propose a Dirichlet process mixture model to learn paradigms. The model uses 50–100 seed paradigms to infer the remaining paradigms, which makes it semi-supervised. The model is similar to ours in the sense that it also uses Dirichlet processes (DPs); however the model does not learn hierarchies between the paradigms.

Luo, Narasimhan, and Barzilay (2017) learn morphological families that share the same stem, such as *faithful*, *faithfully*, *unfaithful*, *faithless*, and so on, that are all derived from *faith*. Those morphological families are learned as a graph and called **morphological forests**, which deviates from the meaning of the term *forest* we refer in this article. Although learning morphological families has been studied as a graph learning problem in Luo, Narasimhan, and Barzilay (2017), in this work, we learn paradigms that generalize morphological families within a collection of hierarchical structures.

Narasimhan, Barzilay, and Jaakkola (2015) model the word formation with morphological chains in terms of parent-child relations. For example, *play* and *playful* have a parent–child relationship as a result of adding the morpheme *ful* at the end of *play*. These relations are modeled by using log-linear models in order to predict the parent relations. Semantic features as given by word2vec (Mikolov et al. 2013) are used in their model in addition to orthographic features for the prediction of parent–child relations. Narasimhan, Barzilay, and Jaakkola use contrastive estimation and generate corrupted examples as pseudo negative examples within their approach.

Our model is an extension of our previous hierarchical clustering algorithm (Can and Manandhar 2012). In that algorithm, a single tree is learned that corresponds to a hierarchical organization of morphological paradigms. The parent nodes merge the paradigms from the child nodes. But such merging of paradigms into a single structure causes unrelated paradigms to be merged resulting in lower segmentation accuracy. The current model addresses this issue by learning a forest of tree structures. Within each tree structure the parent nodes merge the paradigms from the child nodes. Multiple trees ensure that paradigms that should not be merged are kept separated. Additionally, in single tree hierarchical clustering, a manually defined context free grammar was employed to generate the segmentation of a word. In the current model, we predict the segmentation of a word without using any manually defined grammar rules.

## 3. Probabilistic Hierarchical Clustering

Chan (2006) showed that learning the hierarchy between morphological paradigms can help reveal latent relations in data. In the latent class model of Chan (2006), morphological paradigms in a tree structure can be linked to syntactic categories (i.e., part-of-speech tags). An example output of the model is given in Figure 2(a). Furthermore, tokens can be assigned to allomorphs or gender/conjugational variants in each paradigm.

Monson et al. (2008) showed that learning paradigms within a hierarchical model gives a strong performance on morphological segmentation. In their model, each paradigm is part of another paradigm, implemented within a lattice structure (see Figure 2(b)).

Motivated by these works, we aim to learn paradigms within a hierarchical tree structure. We propose a novel hierarchical clustering model that deviates from the current hierarchical clustering models in two aspects:

- 1.
It is a generative model based on a hierarchical Dirichlet process (HDP) that simultaneously infers the hierarchical structure and morphological segmentation.

- 2.
Our model infers multiple trees (i.e., a forest of trees) instead of a single tree.

### 3.1. Model Overview

*D*= {

*x*

^{1},

*x*

^{2}, …,

*x*

^{N}} denote the input data where each

*x*

^{j}is a word. Let

*D*

_{i}⊆

*D*denote the subset of the data generated by tree rooted at

*i*, then (see Figure 3):

*i*with parameters θ and hyperparameters β is given by:

*T*

_{i}the data likelihood is given recursively by:

**product of experts**model (Hinton 2002) comprising

*two*competing hypotheses.

^{1}Hypothesis H

_{1}is given by

*p*(

*D*

_{i}) and assigns all the data points in

*D*

_{i}into a single cluster. Hypothesis H

_{2}is given by

*p*(

*D*

_{l}|

*T*

_{l})

*p*(

*D*

_{r}|

*T*

_{r}) and recursively splits the data in

*D*

_{i}into two partitions (i.e., subtrees)

*D*

_{l}and

*D*

_{r}. The factor

*Z*is the partition function or the normalizing constant given by:

The recursive partitioning scheme we use is similar to that of Heller and Ghahramani (2005). The product of experts scheme used in this paper contrasts with the more conventional sum of experts mixture model of Heller and Ghahramani (2005) that would have resulted in the mixture π *p*(*D*_{i}) + (1 − π) *p*(*D*_{l}|*T*_{l})*p*(*D*_{r}|*T*_{r}), where π denotes the mixing proportion.

*T*= {

*T*

_{1},

*T*

_{2}, …,

*T*

_{n}}, the total likelihood of data in all trees is defined as follows:

*T*= {

*T*

_{1},

*T*

_{2}, …,

*T*

_{n}} be a set of trees.

*T*

_{i}is sampled from a DP as follows:

*U*is a uniform base distribution that assigns equal probability to each tree. We integrate out

*F*, which is a distribution over trees, instead of estimating it. Hence, the conditional probability of sampling an existing tree is computed as follows:

*N*

_{k}denotes the number of words in

*T*

_{k}and

*N*denotes the total number of words in the model. A new tree is generated with the following:

### 3.2. Modeling Morphology with Probabilistic Hierarchical Clustering

In our model, data points are the words and each tree node corresponds to a morphological paradigm (see Figure 4). Each word is part of all the paradigms on the path from the leaf node having that word until the root node. The word can share either its stem or suffix with other words in the same paradigm. Hence, a considerable number of words can be generated through this approach that may not be seen in the corpus. Our model will prefer words that share stems or suffixes to be close to each other within the tree.

*i*, we define a Dirichlet process to generate stems (denoted by $si1,\u2026,siNi$) and a separate Dirichlet process to generate suffixes (denoted by $mi1,\u2026,miNi$):

*DP*(β

_{s},

*P*

^{s}) is a Dirichlet process that generates stems, β

_{s}denotes the concentration parameter, and

*P*

^{s}is the base distribution. $Gis$ is a distribution over the stems $sij$ in node

*i*. Correspondingly,

*DP*(β

_{m},

*P*

^{m}) is a Dirichlet process that generates suffixes with analogous parameters. $Gim$ is a distribution over the suffixes $mij$ in node

*i*. For smaller values of the concentration parameter, it is less likely to generate new types. Thus, sparse multinomials can be generated by the Dirichlet process yielding a skewed distribution. We set β

_{s}< 1 and β

_{m}< 1 in order to generate a small number of stem and suffix types. $sij$ and $mij$ are the

*j*th stem and suffix instance in the

*i*th node, respectively.

The base distributions for stems and suffixes are given by Equation (14) and Equation (15). Here, *c* denotes a single letter or character. We assume that letters are distributed uniformly (Creutz and Lagus 2005b), where *p*(*c*) = 1/*A* for an alphabet having *A* letters. Our model will favor shorter morphemes because they have less factors in the joint probability given by Equations (14) and (15).

*H*

^{s}for stems and

*H*

^{m}for suffixes). These introduce dependencies between stems/suffixes in distinct trees. The model will favor stems and suffixes that are already generated in one of the trees. The HDP for a root node

*i*is defined as follows:

*H*

^{s}and

*H*

^{m}are drawn from the global DPs

*DP*(α

_{s},

*P*

^{s}) and

*DP*(α

_{m},

*P*

^{m}). Here, $\psi iz$ denotes the stem type

*z*in node

*i*, and $\phi iz$ denotes the suffix type

*z*in node

*i*, which are drawn from

*H*

_{s}and

*H*

_{m}(i.e., the global DPs), respectively. The plate diagram of the HDP model is given in Figure 5(b).

From the Chinese restaurant process (CRP) metaphor perspective, the global DPs generate the global sets of dishes (i.e., stems and suffixes) that constitute the menu for all trees in the model. At each tree node, there are two restaurants: one for stems and one for suffixes. At each table, a different type of dish (i.e., stem or suffix type) is served and for each stem/suffix type there exists only one table in each node (i.e., restaurant). Customers are the stem or suffix tokens. Whenever a new customer, $sij$ or $mij$, enters a restaurant, if the table, $\psi iz$ or $\phi iz$, serving that dish already exists, the new customer sits at that table. Otherwise, a new table is generated in the restaurant. A change in one of the restaurants in the leaf nodes leads to the update in each restaurant all the way to the root node. If the dish is not available in the root node, a new table is created for that root node and a global customer is also added to the global restaurant. If no global table exists for that dish, a new global table serving the dish is also created. This can be seen as a Chinese restaurant franchise where each node is a restaurant itself (see Figure 6).

*T*= {

*T*

_{1},

*T*

_{2}, …,

*T*

_{n}} is computed as follows:

*p*(

*S*

_{i}|

*T*

_{i}) and

*p*(

*M*

_{i}|

*T*

_{i}) are computed recursively:

*Z*is the normalization constant. The same also applies for

*M*

_{i}.

*T*

_{i}, $Srooti={si1,si2,\u2026,siNi}$, is:

*i*, there are $nijs$ stem tokens of type

*j*. The first factor accounts for all denominators from both cases. Similarly, the second and the fourth factor in the second line of Equation (30) corresponds to the case where

*K*

^{s}stem types are generated globally (i.e., stem tables in the global restaurant), the third factor corresponds to the case where, for each of the

*K*

^{s}stem types, there are $kjs$ stems of type

*j*in distinct trees. A sample hierarchical structure is given in Figure 7.

*T*

_{i}, $Si={si1,si2,\u2026,siNi}$, which belong to stem tables ${\psi i1,\psi i2,\u2026}$ that index the items on the global menu is reduced to:

*i*(i.e., new customer enters one of the local restaurants), the conditional probability of the new stem $sij$ that belongs to type

*z*(i.e., customer $sij$ sitting at table $\psi iz$) is computed as follows (see Teh 2010 and Teh et al. 2006):

_{i}denotes the table indicators in node

*i*, $niz\u2212sij$ denotes the number of stem tokens that belong to type

*z*in node

*i*when the last instance $sij$ is excluded.

*P*

^{s}.

*i*belonging to global suffix types ${\phi i1,\phi i2,\u2026}$; $Nim$ is the number of local suffix types; $nijm$ is the number of suffix tokens of type

*j*in node

*i*;

*K*

^{m}is the total number of suffix types; and $kjm$ is the number of trees that contain suffixes of type

*j*.

### 3.3. Metropolis-Hastings Sampling for Inferring Trees

Trees are learned along with the segmentations of words via inference. Learning trees involves two steps: 1) constructing initial trees; 2) Metropolis-Hastings sampling for inferring trees.

#### 3.3.1. Constructing Initial Trees.

*w*

^{j}=

*s*+

*m*in tree

*T*

_{k}is given by:

*p*(

*T*

_{k}|

*T*, α,

*U*) of choosing a particular tree. Once a tree is chosen, a branch to insert is selected at random. The algorithm for constructing the initial trees is given in Algorithm 1.

#### 3.3.2. Metropolis-Hastings Sampling.

Once the initial trees are constructed, the hierarchical structures yielding a global maximum likelihood are inferred by using the Metropolis Hastings algorithm (Hastings 1970). The inference is performed by iteratively removing a word from a leaf node from a tree and subsequently sampling a new tree, a position within the tree, and a new segmentation (see Algorithm 2). Trees are sampled with the conditional probability given in Equations (8) and (9). Hence, trees with more words attract more words, and new trees are created proportional to the hyperparameter α.

*p*

_{next}(

*D*|

*T*) denotes the likelihood of the data under the altered model and

*p*

_{cur}(

*D*|

*T*) denotes the likelihood of data under the current model before sampling. The normalization constant cancels out because it is the same for the numerator and the denominator. If $pnext(D|T)1\gamma >pcur(D|T)1\gamma $, then the new sample is accepted: otherwise, the new model is still accepted with a probability

*p*

_{Acc}. The system is cooled in each iteration with decrements η. We refer to Section 4 for details of parameter settings.

### 3.4. Morphological Segmentation

*w*

^{k}denotes the kth word to be segmented in the test set.

### 3.5. Example Paradigms

A sample of root paradigms learned by our model for English is given in Table 1. The model can find similar word forms (i.e., separat + ists, medal + ists, hygien + ists) that are grouped in the neighbor branches in the tree structure (see Figures 9, B.1, and B.2 for sample paradigms learned in English, Turkish, and German).

{final, ungod, frequent, pensive} {ly} |

{kind, kind, kind} {est, er, 0} |

{underrepresent, modul} {ation} |

{plebe, hawai, muslim-croat} {ian} |

{compuls, shredd, compuls, shredd, compuls} {ion, er, ively, ers, ory} |

{zion, modern, modern, dynam} {ism, ists} |

{reclaim, chas, pleas, fell} {ing} |

{mov, engrav, stray, engrav, fantasiz, reischau, decilit, suspect} {ing, er, e} |

{measur, measur, incontest, transport, unplay, reput} {e, able} |

{housewar, decorat, entitl, cuss, decorat, entitl, materi, toss, flay, unconfirm, linse, equipp} {es, ing, alise, ed} |

{fair, norw, soon, smooth, narrow, sadd, steep, noisi, statesw, narrow} {est, ing} |

{rest, wit, name, top, odor, bay, odor, sleep} {less, s} |

{umpir, absorb, regard, embellish, freez, gnash} {ing} |

{nutrition, manicur, separat, medal, hygien, nutrition, genetic, preservation} {0, ists} |

{final, ungod, frequent, pensive} {ly} |

{kind, kind, kind} {est, er, 0} |

{underrepresent, modul} {ation} |

{plebe, hawai, muslim-croat} {ian} |

{compuls, shredd, compuls, shredd, compuls} {ion, er, ively, ers, ory} |

{zion, modern, modern, dynam} {ism, ists} |

{reclaim, chas, pleas, fell} {ing} |

{mov, engrav, stray, engrav, fantasiz, reischau, decilit, suspect} {ing, er, e} |

{measur, measur, incontest, transport, unplay, reput} {e, able} |

{housewar, decorat, entitl, cuss, decorat, entitl, materi, toss, flay, unconfirm, linse, equipp} {es, ing, alise, ed} |

{fair, norw, soon, smooth, narrow, sadd, steep, noisi, statesw, narrow} {est, ing} |

{rest, wit, name, top, odor, bay, odor, sleep} {less, s} |

{umpir, absorb, regard, embellish, freez, gnash} {ing} |

{nutrition, manicur, separat, medal, hygien, nutrition, genetic, preservation} {0, ists} |

Paradigms are captured based on the similarity of either stems or suffixes. Having the same stem such as co-chair (co-chairman, co-chairmen) or trades (trades + man, trades + men) allows us to find segmentations such as co-chair + man vs. co-chair + men and trades + man vs. trades + men. Although we assume a stem + suffix segmentation, other types of segmentation, such as prefix + stem, are also covered. However, stem alterations and infixation are not covered in our model.

## 4. Evaluation

We used publicly available Morpho Challenge data sets for English, German, and Turkish for training. The English data set consists of 878,034 words, the German data set consists of 2,338,323 words, and the Turkish data set consists of 617,298 words. Although frequency of each word was available in the training set, we did not make use of this information. In other words, we use only the word types (not tokens) in training. We do not address the ambiguity of words in this work and leave this as future research.

In all experiments, the initial temperature of the system is set γ = 2 and it is reduced to γ = 0.01 with decrements η = 0.0001 (see Equation (39)). Figure 10 shows the time required for the log likelihoods of the trees of sizes 10K, 16K, and 22K to converge. We fixed α_{s} = α_{m} = β_{s} = β_{m} = 0.01 and α = 0.0005 in all our experiments. The hyperparameters are set manually as a result of several experiments. These are the optimum values obtained from a number of experiments.^{2}

Precision, recall, and F-score values against training set sizes are given in Figures 11 and 12 for English and Turkish, respectively.

### 4.1. Morpho Challenge Evaluation

Although we experimented with different sizes of training sets, we used a randomly chosen 600K words from the English and 200K words from the Turkish and German data sets for evaluation purposes.

Evaluation is performed according to the method proposed in Morpho Challenge (Kurimo et al. 2010a), which in turn is based on evaluation used by Creutz and Lagus (2007). The gold standard evaluation data set utilized within the Morpho Challenge is a hidden set that is *not* available publicly. This makes the Morpho Challenge evaluation different from other evaluations that provide test data. In this evaluation, word pairs sharing at least one common morpheme are sampled. For precision, word pairs are sampled from the system results and checked against the gold standard segmentations. For recall, word pairs are sampled from the gold standard and checked against the system results. For each matching morpheme, 1 point is given. Precision and recall are calculated by normalizing the total obtained scores.

We compare our results with other unsupervised systems from Morpho Challenge 2010 (Kurimo et al. 2010b) for English, German, and Turkish. More specifically, we compare our model with all competing unsupervised systems: Morfessor Baseline (Creutz and Lagus 2002), Morfessor CATMAP (Creutz and Lagus 2005a), Base Inference (Lignos 2010), Iterative Compounding (Lignos 2010), Aggressive Compounding (Lignos 2010), and Nicolas, Farré, and Molinero (2010). Additionally, we compare our system with the Morpho Chain model of Narasimhan, Barzilay, and Jaakkola (2015) by re-training their model on exactly the same training sets as ours. All evaluation was carried out by the Morpho Challenge organizers based on the hidden gold data sets.

English results are given in Table 2. For English, the Base Inference algorithm of Lignos (2010) obtained the highest F-measure in Morpho Challenge 2010 among other competing unsupervised systems. Our model is ranked fourth among all unsupervised systems with a F-measure 60.27%.

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 55.60 | 65.80 | 60.27 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 55.60 | 57.58 | 57.33 |

Base Inference (Lignos 2010) | 80.77 | 53.76 | 64.55 |

Iterative Compounding (Lignos 2010) | 80.27 | 52.76 | 63.67 |

Aggressive Compounding (Lignos 2010) | 71.45 | 52.31 | 60.40 |

Nicolas (Nicolas, Farré, and Molinero 2010) | 67.83 | 53.43 | 59.78 |

Morfessor Baseline (Creutz and Lagus 2002) | 81.39 | 41.70 | 55.14 |

Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) | 74.87 | 39.01 | 50.42 |

Morfessor CatMAP (Creutz and Lagus 2005a) | 86.84 | 30.03 | 44.63 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 55.60 | 65.80 | 60.27 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 55.60 | 57.58 | 57.33 |

Base Inference (Lignos 2010) | 80.77 | 53.76 | 64.55 |

Iterative Compounding (Lignos 2010) | 80.27 | 52.76 | 63.67 |

Aggressive Compounding (Lignos 2010) | 71.45 | 52.31 | 60.40 |

Nicolas (Nicolas, Farré, and Molinero 2010) | 67.83 | 53.43 | 59.78 |

Morfessor Baseline (Creutz and Lagus 2002) | 81.39 | 41.70 | 55.14 |

Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) | 74.87 | 39.01 | 50.42 |

Morfessor CatMAP (Creutz and Lagus 2005a) | 86.84 | 30.03 | 44.63 |

German results are given in Table 3. Our model outperforms other unsupervised models in Morpho Challenge 2010 with a F-measure 50.71%.

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 47.92 | 53.84 | 50.71 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 57.79 | 32.42 | 41.54 |

Base Inference (Lignos 2010) | 66.38 | 35.36 | 46.14 |

Iterative Compounding (Lignos 2010) | 62.13 | 34.70 | 44.53 |

Aggressive Compounding (Lignos 2010) | 59.41 | 37.21 | 45.76 |

Morfessor Baseline (Creutz and Lagus 2002) | 82.80 | 19.77 | 31.92 |

Morfessor CatMAP (Creutz and Lagus 2005a) | 72.70 | 35.43 | 47.64 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 47.92 | 53.84 | 50.71 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 57.79 | 32.42 | 41.54 |

Base Inference (Lignos 2010) | 66.38 | 35.36 | 46.14 |

Iterative Compounding (Lignos 2010) | 62.13 | 34.70 | 44.53 |

Aggressive Compounding (Lignos 2010) | 59.41 | 37.21 | 45.76 |

Morfessor Baseline (Creutz and Lagus 2002) | 82.80 | 19.77 | 31.92 |

Morfessor CatMAP (Creutz and Lagus 2005a) | 72.70 | 35.43 | 47.64 |

Turkish results are given in Table 4. Again, our model outperforms other unsupervised participants, achieving a F-measure 56.41%.

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 57.70 | 55.18 | 56.41 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 72.36 | 25.81 | 38.04 |

Base Inference (Lignos 2010) | 72.81 | 16.11 | 26.38 |

Iterative Compounding (Lignos 2010) | 68.69 | 21.44 | 32.68 |

Aggressive Compounding (Lignos 2010) | 55.51 | 34.36 | 42.45 |

Nicolas (Nicolas, Farré, and Molinero 2010) | 79.02 | 19.78 | 31.64 |

Morfessor Baseline (Creutz and Lagus 2002) | 89.68 | 17.78 | 29.67 |

Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) | 69.25 | 31.51 | 43.32 |

Morfessor CatMAP (Creutz and Lagus 2005a) | 79.38 | 31.88 | 45.49 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 57.70 | 55.18 | 56.41 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 72.36 | 25.81 | 38.04 |

Base Inference (Lignos 2010) | 72.81 | 16.11 | 26.38 |

Iterative Compounding (Lignos 2010) | 68.69 | 21.44 | 32.68 |

Aggressive Compounding (Lignos 2010) | 55.51 | 34.36 | 42.45 |

Nicolas (Nicolas, Farré, and Molinero 2010) | 79.02 | 19.78 | 31.64 |

Morfessor Baseline (Creutz and Lagus 2002) | 89.68 | 17.78 | 29.67 |

Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) | 69.25 | 31.51 | 43.32 |

Morfessor CatMAP (Creutz and Lagus 2005a) | 79.38 | 31.88 | 45.49 |

Our model also outperforms the model from Can and Manandhar (2012) (which we refer to as Single Tree Probabilistic Clustering), although the results are not directly comparable because the largest data set we were able to train that model on was 22K words due to the training time required. The full training set provided by Morpho Challenge was not used in Can and Manandhar (2012). Our current approach is more efficient as the training cost is divided across multiple tree structures with each tree being shallower compared with our previous model.

In order to draw a substantive empirical comparison, we performed another set of experiments by running the current approach on only 22K words as the Single Tree Probabilistic Clustering (Can and Manandhar 2012). Results are given in Tables 5, 6, and 7. As shown in the tables, the current model outperforms the previous model even on the smaller data set.

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 67.75 | 53.93 | 60.06 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 55.60 | 57.58 | 57.33 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 67.75 | 53.93 | 60.06 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 55.60 | 57.58 | 57.33 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 33.93 | 65.31 | 44.66 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 57.79 | 32.42 | 41.54 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 33.93 | 65.31 | 44.66 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 57.79 | 32.42 | 41.54 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 64.39 | 42.99 | 51.56 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 72.36 | 25.81 | 38.04 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 64.39 | 42.99 | 51.56 |

Single Tree Prob. Clustering (Can and Manandhar 2012) | 72.36 | 25.81 | 38.04 |

### 4.2. Additional Evaluation

For additional experiments, we compare our model with Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) based on their evaluation method that differs from the Morpho Challenge evaluation method. Their evaluation method is based on counting the correct segmentation points. For example, if the result segmentation is *booking* + *s* and the gold segmentation is *book* + *ing* + *s*, 1 point is counted. Precision and recall are calculated based on these matching segmentation points. In addition, this evaluation does not use the hidden gold data sets. Instead, the test sets are created by aggregating the test data from Morpho Challenge 2005 and Morpho Challenge 2010 (as reported in Narasimhan, Barzilay, and Jaakkola [2015]) that provide segmentation points.^{3}

We used the same trained models as in our Morpho Challenge evaluation. The English test set contains 2,218 words and the Turkish test set contains 2,534 words.^{4}

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 67.41 | 62.5 | 64.86 |

Morpho Chain | 72.63 | 78.72 | 75.55 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 67.41 | 62.5 | 64.86 |

Morpho Chain | 72.63 | 78.72 | 75.55 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 89.30 | 48.22 | 62.63 |

Morpho Chain | 70.49 | 63.27 | 66.66 |

System . | Precision (%) . | Recall (%) . | F-measure (%) . |
---|---|---|---|

Hierarchical Morphological Segmentation | 89.30 | 48.22 | 62.63 |

Morpho Chain | 70.49 | 63.27 | 66.66 |

For all systems, Morpho Chain evaluation scores are comparably higher than the Morpho Challenge scores. There are several reasons for this. In the Morpho Challenge evaluation, the morpheme labels are considered rather than the surface forms of the morphemes. For example, *pantolon* + *u* + *yla* [with his trousers] and *emel* + *ler* + *i* + *yle* [with his desires] have got both possessive morpheme (*u* and *i*) that is labeled with *POS* and relational morpheme (*yla* and *yle*) labeled with *REL* in common. This increases the total number of points that is computed over all word pairs, and therefore lowers the scores.

Secondly, in the Morpho Chain evaluation, only the gold segmentation that has the maximum match with the result segmentation is chosen for each word (e.g., *yazımıza* has two gold segmentations: *yaz* + *ı* + *mız* + *a* [to our summer] and *yazı* + *mız* + *a*; [to our writing]). In contrast, in the Morpho Challenge evaluation all segmentations in the gold segmentation are evaluated. This is another factor that increases the scores in Morpho Chain evaluation. Thus, the Morpho Chain evaluation favors precision over recall. Indeed, in the Morpho Challenge evaluation, the Morpho Chain system has high precision but their model suffers from low recall due to undersegmentation (see Tables 2 and 4).

It should be noted that the output of our system is not only the segmentation points, but also the hierarchical organization of morphological paradigms that we believe is novel in this work. However, because of the difficulty in measuring the quality of hierarchical paradigms, which will require a corresponding hierarchically organized gold data set, we are unable to provide an objective measure of the quality of hierarchical structures learned. We present different portions from the obtained trees in Appendix B (see Figures B.1 and B.2).^{5} It can be seen that words sharing the same suffixes are gathered closer to each other, such as *reestablish* + *ed*, *reclassifi* + *ed*, *circl* + *ed*, *uncloth* + *ed*, and so forth. Secondly, related morphological families gather closer to each other, such as *impress* + *ively*, *impress* + *ionist*, *impress* + *ions*, *impress* + *ion*.

## 5. Conclusions and Future Work

In this article, we introduce a tree structured Dirichlet process model for hierarchical morphological segmentation. The method is different compared with existing hierarchical and non-hierarchical methods for learning paradigms. Our model learns morphological paradigms that are clustered hierarchically within a forest of trees.

Although our goal in this work is on hierarchical learning, our model shows competitive performance against other unsupervised morphological segmentation systems that are designed primarily for segmentation only. The system outperforms other unsupervised systems in Morpho Challenge 2010 for German and Turkish. It also outperforms the more recent Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) system on the Morpho Challenge evaluation for German and Turkish.

The sample paradigms learned show that these can be linked to other types of latent information, such as part-of-speech tags. Combining morphology and syntax as a joint learning problem within the same model can be a fruitful direction for future work.

The hierarchical structure is beneficial because we can have both compact and more general paradigms at the same time. In this article, we use the paradigms only for the segmentation task, and applications of hierarchy learned is left as future work.

## Appendix A. Derivation of the Full Joint Distribution

*D*= {

*s*

^{1},

*s*

^{2}, …,

*s*

^{N}} denote the input data where each

*s*

^{j}is a data item. A particular setting of a table with

*N*customers has a joint probability of:

## Appendix B. Sample Tree Structures

## Acknowledgments

This research was supported by TUBITAK (The Scientific and Technological Research Council of Turkey) grant number 115E464. We thank Karthik Narasimhan for providing their data sets and code. We are grateful to Sami Virpioja for the evaluation of our results on the hidden gold data provided by Morpho Challenge. We thank our reviewers for critical feedback and spotting an error in our previous version of the article. Their comments have immensely helped improve the article.

## Notes

The review version of this paper and our previous work (Can and Manandhar 2012) missed this connection to the product of experts model and did not include the 1/Z term. We thank our reviewers for spotting this error.

The source code of the model is accessible at: https://github.com/burcu-can/TreeStructuredDP.

In addition to the hidden test data, Morpho Challenge also provides separate publicly available test data.

Because of the unavailability of word2vec word embeddings for German, we were unable to perform Morpho Chain evaluation on this language.

Some of the full trees are given at http://web.cs.hacettepe.edu.tr/~burcucan/TreeStructuredDP.

## References

*Unsupervised Learning of Morphology and POS tagging*