This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data.

Unsupervised learning of morphology has been an important task because of the benefits it provides to many other natural language processing applications such as machine translation, information retrieval, question answering, and so forth. Morphological paradigms provide a natural way to capture the internal morphological structure of a group of morphologically related words. Following Goldsmith (2001) and Monson (2008), we use the term paradigm as consisting of a set of stems and a set of suffixes where each combination of a stem and a suffix leads to a valid word form, for example, {walk, talk, order, yawn}{s, ed, ing} generating the surface forms walk + ed, walk + s, walk + ing, talk + ed, talk + s, talk + ing, order + s, order + ed, order + ing, yawn + ed, yawn + s, yawn + ing. A sample paradigm is given in Figure 1.

Figure 1 

An example paradigm.

Figure 1 

An example paradigm.

Close modal

Recently, we introduced a probabilistic hierarchical clustering model for learning hierarchical morphological paradigms (Can and Manandhar 2012). Each node in the hierarchical tree corresponds to a morphological paradigm and each leaf node consists of a word. A single tree is learned, where different branches on the hierarchical tree correspond to different morphological forms. Well-defined paradigms in the lower levels of trees were learned. However, merging of the paradigms at the upper levels led to undersegmentation in that model. This problem led us to search for ways to learn multiple trees. In our current approach, we learn a forest of paradigms spread over several hierarchical trees. Our evaluation on Morpho Challenge data sets provides better results when compared to the previous method (Can and Manandhar 2012). Our results are comparable to current state-of-the-art results while having the additional benefit of inferring the hierarchical structure of morphemes for which no comparable systems exist.

The article is organized as follows: Section 2 introduces the related work in the field. Section 3 describes the probabilistic hierarchical clustering model with its mathematical model definition and how it is applied for morphological segmentation; the same section explains the inference and the morphological segmentation. Section 4 presents the experimental setting and the obtained evaluation scores from each experiment, and Section 5 concludes and addresses the potential future work following the model presented in this article.

There have been many unsupervised approaches to morphology learning that focus solely on segmentation (Creutz and Lagus 2005a, 2007; Snyder and Barzilay 2008; Poon, Cherry, and Toutanova 2009; Narasimhan, Barzilay, and Jaakkola 2015). Others, such as Monson et al. (2008), Can and Manandhar (2010), Chan (2006), and Dreyer and Eisner (2011), learn morphological paradigms that permit additional generalization.

A popular paradigmatic model is Linguistica (Goldsmith 2001), which uses the Minimum Description Length principle to minimize the description length of a corpus based on paradigm-like structures called signatures. A signature consists of a list of suffixes that are seen with a particular stem—for example, order-{ed, ing, s} denotes a signature for the stem order.

Snover, Jarosz, and Brent (2002) propose a generative probabilistic model that defines a probability distribution over different segmentations of the lexicon into paradigms. Paradigms are learned with a directed search algorithm that examines subsets of the lexicon, ranks them, and incrementally combines them in order to find the best segmentation of the lexicon. The proposed model addresses both inflectional and derivational morphology in a language independent framework. However, their model does not allow multiple suffixation (e.g., having multiple suffixes added to a single stem) whereas Linguistica allows this.

Monson et al. (2008) induce morphological paradigms in a deterministic framework named ParaMor. Their search algorithm begins with a set of candidate suffixes and collects candidate stems that attach to the suffixes (see Figure 2(b)). The algorithm gradually develops paradigms following search paths in a lattice-like structure. Probabilistic ParaMor, involving a statistical natural language tagger to mimic ParaMor, was introduced in Morpho Challenge 2009 (Kurimo et al. 2009). The system outperforms other unsupervised morphological segmentation systems that competed in Morpho Challenge 2009 (Kurimo et al. 2009) for the languages Finnish, Turkish, and German.

Figure 2 

Examples of hierarchical morphological paradigms.

Figure 2 

Examples of hierarchical morphological paradigms.

Close modal

Can and Manandhar (2010) exploit syntactic categories to capture morphological paradigms. In a deterministic scheme, morphological paradigms are learned by pairing syntactic categories and identifying common suffixes between them. The paradigms compete to acquire more word pairs.

Chan (2006) describes a supervised procedure to find morphological paradigms within a hierarchical structure by applying latent Dirichlet allocation. Each paradigm is placed in a node on the tree (see Figure 2(a)). The results show that each paradigm corresponds to a part-of-speech such as adjective, noun, verb, or adverb. However, as the method is supervised, the true suffixes and segmentations are known in advance. Learning hierarchical paradigms helps not only in learning morphological segmentation, but also in learning syntactic categories. This linguistic motivation led us toward learning the hierarchical organization of morphological paradigms.

Dreyer and Eisner (2011) propose a Dirichlet process mixture model to learn paradigms. The model uses 50–100 seed paradigms to infer the remaining paradigms, which makes it semi-supervised. The model is similar to ours in the sense that it also uses Dirichlet processes (DPs); however the model does not learn hierarchies between the paradigms.

Luo, Narasimhan, and Barzilay (2017) learn morphological families that share the same stem, such as faithful, faithfully, unfaithful, faithless, and so on, that are all derived from faith. Those morphological families are learned as a graph and called morphological forests, which deviates from the meaning of the term forest we refer in this article. Although learning morphological families has been studied as a graph learning problem in Luo, Narasimhan, and Barzilay (2017), in this work, we learn paradigms that generalize morphological families within a collection of hierarchical structures.

Narasimhan, Barzilay, and Jaakkola (2015) model the word formation with morphological chains in terms of parent-child relations. For example, play and playful have a parent–child relationship as a result of adding the morpheme ful at the end of play. These relations are modeled by using log-linear models in order to predict the parent relations. Semantic features as given by word2vec (Mikolov et al. 2013) are used in their model in addition to orthographic features for the prediction of parent–child relations. Narasimhan, Barzilay, and Jaakkola use contrastive estimation and generate corrupted examples as pseudo negative examples within their approach.

Our model is an extension of our previous hierarchical clustering algorithm (Can and Manandhar 2012). In that algorithm, a single tree is learned that corresponds to a hierarchical organization of morphological paradigms. The parent nodes merge the paradigms from the child nodes. But such merging of paradigms into a single structure causes unrelated paradigms to be merged resulting in lower segmentation accuracy. The current model addresses this issue by learning a forest of tree structures. Within each tree structure the parent nodes merge the paradigms from the child nodes. Multiple trees ensure that paradigms that should not be merged are kept separated. Additionally, in single tree hierarchical clustering, a manually defined context free grammar was employed to generate the segmentation of a word. In the current model, we predict the segmentation of a word without using any manually defined grammar rules.

Chan (2006) showed that learning the hierarchy between morphological paradigms can help reveal latent relations in data. In the latent class model of Chan (2006), morphological paradigms in a tree structure can be linked to syntactic categories (i.e., part-of-speech tags). An example output of the model is given in Figure 2(a). Furthermore, tokens can be assigned to allomorphs or gender/conjugational variants in each paradigm.

Monson et al. (2008) showed that learning paradigms within a hierarchical model gives a strong performance on morphological segmentation. In their model, each paradigm is part of another paradigm, implemented within a lattice structure (see Figure 2(b)).

Motivated by these works, we aim to learn paradigms within a hierarchical tree structure. We propose a novel hierarchical clustering model that deviates from the current hierarchical clustering models in two aspects:

  • 1. 

    It is a generative model based on a hierarchical Dirichlet process (HDP) that simultaneously infers the hierarchical structure and morphological segmentation.

  • 2. 

    Our model infers multiple trees (i.e., a forest of trees) instead of a single tree.

Although not covered in this work, additional work can be aimed at discovering other types of latent information such as part-of-speech and allomorphs.

3.1. Model Overview

Let D = {x1, x2, …, xN} denote the input data where each xj is a word. Let DiD denote the subset of the data generated by tree rooted at i, then (see Figure 3):
(1)
where Di={xi1,xi2,,xiNi}.
Figure 3 

A portion of a tree rooted at i with child nodes k, j with corresponding data points they generate. i, j, k refer to the tree nodes; Ti, Tj, Tk refer to corresponding trees; and Di, Dj, Dk refer to data contained in trees.

Figure 3 

A portion of a tree rooted at i with child nodes k, j with corresponding data points they generate. i, j, k refer to the tree nodes; Ti, Tj, Tk refer to corresponding trees; and Di, Dj, Dk refer to data contained in trees.

Close modal
The marginal probability of the data items in a given node in a tree i with parameters θ and hyperparameters β is given by:
(2)
Given tree Ti the data likelihood is given recursively by:
(3)
The term 1Z(Di)p(Dl|Tl)p(Dr|Tr) corresponds to a product of experts model (Hinton 2002) comprising two competing hypotheses.1 Hypothesis H1 is given by p(Di) and assigns all the data points in Di into a single cluster. Hypothesis H2 is given by p(Dl|Tl)p(Dr|Tr) and recursively splits the data in Di into two partitions (i.e., subtrees) Dl and Dr. The factor Z is the partition function or the normalizing constant given by:
(4)

The recursive partitioning scheme we use is similar to that of Heller and Ghahramani (2005). The product of experts scheme used in this paper contrasts with the more conventional sum of experts mixture model of Heller and Ghahramani (2005) that would have resulted in the mixture π p(Di) + (1 − π) p(Dl|Tl)p(Dr|Tr), where π denotes the mixing proportion.

Finally, given the collection of trees T = {T1, T2, …, Tn}, the total likelihood of data in all trees is defined as follows:
(5)
Trees are generated from a DP. Let T = {T1, T2, …, Tn} be a set of trees. Ti is sampled from a DP as follows:
(6)
(7)
where α denotes the concentration parameter of the DP. U is a uniform base distribution that assigns equal probability to each tree. We integrate out F, which is a distribution over trees, instead of estimating it. Hence, the conditional probability of sampling an existing tree is computed as follows:
(8)
where Nk denotes the number of words in Tk and N denotes the total number of words in the model. A new tree is generated with the following:
(9)

3.2. Modeling Morphology with Probabilistic Hierarchical Clustering

In our model, data points are the words and each tree node corresponds to a morphological paradigm (see Figure 4). Each word is part of all the paradigms on the path from the leaf node having that word until the root node. The word can share either its stem or suffix with other words in the same paradigm. Hence, a considerable number of words can be generated through this approach that may not be seen in the corpus. Our model will prefer words that share stems or suffixes to be close to each other within the tree.

Figure 4 

A sample hierarchical tree structure that illustrates the clusters in each node (i.e., paradigms). Each node corresponds to a cluster (i.e., morphological paradigm) and the leaf nodes correspond to input data. The figure shows the ideal forest of trees that one expects given the input data.

Figure 4 

A sample hierarchical tree structure that illustrates the clusters in each node (i.e., paradigms). Each node corresponds to a cluster (i.e., morphological paradigm) and the leaf nodes correspond to input data. The figure shows the ideal forest of trees that one expects given the input data.

Close modal
The plate diagram of the generative model is given in Figure 5(a). Given a child node i, we define a Dirichlet process to generate stems (denoted by si1,,siNi) and a separate Dirichlet process to generate suffixes (denoted by mi1,,miNi):
(10)
(11)
(12)
(13)
(14)
(15)
where DPs, Ps) is a Dirichlet process that generates stems, βs denotes the concentration parameter, and Ps is the base distribution. Gis is a distribution over the stems sij in node i. Correspondingly, DPm, Pm) is a Dirichlet process that generates suffixes with analogous parameters. Gim is a distribution over the suffixes mij in node i. For smaller values of the concentration parameter, it is less likely to generate new types. Thus, sparse multinomials can be generated by the Dirichlet process yielding a skewed distribution. We set βs < 1 and βm < 1 in order to generate a small number of stem and suffix types. sij and mij are the jth stem and suffix instance in the ith node, respectively.
Figure 5 

The DP model for the child nodes (on the left) illustrates the generation of words talking, cooked, yelling. Each child node maintains its own DP that is independent of DPs from other nodes. The HDP model for the root nodes (on the right) illustrates the generation of words yelling, talking, repairs. In contrast to the DPs in the child nodes, in the HDP model, the stems/suffixes are shared across all root HDPs.

Figure 5 

The DP model for the child nodes (on the left) illustrates the generation of words talking, cooked, yelling. Each child node maintains its own DP that is independent of DPs from other nodes. The HDP model for the root nodes (on the right) illustrates the generation of words yelling, talking, repairs. In contrast to the DPs in the child nodes, in the HDP model, the stems/suffixes are shared across all root HDPs.

Close modal

The base distributions for stems and suffixes are given by Equation (14) and Equation (15). Here, c denotes a single letter or character. We assume that letters are distributed uniformly (Creutz and Lagus 2005b), where p(c) = 1/A for an alphabet having A letters. Our model will favor shorter morphemes because they have less factors in the joint probability given by Equations (14) and (15).

For the root nodes we use HDPs that share global DPs (denoted by Hs for stems and Hm for suffixes). These introduce dependencies between stems/suffixes in distinct trees. The model will favor stems and suffixes that are already generated in one of the trees. The HDP for a root node i is defined as follows:
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
where the base distributions Hs and Hm are drawn from the global DPs DPs, Ps) and DPm, Pm). Here, ψiz denotes the stem type z in node i, and φiz denotes the suffix type z in node i, which are drawn from Hs and Hm (i.e., the global DPs), respectively. The plate diagram of the HDP model is given in Figure 5(b).

From the Chinese restaurant process (CRP) metaphor perspective, the global DPs generate the global sets of dishes (i.e., stems and suffixes) that constitute the menu for all trees in the model. At each tree node, there are two restaurants: one for stems and one for suffixes. At each table, a different type of dish (i.e., stem or suffix type) is served and for each stem/suffix type there exists only one table in each node (i.e., restaurant). Customers are the stem or suffix tokens. Whenever a new customer, sij or mij, enters a restaurant, if the table, ψiz or φiz, serving that dish already exists, the new customer sits at that table. Otherwise, a new table is generated in the restaurant. A change in one of the restaurants in the leaf nodes leads to the update in each restaurant all the way to the root node. If the dish is not available in the root node, a new table is created for that root node and a global customer is also added to the global restaurant. If no global table exists for that dish, a new global table serving the dish is also created. This can be seen as a Chinese restaurant franchise where each node is a restaurant itself (see Figure 6).

Figure 6 

A depiction of the Chinese restaurant franchise (i.e., global vs. local CRPs). S1 = {walk, order, sleep, etc.}, M1 = {s, ing}, S2 = {pen, book}, M2 = {0, s}, S3 = {walk, order}, M3 = {0, s}, where 0 denotes an empty suffix. For each stem type in the distinct trees, a customer is inserted in the global restaurant. For example, there are two stem customers that are being served the stem type walk because walk exists in two different trees.

Figure 6 

A depiction of the Chinese restaurant franchise (i.e., global vs. local CRPs). S1 = {walk, order, sleep, etc.}, M1 = {s, ing}, S2 = {pen, book}, M2 = {0, s}, S3 = {walk, order}, M3 = {0, s}, where 0 denotes an empty suffix. For each stem type in the distinct trees, a customer is inserted in the global restaurant. For example, there are two stem customers that are being served the stem type walk because walk exists in two different trees.

Close modal
In order to calculate the joint likelihood of the model, we need to consider both trees and global stem/suffix sets (i.e., local and global restaurants). The model is exchangeable because it is a CRP—which means that the order the words are segmented does not alter the joint probability. The joint likelihood of the entire model for a given collection of trees T = {T1, T2, …, Tn} is computed as follows:
(24)
(25)
(26)
where p(Si|Ti) and p(Mi|Ti) are computed recursively:
(27)
where Z is the normalization constant. The same also applies for Mi.
Following the CRP, the joint probability of the stems in each root node Ti, Srooti={si1,si2,,siNi}, is:
(28)
(29)
(30)
where the first line in Equation (30) corresponds to the stem CRPs in the root nodes of the trees (see Equation (2.22) in Can [2011] and Equation (A.1) in Appendix A). The second line in Equation (30) corresponds to the global CRP of stems. The second factor in the first line of Equation (30) corresponds to the case where Lis stem types are generated the first time, and the third factor in the first line corresponds to the case, where for each of the Lis stem types at node i, there are nijs stem tokens of type j. The first factor accounts for all denominators from both cases. Similarly, the second and the fourth factor in the second line of Equation (30) corresponds to the case where Ks stem types are generated globally (i.e., stem tables in the global restaurant), the third factor corresponds to the case where, for each of the Ks stem types, there are kjs stems of type j in distinct trees. A sample hierarchical structure is given in Figure 7.
Figure 7 

A sample hierarchical structure that contains D = {walk + ed, talk + ed, order + ed, walk + ing, talk + ing} and the corresponding global tables. Here sk1=walk/1 denotes the first stem walk in node k with frequency 1.

Figure 7 

A sample hierarchical structure that contains D = {walk + ed, talk + ed, order + ed, walk + ing, talk + ing} and the corresponding global tables. Here sk1=walk/1 denotes the first stem walk in node k with frequency 1.

Close modal
The joint probability of stems in a child node Ti, Si={si1,si2,,siNi}, which belong to stem tables {ψi1,ψi2,} that index the items on the global menu is reduced to:
(31)
Whenever a new stem is added to a node i (i.e., new customer enters one of the local restaurants), the conditional probability of the new stem sij that belongs to type z (i.e., customer sij sitting at table ψiz) is computed as follows (see Teh 2010 and Teh et al. 2006):
(32)
where Ψi denotes the table indicators in node i, nizsij denotes the number of stem tokens that belong to type z in node i when the last instance sij is excluded.
If the stem (customer) does not exist in the root node (i.e., chooses a non-existing dish type in the root node), the new stem’s probability is calculated as follows:
(33)
If the stem type (dish) exists in the global clusters (i.e., global menu), it is chosen with the probability proportional to the number of trees that contain the stem type (i.e., number of global stem customers that are served the dish). Otherwise, the new stem (dish type) is chosen based on the base distribution Ps.
Analogously to Equations (30)(33) that apply to stems, the corresponding equations for suffixes are given by Equations (34)(37).
(34)
(35)
(36)
(37)
where Mi={mi1,mi1,,miNi} is the set of suffixes in node i belonging to global suffix types {φi1,φi2,}; Nim is the number of local suffix types; nijm is the number of suffix tokens of type j in node i; Km is the total number of suffix types; and kjm is the number of trees that contain suffixes of type j.

3.3. Metropolis-Hastings Sampling for Inferring Trees

Trees are learned along with the segmentations of words via inference. Learning trees involves two steps: 1) constructing initial trees; 2) Metropolis-Hastings sampling for inferring trees.

3.3.1. Constructing Initial Trees.

Initially, all words are split at random points with uniform probability. We use an approximation to construct the intial trees. Instead of computing the full likelihood of the model, for each tree, we only compute the likelihood of a single DP and assume that all words belong to this DP. The conditional probability of inserting word wj = s + m in tree Tk is given by:
(38)
We use Equation (8) and (9) for computing the conditional probability p(Tk|T, α, U) of choosing a particular tree. Once a tree is chosen, a branch to insert is selected at random. The algorithm for constructing the initial trees is given in Algorithm 1.

graphic

3.3.2. Metropolis-Hastings Sampling.

Once the initial trees are constructed, the hierarchical structures yielding a global maximum likelihood are inferred by using the Metropolis Hastings algorithm (Hastings 1970). The inference is performed by iteratively removing a word from a leaf node from a tree and subsequently sampling a new tree, a position within the tree, and a new segmentation (see Algorithm 2). Trees are sampled with the conditional probability given in Equations (8) and (9). Hence, trees with more words attract more words, and new trees are created proportional to the hyperparameter α.

graphic

Figure 8 

A sampling step in Metropolis-Hastings algorithm.

Figure 8 

A sampling step in Metropolis-Hastings algorithm.

Close modal
Once a tree is sampled, we draw a new position and a new segmentation. The word is inserted at the sampled position with the sampled segmentation. The new model is either accepted or rejected with the Metropolis-Hastings accept-reject criteria. We also use a simulated annealing cooling schedule by assigning an initial temperature γ to the system and decreasing the temperature at each iteration. The accept probability we use is given by:
(39)
where pnext(D|T) denotes the likelihood of the data under the altered model and pcur(D|T) denotes the likelihood of data under the current model before sampling. The normalization constant cancels out because it is the same for the numerator and the denominator. If pnext(D|T)1γ>pcur(D|T)1γ, then the new sample is accepted: otherwise, the new model is still accepted with a probability pAcc. The system is cooled in each iteration with decrements η. We refer to Section 4 for details of parameter settings.
(40)

3.4. Morphological Segmentation

Once the model is learned, it can be used for the segmentation of novel words. We use only root nodes for the segmentation. Viterbi decoding is used in order to find the morphological segmentation of each word having the maximum probability:
(41)
where wk denotes the kth word to be segmented in the test set.

3.5. Example Paradigms

A sample of root paradigms learned by our model for English is given in Table 1. The model can find similar word forms (i.e., separat + ists, medal + ists, hygien + ists) that are grouped in the neighbor branches in the tree structure (see Figures 9, B.1, and B.2 for sample paradigms learned in English, Turkish, and German).

Table 1 

Example paradigms in English.

{final, ungod, frequent, pensive} {ly} 
{kind, kind, kind} {est, er, 0} 
{underrepresent, modul} {ation} 
{plebe, hawai, muslim-croat} {ian} 
{compuls, shredd, compuls, shredd, compuls} {ion, er, ively, ers, ory} 
{zion, modern, modern, dynam} {ism, ists} 
{reclaim, chas, pleas, fell} {ing} 
{mov, engrav, stray, engrav, fantasiz, reischau, decilit, suspect} {ing, er, e} 
{measur, measur, incontest, transport, unplay, reput} {e, able} 
{housewar, decorat, entitl, cuss, decorat, entitl, materi, toss, flay, unconfirm, linse, equipp} {es, ing, alise, ed} 
{fair, norw, soon, smooth, narrow, sadd, steep, noisi, statesw, narrow} {est, ing} 
{rest, wit, name, top, odor, bay, odor, sleep} {less, s} 
{umpir, absorb, regard, embellish, freez, gnash} {ing} 
{nutrition, manicur, separat, medal, hygien, nutrition, genetic, preservation} {0, ists} 
{final, ungod, frequent, pensive} {ly} 
{kind, kind, kind} {est, er, 0} 
{underrepresent, modul} {ation} 
{plebe, hawai, muslim-croat} {ian} 
{compuls, shredd, compuls, shredd, compuls} {ion, er, ively, ers, ory} 
{zion, modern, modern, dynam} {ism, ists} 
{reclaim, chas, pleas, fell} {ing} 
{mov, engrav, stray, engrav, fantasiz, reischau, decilit, suspect} {ing, er, e} 
{measur, measur, incontest, transport, unplay, reput} {e, able} 
{housewar, decorat, entitl, cuss, decorat, entitl, materi, toss, flay, unconfirm, linse, equipp} {es, ing, alise, ed} 
{fair, norw, soon, smooth, narrow, sadd, steep, noisi, statesw, narrow} {est, ing} 
{rest, wit, name, top, odor, bay, odor, sleep} {less, s} 
{umpir, absorb, regard, embellish, freez, gnash} {ing} 
{nutrition, manicur, separat, medal, hygien, nutrition, genetic, preservation} {0, ists} 
Figure 9 

Sample hierarchies in English, Turkish, and German.

Figure 9 

Sample hierarchies in English, Turkish, and German.

Close modal

Paradigms are captured based on the similarity of either stems or suffixes. Having the same stem such as co-chair (co-chairman, co-chairmen) or trades (trades + man, trades + men) allows us to find segmentations such as co-chair + man vs. co-chair + men and trades + man vs. trades + men. Although we assume a stem + suffix segmentation, other types of segmentation, such as prefix + stem, are also covered. However, stem alterations and infixation are not covered in our model.

We used publicly available Morpho Challenge data sets for English, German, and Turkish for training. The English data set consists of 878,034 words, the German data set consists of 2,338,323 words, and the Turkish data set consists of 617,298 words. Although frequency of each word was available in the training set, we did not make use of this information. In other words, we use only the word types (not tokens) in training. We do not address the ambiguity of words in this work and leave this as future research.

In all experiments, the initial temperature of the system is set γ = 2 and it is reduced to γ = 0.01 with decrements η = 0.0001 (see Equation (39)). Figure 10 shows the time required for the log likelihoods of the trees of sizes 10K, 16K, and 22K to converge. We fixed αs = αm = βs = βm = 0.01 and α = 0.0005 in all our experiments. The hyperparameters are set manually as a result of several experiments. These are the optimum values obtained from a number of experiments.2

Figure 10 

The likelihood convergence in time (in minutes) for data sets of size 16K and 22K.

Figure 10 

The likelihood convergence in time (in minutes) for data sets of size 16K and 22K.

Close modal

Precision, recall, and F-score values against training set sizes are given in Figures 11 and 12 for English and Turkish, respectively.

Figure 11 

English results for different size of data.

Figure 11 

English results for different size of data.

Close modal
Figure 12 

Turkish results for different size of data.

Figure 12 

Turkish results for different size of data.

Close modal

4.1. Morpho Challenge Evaluation

Although we experimented with different sizes of training sets, we used a randomly chosen 600K words from the English and 200K words from the Turkish and German data sets for evaluation purposes.

Evaluation is performed according to the method proposed in Morpho Challenge (Kurimo et al. 2010a), which in turn is based on evaluation used by Creutz and Lagus (2007). The gold standard evaluation data set utilized within the Morpho Challenge is a hidden set that is not available publicly. This makes the Morpho Challenge evaluation different from other evaluations that provide test data. In this evaluation, word pairs sharing at least one common morpheme are sampled. For precision, word pairs are sampled from the system results and checked against the gold standard segmentations. For recall, word pairs are sampled from the gold standard and checked against the system results. For each matching morpheme, 1 point is given. Precision and recall are calculated by normalizing the total obtained scores.

We compare our results with other unsupervised systems from Morpho Challenge 2010 (Kurimo et al. 2010b) for English, German, and Turkish. More specifically, we compare our model with all competing unsupervised systems: Morfessor Baseline (Creutz and Lagus 2002), Morfessor CATMAP (Creutz and Lagus 2005a), Base Inference (Lignos 2010), Iterative Compounding (Lignos 2010), Aggressive Compounding (Lignos 2010), and Nicolas, Farré, and Molinero (2010). Additionally, we compare our system with the Morpho Chain model of Narasimhan, Barzilay, and Jaakkola (2015) by re-training their model on exactly the same training sets as ours. All evaluation was carried out by the Morpho Challenge organizers based on the hidden gold data sets.

English results are given in Table 2. For English, the Base Inference algorithm of Lignos (2010) obtained the highest F-measure in Morpho Challenge 2010 among other competing unsupervised systems. Our model is ranked fourth among all unsupervised systems with a F-measure 60.27%.

Table 2 

Morpho Challenge 2010 experimental results for English.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 55.60 65.80 60.27 
Single Tree Prob. Clustering (Can and Manandhar 201255.60 57.58 57.33 
Base Inference (Lignos 201080.77 53.76 64.55 
Iterative Compounding (Lignos 201080.27 52.76 63.67 
Aggressive Compounding (Lignos 201071.45 52.31 60.40 
Nicolas (Nicolas, Farré, and Molinero 201067.83 53.43 59.78 
Morfessor Baseline (Creutz and Lagus 200281.39 41.70 55.14 
Morpho Chain (Narasimhan, Barzilay, and Jaakkola 201574.87 39.01 50.42 
Morfessor CatMAP (Creutz and Lagus 2005a86.84 30.03 44.63 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 55.60 65.80 60.27 
Single Tree Prob. Clustering (Can and Manandhar 201255.60 57.58 57.33 
Base Inference (Lignos 201080.77 53.76 64.55 
Iterative Compounding (Lignos 201080.27 52.76 63.67 
Aggressive Compounding (Lignos 201071.45 52.31 60.40 
Nicolas (Nicolas, Farré, and Molinero 201067.83 53.43 59.78 
Morfessor Baseline (Creutz and Lagus 200281.39 41.70 55.14 
Morpho Chain (Narasimhan, Barzilay, and Jaakkola 201574.87 39.01 50.42 
Morfessor CatMAP (Creutz and Lagus 2005a86.84 30.03 44.63 

German results are given in Table 3. Our model outperforms other unsupervised models in Morpho Challenge 2010 with a F-measure 50.71%.

Table 3 

Morpho Challenge 2010 experimental results for German.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 47.92 53.84 50.71 
Single Tree Prob. Clustering (Can and Manandhar 201257.79 32.42 41.54 
Base Inference (Lignos 201066.38 35.36 46.14 
Iterative Compounding (Lignos 201062.13 34.70 44.53 
Aggressive Compounding (Lignos 201059.41 37.21 45.76 
Morfessor Baseline (Creutz and Lagus 200282.80 19.77 31.92 
Morfessor CatMAP (Creutz and Lagus 2005a72.70 35.43 47.64 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 47.92 53.84 50.71 
Single Tree Prob. Clustering (Can and Manandhar 201257.79 32.42 41.54 
Base Inference (Lignos 201066.38 35.36 46.14 
Iterative Compounding (Lignos 201062.13 34.70 44.53 
Aggressive Compounding (Lignos 201059.41 37.21 45.76 
Morfessor Baseline (Creutz and Lagus 200282.80 19.77 31.92 
Morfessor CatMAP (Creutz and Lagus 2005a72.70 35.43 47.64 

Turkish results are given in Table 4. Again, our model outperforms other unsupervised participants, achieving a F-measure 56.41%.

Table 4 

Morpho Challenge 2010 experimental results for Turkish.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 57.70 55.18 56.41 
Single Tree Prob. Clustering (Can and Manandhar 201272.36 25.81 38.04 
Base Inference (Lignos 201072.81 16.11 26.38 
Iterative Compounding (Lignos 201068.69 21.44 32.68 
Aggressive Compounding (Lignos 201055.51 34.36 42.45 
Nicolas (Nicolas, Farré, and Molinero 201079.02 19.78 31.64 
Morfessor Baseline (Creutz and Lagus 200289.68 17.78 29.67 
Morpho Chain (Narasimhan, Barzilay, and Jaakkola 201569.25 31.51 43.32 
Morfessor CatMAP (Creutz and Lagus 2005a79.38 31.88 45.49 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 57.70 55.18 56.41 
Single Tree Prob. Clustering (Can and Manandhar 201272.36 25.81 38.04 
Base Inference (Lignos 201072.81 16.11 26.38 
Iterative Compounding (Lignos 201068.69 21.44 32.68 
Aggressive Compounding (Lignos 201055.51 34.36 42.45 
Nicolas (Nicolas, Farré, and Molinero 201079.02 19.78 31.64 
Morfessor Baseline (Creutz and Lagus 200289.68 17.78 29.67 
Morpho Chain (Narasimhan, Barzilay, and Jaakkola 201569.25 31.51 43.32 
Morfessor CatMAP (Creutz and Lagus 2005a79.38 31.88 45.49 

Our model also outperforms the model from Can and Manandhar (2012) (which we refer to as Single Tree Probabilistic Clustering), although the results are not directly comparable because the largest data set we were able to train that model on was 22K words due to the training time required. The full training set provided by Morpho Challenge was not used in Can and Manandhar (2012). Our current approach is more efficient as the training cost is divided across multiple tree structures with each tree being shallower compared with our previous model.

In order to draw a substantive empirical comparison, we performed another set of experiments by running the current approach on only 22K words as the Single Tree Probabilistic Clustering (Can and Manandhar 2012). Results are given in Tables 5, 6, and 7. As shown in the tables, the current model outperforms the previous model even on the smaller data set.

Table 5 

Comparison with Single Tree Probabilistic Clustering for English.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 67.75 53.93 60.06 
Single Tree Prob. Clustering (Can and Manandhar 201255.60 57.58 57.33 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 67.75 53.93 60.06 
Single Tree Prob. Clustering (Can and Manandhar 201255.60 57.58 57.33 
Table 6 

Comparison with Single Tree Probabilistic Clustering for German.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 33.93 65.31 44.66 
Single Tree Prob. Clustering (Can and Manandhar 201257.79 32.42 41.54 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 33.93 65.31 44.66 
Single Tree Prob. Clustering (Can and Manandhar 201257.79 32.42 41.54 
Table 7 

Comparison with Single Tree Probabilistic Clustering for Turkish.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 64.39 42.99 51.56 
Single Tree Prob. Clustering (Can and Manandhar 201272.36 25.81 38.04 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 64.39 42.99 51.56 
Single Tree Prob. Clustering (Can and Manandhar 201272.36 25.81 38.04 

4.2. Additional Evaluation

For additional experiments, we compare our model with Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) based on their evaluation method that differs from the Morpho Challenge evaluation method. Their evaluation method is based on counting the correct segmentation points. For example, if the result segmentation is booking + s and the gold segmentation is book + ing + s, 1 point is counted. Precision and recall are calculated based on these matching segmentation points. In addition, this evaluation does not use the hidden gold data sets. Instead, the test sets are created by aggregating the test data from Morpho Challenge 2005 and Morpho Challenge 2010 (as reported in Narasimhan, Barzilay, and Jaakkola [2015]) that provide segmentation points.3

We used the same trained models as in our Morpho Challenge evaluation. The English test set contains 2,218 words and the Turkish test set contains 2,534 words.4

The English results are given in Table 8 and Turkish results are given in Table 9.

Table 8 

Comparison with Morpho Chain model for English based on Morpho Chain evaluation.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 67.41 62.5 64.86 
Morpho Chain 72.63 78.72 75.55 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 67.41 62.5 64.86 
Morpho Chain 72.63 78.72 75.55 
Table 9 

Comparison with Morpho Chain model for Turkish based on Morpho Chain evaluation.

SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 89.30 48.22 62.63 
Morpho Chain 70.49 63.27 66.66 
SystemPrecision (%)Recall (%)F-measure (%)
Hierarchical Morphological Segmentation 89.30 48.22 62.63 
Morpho Chain 70.49 63.27 66.66 

For all systems, Morpho Chain evaluation scores are comparably higher than the Morpho Challenge scores. There are several reasons for this. In the Morpho Challenge evaluation, the morpheme labels are considered rather than the surface forms of the morphemes. For example, pantolon + u + yla [with his trousers] and emel + ler + i + yle [with his desires] have got both possessive morpheme (u and i) that is labeled with POS and relational morpheme (yla and yle) labeled with REL in common. This increases the total number of points that is computed over all word pairs, and therefore lowers the scores.

Secondly, in the Morpho Chain evaluation, only the gold segmentation that has the maximum match with the result segmentation is chosen for each word (e.g., yazımıza has two gold segmentations: yaz + ı + mız + a [to our summer] and yazı + mız + a; [to our writing]). In contrast, in the Morpho Challenge evaluation all segmentations in the gold segmentation are evaluated. This is another factor that increases the scores in Morpho Chain evaluation. Thus, the Morpho Chain evaluation favors precision over recall. Indeed, in the Morpho Challenge evaluation, the Morpho Chain system has high precision but their model suffers from low recall due to undersegmentation (see Tables 2 and 4).

It should be noted that the output of our system is not only the segmentation points, but also the hierarchical organization of morphological paradigms that we believe is novel in this work. However, because of the difficulty in measuring the quality of hierarchical paradigms, which will require a corresponding hierarchically organized gold data set, we are unable to provide an objective measure of the quality of hierarchical structures learned. We present different portions from the obtained trees in Appendix B (see Figures B.1 and B.2).5 It can be seen that words sharing the same suffixes are gathered closer to each other, such as reestablish + ed, reclassifi + ed, circl + ed, uncloth + ed, and so forth. Secondly, related morphological families gather closer to each other, such as impress + ively, impress + ionist, impress + ions, impress + ion.

In this article, we introduce a tree structured Dirichlet process model for hierarchical morphological segmentation. The method is different compared with existing hierarchical and non-hierarchical methods for learning paradigms. Our model learns morphological paradigms that are clustered hierarchically within a forest of trees.

Although our goal in this work is on hierarchical learning, our model shows competitive performance against other unsupervised morphological segmentation systems that are designed primarily for segmentation only. The system outperforms other unsupervised systems in Morpho Challenge 2010 for German and Turkish. It also outperforms the more recent Morpho Chain (Narasimhan, Barzilay, and Jaakkola 2015) system on the Morpho Challenge evaluation for German and Turkish.

The sample paradigms learned show that these can be linked to other types of latent information, such as part-of-speech tags. Combining morphology and syntax as a joint learning problem within the same model can be a fruitful direction for future work.

The hierarchical structure is beneficial because we can have both compact and more general paradigms at the same time. In this article, we use the paradigms only for the segmentation task, and applications of hierarchy learned is left as future work.

Let D = {s1, s2, …, sN} denote the input data where each sj is a data item. A particular setting of a table with N customers has a joint probability of:
(A.1)
For each customer, either a new table is created or the customer sits at an occupied table. For each table, at least one table creation is performed, which forms the second and the fourth factor in the last equation. Once the table is created, factors that represent the number of customers sitting at each table are chosen accordingly, which refers to the third factor in the last equation. All factors with α are aggregated in the first factor in terms of a Gamma function in the last equation.
Figure B.1 

Sample hierarchies in English.

Figure B.1 

Sample hierarchies in English.

Close modal
Figure B.2 

Sample hierarchies in Turkish.

Figure B.2 

Sample hierarchies in Turkish.

Close modal

This research was supported by TUBITAK (The Scientific and Technological Research Council of Turkey) grant number 115E464. We thank Karthik Narasimhan for providing their data sets and code. We are grateful to Sami Virpioja for the evaluation of our results on the hidden gold data provided by Morpho Challenge. We thank our reviewers for critical feedback and spotting an error in our previous version of the article. Their comments have immensely helped improve the article.

1 

The review version of this paper and our previous work (Can and Manandhar 2012) missed this connection to the product of experts model and did not include the 1/Z term. We thank our reviewers for spotting this error.

2 

The source code of the model is accessible at: https://github.com/burcu-can/TreeStructuredDP.

3 

In addition to the hidden test data, Morpho Challenge also provides separate publicly available test data.

4 

Because of the unavailability of word2vec word embeddings for German, we were unable to perform Morpho Chain evaluation on this language.

5 

Some of the full trees are given at http://web.cs.hacettepe.edu.tr/~burcucan/TreeStructuredDP.

Can
,
Burcu
.
2011
.
Statistical Models for Unsupervised Learning of Morphology and POS tagging
.
Ph.D. thesis
,
Department of Computer Science, The University of York
.
Can
,
Burcu
and
Suresh
Manandhar
.
2010
.
Clustering morphological paradigms using syntactic categories
. In
Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, September 30 - October 2, 2009, Revised Selected Papers
, pages
641
648
,
Corfu
.
Can
,
Burcu
and
Suresh
Manandhar
.
2012
.
Probabilistic hierarchical clustering of morphological paradigms
. In
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
,
EACL ’12
, pages
654
663
,
Avignon
.
Chan
,
Erwin
.
2006
.
Learning probabilistic paradigms for morphology in a latent class model
. In
Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology
,
SIGPHON ’06
, pages
69
78
,
New York, NY
.
Creutz
,
Mathias
and
Krista
Lagus
.
2002
.
Unsupervised discovery of morphemes
. In
Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6
,
MPL ’02
, pages
21
30
,
Philadelphia, PA
.
Creutz
,
Mathias
and
Krista
Lagus
.
2005a
.
Inducing the morphological lexicon of a natural language from unannotated text
. In
Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR 2005)
, pages
106
113
,
Espoo
.
Creutz
,
Mathias
and
Krista
Lagus
.
2005b
.
Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0
.
Technical Report A81
.
Helsinki University of Technology
.
Creutz
,
Mathias
and
Krista
Lagus
.
2007
.
Unsupervised models for morpheme segmentation and morphology learning
.
ACM Transactions Speech Language Processing
,
4
,
3:1
3:34
.
Dreyer
,
Markus
and
Jason
Eisner
.
2011
.
Discovering morphological paradigms from plain text using a Dirichlet Process mixture model
. In
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
, pages
616
627
,
Edinburgh
.
Goldsmith
,
John
.
2001
.
Unsupervised learning of the morphology of a natural language
.
Computational Linguistics
,
27
(
2
):
153
198
.
Hastings
,
W. K.
1970
.
Monte Carlo sampling methods using Markov chains and their applications
.
Biometrika
,
57
:
97
109
.
Heller
,
Katherine A.
and
Zoubin
Ghahramani
.
2005
.
Bayesian hierarchical clustering
. In
Proceedings of the 22nd International Conference on Machine Learning
,
ICML ’05
, pages
297
304
,
Bonn
.
Hinton
,
Geoff
.
2002
.
Training products of experts by minimizing contrastive divergence
.
Neural Computation
,
14
(
8
):
1771
1800
.
Kurimo
,
Mikko
,
Sami
Virpioja
,
Ville
Turunen
, and
Krista
Lagus
.
2010a
.
Morpho Challenge Competition 2005–2010: Evaluations and results
. In
Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
,
SIGMORPHON ’10
, pages
87
95
,
Uppsala
.
Kurimo
,
Mikko
,
Sami
Virpioja
,
Ville T.
Turunen
,
Graeme W.
Blackwood
, and
William
Byrne
.
2009
.
Overview and results of Morpho Challenge 2009
. In
Proceedings of the 10th Cross-Language Evaluation Forum Conference on Multilingual Information Access Evaluation: Text Retrieval Experiments
,
CLEF’09
, pages
578
597
,
Corfu
.
Kurimo
,
Mikko
,
Sami
Virpioja
,
Ville T.
Turunen
,
Bruno
Gólenia
,
Sebastian
Spiegler
,
Oliver
Ray
,
Peter
Flach
,
Oskar
Kohonen
,
Laura
Leppänen
,
Krista
Lagus
,
Constantine
Lignos
,
Lionel
Nicolas
,
Jacques
Farré
, and
Miguel
Molinero
.
2010b
.
Proceedings of the Morpho Challenge 2010 Workshop
.
Technical Report TKK-ICS-R37
.
Aalto University
.
Lignos
,
Constantine
.
2010
.
Learning from unseen data
. In
Proceedings of the Morpho Challenge 2010 Workshop
, pages
35
38
,
Espoo
.
Luo
,
Jiaming
,
Karthik
Narasimhan
, and
Regina
Barzilay
.
2017
.
Unsupervised learning of morphological forests
.
Transactions of the Association of Computational Linguistics
,
5
:
353
364
.
Mikolov
,
Tomas
,
Kai
Chen
,
Greg
Corrado
, and
Jeffrey
Dean
.
2013
.
Efficient estimation of word representations in vector space
.
CoRR
,
abs/1301.3781
.
Monson
,
Christian
.
2008
.
Paramor: From Paradigm Structure to Natural Language Morphology Induction
.
Ph.D. thesis
,
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
.
Monson
,
Christian
,
Jaime
Carbonell
,
Alon
Lavie
, and
Lori
Levin
.
2008
.
Paramor: Finding paradigms across morphology
. In
Lecture Notes in Computer Science - Advances in Multilingual and Multimodal Information Retrieval; Cross-Language Evaluation Forum, CLEF 2007
, pages
900
907
,
Springer
,
Berlin
.
Narasimhan
,
Karthik
,
Regina
Barzilay
, and
Tommi S.
Jaakkola
.
2015
.
An unsupervised method for uncovering morphological chains
.
Transactions of the Association for Computational Linguistics
,
3
:
157
167
.
Nicolas
,
Lionel
,
Jacques
Farré
, and
Miguel A.
Molinero
.
2010
.
Unsupervised learning of concatenative morphology based on frequency-related form occurrence
. In
Proceedings of the Morpho Challenge 2010 Workshop
, pages
39
43
,
Espoo
.
Poon
,
Hoifung
,
Colin
Cherry
, and
Kristina
Toutanova
.
2009
.
Unsupervised morphological segmentation with log-linear models
. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
,
NAACL ’09
, pages
209
217
,
Boulder, CO
.
Snover
,
Matthew G.
,
Gaja E.
Jarosz
, and
Michael R.
Brent
.
2002
.
Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step
. In
Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning
, pages
11
20
,
Philadelphia, PA
.
Snyder
,
Benjamin
and
Regina
Barzilay
.
2008
.
Unsupervised multilingual learning for morphological segmentation
. In
Proceedings of ACL-08: HLT
, pages
737
745
,
Columbus, OH
.
Teh
,
Y. W.
2010
. In
Dirichlet processes
, In
Encyclopedia of Machine Learning
,
Springer
.
Teh
,
Y. W.
,
M. I.
Jordan
,
M. J.
Beal
, and
D. M.
Blei
.
2006
.
Hierarchical Dirichlet Processes
.
Journal of the American Statistical Association
,
101
(
476
):
1566
1581
.
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits you to copy and redistribute in any medium or format, for non-commercial use only, provided that the original work is not remixed, transformed, or built upon, and that appropriate credit to the original source is given. For a full description of the license, please visit https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode.