## Abstract

We borrow the concept of representation learning from deep learning research, and we argue that the quest for Greenbergian implicational universals can be reformulated as the learning of good latent representations of languages, or sequences of surface typological features. By projecting languages into latent representations and performing inference in the latent space, we can handle complex dependencies among features in an implicit manner. The most challenging problem in turning the idea into a concrete computational model is the alarmingly large number of missing values in existing typological databases. To address this problem, we keep the number of model parameters relatively small to avoid overfitting, adopt the Bayesian learning framework for its robustness, and exploit phylogenetically and/or spatially related languages as additional clues. Experiments show that the proposed model recovers missing values more accurately than others and that some latent variables exhibit phylogenetic and spatial signals comparable to those of surface features.

## 1. Introduction

### 1.1 Representation Learning for Linguistic Typology

Beginning with the pioneering research by Greenberg (1963), linguists have taken quantitative approaches to linguistic typology. To propose a dozen cross-linguistic generalizations, called linguistic universals (an example is that languages with dominant VSO [verb–subject–object] order are always prepositional), Greenberg investigated a sample of 30 languages from around the world to correct for phylogenetic and areal effects. Linguistic universals, including those formulated in absolute terms by Greenberg, are rarely exceptionless (Dryer 1998), and therefore they are called statistical universals, as opposed to absolute universals.

While a great amount of effort has been invested into theory construction and careful analysis of field data, typologists have relied on elementary statistical concepts, such as frequency, mode, and deviation from expectation (Nichols 1992; Cysouw 2003). The limitations of superficial statistical analysis become evident especially when one seeks *diachronic* explanations of cross-linguistic variation. Although Greenberg (1978) proposed probabilistic models of language change over time, powerful statistical tools for making inferences were not available. To do so, we need a model with predictive power or the ability to draw from the present distribution generalizations that are applicable to the past. To infer the states of languages in the past, or complex latent structures in general, we need a robust statistical framework, powerful inference algorithms, and large computational resources. A statistical package that meets all of these requirements was not known to typologists in Greenberg’s day.

Today, the research community of computational linguistics knows a solution: Bayesian models armed with computationally intensive Markov chain Monte Carlo inference algorithms. Indeed, Bayesian models have been used extensively to uncover complex latent structures behind natural language text in the last decade or so (Goldwater and Griffiths 2007; Griffiths, Steyvers, and Tenenbaum 2007; Goldwater, Griffiths, and Johnson 2009).

Given this, it is somewhat surprising that with some notable exceptions (Daumé III and Campbell 2007; Daumé III 2009), the application of Bayesian statistics to typological problems has been done largely outside of the computational linguistics community, although computational linguists have recognized the usefulness of typological information for multilingual text processing (Bender 2016; O’Horan et al. 2016). In fact, it is evolutionary biology that has offered solutions to typological questions (Dediu 2010; Greenhill et al. 2010; Dunn et al. 2011; Maurits and Griffiths 2014; Greenhill et al. 2017).

In this article, we demonstrate that **representation learning**, a concept that computational linguists have become familiar with over the past decade, is useful for the study of linguistics typology (Bengio, Courville, and Vincent 2013). Although it has been applied to genomics (Asgari and Mofrad 2015; Tan et al. 2016), to our knowledge, representation learning has not been used in the context of evolution or applied to language data.

The goal of representation learning is to learn useful latent representations of the data (Bengio, Courville, and Vincent 2013). We assume that latent representations exist behind surface representations, and we seek to let a model connect the two types of representations.

To provide intuition, we consider handwritten digits represented by grayscale 28 × 28 images (LeCun et al. 1998). Each pixel takes one of 256 values. This means that there are 256^{28×28} possible images. However, only a tiny portion of them look like *natural* digits. Such data points must be smoothly connected because natural digits usually continue to look natural even if small modifications are added to them (for example, slightly rotating the images). These observations lead us to the manifold hypothesis: The data reside on low-dimensional manifolds embedded in a high-dimensional space, and thus must be able to be represented by a relatively small number of latent variables. Moreover, good latent representations must disentangle underlying abstract factors, as illustrated in Figure 1. One latent variable represents, say, the angle, while another one smoothly controls the width of digits (Chen et al. 2016). As these factors demonstrate, modification of one latent variable affects multiple pixels at once.

Our key idea is that the same argument applies to typological data, although typological features are much more informative than the pixels in an image. Combining typological features together, we can map a given language to a point in high-dimensional space. What Greenbergian universals indicate is that *natural* languages are not evenly distributed in the space. Most Greenbergian universals are implicational, that is, given in the form of “if *x* holds, then *y* also holds.” In other words, the combination of (*x*, ¬*y*) is non-existent (absolute universals) or rare (statistical universals). In addition, languages have evolved gradually, and many of them are known to share common ancestors. If we accept the uniformitarian hypothesis, that is, the assumption that universals discovered in modern languages should also apply to past languages (Croft 2002), a smooth line of *natural* languages must be drawn between a modern language and its ancestor, and by extension, between any pair of phylogenetically related languages, as illustrated in Figure 2. Thus, we expect languages to lie on a smooth lower-dimensional manifold. Just like the angle of a digit, a latent variable must control multiple surface features at once.

### 1.2 Diachrony-Aware Bayesian Representation Learning

The question, then, is how to turn this idea into a concrete computational model. The concept of representation learning was popularized in the context of deep learning, with deep autoencoders being typical models (Hinton and Salakhutdinov 2006). Accordingly, we previously proposed an autoencoder-based neural network model for typological data (Murawaki 2015). With follow-up experiments, however, we later found that the model suffered from serious problems.

The first problem is overfitting. Neural network methods are known to be data-hungry. They can approximate a wide variety of functions by simply combining general-purpose components (Hornik 1991), but the flexibility is obtained at the cost of requiring very large amounts of data. The database available to us is a matrix where languages are represented as rows and features as columns (Haspelmath et al. 2005). The number of languages is on the order of 1,000 and the number of features is on the order of 100. This precious database summarizes decades of work by various typologists, but from the viewpoint of machine learning, it is not very large.

More importantly, the typological database is characterized by an alarmingly large number of missing values. Depending on how we perform preprocessing, only 20% to 30% of the items are present in the language–feature matrix. The situation is unlikely to change in the foreseeable future for a couple of reasons. Of the thousands of languages in the world, there is ample documentation for only a handful. Even if grammatical sketches are provided by field linguists, it is not always easy for non-experts to determine the appropriate value for a given feature because typological features are highly theory-dependent. If one manages to provide some missing values, they are also likely to add previously uncovered languages to the database, with few features present. The long tail remains long. Thus there seems to be no way of escaping the problem of missing values.

Here, we combine several methods to cope with this problem. All but one can be collectively referred to as Bayesian learning. We replace the general-purpose neural network with a carefully crafted generative model that has a smaller number of model parameters. We apply prior distributions to the model parameters to penalize extreme values. In inference, we do not rely on a single point estimate of model parameters but draw multiple samples from the posterior distribution to account for uncertainty.^{1}

The last part of the proposed method can be derived when we re-interpret implicational universals in terms of the language–feature matrix: They focus on dependencies between columns and thus can be referred to as inter-feature dependencies. We can also exploit dependencies between rows, or inter-language dependencies. It is well known that the values of a typological feature do not distribute randomly in the world but reflect vertical (phylogenetic) transmissions from parents to children and horizontal (spatial or areal) transmissions between populations (Nichols 1992). For example, languages of mainland Southeast Asia, such as Hmong, Thai, and Vietnamese, are known for having similar tonal systems even though they belong to different language families (Enfield 2005). For this reason, combining inter-language dependencies with inter-feature dependencies is a promising solution to the problem of missing values. Whereas inter-feature dependencies are synchronic in nature, inter-language dependencies reflect diachrony, at least in an indirect manner. Thus, we call the combined approach **diachrony-aware learning**.

As a building block, we use a Bayesian autologistic model that takes both the vertical and horizontal factors into consideration (Murawaki and Yamauchi 2018). Just like the familiar logistic model, the autologistic model assumes that a dependent random variable (a language in our case) depends probabilistically on explanatory variables. The difference is that explanatory variables themselves are languages that are to be stochastically explained by other languages. The motivation behind this is that languages that are related either vertically or horizontally must be predictive of the language in question. Thus, languages are dependent on each other and form a **neighbor graph** in which every pair of interdependent languages is connected. A major advantage of the autologistic model over the standard tree model (Gray and Atkinson 2003; Bouckaert et al. 2012) is the ability to integrate the vertical and horizontal factors into a single model by simply using two neighbor graphs (Towner et al. 2012).

To do so, we make use of two additional resources: (1) a phylogenetic neighbor graph of languages that can be generated by connecting every pair of languages in each language family and (2) a spatial neighbor graph that connects languages within a specified distance.

A problem with the combined approach is that the model for inter-language dependencies, in its original form, cannot be integrated into the model for inter-feature dependencies. They both explain how the surface language–feature matrix is generated, even though only one generative story can exist. To resolve the conflict, we incorporate the autologistic model at the level of latent representations, rather than surface features, with the reasonable assumption that phylogenetically and/or spatially close languages tend to share the same latent variables in addition to the same surface features. In the end, the integrated Bayesian generative model first generates the latent representations of languages using inter-language dependencies, and then generates the surface representations of languages using inter-feature dependencies, as summarized in Figure 3.

Experiments show that the proposed Bayesian model recovers missing values considerably more accurately than other models. In addition, the integrated model consistently outperforms baseline models that exploit only one of the two types of dependencies, demonstrating the complementary nature of inter-feature and inter-language dependencies.

Since autologistic models require variables to be discrete, we inevitably adopt *binary* latent representations. We call our latent variables **linguistic parameters** for their superficial resemblance to parameters in the principles-and-parameters framework of generative grammar (Chomsky and Lasnik 1993) and for other reasons. A side effect of the discreteness constraint is good interpretability of linguistic parameters, in comparison with that of the continuous representations of Murawaki (2015). To demonstrate this, we project linguistic parameters on a world map and show that at least some of them exhibit phylogenetic and spatial signals comparable to those of surface features. Also, because both the surface and latent representations are discrete, linguistic parameters can readily be used as a substitute for surface features and therefore have a wide range of potential applications, including tree-based phylogenetic inference (Gray and Atkinson 2003; Bouckaert et al. 2012; Chang et al. 2015).

## 2. Background

### 2.1 Greenbergian Universals

Since Greenberg (1963), various interdependencies among typological features have been observed across the world’s languages. For example, if a language takes a verb before an object (VO), then it takes postnominal relative clauses (NRel) (VO → NRel, in shorthand), and a related universal, RelN → OV, also holds (Dryer 2011). Such cross-linguistic generalizations are specifically called **Greenbergian universals**, as opposed to Chomskyan universals, which we discuss in the next section. A Bayesian model for discovering Greenbergian universals was presented by Daumé III and Campbell (2007).

Greenbergian universals indicate that certain combinations of features are unnatural. Moreover, Greenberg (1978) discussed how to extend the synchronic observations to diachronic reasoning: Under the uniformitarian hypothesis, languages must have changed in a way such that they avoid unnatural combinations of features.

Despite these highly influential observations, most computational models of typological data assume independence between features (Daumé III 2009; Dediu 2010; Greenhill et al. 2010; Murawaki 2016; Greenhill et al. 2017). These methods are at risk for reconstructing typologically unnatural languages. A rare exception is Dunn et al. (2011), who extended Greenberg’s idea by applying a phylogenetic model of correlated evolution (Pagel and Meade 2006).

Both Greenberg (1963, 1978) and Dunn et al. (2011) focused on *pairs* of features. However, the dependencies between features are not limited to feature pairs (Tsunoda, Ueda, and Itoh 1995; Itoh and Ueda 2004). The order of relative clauses, just mentioned above, has connections to the order of adjective and noun (AdjN or NAdj), in addition to the order of object and verb, as two universals, RelN → AdjN and NAdj → NRel, are known to hold well (Dryer 2011).

Limiting the scope of research to feature pairs is understandable given that a combination of three or more features is often beyond human comprehension. Even for computers, extending a model of feature pairs (Dunn et al. 2011) to multiple features is hampered by computational intractability due to combinatorial explosion. What we propose here is a computationally tractable way to handle multiple inter-feature dependencies. We map interdependent variables to latent variables that are independent from each other by assumption. If we perform inference in the latent space (for example, reconstructing ancestral languages from their descendants) and then project the data back to the original space, we can handle inter-feature dependencies in an implicit manner.

### 2.2 Chomskyan Universals

Thanks to its tradition of providing concrete and symbolic representations to latent structures of languages, generative grammar has constantly given inspiration to computational linguists. At the same time, however, it is a source of frustration because, with Optimality Theory (Prince and Smolensky 2008) being a notable exception, it rarely explores how disambiguation is performed. Linguistic typology is no exception. Symbolic latent representations behind surface patterns are proposed (Baker 2001), but they have trouble explaining disharmonic languages around the world (Boeckx 2014).

A Chomskyan explanation for typological variation is (macro)parameters, which are part of the **principles and parameters** (P&P) framework (Chomsky and Lasnik 1993). In this framework, the structure of a language is explained by (1) a set of universal principles that are common to all languages and (2) a set of parameters whose values vary among languages. Here we skip the former because our focus is on structural variability. According to P&P, if we give specific values to all the parameters, then we obtain a specific language. Each parameter is binary and, in general, sets the values of multiple surface features in a deterministic manner. For example, the head directionality parameter is either head-initial or head-final. If head-initial is chosen, then surface features are set to VO, NAdj, and Prepositions; otherwise, the language in question becomes OV, AdjN, and Postpositions (Baker 2001). Baker (2001) discussed a number of parameters, such as the head directionality, polysynthesis, and topic-prominent parameters.

Our design decision to use *binary* latent representations is partly inspired by the parameters of generative grammar. We also borrow the term **parameter** from this research field, due to a conflict in terminology. **Features** usually refer to latent representations in the machine learning community (Griffiths and Ghahramani 2011; Bengio, Courville, and Vincent 2013). Unfortunately, the word *feature* is reserved for surface variables in the present study, and we need another name for latent variables. We admit the term *parameter* is confusing because, in the context of machine learning, it refers to a variable tied to the model itself, rather than its input or output. For clarity, we refer to binary latent representations as **linguistic parameters** throughout this article. A parameter of the model is referred to as a **model parameter**.

It should be noted that we do not intend to present the proposed method as a computational procedure to induce P&P parameters. Although our binary latent representations are partly inspired by P&P, their differences cannot be ignored. There are at least five differences between the P&P framework and the proposed model. First, whereas the primary focus of generative linguists is put on morphosyntactic characteristics of languages, the data sets we used in the experiments are not limited to them.

Second, Baker (2001) presented a hierarchical organization of parameters (see Figure 6.4 of Baker [2001]). However, we assume independence between linguistic parameters. Introducing a hierarchical structure to linguistic parameters is an interesting direction to explore, but we leave it for future work.

Third, whereas P&P hypothesizes deterministic generation, the proposed model *stochastically* generates a language’s features from its linguistic parameters. This choice appears to be inevitable because obtaining exceptionless relations from real data is virtually impossible.

Fourth, a P&P parameter typically controls a very small set of features. In contrast, if our linguistic parameter is turned on, it more or less modifies all the feature generation probabilities. However, we can expect a small number of linguistic parameters to dominate the probabilities because weights are drawn from a heavy-tailed distribution, as we describe in Section 4.1.

Lastly, our linguistic parameters are asymmetric in the sense that they do not operate at all if they are off. Although the marked–unmarked relation comes about as a natural consequence of incorporating binary variables into a computational model, this is not necessarily the case with P&P. For example, the head directionality parameter has two values, head-initial and head-final, and it is not clear which one is marked and which one is unmarked. In the Bayesian analysis of mixture models, mixture components are known to be unidentifiable because the posterior distribution is invariant to permutations in the labels (Jasra, Holmes, and Stephens 2005). In our model, the on and off of a linguistic parameter is not exactly swappable, but it is likely that a point in the search space where head-initial is treated as the marked form is separated by deep valleys from another point where head-final is treated as the marked form. If this is the case, a Gibbs sampler generally cannot cross the valleys. We discuss this point again in Section 7.3.

### 2.3 Functional Explanations

Needless to say, Greenbergian typologists themselves have provided explanations for cross-linguistic patterns although, unlike generative linguists, they generally avoid using *metaphysical* representations. Such explanations can be collectively referred to as functional explanations (Haspelmath 2008b).

One major type of functional explanation is synchronic in nature and often is a matter of economy. For example, several languages exhibit an adnominal alienability split, that is, the use of different possessive constructions for **inalienable** nouns (e.g., my arm) and **alienable** nouns (e.g., your car). An implicational universal for the split is:

If a language has an adnominal alienability split, and one of the constructions is overtly coded while the other one is zero-coded, it is always the inalienable construction that is zero-coded, while the alienable construction is overtly coded. (Haspelmath 2008a)

Another type is diachronic explanations. According to this view, at least some patterns observed in surface features arise from common paths of diachronic development (Anderson 2016). An important factor of diachronic development is grammaticalization, by which content words change into function words (Heine and Kuteva 2007). For example, the correlation between the order of adposition and noun and the order of genitive and noun might be explained by the fact that adpositions are often derived from nouns.

Regardless of whether they are synchronic or diachronic, functional explanations imply that unattested languages may simply be improbable but not impossible (Haspelmath 2008b). Because of the stochastic nature of the proposed model, we are more closely aligned with functionalists than with generative linguists. Note that the proposed model only describes patterns found in the data. It does not explain the underlying cause-and-effect mechanisms, although we hope that it can help linguists explore them.

### 2.4 Vertical and Horizontal Transmission

The standard model for phylogenetic inference is the tree model, where a trait is passed on from parent to child with occasional modifications. In fact, the recent success in the applications of statistical models to historical linguistic problems is largely attributed to the tree model (Gray and Atkinson 2003; Bouckaert et al. 2012), although the applications are subject to frequent criticism (Chang et al. 2015; Pereltsvaig and Lewis 2015). In linguistic typology, however, a non-tree-like mode of evolution has emerged as one of the central topics (Trubetzkoy 1928; Campbell 2006). Typological features, like loanwords, can be borrowed by one language from another, and as a result, vertical (phylogenetic) signals are obscured by horizontal (spatial) transmission.

The task of incorporating both vertical and horizontal transmissions within a statistical model of language evolution is notoriously challenging because of the excessive flexibility of horizontal transmissions. This is the reason why previously proposed models are coupled with some very strong assumptions—for example, that a reference tree is given a priori (Nelson-Sathi et al. 2011) and that horizontal transmissions can be modeled through time-invariant areal clusters (Daumé III 2009).

Consequently, we pursue a line of research in linguistic typology that draws on information on the current distribution of typological features without explicitly requiring the reconstruction of previous states (Nichols 1992, 1995; Parkvall 2008; Wichmann and Holman 2009). The basic assumption is that if the feature in question is vertically stable, then a phylogenetically defined group of languages will tend to share the same value. Similarly, if the feature in question is horizontally diffusible, then spatially close languages would be expected to frequently share the same feature value. Because the current distribution of typological features is more or less affected by these factors, the model needs to take both vertical and horizontal factors into account.

Murawaki and Yamauchi (2018) adopted a variant of the autologistic model, which had been widely used to model the spatial distribution of a feature (Besag 1974; Towner et al. 2012). The model was also used to impute missing values because the phylogenetic and spatial neighbors of a language had some predictive power over its feature values. Our assumption in this study is that the same predictive power applies to latent representations.

## 3. Data and Preprocessing

### 3.1 Input Specifications

The proposed model requires three types of data as the input: (1) a language–feature matrix, (2) a phylogenetic neighbor graph, and (3) a spatial neighbor graph. Table 1 lists the major symbols used in this article.

L | (l) | # of languages |

K | (k) | # of linguistic parameters (given a priori) |

M | (m) | # of model parameters of W, $\Theta ~$, and Θ |

N | (n) | # of surface discrete linguistic features |

F_{n} | # of distinct values for categorical feature n | |

A = {(v_{k}, h_{k}, u_{k})|k ∈ {1, ⋯, K}} | Model parameters for the autologistic models | |

Z ∈ {0, 1}^{L×K} | Binary latent parameter matrix | |

W ∈ ℝ^{K×M} | Weight matrix | |

$\Theta ~$ ∈ ℝ^{L×M} | Unnormalized model parameter matrix | |

Θ ∈ (0, 1)^{L×M} | Normalized model parameter matrix | |

X ∈ ℕ^{L×N} | Surface discrete linguistic feature matrix |

L | (l) | # of languages |

K | (k) | # of linguistic parameters (given a priori) |

M | (m) | # of model parameters of W, $\Theta ~$, and Θ |

N | (n) | # of surface discrete linguistic features |

F_{n} | # of distinct values for categorical feature n | |

A = {(v_{k}, h_{k}, u_{k})|k ∈ {1, ⋯, K}} | Model parameters for the autologistic models | |

Z ∈ {0, 1}^{L×K} | Binary latent parameter matrix | |

W ∈ ℝ^{K×M} | Weight matrix | |

$\Theta ~$ ∈ ℝ^{L×M} | Unnormalized model parameter matrix | |

Θ ∈ (0, 1)^{L×M} | Normalized model parameter matrix | |

X ∈ ℕ^{L×N} | Surface discrete linguistic feature matrix |

Let *L* and *N* be the numbers of languages and surface features, respectively. The language–feature matrix *X* ∈ ℕ^{L×N} contains discrete items. A substantial portion of the items may be missing. *x*_{l,n} denotes the value of feature *n* for language *l*. Features can be classified into three types: (1) binary (*x*_{l,n} ∈ {0, 1}), (2) categorical (*x*_{l,n} ∈ {1, 2, ⋯, *F*_{n}}, where *F*_{n} is the number of distinct values), and (3) count (*x*_{l,n} ∈ {0, 1, 2, ⋯}).

A neighbor graph is an undirected graph in which each node represents a language. The graph connects every pair of languages that are related in some way and thus are likely to be similar to some degree. A phylogenetic neighbor graph connects phylogenetically related languages, while a spatial neighbor graph connects spatially close pairs of languages.

### 3.2 Preprocessing

Although any database that meets the requirements described in Section 3.1 can be used, we specifically tested two typological databases in the present study: (1) the online edition^{2} of the *World Atlas of Language Structures* (WALS) (Haspelmath et al. 2005) and (2) Autotyp 0.1.0 (Bickel et al. 2017).

WALS (Haspelmath et al. 2005) is a large database of typological features compiled by dozens of typologists. Since its online version was released in 2008, it has been occasionally used by the computational linguistics community (Naseem, Barzilay, and Globerson 2012; O’Horan et al. 2016; Bender 2016). WALS covers a wide range of linguistic domains (called “areas” in WALS), such as phonology, morphology, nominal categories, and word order. All features are categorically coded. For example, Feature 81A, “Order of Subject, Object and Verb” has seven possible values: SOV, SVO, VSO, VOS, OVS, OSV, and No dominant order, and each language is assigned one of these seven values.

We downloaded a CSV file that contained metadata and feature values for each language. Sign languages were dropped because they were too different from spoken languages. Pidgins and creoles were also removed from the matrix because they belonged to the dummy language family “other.” We imputed some missing values that could trivially be inferred from other features. Feature 144D, “The Position of Negative Morphemes in SVO Languages” is an example. Because it is only applicable to SVO languages, languages for which the value of Feature 81A is SOV are given the special value Undefined. We then removed features that covered fewer than 150 languages. We manually classified features into binary and categorical features (no count features were present in WALS) and replaced text-format feature values with numerical ones.

WALS provides two-level phylogenetic groupings: **family** (upper) and **genus** (lower). For example, English belongs to the Indo-European family and to its subgroup (genus), Germanic. Genera are designed to be roughly comparable taxonomic groups so that they facilitate cross-linguistic comparison (Dryer 1989). Following Murawaki and Yamauchi (2018), we constructed a phylogenetic neighbor graph by connecting every pair of languages within each genus.

WALS associates each language with single-point geographical coordinates (longitude and latitude). Following Murawaki and Yamauchi (2018), we constructed a spatial neighbor graph by linking all language pairs that were located within a distance of *R* = 1,000 km.

Autotyp (Bickel et al. 2017) is a smaller database, and it appears to be more coherent because a smaller number of typologists led its construction. It is a mixture of raw data and automatically aggregated data and covers finer-grained domains (called “modules” in Autotyp) such as alignment per language, locus per language, and valence per language. In Autotyp, domains are classified into three types: (1) single entry per language, (2) single aggregated entry per language, and (3) multiple entries per language. We only used the first two types of domains, in which a language is given a single value per feature. As the name suggests, features belonging to the last type of domains have multiple values in general, and the number of distinct combined values can reach the order of 100.

We downloaded the data set from the GitHub repository.^{3} In addition to dropping sign languages, and creole and mixed languages, which were all marked as such in the metadata, we manually removed ancient languages. Languages without phylogenetic information or geographical coordinate points were also removed. We manually classified features into binary, categorical, and count features^{4} and assigned numerical codes to the values of the binary and categorical features. We then removed features that covered fewer than 50 languages.

The phylogenetic and spatial neighbor graphs for Autotyp data were constructed as were done for WALS. One difference was that Autotyp did not have a single phylogenetic level comparable to WALS’s genera. It instead provides six non-mandatory metadata fields: majorbranch, stock, subbranch, subsubbranch, lowestsubbranch, and quasistock. We attempted to create genus-like groups by combining these fields, but the results were far from perfect and are subject to future changes.

Table 2 summarizes the results of preprocessing (*K* and *M* are introduced later). We can see that although we removed low-coverage features, only about 20% of items were present in the language–feature matrices *X*. Figure 4 visualizes *X*. It is evident that we were dealing with a type of missing values called **missing not at random**. Data gaps are not random because both languages and features exhibit power-law behavior, that is, a small number of high-coverage languages (features) are contrasted with heavy tails of low-coverage languages (features). What is worse, the lack of one feature is predictive of the lack of some others because typologists have coded multiple related features at once.

. | . | WALS . | Autotyp . |
---|---|---|---|

L | (# of languages) | 2,607 | 1,063 |

K | (# of linguistic parameters) | 50 or 100 | |

M | (# of model parameters of W, $\Theta ~$, and Θ) | 760 | 958 |

N | (# of discrete features) | 152 | 372 |

# of binary features | 14 | 229 | |

# of categorical features | 138 | 118 | |

# of count features | 0 | 25 | |

Proportion of items present in the language–feature matrix (%) | 19.98 | 21.54 | |

# of phylogenetic neighbors on average | 30.77 | 7.43 | |

# of spatial neighbors on average | 89.10 | 38.53 |

. | . | WALS . | Autotyp . |
---|---|---|---|

L | (# of languages) | 2,607 | 1,063 |

K | (# of linguistic parameters) | 50 or 100 | |

M | (# of model parameters of W, $\Theta ~$, and Θ) | 760 | 958 |

N | (# of discrete features) | 152 | 372 |

# of binary features | 14 | 229 | |

# of categorical features | 138 | 118 | |

# of count features | 0 | 25 | |

Proportion of items present in the language–feature matrix (%) | 19.98 | 21.54 | |

# of phylogenetic neighbors on average | 30.77 | 7.43 | |

# of spatial neighbors on average | 89.10 | 38.53 |

## 4. Bayesian Generative Model

Our goal is to induce a binary latent parameter matrix *Z* ∈ {0, 1}^{L×K} from the observed portion of the language–feature matrix *X*. *K* is the number of linguistic parameters and is to be specified a priori. To obtain *Z*, we first define a probabilistic generative model that describes how *Z* is stochastically generated and how *X* is generated from *Z*. After that, we devise a method to infer *Z*, as we explain in Section 5. For now, we do not need to care about missing values.

*A*is a set of model parameters that control the generation of

*Z*, whereas

*W*is a weight matrix that connects

*Z*and

*X*.

For ease of description, we trace the generative story *backward* from *X*. Section 4.1 describes the second part, which is followed by Section 4.2 for the first part.

### 4.1 Inter-Feature Dependencies

In this section, we describe how the surface feature representations are generated from the binary latent representations. Figure 5 illustrates the process. We use matrix factorization (Srebro, Rennie, and Jaakkola 2005; Griffiths and Ghahramani 2011) to capture inter-feature dependencies. Because the discrete feature matrix *X* cannot directly be decomposed into two matrices, we instead decompose a closely related, unnormalized model parameter matrix $\Theta ~$. It is $\Theta ~$ that directly controls the stochastic generation of *X*.

*x*

_{l,n}can take a binary, categorical, or count value. As usual, we assume that a binary feature is drawn from a Bernoulli distribution:

_{l,f(n,1)}∈ (0, 1) is the corresponding model parameter. Because, as we discuss subsequently, one feature can correspond to more than one linguistic parameter, feature

*n*is mapped to the corresponding model parameter index by the function

*f*(

*n*,

*i*) ∈ {1, ⋯,

*m*, ⋯,

*M*}. A binary or count feature has one model parameter whereas a categorical feature with

*F*

_{n}distinct values has

*F*

_{n}model parameters.

*M*is the total number of these model parameters. θ

_{l,m}is an item of the normalized model parameter matrix Θ ∈ (0, 1)

^{L×M}.

_{l,f(n,i)}∈ (0, 1) and $\u2211i=1Fn$ θ

_{l,f(n,i)}= 1. As you can see, the categorical feature

*n*has

*F*

_{n}model parameters.

_{l,f(n,1)}> 0. This distribution has mean and variance θ

_{l,f(n,1)}.

^{5}

^{L×M}. For binary features, we use the sigmoid function:

*Z*and the weight matrix

*W*. The generation of

*Z*is described in Section 4.2. Each item of $\Theta ~$, $\theta ~l,m$, is language

*l*’s

*m*-th unnormalized model parameter. It is affected only by linguistic parameters with

*z*

_{l,k}= 1 because

*z*

_{l,k}= 0, the linguistic parameter

*k*has no effect on θ

_{l,f(n,i)}because exp(

*z*

_{l,k}

*w*

_{k,f(n,i)}) = 1. Otherwise, if

*w*

_{k,f(n,i)}> 0, it makes θ

_{l,f(n,i)}larger, and if

*w*

_{k,f(n,i)}< 0, it lowers θ

_{l,f(n,i)}. Suppose that for the linguistic parameter

*k*, a certain group of languages takes

*z*

_{l,k}= 1. If two categorical feature values (

*n*

_{1},

*i*

_{1}) and (

*n*

_{2},

*i*

_{2}) have large positive weights (

*w*

_{k,f(n1,i1)}> 0 and

*w*

_{k,f(n2,i2)}> 0), the pair must often co-occur in these languages. Likewise, the fact that two feature values do not co-occur can be encoded as a positive weight for one value and a negative weight for the other.

Binary and count features are more straightforward because both the sigmoid and softplus functions take a single argument and increase monotonically. For a binary feature, if *z*_{l,k} = 1 and *w*_{k,f(n,i)} > 0, then θ_{l,f(n,i)} approaches 1. *w*_{k,f(n,i)} < 0 makes θ_{l,f(n,i)} closer to 0. The best fit for the count data is obtained when the value equals the mode of the Poisson distribution, which is close to θ_{l,f(n,i)}.

Each item of *W*, *w*_{k,m}, is generated from a Student *t*-distribution with 1 degree of freedom. We choose this distribution for two reasons. First, it has heavier tails than the Gaussian distribution and allows some weights to fall far from 0. Second, our inference algorithm demands that the negative logarithm of the probability density function be differentiable, as explained in Section 5.2. The *t*-distribution satisfies the condition whereas the Laplace distribution does not.^{6}

### 4.2 Inter-Language Dependencies

The autologistic model (Murawaki and Yamauchi 2018) for the linguistic parameter *k* generates a column of *Z*, z_{*,k} = (*z*_{1,k}, ⋯, *z*_{L,k}). To construct the model, we use two neighbor graphs and the corresponding three counting functions, as illustrated in Figure 6. *V*(z_{*,k}) returns the number of pairs sharing the same value in the phylogenetic neighbor graph, and *H*(z_{*,k}) is the spatial equivalent of *V*(z_{*,k}). *U*(z_{*,k}) gives the number of languages that take the value 1.

*v*

_{k}> 0, horizontal diffusibility

*h*

_{k}> 0, and universality

*u*

_{k}∈ (−∞, ∞) for each linguistic parameter

*k*. The probability of z

_{*.k}conditioned on

*v*

_{k},

*h*

_{k}, and

*u*

_{k}is given as

The autologistic model can be interpreted in terms of the competition associated with the 2^{L} possible assignments of z_{*,k} for the probability mass 1. If a given value, z_{*,k}, has a relatively large *V*(z_{*,k}), then setting a large value for *v*_{k} enables it to appropriate fractions of the mass from its weaker rivals. However, if too large a value is set for *v*_{k}, then it will be overwhelmed by its stronger rivals.

*l*taking value

*b*∈ {0, 1}, conditioned on the rest of the languages, z

_{−l,k}= (

*z*

_{1,k}, ⋯,

*z*

_{l−1,k},

*z*

_{l+1,k}, ⋯,

*z*

_{L,k}):

*V*

_{l,k,b}is the number of language

*l*’s phylogenetic neighbors that assume the value

*b*, and

*H*

_{l,k,b}is its spatial counterpart.

*P*(

*z*

_{l,k}=

*b*∣ z

_{−l,k},

*v*

_{k},

*h*

_{k},

*u*

_{k}) is expressed by the weighted linear combination of the three factors in the log-space. It will increase with a rise in the number of phylogenetic neighbors that assume the value

*b*. However, this probability depends not only on the phylogenetic neighbors of language

*l*, but it also depends on its spatial neighbors and on universality. How strongly these factors affect the stochastic selection is controlled by

*v*

_{k},

*h*

_{k}, and

*u*

_{k}.

*Z*has

*K*columns. Accordingly, we have

*K*autologistic models:

*A*can be decomposed in a similar manner:

*v*

_{k}∼ Gamma(κ, θ),

*h*

_{k}∼ Gamma(κ, θ), and

*u*

_{k}∼ 𝒩(0, σ

^{2}). They complete the generative story. In the experiments, we set shape κ = 1, scale θ = 1, and standard deviation σ = 10. These priors are not non-informative, but they are sufficiently gentle in the regions where these model parameters typically reside.

An extension of the model is to set z_{*,K} = (1, ⋯, 1) for the last linguistic parameter *K*. Consequently, the autologistic model is dropped from the linguistic parameter *K* (while *K* − 1 autologistic models remain). With this modification, the weight vector w_{K,*} = (*w*_{K,1}, ⋯, *w*_{K,M}) is activated for all languages and serves as a bias term. We used this version of the model in the experiments.

*v*

_{k}=

*h*

_{k}= 0, Equation (11) is reduced to

*z*

_{l,k}no longer depends on z

_{−l,k}and is a simple Bernoulli trial with probability exp(

*u*

_{k})/(1 + exp(

*u*

_{k})).

^{7}

## 5. Posterior Inference

*Z*together with other latent variables. With a slight abuse of notation, let

*X*be disjointly decomposed into the observed portion

*X*

^{obs}and the remaining missing portion

*X*

^{mis}. Formally, the posterior probability is given as

As usual, we use Gibbs sampling to draw samples from the posterior distribution. Given observed values *x*_{l,n}, we iteratively update *z*_{l,k}, *v*_{k}, *h*_{k}, *u*_{k}, and w_{k,*} as well as missing values *x*_{l,n}.

*Update missing**x*_{l,n}.

*x*_{l,n} is sampled from Equations (2), (3), or (4), depending on the type of feature *n*.

*Update**z*_{l,k}*and*$xl,*mis$.

*z*

_{l,k}and $xl,*mis$, the missing portion of x

_{l,*}= (

*x*

_{l,1}, ⋯,

*x*

_{l,N}). We find that updating $xl,*mis$ drastically improves the mobility of

*z*

_{l,k}. The proposal distribution first toggles the current

*z*

_{l,k}to obtain the proposal $zl,k\u2032$ (1 if

*z*

_{l,k}= 0; 0 otherwise). As the corresponding w

_{k,*}= (

*w*

_{k,1}, ⋯,

*w*

_{k,M}) gets activated or inactivated, $\theta l,*\u2032$ = ($\theta l,1\u2032$, ⋯, $\theta l,M\u2032$) is also updated accordingly. We perform a Gibbs sampling scan on $xl,*mis$: Every missing

*x*

_{l,n}is sampled from the corresponding distribution with the proposal model parameter(s). The proposal is accepted with probability

*P*(

*z*

_{l,k}, x

_{l,*}∣ −) is the probability of generating the current state (

*z*

_{l,k}, x

_{l,*}), while

*P*($zl,k\u2032$, $xl,*\u2032$ ∣ −) is the probability of generating the proposed state ($zl,k\u2032$, $xl,*\u2032$), in which $xl,*mis$ is updated by the proposal distribution.

*Q*is the proposal function constructed as explained above. Equation (16) can be calculated by combining Equations (11), (2), (3), and (4).

*Update**v*_{k}, *h*_{k}, *and**u*_{k}.

We want to sample *v*_{k} (and *h*_{k} and *u*_{k}) from *P*(*v*_{k} ∣ −) ∝ *P*(*v*_{k})*P*(z_{*,k} ∣ *v*_{k}, *h*_{k}, *u*_{k}). This belongs to a class of problems known as **sampling from doubly-intractable distributions** (Møller et al. 2006; Murray, Ghahramani, and MacKay 2006). Although it remains a challenging problem in statistics, it is not difficult to approximately sample the variables if we give up theoretical rigorousness (Liang 2010). The details of the algorithm are described in Section 5.1.

*Update* w_{k,*}.

The remaining problem is how to update *w*_{k,m}. Because the number of weights is very large (*K* × *M*), the simple Metropolis-Hastings algorithm (Görür, Jäkel, and Rasmussen 2006; Doyle, Bicknell, and Levy 2014) is not a workable option. To address this problem, we block-sample w_{k,*} = (*w*_{k,1}, ⋯, *w*_{k,M}) using Hamiltonian Monte Carlo (HMC) (Neal 2011). We present a sketch of the algorithm in Section 5.2.

### 5.1 Approximate Sampling from Doubly Intractable Distributions

During inference, we want to sample *v*_{k} (and *h*_{k} and *u*_{k}) from its posterior distribution, *P*(*v*_{k} ∣ −) ∝ *P*(*v*_{k})*P*(z_{*,k} ∣ *v*_{k}, *h*_{k}, *u*_{k}). Unfortunately, we cannot apply the standard Metropolis-Hastings (MH) sampler to this problem because *P*(z_{*,k} ∣ *v*_{k}, *h*_{k}, *u*_{k}) contains an intractable normalization term. Such a distribution is called a **doubly intractable** distribution because Markov chain Monte Carlo itself approximates the intractable distribution (Møller et al. 2006; Murray, Ghahramani, and MacKay 2006). This problem remains an active topic in the statistics literature to date. However, if we give up theoretical rigorousness, it is not difficult to draw samples from the posterior, which are only approximately correct but work well in practice.

Specifically, we use the double MH sampler (Liang 2010). The key idea is to use an auxiliary variable to cancel out the normalization term. This sampler is based on the exchange algorithm of Murray, Ghahramani, and MacKay (2006), which samples *v*_{k} in the following steps.

- 1.
Propose $vk\u2032$ ∼

*q*($vk\u2032$|*v*_{k}, z_{*,k}) - 2.
Generate an auxiliary variable $z*,k\u2032$ ∼

*P*($z*,k\u2032$ ∣ $vk\u2032$,*h*_{k},*u*_{k}) using an*exact*sampler. - 3. Accept $vk\u2032$ with probability min{1,
*r*(*v*_{k}, $vk\u2032$, $z*,k\u2032$ ∣ z_{*,k})}, where$r(vk,vk\u2032,z*,k\u2032\u2223z*,k)=P(vk\u2032)q(vk\u2223vk\u2032,z*,k)P(vk)q(vk\u2032\u2223vk,z*,k)\xd7P(z*,k\u2223vk\u2032,hk,uk)P(z*,k\u2032\u2223vk,hk,uk)P(z*,k\u2223vk,hk,uk)P(z*,k\u2032\u2223vk\u2032,hk,uk)$(17)

*z*

_{l,k}’s starting from the current z

_{*,k}. At each step of the Gibbs sampling scan, $zl,k\u2032$ is updated according to

*P*($zl,k\u2032$ ∣ $z\u2212l,k\u2032$, $vk\u2032$,

*h*

_{k},

*u*

_{k}). Note that the auxiliary variable $z*,k\u2032$ is only used to compute Equation (17).

We construct the proposal distributions *q*($vk\u2032$ ∣ *v*_{k}, z_{*,k}) and *q*($hk\u2032$ ∣ *h*_{k}, z_{*,k}) using a log-normal distribution, and *q*($uk\u2032$ ∣ *u*_{k}, z_{*,k}) using a Gaussian distribution with mean *u*_{k}.

### 5.2 Hamiltonian Monte Carlo

HMC (Neal 2011) is a Markov chain Monte Carlo method for drawing samples from a probability density distribution. Unlike Metropolis-Hastings, it exploits gradient information to propose a new state, which can be distant from the current state. If no numerical error is involved, the new state proposed by HMC is accepted with probability 1.

HMC has a connection to Hamiltonian dynamics and the physical analogy is useful for gaining an intuition. In HMC, the variable to be sampled, *q* ∈ ℝ^{M}, is seen as a generalized coordinate of a system and is associated with a potential energy function *U*(*q*) = −log *P*(*q*), the negative logarithm of the (unnormalized) density function. The coordinate *q* is tied with an auxiliary momentum variable *p* ∈ ℝ^{M} and a kinetic function *K*(*p*). The momentum makes the object move. Because *H*(*q*, *p*) = *U*(*q*) + *K*(*p*), the sum of the kinetic and potential energy, is constant with respect to time, the time evolution of the system is uniquely defined given an initial state (*q*_{0}, *p*_{0}). The trajectory is computed to obtain a state (*q*, *p*) at some time, and that *q* is the next sample we want.

Algorithm 1 shows the pseudo-code, which is adopted from Neal (2011). The momentum variable is drawn from the Gaussian distribution (line 2). The time evolution of the system is numerically simulated using the leapfrog method (lines 4–11), where ϵ and *S* are parameters of the algorithm to be tuned. This is followed by a Metropolis step to correct for numerical errors (lines 13–18).

Going back to the sampling of w_{k,*}, we need *U*(w_{k,*}) = −log *P*(w_{k,*} ∣ −) and its gradient ∇*U*(w_{k,*}) to run HMC. The unnormalized density function *P*(w_{k,*} ∣ −) is the product of (1) the probability of generating *w*_{k,m}’s from the *t*-distribution and (2) the probability of generating *x*_{l,n}’s for each language with *z*_{l,k} = 1. Note that *U*(w_{k,*}) is differentiable because Equations (5), (6), and (7) as well as the *t*-distribution are differentiable.

## 6. Missing Value Imputation

### 6.1 Settings

Although our goal is to induce good latent representations, “goodness” is too subjective to be measured. To quantitatively evaluate the proposed model, we use missing value imputation as an approximate performance indicator. If the model predicts missing feature values better than reasonable baselines, we can say that the induced linguistic parameters are justified. Although no ground truth exists for the missing portion of a data set, missing value imputation can be evaluated by hiding some observed values and verifying the effectiveness of their recovery. We conducted a 10-fold cross-validation.

We ran the proposed model, now called SynDia, with two different settings: *K* = 50 and 100.^{8} We performed posterior inference for 500 iterations. After that, we collected 100 samples of *x*_{l,n} for each language, one per iteration. For each missing value *x*_{l,n}, we output the most frequent value among the 100 samples. The HMC parameters ϵ and *S* were set to 0.05 and 10, respectively. We applied simulated annealing to the sampling of *z*_{l,k} and $xl,*mis$. For the first 100 iterations, the inverse temperature was increased from 0.1 to 1.0.

We also tested a simplified version of SynDia, Syn, from which *v*_{k} and *h*_{k} were removed. Equation (11) was replaced with Equation (14).

We compared Syn and SynDia with several baselines.

**MFV**For each feature

*n*, always output the most frequent value among the observed portion of x_{*,n}= (*x*_{1,n}, ⋯,*x*_{L,n}).**Surface-Dia**The autologistic model applied to each surface feature

*n*(Murawaki and Yamauchi 2018). After 500 burn-in iterations, we collected 500 samples with the interval of five iterations. Among the collected samples, we chose the most frequent feature value for each language as the output.^{9}**DPMPM**A Dirichlet process mixture of multinomial distributions with a truncated stick-breaking construction (Si and Reiter 2013) used by Blasi, Michaelis, and Haspelmath (2017) for missing value imputation. It assigned a single categorical latent variable to each language. As an implementation, we used the R package

*NPBayesImpute*. We ran the model with several different values for the truncation level*K**. The best score is reported.

### 6.2 Results

Table 3 shows the result. We can see that SynDia outperformed the rest by substantial margins for both data sets. The gains of SynDia over Syn were statistically significant in both data sets (*p* < 0.01 for WALS and *p* < 0.05 for Autotyp). Although the likelihood *P*(*X* ∣ *Z*, *W*) went up as *K* increased, likelihood was not necessarily correlated with accuracy. Because of the high ratio of missing values, the model with a larger *K* can overfit the data. The best score was obtained with *K* = 100 for WALS and *K* = 50 for Autotyp, although the differences were not statistically significant in either data set. We conjecture that because WALS’s features ranged over broader domains of language, a larger *K* was needed to cover major regular patterns in the data (Haspelmath 2008b).

Type . | Model . | WALS . | Autotyp . |
---|---|---|---|

Inter-language dependencies | MFV | 54.80 | 69.42 |

Surface-Dia | 61.27 | 73.04 | |

Inter-feature dependencies | DPMPM (K* = 50) | 59.12 | 65.53 |

MCA | 65.61 | 76.29 | |

Syn (K = 50) | 72.71 | 81.24 | |

Syn (K = 100) | 72.78 | 81.20 | |

Both | SynDia (K = 50) | 73.47 | 81.58 |

SynDia (K = 100) | 73.57 | 81.33 |

Type . | Model . | WALS . | Autotyp . |
---|---|---|---|

Inter-language dependencies | MFV | 54.80 | 69.42 |

Surface-Dia | 61.27 | 73.04 | |

Inter-feature dependencies | DPMPM (K* = 50) | 59.12 | 65.53 |

MCA | 65.61 | 76.29 | |

Syn (K = 50) | 72.71 | 81.24 | |

Syn (K = 100) | 72.78 | 81.20 | |

Both | SynDia (K = 50) | 73.47 | 81.58 |

SynDia (K = 100) | 73.57 | 81.33 |

The fact that Syn outperformed Surface-Dia suggests that inter-feature dependencies have more predictive power than inter-language dependencies in the data sets. However, they are complimentary in nature, as SynDia outperformed Syn.

DPMPM performed poorly even if a small value was set to the truncation level *K** to avoid overfitting. It divided the world’s languages into a finite number of disjoint groups. In other words, the latent representation of a language was a single categorical value. As expected, DPMPM showed its limited expressive power.

MCA uses a more expressive representation for each language: a sequence of continuous variables. It outperformed DPMPM but was inferior to Syn by a large margin. We conjecture that MCA was more sensitive to initialization than the Bayesian model armed with Markov chain Monte Carlo sampling.

To investigate the effects of inter-language dependencies, we conducted ablation experiments. We removed *v*_{k}, *h*_{k}, or both from the model. Note that if both are removed, the resultant model is Syn. The result is shown in Table 4. Although not all modifications yielded statistically significant changes, we observed exactly the same pattern in both data sets: The removal of vertical and horizontal factors consistently degenerated the performance, with *h*_{k} having larger impacts than *v*_{k}.

Model . | WALS (K = 100)
. | Autotyp (K = 50)
. |
---|---|---|

Full model (SynDia) | 73.57 | 81.58 |

-vertical | 73.40 (−0.17) | 81.33 (−0.25) |

-horizontal | 73.17 (−0.39)^{†} | 81.28 (−0.30)^{†} |

-vertical-horizontal (Syn) | 72.78 (−0.79)^{††} | 81.24 (−0.34)^{†} |

Model . | WALS (K = 100)
. | Autotyp (K = 50)
. |
---|---|---|

Full model (SynDia) | 73.57 | 81.58 |

-vertical | 73.40 (−0.17) | 81.33 (−0.25) |

-horizontal | 73.17 (−0.39)^{†} | 81.28 (−0.30)^{†} |

-vertical-horizontal (Syn) | 72.78 (−0.79)^{††} | 81.24 (−0.34)^{†} |

## 7. Discussion

### 7.1 Vertical Stability and Horizontal Diffusibility

Missing value imputation demonstrates that SynDia successfully captures regularities of the data. We now move on to the question of what the learned linguistic parameters look like. We performed posterior inference again, with the settings based on the results of missing value imputation: We chose *K* = 100 for WALS and *K* = 50 for Autotyp. The difference was that we no longer performed a 10-fold cross-validation but gave all observed data to the models.

Figure 7 shows the proportion of the modal value of each linguistic parameter. For Autotyp, the largest histogram bin ranged from 0.95 to 1.0. This suggests that even with *K* = 50, macro-patterns were covered by a limited number of linguistic parameters. The rest focused on capturing micro-patterns. In contrast, linguistic parameters of WALS tended toward 0.5. There might have been more macro-patterns to be captured.

Next, we take advantage of the fact that both Surface-Dia and SynDia use the autologistic model to estimate vertical stability *v*_{k} (*v*_{n} for Surface-Dia) and horizontal diffusibility *h*_{k} (*h*_{n}). Large *v*_{k} indicates that phylogenetically related languages would be expected to frequently share the same value for linguistic parameter *k*. Similarly, *h*_{k} measures how frequently spatially close languages would be expected to share the parameter value. Because both models make use of the same neighbor graphs, we expect that if the latent representations successfully reflect surface patterns, *v*_{k}’s (*h*_{k}’s) are in roughly the same range as *v*_{n}’s (*h*_{n}’s). To put it differently, if the region in which most *v*_{k}’s reside is much closer to zero than that of most *v*_{n}’s, it serves as a signal of the model’s failure.

Figure 8 and 9 summarize the results. For both WALS and Autotyp, most linguistic parameters were within the same ranges as surface features, both in terms of vertical stability and horizontal diffusibility. Although this result does not directly prove that the latent representations are good, we can at least confirm that the model did not fail in this regard.

There were some outliers to be explained. Autotyp’s outliers were surface features and are easier to explain: Features with very large *v*_{n} were characterized by heavy imbalances in their value distributions, and their minority values were confined to small language families.

The outliers of WALS were linguistic parameters. They had much larger *v*_{k}, much larger *h*_{k}, or both, than those of surface features, but deviations from the typical range were more marked for *v*_{k}. As we see in Figure 7(a), the linguistic parameters of WALS, including the outliers, focused on capturing macro-patterns and did not try to explain variations found in small language families. As we discuss later, these linguistic parameters were associated with clear geographical distributions, which we believe may be too clear. We are unsure whether they successfully “denoised” minor variations found in surface features or simply overgeneralized macro-patterns to missing features.

### 7.2 Manual Investigation of the Latent Representations

Figure 10 visualizes the weight matrix *W*. We can find some dashed horizontal lines, which indicate strong dependencies between feature values. However, it is difficult for humans to directly derive meaningful patterns from the noise-like images.

A more effective way to visually explore linguistic parameters is to plot them on a world map. In fact, WALS Online provides the functionality of plotting a surface feature and dynamically combined features to help typologists discover geographical patterns (Haspelmath et al. 2005). As exemplified by Figure 11(a), some surface features show several geographic clusters of large size, revealing something about the evolutionary history of languages. We can do the same thing for linguistic parameters.

An example from WALS is shown in Figure 11(b). Even with a large number of missing values, SynDia yielded geographic clusters of comparable size for some linguistic parameters. Needless to say, not all surface features were associated with clear geographic patterns, and neither were linguistic parameters.

For comparison, we also investigated linguistic parameters induced by Syn (not shown in this article), which did not exploit inter-language dependencies. Some geographic clusters were found, especially when the estimation of *z*_{l,k} was stable. In our subjective evaluation, however, SynDia appeared to show clearer patterns than Syn. There were many low-coverage languages, and due to inherent uncertainty, *z*_{l,k} swung between 0 and 1 during posterior inference when inter-language dependencies were ignored. As a result, geographical signals were obscured.

The world map motivates us to look back at the cryptic weight matrix. The linguistic parameter visualized in Figure 11(b), for example, was tied to weight vector w_{k,*}, which gave the largest positive values to the following feature–value pairs: Feature 31A, “Sex-based and Non-sex-based Gender Systems”: No gender; Feature 44A, “Gender Distinctions in Independent Personal Pronouns”: No gender distinctions; Feature 30A, “Number of Genders”: None; and Feature 32A, “Systems of Gender Assignment”: No gender. We can determine that the main function of this linguistic parameter is to disable a system of grammatical gender. It is interesting that not all genderless languages turned on this linguistic parameter. The value No gender for Feature 31A is given a relatively large positive weight by another linguistic parameter, which was characterized by a tendency to avoid morphological marking. In this way, we can draw from latent parameters hypotheses of linguistic typology that are to be tested through a more rigorous analysis of data.

### 7.3. Applications

Although we have demonstrated in the previous section how we can directly examine the learned model for a typological inquiry, this approach has limitations. As we briefly discussed in Section 2.2, the search space must have numerous modes. A trivial way to prove this is to swap a pair of linguistic parameters together with the corresponding weight vector pair. The posterior probability remains the same. Although permutation invariance poses little problem in practice, arbitrariness involving marked–unmarked relations can be problematic. The model has a certain degree of flexibility that allows it to choose a marked form from a dichotomy it uncovers, and the rather arbitrary decision affects other linguistic parameters. The net outcome of unidentifiability is that even if the model provides a seemingly good explanation for an observed pattern, another run of the model may provide another explanation.

We suggest that instead of investing too much effort in interpreting induced linguistic parameters, we can use the model as a black box that connects two or more surface representations via the latent space. A schematic illustration of the proposed approach is shown in Figure 12. We first use the model to map a given language x_{l,*} into its latent representation z_{l,*}, then manipulate z_{l,*} in the latent space to obtain $zl\u2032,*$, the latent representation of a new language *l*′, and finally project $zl\u2032,*$ back into the original space to obtain its surface representation $xl\u2032,*$. By comparing x_{l,*} with $xl\u2032,*$, we can identify how manipulation in the latent space affects multiple surface features in general.

A useful property of this approach is that because both the surface and latent representations are discrete, linguistic parameters can readily be used as a substitute for surface features. This means that methods that have been successfully applied to surface features are applicable to linguistic parameters as well. Undoubtedly, the most promising method is Bayesian phylogenetic analysis (Dediu 2010; Greenhill et al. 2010; Dunn et al. 2011; Maurits and Griffiths 2014; Greenhill et al. 2017). For example, we can reconstruct an ancestral language from its descendants in the latent space. By comparing the ancestor and a descendant in the original space, we can investigate how languages change over time without being trapped into unnatural combinations of features, as was envisioned by Greenberg (1978). Unlike Dunn et al. (2011), who focused on the dependency between a pair of binary features, the proposed framework has the ability to uncover correlated evolution involving multiple features (Murawaki 2018).

## 8. Conclusions

In this article, we reformulated the quest for Greenbergian universals with representation learning. To develop a concrete model that is robust with respect to a large number of missing values, we adopted the Bayesian learning framework. We also exploited inter-language dependencies to deal with low-coverage languages. Missing value imputation demonstrates that the proposed model successfully captures regularities of the data. We plotted languages on a world map to show that some latent variables are associated with clear geographical patterns. The source code is publicly available at https://github.com/murawaki/lattyp.

The most promising application of latent representation-based analysis is Bayesian phylogenetic methods. The proposed model can be used to uncover correlated evolution involving multiple features.

## Acknowledgments

The key ideas and early experimental results were presented at the Eighth International Joint Conference on Natural Language Processing (Murawaki 2017). This article, which presents updated results, is a substantially extended version of the earlier conference paper. This work was partly supported by JSPS KAKENHI grant 18K18104.

## Notes

The combination of neural networks and Bayesian learning is actively studied (Welling and Teh 2011) and is applied to natural language tasks (Gan et al. 2017). One thing common to these studies, including ours, is the use of Hamiltonian Monte Carlo (HMC) or variants of it for inference. Bayesian neural networks use online extensions to HMC because scalability is a vital concern. In contrast, we use vanilla HMC because our database is relatively small.

We discarded ratio features whose values ranged from 0 to 1, inclusive. Although it is possible to model ratio data, it is generally not a good idea because the “raw” features from which the ratio is calculated are more informative.

Alternatively, we can use a negative binomial distribution because it may provide a closer fit by decoupling mean and variance. Another option is to use a Poisson hurdle distribution (Mullahy 1986), which deals with the high occurrence of zeroes in the data.

Alternatively, we can explicitly impose sparsity by generating another binary matrix *Z*_{W} and replacing *W* with *Z*_{W} ⊙ *W*, where ⊙ denotes element-wise multiplication.

Indian buffet processes (Griffiths and Ghahramani 2011) are the natural choice for modeling binary latent matrices (Görür, Jäkel, and Rasmussen 2006; Knowles and Ghahramani 2007; Meeds et al. 2007; Doyle, Bicknell, and Levy 2014). The Indian buffet process (IBP) is appealing for its ability to adjust the number of linguistic parameters to data. A linguistic parameter *k* is called an **active** parameter if it has one or more languages with *z*_{l,k} = 1. The key property of the IBP is that although there are an unbounded number of linguistic parameters, the number of active linguistic parameters *K*^{+} is finite. *K*^{+} changes during posterior inference. It is decremented when one linguistic parameter becomes inactive. Similarly, it is incremented when *z*_{l,k} changes from 0 to 1 for an inactive linguistic parameter *k*.

We modeled *P*(*Z* ∣ *A*) using an IBP in a preliminary study but later switched to the present model for two reasons. First, it is difficult to extend an IBP to incorporate inter-language dependencies. It appears that we have no choice but to replace the IBP with the product of the autologistic models. Second, the nonparametric model’s adaptability did not work in our case. In theory, Gibbs sampling converges to a stationary distribution after a sufficiently large number of iterations. However, we observed that the number of active linguistic parameters heavily depended on its initial value *K*_{0} because it was very rare for additional linguistic parameters to survive as active linguistic parameters. For this reason, the number of linguistic parameters for the present model, *K*, is given a priori and is fixed throughout posterior inference.

In preliminary experiments, we also tried *K* = 250 and 500, but it quickly became evident that a large *K* caused performance degeneration.

Whereas Murawaki and Yamauchi (2018) partitioned $x*,nobs$ into 10 equal sized subsets for each feature *n*, we applied the 10-fold cross-validation to *X*^{obs} as a whole, in order to allow comparison with other models. At the level of features, this resulted in slightly uneven distributions over folds but did not seem to have made a notable impact on the overall performance.

## References

*Nature*