On the Complexity and Typology of Inflectional Morphological Systems

We quantify the linguistic complexity of different languages’ morphological systems. We verify that there is a statistically significant empirical trade-off between paradigm size and irregularity: A language’s inflectional paradigms may be either large in size or highly irregular, but never both. We define a new measure of paradigm irregularity based on the conditional entropy of the surface realization of a paradigm— how hard it is to jointly predict all the word forms in a paradigm from the lemma. We estimate irregularity by training a predictive model. Our measurements are taken on large morphological paradigms from 36 typologically diverse languages.


Introduction
What makes an inflectional system "complex"? Linguists have sometimes considered measuring this by the size of the inflectional paradigms-the number of morpho-syntactic distinctions the language makes (McWhorter, 2001). However, this gives only a partial picture of complexity (Sagot, 2013); beyond simply being larger, some inflectional systems are more irregular-it is harder to guess forms in the paradigm from other forms in the same paradigm. Ackerman and Malouf (2013) hypothesize that these two notions of morphological complexity interact: while a system may be complex along either axis, it is never complex along both, providing a trade-off.
In this work, we develop machine learning tools to operationalize this hypothesis using recurrent neural networks and latent variable models to measure the complexity of inflectional systems. We explain our approach to quantifying two aspects of inflectional complexity and, in one case, derive a variational bound to enable efficient approximation to the metric. This allows a completely data-driven approach by which we can measure the morphological complexity of a given language in a clean, relatively theory-agnostic manner.
Our study focuses on an evaluation of 31 diverse languages, using collections of orthographic paradigms. Importantly, our method does not require a linguistic analysis of words into their constituent morphemes, e.g., hoping → hope+ing. We find support for the hypothesis of Ackerman and Malouf (2013). Concretely, we show that the more forms an inflectional paradigm has, the more predictable the forms must be from one another (for example, they might be related by a simple change of suffix). This intuition has a long history in the linguistics community, as field linguists have often noted that languages with extreme morphological richness, e.g., agglutinative and polysynthetic languages, have virtually no exceptions or irregular forms. Our contribution lies in mathematically formulating this notion of regularity and providing a means to estimate it by fitting a probability model. Using these tools, we provide a quantitative verification of this conjecture on a large set of typologically diverse languages, which is significant with p < 0.05.

Word-Based Morphology
We adopt the framework of word-based morphology (Aronoff, 1976;Spencer, 1991). 1 Thus, for the rest P R E P R I N T of the work we will define an inflected lexicon as a set of word types. Each word type is a triple of • a lexeme (an arbitrary integer or string that indexes the word's core meaning and part of speech) • a slot (an arbitrary integer or object that indicates how the word is inflected) • a surface form (a string over a fixed phonological or orthographic alphabet Σ) We write π( ) for the set of word types (triples) in the lexicon that share lexeme , known as the paradigm of . The slots that appear in this set are said to be filled by the corresponding surface forms. For example, in the English paradigm π(walk Verb ), the past-tense slot is filled by walked.
Nothing in our method requires a Bloomfieldian structuralist analysis that decomposes each word into underlying morphemes: rather, this paper is amorphous in the sense of Anderson (1992).
More specifically, we will work within the Uni-Morph annotation scheme (Sylak-Glassman, 2016). In the simplest case, each slot specifies a morphosyntactic bundle of inflectional features such as tense, mood, person, number, and gender. For example, the Spanish surface form pongas appears with a slot that indicates that this word has the features [ TENSE=PRESENT, MOOD=SUBJUNCTIVE, PERSON=2, NUMBER=SG ]. However, in a language where two or more feature bundles systematically yield the same form across all lexemes, UniMorph generally collapses them into a single slot that realizes multiple feature bundles. Thus, a single "verb lemma" slot suffices to describe all English surface forms in {see, go, jump, . . . }: this slot indicates that the word can be a bare infinitive verb, but also that it can be a present-tense verb that may have any gender and any person/number pair other than 3rd-person/singular. We postpone a discussion of the details of UniMorph until §6.1, but it is mostly compatible with other, similar schemes.

Defining Complexity
Ackerman and Malouf (2013) distinguish two types of morphological complexity, which we elaborate on below. For a more general overview of morphological complexity, see .

Enumerative Complexity
The first type, enumerative complexity (ecomplexity), is the number of morpho-syntactic distinctions a language makes within a part of speech. For example, the enumerative complexity of English verbs can be quantified as the average size of the paradigm |π( )| where ranges over a list of English verb lexemes.
The notion of e-complexity has a long history in linguistics. The idea was explicitly discussed as early as Sapir (1921). More recently, Sagot (2013) has referred to this concept as counting complexity, referencing comparison of the complexity of creoles and non-creoles by McWhorter (2001).
For a given part of speech, this quantity varies dramatically over the languages of the world. While the regular English verb paradigm has three slots in our annotation, the Archi verb will have thousands (Kibrik, 1998). However, does this make the Archi system more complex? In other words, is it more difficult to describe or to learn? Despite the plethora of forms, it is often the case that one can regularly predict one form from another, indicating that few forms actually have to be memorized for each lexeme.

Integrative Complexity
The second notion of complexity is integrative complexity (i-complexity), which measures how regular an inflectional system is on the surface. Students of a foreign language will most certainly have encountered the concept of an irregular verb. Pinning down a formal and workable cross-linguistic definition is non-trivial, but the intuition that some inflected forms are regular and others irregular dates back at least to Bloomfield (1933, pp. 273-274), who famously argued that what makes a surface form regular is that it is the output of a deterministic function. For an in-depth dissection of the subject, see Stolz et al. (2012). Ackerman and Malouf (2013) build their definition of i-complexity on the information-theoretic notion of entropy (Shannon, 1948). Their intuition is that a morphological system should be considered irregular to the extent that its forms are unpredictable. They say, for example, that the nominative singular form is unpredictable in a language if many verbs express it with suffix -o while many others use -∅. In this paper, we will propose an improvement to their entropy-T based measure.

The Low-Entropy Conjecture
The low-entropy conjecture, as formulated by Ackerman and Malouf (2013, p. 436), "is the hypothesis that enumerative morphological complexity is effectively unrestricted, as long as the average conditional entropy, a measure of integrative complexity, is low." In other words, morphological systems face a tradeoff between e-complexity and i-complexity: a system may be complex under either metric, but not under both. Indeed, Ackerman & Malouf go so far as to say that there need be no upper bound on e-complexity as long as the i-complexity remains sufficiently low.
This line of thinking harks back to the equal complexity conjecture of Hockett, who stated: "objective measurement is difficult, but impressionistically it would seem that the total grammatical complexity of any language, counting both the morphology and syntax, is about the same as any other" (Hockett, 1958, pp. 180-181). Similar trade-offs have been found in other branches of linguistics (see Oh (2015) for a review). For example, there is a trade-off between rate of speech and syllable complexity (Pellegrino et al., 2011): this means that even though Spanish speakers utter many more syllables per second than Chinese, the overall information rate is quite similar as Chinese syllables carry more information (they mark tonality).
Hockett's equal complexity conjecture is controversial: languages such as Riau Indonesian seem low in complexity across morphology and syntax (Gil, 1994). This is why Ackerman and Malouf instead posit that a linguistic system has bounded complexity. Their low-entropy conjecture says that the "total" complexity of a morphological system (e-complexity and i-complexity) must not be too high-though it can be low, as indeed it is in isolating languages like Chinese and Japanese.

Entropic Integrative Complexity
In this section, we advocate for a probabilistic treatment of paradigmatic morphology. We assume that a language's inflectional morphology system is a distribution over possible paradigms (Dreyer and Eisner, 2009;Cotterell et al., 2015). For instance, knowing the 4-slot English verbal paradigm means knowing a joint distribution over 4-tuples of surface forms, This is the "base distribution" from which each new word type's paradigm is assumed to have been sampled. Each observed paradigm such as p(run, runs, running, ran) provides evidence of this distribution. The fact that some paradigms are used more frequently than others (more tokens) does not mean that they have higher base probability under the morphological system p. Rather, their higher usage is a semantic effect or simply a rich-get-richer effect (Dreyer and Eisner, 2011). We expect the base distribution to place low probability on implausible paradigms, e.g., p(run, snur, running, nar) is low-perhaps close to zero. Moreover, we expect the conditionals of this distribution to assign high probability to the result of applying regular processes, e.g., p(sprint, sprints, sprinting | sprinted) in English should be close to 1.
So should p(wug, wugs, wugging | wugged), where wug is a novel word. We note that p (when smoothed) will have support over Σ * × · · · × Σ * : it assigns positive probability to any n-tuple of strings. The model is thus capable of evaluating arbitrary wug-formations (Berko, 1958), including irregular ones.
So how do we relate p to the i-complexity of a language? Here, we again follow the spirit of Ackerman and Malouf (2013) and argue that the entropy H(p) is an appropriate measure, which is defined in our setting as the joint entropy (2)

A Variational Upper Bound on Entropy
Lamentably, the paradigm entropy defined in equation (2) requires approximation. First, we do not actually know the true distribution p. Furthermore, even if we knew p, a sufficiently expressive distribution would render direct computation intractable: it involves n nested sums over the infinite set Σ * . Thus, following Brown et al. (1992), we use a probability : Two potential directed graphical models for the paradigm completion task. The topology in (a) encodes the the network where all forms are predicted from the lemma. The topology in (b), on the other hand, makes it easier to predict forms given the others: pongas is predicted from ponga, with which it shares a stem. Qualitatively, the structure learning algorithm discussed in §4.2 finds trees structured similarly to (b). model to estimate an upper bound for the paradigm entropy. Our starting point is a well-known bound on the entropy of p, where q is any other distribution over the same space as p. In our setting, the cross-entropy H(p, q) is defined as The quality of the bound in (3) depends on how close q is to p, as measured by the KL-divergence D(p || q), with equality in (3) if and only if p = q.
Choice of q. The bound in equation (3) holds for any choice of q. We cannot practically search over all distributions to find the tightest bound. Nevertheless, we can still find a reasonably good q through direct estimation of a probability model. Given a set of true morphological paradigms D train drawn from p, we can fit our probability model q in any reasonable way, for example by (locally) maximizing the loglikelihood This is equivalent to seeking the tightest bound (3) achievable by any distribution q in a parametric family Q. We discuss our specific choice of Q in §4 below.
Estimate of i-complexity. Having chosen q, we can estimate H(p, q) using a separate held-out sample: 2 where d = |D test |. This can be computed as long as we can evaluate q, and the estimate converges to H(p, q) as the test sample size d → ∞. We return this estimate as our practical approximation of the desired e-complexity H(p).

A Generative Model of the Paradigm
To fit q, we need a tractable parametric family Q of joint distributions over paradigms. To define Q, we follow Cotterell et al. (2017b) and arrange the n slots into a tree-structured Bayesian network (a directed graphical model). For a given tree T , we have the T function pa T (i), which returns the parent of the i th cell or the empty string if the i th cell is the root. We show two possible tree structures for Spanish verbs in Figure 1. Now, we may write a particular element of Q-a factored joint distribution over all n forms-as where θ represents the parameter vector of the Bayesian network. We specifically model all of the conditional probabilities in (7) using a neural sequence-to-sequence model with parameters θ, as described in §4.1 below. As our distribution q θ is a smooth function of θ, we can maximize (5) via gradient-based optimization, as outlined in §6.2.

Neural Sequence-to-Sequence Model
The state of the art in morphological reinflection (Kann and Schütze, 2016) uses an LSTM-based sequence-to-sequence model (Sutskever et al., 2014) with attention (Bahdanau et al., 2015). The idea is to model reinflection as "translation" of an input character sequence, with a description of the desired output slot appended to the input sequence in the form of special characters. For example, to For example, in German, consider the mapping from the nominative singular form Hand to the nominative plural form Hände. This is encoded with the source string H a n d IN=NOM IN=SG OUT=NOM OUT=PL and target string Hä n d e. If the slot realizes multiple feature bundles, we append each of them to the input source string. This encoding may be suboptimal, as it throws away which features belong to which bundles. This is similar to the encoding in Kann and Schütze (2016), and allows the same LSTM with parameters θ to be reused at each factor of (7). Different factors q θ (m i | m j ) are distinguished only by the fact that the morphological tags for slots i and j are appended to the input string before the LSTM is applied to it.

Structure Learning
Which tree over the n slots is optimal? It is not clear a-priori how to arrange the slots in a paradigm such that their predictability is maximized. For instance, consider the irregular Spanish verb poner, we may SINGULAR PLURAL Table 1: Structuralist analysis of Modern Greek nominal inflection classes. (Ralli, 1994;Ralli, 2002).
want to predict its present subjunctive forms, e.g., ponga, pongas and ponga, from another form that shares the same stem, e.g., ponga-this maximizes predictability in that we no longer have to account for the irregular present subjective stem change. Our goal, however, is to select the optimal tree for the data, rather than pre-specified linguistic knowledge of the language.
In graph-theoretic terms, we choose the highestweighted directed spanning tree over n vertices, as found by the algorithm of Edmonds (1967). The weight of a candidate tree is the sum of all its edge weights and the weight of its root vertex, where we define the weight of a candidate edge to m i from m j as 1 d m∈D dev log q(m i | m j ), and define the weight of vertex m i as 1 d m∈D dev log q(m i | empty string), where D dev is a set of development paradigms. In each case, q is a sequence-to-sequence model trained on D train , so computing these n 2 weights requires us to train n 2 sequence-to-sequence models. Under this scheme, the weight of a candidate tree is the loglikelihood 1 d m∈D dev log q(m 1 , . . . , m n ) of a model whose structure is given by the tree and whose conditional distributions are given by these trained q distributions. Recall that our estimate of H(p, q) is the same, but evaluated on D test (equation (6)).
In fact, as in §4.1, we train only a single shared LSTM-based sequence-to-sequence model to perform all n 2 transductions. Once we have selected the tree, we could retrain the model to focus on only the n transductions actually required by the tree, but our present experiments do not retrain. Ackerman and Malouf (2013) Our formulation of the low-entropy principle differs somewhat from Ackerman and Malouf (2013). We highlight the differences.

P R E P R I N T 5 A Methodological Comparison to
Heuristic Approximation to p. Ackerman and Malouf (2013) first construct what we regard as a heuristic approximation to the joint distribution p over forms in a paradigm. They first provide a structuralist decomposition of words into their constituent morphemes. Then, they consider a distribution r(m i | m j ) that builds new forms by swapping morphemes. In contrast to our neural sequence-tosequence approach, this distribution unfortunately does not have support over Σ * and, thus, cannot consider changes other than substitution of affixes.
As concrete example of r, consider Table 1's Modern Greek example from Ackerman and Malouf (2013). The conditional distribution r(m GEN;SG | m ACC;PL = -i), over genitive singular forms is peaked since there is exactly one possible transformation: substituting -us for -i. This is not always the case for Modern Greek, Ackerman and Malouf (2013) estimated that r(m NOM;SG | m ACC;PL = -a) swaps -a for ∅ with probability 2 /3 and for -o with probability 1 /3. We reiterate that no other transformation would be possible, e.g., swapping -a for -es or mapping it to some arbitrary form such as foo.
Average Conditional Entropy. The second difference is their reliance on the pair-wise conditional entropy between two cells. That is, they argue for the quantity where m j is a given form. (We have written the sum over Σ * , but as r has finite support, in practice one only has to consider the possible reinflections of m j the annotation of the data admits.) The entropy of an entire paradigm, is then the average conditional entropy:

Critique of Ackerman and Malouf (2013)
Now, we offer a critique of Ackerman and Malouf (2013) on three points: (i) different linguistic theories may offer different results, (ii) there is no principled manner to handle morphological irregularity, and (iii) average conditional entropy overestimates the i-complexity in comparison to joint entropy. We discuss each in turn.
Theory-dependent Entropy. We consider a classical example from English morpho-phonology that demonstrates the dependence of paradigm entropy on the specific analysis chosen. In regular English plural formation, the speaker has three choices: [z], [s] and [1z]. Here are two potential analyses. One the one hand, we may treat this as a case of pure allomorphy with three potential, unrelated suffixes. Under such an analysis, the entropy will reflect the empirical distribution: roughly, 1 /4 log 1 /4 + 3 /8 log 3 /8 + 3 /8 log 3 /8 ≈ 1.56127. On the other hand, if we assume a unique underlying affix /z/, which is attached and then converted to either [z], [s] or [1z] by an application of perfectly regular phonology, this part of the morphological system of English has entropy of 0-one choice. See Kenstowicz (1994, p.72) for a discussion of these alternatives from a theoretical standpoint. Note that our goal is not to advocate for one of these analyses, but merely to suggest that Ackerman and Malouf (2013)'s quantity is analysis-dependent. In contrast, our approach is theory-agnostic in that we jointly learn string-tostring transformations, reminiscent of a-morphorous morphology (Anderson, 1992), and thus our (approximation to) paradigm entropy does not suffer this drawback. Indeed, our assumptions are limitedrecurrent neural networks are universal algorithm approximators. It has been shown that there exists a finite RNN that can compute any computable function (Siegelmann and Sontag, 1991;Siegelmann and Sontag, 1995). Thus, the only true assumption we make of morphology is mild: we assume it is Turingcomputable; that language is Turing-computable is a fundamental tenet of cognitive science (McCulloch and Pitts, 1943;Sobel and Li, 2013). (2013)  We highlight the importance of this point because suppletive forms are certainly very common in academic English: the plural of binyan is binyanim and the plural of lemma is lemmata. It is unlikely that native English speakers have even a partial model of Hebrew and Greek, respectively, nominal morphology in their heads-a more plausible scenario is simply that these forms are learned by rote. As speakers and hearers are capable of producing and analyzing these forms, we should demand the same capacity of our models. We note that these restrictive assumptions are relatively common in the literature, e.g., Allen and Becker (2015)'s sublexical learner is likewise incapable of placing probability mass on irregulars. 3

Morphological Irregularity. A second problem with Ackerman and Malouf
Average Conditional Entropy versus Joint Entropy. Finally, we take issue with the formulation of paradigm entropy as average conditional entropy, as exhibited in equation (9). For one, it does not correspond to the entropy of any one joint distribution as the product of the conditionals does not yield the joint; this denies the quantity a clean mathematical interpretation. Second, it is Priscian (Robins, 2013) in its analysis in that any form can be generated from any other, which, in practice, will cause it to overestimate the i-complexity of a morphological system. Consider the German dative plural Händen (from the German Hand "hand"). Predicting this form from the nominative singular Hand is difficult, but predicting it from the nominative plural Hände is trivial: just add the suffix -n. In Ackerman and Malouf (2013)'s formulation, r(Händen | Hand) and r(Händen | Hände) both contribute to the paradigm's entropy with the former raising the quantity. We believe this is suboptimal and, as we have shown in §4, an entropy-based formulation of morphological complexity need not have this property, i.e., only one of the two conditional entropies must count towards the final entropy of the paradigm, as is the case in the minimum spanning aborescence.

Experiments
The crux of our experimentation is simple: we will plot e-complexity versus i-complexity over as many languages as possible, and then devise a numerical test of whether the low-entropy conjecture appears to hold.

Data and UniMorph Annotation
At the moment, the largest source of annotated full paradigms is the UniMorph dataset , which contains data that have been extracted from Wiktionary, as well as other morphological lexica and analyzers, and then converted into a universal format. A partial subset of Unimorph has been used in the running of the SIGMORPHON-CoNLL 2017 shared task on morphological inflection generation (Cotterell et al., 2017a). We use verbal paradigms from 23 typologically diverse languages, and nominal paradigms from 31 typologically diverse languages. These are the Uni-Morph languages that contain at least 500 distinct verbal or nominal paradigms. 4 As the neural methods require a large set of annotated training examples to achieve high performance, it is difficult to use them in a lower-resource scenario.
Empirically Measuring i-Complexity. Here we follow the procedure from §3.1 and §4. That is, we partition the available paradigms into training, development and test sets. We train the factors of our generative model ( §4.1) on the training set, selecting among potential model structures on the development set using Edmonds's algorithm ( §4.2), and then evaluate i-complexity on the unseen test set ( §3.1). Using held-out data in this way gives a fair estimate of the actual predictability (i-complexity) of the paradigms, which is why it is standard practice on most common NLP tasks, though less common in quantitative approaches to linguistic theory.  Figure 2: The y-axis is cross-entropy, an approximation to the paradigm and entropy and a measure of i-complexity. The x-axis is the size of the paradigm, a measure of e-complexity. Both of these graphs overlay purple and green points, as discussed in §6.2. For concreteness, the purple points are those models trained with the same number of paradigms observed at training time across languages and the green points are those models trained with the same number of slot-to-slot mappings observed at training time. The purple curve is the Pareto curve for the purple points, and the area under it is shaded in purple; similarly for green.
Empirically Measuring E-Complexity. The measurement of the e-complexity in this scheme is relatively straightforward. Following Ackerman and Malouf, we simply count the number of slots in the paradigm (for nouns or for verbs as appropriate). 5

Experimental Details
For the i-complexity experiments, we split the full set of UniMorph nominal paradigms into train, development, and test sets as follows. We held out at random 50 full paradigms for the development set, and 50 others for the test set. To form the development and test sets, we include all pairwise mappings between inflected forms in each of the 50 paradigms, except the identity mapping.
We sampled D train from the remaining data. We tried two ways of doing this, which deal differently with the fact that different languages have a different number of slots per paradigm. Both regimes seemed reasonable, so we tried it both ways to confirm that the choice did not affect the qualitative results.
Equal Number of Paradigms (Purple). In the first regime, D train (for each language) contains 600 paradigms chosen from the non-held-out data. We trained the reinflection model in §4.2 on all n 2 mappings from these paradigms. Henceforth, we will abbreviate this training regime to the purple scheme.
Equal Number of Pairs (Green). In the second regime, we trained the reinflection model in §4.2 on 60,000 (m i , m j ) or (m i , empty string) pairs sampled without replacement from the non-held-out paradigms. 6 This matches the amount of training data, but may disadvantage languages with large paradigms, since the reinflection model will see fewer examples of any individual mapping between paradigm slots. Henceforth, we will abbreviate this  training regime to the green scheme.
Model and Training Details. We use the Open-NMT toolkit (Klein et al., 2017). We largely follow the recipe given in Kann and Schütze (2016), the winning submission on the 2016 SIGMORPHON shared task for inflectional morphology. Accordingly, we use a character embedding size of 300, and 100 hidden units in both the encoder and decoder. Our gradient-based optimization method was AdaDelta (Zeiler, 2012) with a minibatch size of 80. We trained for 20 epochs and select the test model based on the performance on the development set. We decoded with beam search with a beam size of 12.

Results and Analysis
Our results are listed in Table 2 and plotted in Figure 2, where each dot represents a language. We saw little difference between the green and the purple training schemes, though it was not clear a-priori that this would be the case. The plots appear to show a clear trade-off between i-complexity and the e-complexity. We now provide quantitative support for this impression, by constructing a statistical significance test.
Visually, Ackerman and Malouf's low-entropy conjecture boils down to the claim that languages cannot exist in the upper right-hand corner of the graph, i.e., they cannot have both high e-complexity and high i-complexity. In other words, the upperright hand corner of the graph is "emptier" than it would be by chance.
How can we quantify this? The Pareto curve for a multiobjective optimization problem shows, for each x, the maximum value y of the second objective that can be achieved while keeping the first objective ≥ x (and vice-versa). This is shown in Figure 2 as a step curve, showing the maximum i-complexity y that was actually achieved for each level x of e-complexity. This curve is the tightest non-increasing function that upper-bounds all of the observed points: we have no evidence from our sample of languages that any language can appear above the curve.
We say that the upper right-hand corner is "empty" to the extent that the area under the Pareto curve is small. To ask whether it is indeed emptier than would be expected by chance, we perform a nonparametric permutation test that destroys the claimed correlation between the e-complexity and i-complexity values. From our observed points {(x 1 , y 1 ), . . . , (x m , y m )}, we can stochastically construct a new set of points {(x 1 , y σ(1) ), . . . , (x m , y σ(m) )} where σ is a permutation of 1, 2, . . . , m selected uniformly at random. The resulting scatterplot is what we would expect under the null hypothesis of no correlation. Our p-value T is the probability that the new scatterplot has an even emptier upper right-hand corner-that is, the probability that the area under the null-hypothesis Pareto curve is ≤ the area under the actually observed Pareto curve. We estimate this probability by constructing 10,000 random scatterplots.
In the purple training scheme, we find that the upper right-hand corner is significantly empty, with p < 0.017 and p < 0.045 for the verbal and nominal paradigms, respectively. In the green training scheme, we find that the upper right-hand corner is significantly empty with p < 0.042 and p < 0.034 in the verbal and nominal paradigms, respectively.

Future Directions
Learnability. Ackerman & Malouf's hypothesis is an interesting starting point for future work. It seems to be implicitly motivated by the notion that naturally occurring languages must be learnable. In other words, the intuition is that languages with large paradigms need to be regular overall, because in such a language, the average word type is observed too rarely for a learner to memorize an irregular surface form for it. Yet even in such a language, some word types are frequent, because some lexemes and some slots are especially useful. Thus, if learnability of the lexicon is indeed the driving force, 7 then we should make the finer-grained conjecture that irregularity (unpredictability) will be better tolerated for the more frequently observed word types, regardless of paradigm size. Better yet, we should directly investigate whether naturally occurring inflectional systems are more learnable (at least by machine learning algorithms) than would be expected by chance. This is what one would predict if languages are shaped by natural selection or, more plausibly, by noisy transmission from each generation to the next (Hare and Elman, 1995;. Moving Beyond the Forms. The complexity of morphological inflections is only a small bit of the larger question of morphological typology. We have left many bits unexplored. In the realm of morphology, for instance, we have conditioned on the morpho-7 Rather than, say, description length of the lexicon (Rissanen and Ristad, 1994). syntactic feature bundles. Ideally, we would like to explain the underlying mechanisms that give rise to these feature-bundles and the distinctions they make.
In addition, our current treatment depends upon a paradigmatic treatment of morphology. While viewing inflectional morphology as paradigmatic is not controversial, derivational morphology is still often viewed as syntagmatic. Can we discover a quantitative formulation of derivational complexity? We note that paradigmatic treatments of derivational morphology have been offered: see Cotterell et al. (2017c) for a computational perspective and the references therein for theoretical positions and arguments.

Conclusions
We have provided a clean mathematical formulation of enumerative and integrative complexity of inflectional systems, using tools from generative modeling and deep learning. With a empirical study on 36 typologically diverse languages, we have shown that there is a Pareto-style trade-off between e-complexity and i-complexity in morphological systems. In short, this means that morphological systems can either mark a large number of morpho-syntactic distinctions, as Finnish, Turkish and other agglutinative and polysynthetic languages do, or they may have a high-level of unpredictability, i.e., irregularity.
This trade-off is a bit different than other tradeoffs in linguistic typology, in that a language is under no obligation to be morphologically rich-it may have low e-complexity and i-complexity. Carstairs-McCarthy (2010) has pointed out that languages need not have morphology at all, though they must have phonology and syntax.
Interestingly, NLP has largely focused on ecomplexity. Our community views a language as morphologically complex if it has a profusion of unique forms, even if they are very predictable. The reason is probably our habit of working at the wordlevel, so that all forms not found in the training set are out-of-vocabulary. However, as NLP moves to the character-level, we will need other definitions of morphological richness. A language like Hungarian with almost perfectly predictable morphology may be easier to process than a language like German with an abundance of irregularity.