## Abstract

Steedman (2020) proposes as a formal universal of natural language grammar that grammatical permutations of the kind that have given rise to transformational rules are limited to a class known to mathematicians and computer scientists as the “separable” permutations. This class of permutations is exactly the class that can be expressed in combinatory categorial grammars (CCGs). The excluded non-separable permutations do in fact seem to be absent in a number of studies of crosslinguistic variation in word order in nominal and verbal constructions.

The number of permutations that are separable grows in the number *n* of lexical elements in the construction as the Large Schröder Number *S*_{n−1}. Because that number grows much more slowly than the *n*! number of all permutations, this generalization is also of considerable practical interest for computational applications such as parsing and machine translation.

The present article examines the mathematical and computational origins of this restriction, and the reason it is exactly captured in CCG without the imposition of any further constraints.

## 1. Introduction

Permutation of elements in a construction, either within a single language or across languages, creating discontinuous semantic dependencies, is a widespread characteristic of natural languages that has motivated transformational rules (Chomsky 1957, passim), among other expressive grammar formalisms. Steedman (2020) proposes as a formal universal of natural language grammar that grammatical permutations are limited to a class known to mathematicians and computer scientists as the “separable” permutations, which are those orders over which a tree can be constructed such that all leaves descending from any node form a continuous subset *i*…*j* of the original ordered set 1, …, *n*.

The evidence for this claim is based on a number of studies of two data sets of alternations in two major constructions. The first concerns the permutations of the four elements Det(erminer) Num(erator) Adj(ective) N(oun) that surface in that order in English noun phrases (NP) like “These five young lads,” but in Yoruba and many other languages are found in the mirror-image permutation. The crosslinguistic variation in this construction has been studied by Cinque (2005) and Abels and Neeleman (2012) (among others) for languages with a single dominant order, and also within a single language with very free word order for the same construction by Nchare (2012). The second construction is the verb group, also consisting of four elements, which we can think mnemonically of as the verb-sequence “must have been dancing,” whose variations with various verb-types both across and within the various Germanic languages and dialects, has been studied by Wurmbrand (2004; 2006) and Abels (2016).

In both constructions, the attested orders fall on highly skewed Zipfian power-law distributions, with the rarest orders found in only a very few cases. Such distributions make it impossible to decide whether further unattested orders are entirely excluded by hard constraints on the universal space of possible grammars, or merely disfavored by sampling error cc “Harmony” (proposed by Culbertson, Smolensky, and Legendre 2012, among others considered below) that may merely make them vanishingly rare in samples of the size that we can actually access (Abels and Neeleman 2012). Of course, a further possibility is that the unattested orders arise from a combination of hard and soft constraints.

The present study defers to such authors on the specific characteristics of the soft constraints and the distributions themselves, and considers only the nature and origin of a specific hard constraint on the formal operations that define the set of permutations that such distributions can apply to in the first place.

Both the nominal and verbal constructions concerned can be thought of in grammatical terms as made up of elements or “categories” with a “natural order of dominance” stemming from their semantics. For these constructions, that order of dominance happens in English to be the same as their left-to-right order on the page, inducing a right-branching tree-structure. However, in other languages, the same relations of dominance allow all of the word orders that would arise if the same tree were treated as a mobile, with daughters allowed to rotate around each other, as well as some word orders involving apparent displacements that go beyond rotation. While the constructions of interest allow only one natural order of dominance, other sets of categories including multiple words of the same category may allow more than one natural order to merge. (For example, the set consisting of the words “divorced, alleged, spy” has two natural orders of dominance, the other one corresponding to a distinct sense “alleged divorced spy.”)

We can therefore abstract over the two constructions of interest as being made up of elements numbered 1 > 2 > 3 > 4 according to that natural order of dominance, under the following correspondence:

There is widespread suspicion that not all of the four-factorial permutations of these constructions are allowed (Williams 2003; Svenonius 2007).

In considering whether a given permutation is allowed for the NP construction, Cinque (2005; 2010) emphasizes the need to exclude from consideration word orders for the NP in which the adjective is extraposed or has the markers of a relative clause. The exclusion is crucial, since such modifiers may be adjoined to the entire NP, rather than to N, thereby defining a *different* natural order of dominance, with the extraposed element as 1. Cinque rejects a number of the orders claimed by Dryer (2018) on this ground. The present paper follows Cinque strictly concerning the attested fixed NP orders.

Cinque also proposes to exclude free word-order alternation of the kind investigated by Nchare (2012) for the Shupamem NP, on the grounds that alternating orders reflect distinctions of focus. Accounts of focus alternations have sometimes assumed the existence of an FP-like focus-phrasal external functional projection, offering an extra target for a different variety of movement resembling extraposition (Szabolcsi 1983, 1994; Giusti 2006). However, this is an additional and somewhat theory-internal assumption. English achieves similar alternations of focus or contrast in the NP by in situ accent placement without varying word order, for which Steedman (2014) offers a lexicalized alternative semantics without focus-movement or focus-projection, other than by surface-compositional syntactic derivation. We therefore depart from Cinque on this point, and provisionally accept Nchare’s free orders as arising from the same mechanism as Cinque’s.

The orders attested in these three studies for the NP and VP constructions can then be summarized as in Figure 1, in which the alphabetic ordering of the permutations (which will be used for reference throughout this article) follows Cinque. (Abels 2016 adopts a different convention.) Cinque (2007 n.13) notes the possibility that Figure 1(m) constitutes a fifteenth order for the NP, attested for only one language so far, Dhivehi (Maldivian; Cain 2000), indicated by parentheses in the figure.^{1}

We note at this point that Dryer (2018) claims several further word orders as basic for NP in fixed-order languages, in addition to those listed by Cinque. In particular, he claims (2018: pages 17, 29) that Kilivila (Senft 1986) and four other languages have order (g), which is excluded by the theory developed below, as their default order. However, Cinque argues that Senft’s example of this order involves adjectival extraposition, noting that Senft (1986, pages 96) also cites the canonical order (a) for the full four-element construction itself. Of the other four languages cited by Dryer, he himself notes for Yapese (Jensen 1977) that the adjective in his example of order (g) is marked as a relative clause including the copula, hence arguably also extraposed from NP. Of the remaining three languages for which Dryer claims this order, Katu (Costello 1969, page 22) does not lexically distinguish demonstratives from locatives, but the one example Costello gives (1969, page 34 (87)) involving both an adjective and a demonstrative locative has the canonical reverse order N Adj Dem. The example given by Tryon (1967, page 60) for Dehu/Drehu includes the copula with the adjective, so is arguably also extraposed.

In the lexically multifunctional language Teop (Mosel 2017), to which Dryer also attributes (g) as base order, adjectives are expressed as adjoined adjectival phrases, with their own copies of the article or agreement and numerator, so that a strong possibility of extraposition or apposition clearly exists here also.

In terms of the CCG theory developed below, adjective extraposition requires the addition of a distinct NP-adjunct category, syntactically and semantically non-isomorphic to the standard adjective, inducing a *different* Natural Order of Dominance over the four elements, with the extraposed adjectival as 1, the demonstrative as 2, the numerator as 3, and the noun as 4. Thus, none of these languages constitutes a counterexample to the claim below that order (g) is universally excluded for the standard categories. We have strictly followed Cinque’s account of the attested NP orders, and exclude several other orders listed by Dryer whose attestation appear to depend on adjective extraposition, including (h).

Of the orders for the Germanic verb group admitted by Abels (2016), he gives only qualified support for Figure 1f, h, m, and s, in some cases because they are only attested as alternates, similarly marked by parentheses.

It is striking that just two of the 24 permutations over these two constructions are entirely without even equivocal support for attestation in these three studies, namely, Figure 1g and j—that is, 2413 and its mirror-image 3142. These are the permutations that would give rise to the following word orders for the two constructions:

The above examples of permutations that in CCG are neither allowed as base orders nor as alternates for the two constructions include lexical types indicating the dominance relations among these elements, using an order-free categorial notation, in which a category of the form *X*|*Y* indicates an element that combines with a sister constituent of type *Y* adjacent to the right or to the left to yield a constituent of type *X*.^{2}

These categories immediately offer a hint as to a hard constraint forcing the two orders 2413 and its mirror-image 3142 to remain unattested. A formal proof is set out below, but it is obvious from inspection of (2) and (3) that these are the only two permutations in which *no category of the form X|Y is adjacent to a category Y, or to a category Y|Z that could yield Y as its result*.

It is also interesting that the two permutations 2413 and its mirror-image 3142 are distinguished by mathematicians and computer scientists as the only two permutations on the order 1234 that are *non-separable* (Avis and Newborn 1981; Bose, Buss, and Lubiw 1998). This observation motivated Steedman (2020) to hypothesize that for a fixed natural order of dominance of arbitrary length all permutations generated by CCG are within the set of *separable permutations*.

The main contribution of the present article is to show that this hypothesis indeed holds for sentences of arbitrary length. In addition, we also establish a close connection between CCG derivations and *separating trees*, a connection between CCG parsing and the number of paths in a lattice, and the relation of this result to the other grammar formalisms, namely, ITG (Wu 1996), TAG (Joshi 1985), MG (Stabler 1996), and CAT (Williams 2003).

The paper is structured in the following way. In Section 2 we give a short overview of CCG. In Section 3 we give a more specific definition of natural order of dominance and prove that Steedman’s (2020) hypothesis is true. Section 4 shows an alternative, possibly more intuitive, way of enumerating the permutations that CCG allows, first by establishing the isomorphism between the normal forms of CCG derivations and *separating trees*, and then by showing the connection between CCG shift-reduce parsing and the Large Schröder Number of paths in a lattice on the Cartesian plane. Section 5 establishes the statistical significance of the fact that the two non-separable permutations that are *predicted* to be unattested are among the vanishingly small number of permutations that are in fact empirically unattested so far. Section 6 puts this result in the context of other grammar formalisms and their expressiveness. Appendix A gives CCG derivations for all 22 separable permutations of the attested NP and VP word orders.

## 2. Combinatory Categorial Grammar

Combinatory Categorial Grammar (CCG) is a radically lexicalized theory of grammar in which all language-specific syntactic and semantic information concerning word order and subcategorization is specified in lexical entries or *categories*, and is projected onto the sentences of the language by universal rules that are *combinatory*, in the sense that they apply to pairs of non-empty, strictly contiguous, categories. The latter property immediately guarantees that no CCG can recognize the unattested permutation in (2) and (3).

### 2.1 The Categorial Lexicon

The lexical fragment for the very common English NP order in CCG is a *directional* categorial grammar, in which all instances of | are instantiated as /, meaning that they have to combine with an element to the right, thus:^{3}

Slashes identify categories of the form *X*/*Y* and *X*∖*Y* as *functions* taking an argument of syntactic type *Y* to the right or to the left, and as yielding a result of type *X*.

By contrast, the following lexical fragment defines the even more common mirror-image word order glossed as “Lads young five these,” as required, for example, for Yoruba (Hawkins 1983, page 119).

### 2.2 Rules of Function Application

The universally available rules (6) of syntactic combination called **forward** and **backward application** (respectively labeled > and <) allow syntactic derivation from such lexicons.

The type ★ of the slashes in *X*/_{★}*Y* and *X*∖_{★}*Y* limit the categories to which these rules can apply, and can be ignored for the moment.

The forward rule (6a) allows the following derivation for the English lexicon (4):

The rightward arrow > on all combinations in (7) indicates that it is the rightward functional application rule (6a) that has applied in these cases.^{4}

The derivation in (7) is clearly isomorphic to the standard linguistic context-free phrase structure *tree*, except that it is the right way up—that is, its leaves are at the top and its root at the bottom.

Together with the rightward and leftward application rules (6) for combination to the right and left, such categories immediately generate eight of the 24 permutations, as follows:

Cinque (2005) gives counts for each word order in his sample of 700 or so languages, quantized to four levels. The present counts are updated according to Cinque’s 2013 sample of around 1,500 languages, with “very few” corresponding to 10–29, “few” to 30–99, “many” to 100–299, and “very many” to 300+ (cf. Merlo 2015, Table 1). The discontinuous alpha-numeration in (8) again reflects the place of these orders in Cinque’s ordering of the 24 permutations of these elements introduced earlier in Figure 1, which we take as standard. All eight orders are among those endorsed by Cinque and Nchare.^{5}

The above eight orders are the ones obtained by treating the tree (7) as a mobile, allowing sisters to rotate around each other. The same eight orders are similarly available for the Germanic serial verb construction, although for these orders we have no meaningful frequency estimates:

Again all of these orders are attested in Germanic, although Abel’s endorsement of (s) is equivocal, on the grounds that it is only attested as an alternate in his sample.

It is striking that all possible orders based on purely applicative derivations appear to be attested for both constructions, and that all of the NP orders admitted by Cinque (2013) with “very many” or “many” attestations are among them, suggesting soft constraints favoring “harmony” or consistent directionality across categories and purely applicative derivations.^{6}

However, the other 14 orders attested by the authorities in Figure 1 involve discontinuities. For example, in Figure 1c, in its instantiation as the Masai NP order “These lads five young,” element 3 the adjective “young” is separated from its 4, the noun “lads” by 2, the numerator “five.” How do we allow the attested orders without also allowing the unnattested orders in (2) and (3)?

To do so, something more than the simple application rules is required. Cinque and others propose transformational movement as that “something more.” The present approach offers an alternative to movement, or any other syntactic operation of “action at a distance” over non-contiguous elements.

### 2.3 Rules That Change Word Order

CCGs also include universally available rules of functional composition, strictly limited in the first-order case to the following four combinatory rules:^{7}

The types × and ◇ of the slashes in *X*/_{×}*Y* and *X*/_{◇}*Y* limit the categories to which these rules can apply.

In the full theory, these rules are generalized to “second level” cases, in which the secondary function is of the form (*Y*|*Z*)|*W* such as the following “forward crossing” instance, in which | matches either / or ∖ in both input and output:

The combination of crossing rules and generalized composition is the source of (slightly) greater than context-free expressive power in CCG, allowing analyses of trans-context-free constructions like Germanic crossed dependencies (Bresnan et al. 1982; Steedman 1985, 2000).

All syntactic rules in CCG are subject to a generalization called the Combinatory Projection Principle (CPP), which says that rules must apply consistent with the directionality specified on the primary function *X*|*Y*, and must project unchanged onto their result the directionality of any argument(s) specified on the secondary function *Y*|*Z*:^{8}

Because directionality is originally specified in the lexicon, this principle amounts to the statement that syntax cannot override the word order specified there.

In particular, this principle universally excludes rules like the following from CCG:

The same principle excludes all movement, copying, deletion-under-identity, or other action-at-a-distance, all structure-changing operations such as “restructuring,” “reanalysis,” or “reconstruction,” and all “traces” and other syntactic empty categories, and makes syntactic derivation strictly type-dependent, rather than structure-dependent.

The types ◇ and × on the slashes on the primary function *X*|*Y* in the composition rules (10) and (11), like the type ★ on the application rules (6), allows categories to be lexically restricted as to which of the rules can apply to them or to their projections under the CPP. The absence of specific slash-typing on the secondary function *Y*|*Z* is an abbreviation meaning that it schematizes over all slash types. However, the CPP (13) requires that the corresponding slash type on the argument |*Z* in the result *X*|*Z* is the *same* slash type.

This slash-typing system is different from the one used by Baldridge and Kruijff (2003) in that types are not organized into a type hierarchy. Avoiding slash type hierarchy allows for a more strict specification of which rules can be applied. The type we now write as *X*/_{◇★}*Y* was written following Steedman (2005) and Baldridge and Kruijff (2003) as *X*/_{◇}*Y*, and *X*/_{×★}*Y* was written *X*/_{×}*Y*.^{9}

The inclusion of the harmonic composition rules (10) allows some additional derivations, and supports a variety of “non-constituent” coordinations (Steedman 1985; Dowty 1988), of which the following is a simple example:^{10}

The crossing composition rules (11), unlike the harmonic rules (10), have a reordering effect that is relevant to the present discussion. For example, in English they allow a non-movement-based account of the “Heavy NP Shift” construction, thus:

It will be obvious from the above derivation that allowing the crossing composition rules (11) to apply to unrestricted categories induces *alternation of word order*, as here between the Heavy-Shifted order in (16) and the canonical order *I will buy a book tomorrow*. Steedman (2020) shows that such word-order alternations can be excluded for fixed NP word-order languages of the kind considered by Cinque by restricting the slash type of the functor categories in the lexicon as either × (“only crossing-compose”) or ★ (“only apply”).

If, as in the alternation in English between (15) and the canonical order, we want a category to combine by *both* forward harmonic composition *and* forward application, then we assign a lexical category of the form *X*/_{◇★}*Y*, bearing the union of ◇ and ★ slash types, as in derivation (15). If we want all three rule-types to apply to a forward category, to allow free word order, then we assign it the union of all three slash types *X*/_{◇×★}*Y*, which to save space and maintain compatibility with earlier notations we write as the universal forward slash *X*/*Y* as in (16). However, such details of language-specific fine-tuning of grammars can be ignored for the present purposes.

## 3. The Generalization Concerning the Permutations Allowed by CCG

In this section we prove that the earlier claim that CCG generates only separable permutations holds for constructions of arbitrary length. Section 3.1 extends the definition of natural order of dominance to an arbitrary number of elements and shows how the natural order of dominance follows from the CCG lexicon that defines constructions. Section 3.2 then defines the notion of separable permutations. Sections 3.3 and 3.4 define the proposition more formally and prove its correctness. Section 3.5 shows how this result applies to the standard version of CCG including lexical type-raising and the full range of combinatory rules.

### 3.1 Natural Order of Dominance

The notion of natural order of dominance has been used up until this point informally, in the context of the categories defining the four core elements of the NP and VP constructions that point the way to a generalization. To see the generalization to other constructions of arbitrary length, a more formal definition is needed.

We define the Natural Order of Dominance (NOD) over categories of type *X*|*Y* where both *X* and *Y* are categories of first order, as follows. A word *a*_{i} with a category *X*|*Y**immediately dominates* a word *a*_{j} with category *Y*|*Z*.^{11} A NOD over *n* words is then defined as an ordering *a*_{1}, … *a*_{n} such that *a*_{i} immediately dominates *a*_{i+1} for all *i* < *n*. (There may be more than one NOD if there is more than one word of the same type.)

Because every word’s category can select for only one argument category, and because each word *a*_{i} in a NOD immediatly dominates word *a*_{i+1}, that NOD defines a total ordering of words paired with their categories. We can write any NOD as a sequence of categories *X*_{1}|*X*_{2}, *X*_{2}|*X*_{3}, … *X*_{n−1}|*X*_{n}, *X*_{n}|*X*_{n+1} where each category *X*_{i}|*X*_{i+1} is a category of a word taking position *i* in the NOD.

In the previous sections we have invoked NOD over the elements of noun and verb phrases with categories: *NP*|*NumP*, *NumP*|*N*′, *N*′|*N*, *N* and *VP*_{1}|*VP*_{2}, *VP*_{2}|*VP*_{3}, *VP*_{3}|*VP*_{4}, *VP*_{4}. These sequences match the generalized definition of NOD except for the last element being atomic *X*_{n} instead of a function category *X*_{n}|*X*_{n+1}. To allow such sequences, we extend the definition of NOD to allow the last element of the node *X*_{n}|*X*_{n+1} to be an atomic *X*_{n}. Because |*X*_{n+1} will never be canceled, it makes no difference to NOD whether it is present or not. In the proofs we will use the *X*_{n}|*X*_{n+1} form for convenience.

The linearization of these *n* words that is consistent with their NOD will be called their **Canonical Linearization** (CL).

### 3.2 Separable Permutations and Separating Trees

Separable permutations were first formulated in algorithmic form by Avis and Newborn (1981) and have since then been rediscovered in different (but formally equivalent) forms such as bootstrap percolation in permutation matrices (Shapiro and Stephens 1991), permutations avoiding the “forbidden” patterns 2413 and 3142 (West 1996), and inversion transduction grammar (ITG) permutations (Wu 1996, among others).

The pattern-avoiding definition of West (1996) can be stated in the following way: The permutation *π* is separable if and only if there are no digits *a* < *b* < *c* < *d* that appear in *π* as *b* … *d* … *a* … *c* or *c* … *a* … *d* … *b*. For example, 4 1 6 3 5 2 is not separable because the digits 3 < 4 < 5 < 6 form the forbidden pattern 4 … 6 … 3 … 5. Notice that the elements of the forbidden pattern do not need to be immediately next to each other: They are forbidden even when discontiguous.

Bose, Buss, and Lubiw (1998) define the separable permutations in terms of *separating trees*, which we define for present purposes as follows:

For example, a permutation such as 4 3 1 2 5 8 6 7 over the range [1 … 8] can be split into two contiguous sub-sequences over continuous ranges, namely: 4 3 1 2 over range [1 … 4] and 5 8 6 7 over range [5 … 8], with the resulting binary node labeled +. If we continue to recursively factorize contiguous sub-sequences over continuous ranges, we form the separating tree shown in Figure 2.

It is important for what follows to note that the identities of the leaves of a separable tree are redundant, in the sense that they are fully determined by the tree structure and + / − labels on the non-terminal nodes.

Bose, Buss, and Lubiw (1998) show that labels on separating trees define a sorting operation that reorders the children based on their ranges: If the label on a node is + then the range in the canonical linearization covered by the left child is ordered before that of the right child, and the existing order and label + is retained. If the label is −, then the CL range covered by the left child is ordered after that of the right, and the − operator is applied to restore the canonical order of its children, changing the label to +. This sorting procedure is exemplified in the series of transformations of the trees in Figure 2 that swap children of all nodes labeled with − until the canonical order results.

The restriction of separating trees to binary structures limits the permutations that they can cover. For instance, we cannot create a separating tree for the permutation 2 4 1 3 because it cannot be split into contiguous subtrees covering continuous ranges.

Bose, Buss, and Lubiw (1998) show that the set of permutations that is described by separating trees is precisely the set of separable permutations. In other words, every separable permutation can be mapped to a separating tree and every separating tree can be mapped to a separable permutation.

Although it is clear that every separating tree maps to a unique separable permutation, some separable permutations can be mapped to multiple different separating trees. We will return to this point in Section 4.

### 3.3 The Claim

We observe that there is a direct relation between CCG derivations and separating trees. The intuition is clear from Figure 3: All CCG derivations for the canonical linearization of the five-element NOD, exemplified in Figure 3a, consist entirely of forward combinations, while the isomorphic separating trees for the canonical linearization, exemplified in Figure 3b, consist entirely of positive + nodes (cf. Figure 2d). Similarly, CCG derivations for non-canonical linearizations, exemplified in Figure 3c, include backward combinations, which correspond to − nodes in the isomorphic separating tree in Figure 3d.^{12}

We seek to prove that the following claims hold for any NOD of *n* elements:

CCG can derive

*only*separable permutations of the canonical linearization.CCG can derive

*all*separable permutations of the canonical linearization.The cardinality of the set of permutations derivable by CCG is bounded by

*S*_{n−1}, the*n*− 1th element of the Large Schröder Number series (OEIS series A006318).

We begin by proving three helpful lemmas. To prove the first of these lemmas we use **strong mathematical induction**, sometimes also called *complete induction*. To prove that some proposition *P*(*n*) holds for all natural numbers using strong induction it is sufficient to prove the following statements:

- (i)
base case

*P*(1) and - (ii)
inductive step [

*P*(1) ∧*P*(2) ∧ ⋯ ∧*P*(*k*)] ⇒*P*(*k*+ 1) holds for all*k*∈ ℕ

*P*(

*k*), whereas strong induction assumes

*P*(1) ∧ ⋯ ∧

*P*(

*k*). Both are equally sound proving strategies, but strong induction is more convenient for proving propositions about trees. A more detailed distinction between weak and strong induction is available in many textbooks that cover proof methods (Eccles 1997; Liebeck 2010; Gunderson 2014).

### 3.4 The Proof

**Lemma 1** For a fixed NOD of *n* elements with canonical linearization *X*_{1}|*X*_{2}*X*_{2}|*X*_{3} … *X*_{n}|*X*_{n+1}:

- (i)
CCG derivations can form all possible contiguous binary branching trees;

- (ii)
Every node that spans from

*i*to*j*in every tree derives a category*X*_{i}|*X*_{j+1}.

*Proof*. *Base case*: For *n* = 1 we have a single category in the canonical linearization *X*_{1}|*X*_{2}. This is the only possible tree and it is binary in a trivial sense. Its only node spans from 1 to 1 deriving category *X*_{1}|*X*_{1+1}.

*Inductive step*: Assuming the strong inductive hypothesis holds for all 1 ≤ *k* ≤ *n*, we prove that it also holds for *n* + 1. We analyze the canonical linearization *X*_{1}|*X*_{2}*X*_{2}|*X*_{3} … *X*_{n+1}|*X*_{n+2} by splitting it into two contiguous parts where the first one is *X*_{1}|*X*_{2} … *X*_{k}|*X*_{k+1} and the second one is *X*_{k+1}|*X*_{k+2} … *X*_{n+1}|*X*_{n+2} for any *k* ∈ [1 … *n*]. Each of these two parts forms a smaller NOD with length ≤ *n*. From the inductive assumption it follows that for both of these parts CCG can form all possible binary branching trees, and that all nodes spanning *i* to *j* within these parts will be labeled with *X*_{i}|*X*_{j+1}. That means that the root node of the left and right part will be *X*_{1}|*X*_{k+1} and *X*_{k+1}|*X*_{n+2}, respectively. Now we can combine the left and right part into the whole structure using forward function composition >**B** to give us a new node *X*_{1}|*X*_{n+2}. This node satisfies the labeling condition of the lemma. The binary branching condition is also satisfied because we put no constraints on the size of the spans of the left and the right part (as long as they are non-empty), and by the inductive assumption, each part can derive all possible binary trees.□

**Lemma 2** All word orders that can be derived by CCG for a given NOD can also be derived by an isomorphic separating tree that reorders the same permutation into the canonical linearization for that NOD.

*Proof*. It is sufficient to show that any CCG derivation tree corresponds to an isomorphic separating tree over the same permutation.

Consider any arbitrary CCG derivation tree for a canonical linearization of a NOD of length *n*. Example: One such derivation for the NOD for *n* = 5 is in Figure 3a.

From Lemma 1, we know that all nodes with span *i* to *j* will have label *X*_{i}|*X*_{j+1} and all binary nodes will use forward function composition >**B**. We can derive the same order with an isomorphic separating tree where instead of using CCG forward combination we have the + operator at every node, because the children are in the canonical order. Example: Figure 3b.

Non-canonical linearization in CCG requires backward combinations. For any node in a CCG derivation for the canonical linearization whose left child has category *X*_{i}|*X*_{j+1} and right child category *X*_{j+1}|*X*_{k+1} we can obtain the inverse order by reversing the directionality of the combination and reversing the order of the child nodes, making *X*_{j+1}|*X*_{k+1} precede *X*_{i}|*X*_{j+1}, so that all words in the range *j* + 1 to *k* in the canonical order precede words *i* to *j*. Example: Figure 3c shows this reordering effect for the two nodes where forward is replaced by backward combination.

Any CCG derivation reordered in this way corresponds to an isomorphic separating tree in which the unrotated nodes corresponding to forward combination are relabeled as + and the rotated nodes corresponding to backward combination are relabeled as −. Example: Figure 3d.

By the result of Bose, Buss, and Lubiw discussed in Section 3.2, applying the − operator to convert all such nodes to + and invert the linear order of their children reverses the process to recover the canonical linearization of which this is a separable permutation.□

**Lemma 3** CCG can derive all possible separating trees for an NOD of arbitrary length.

*Proof*. By Lemma 1, CCG can derive all possible binary branching trees on the canonical linearization. By Lemma 2, each node in each of these binary trees can be labeled either + or −, defining a separating tree isomorphic to the CCG derivation for the corresponding linearization. Because we saw in Section 3.2 that a separating tree determines the identities of its leaves, the set of all possible binary trees with all possible node labelings is the set of all possible separating trees under the definition (17) there.□

Finally, we can prove the main claims that relate CCG derivations of NOD and separable permutations of canonical linearization of NOD.

**Theorem 1**□CCG derives only separable permutations.

*Proof*. By Lemma 2, all CCG derivations can be mapped to separating trees. All separating trees correspond to a separable permutation (Bose, Buss, and Lubiw 1998). Therefore CCG derives only separable permutations.□

**Theorem 2** CCG derives all separable permutations over a NOD.

*Proof*. By Lemma 3, CCG can derive all possible separating trees. Every separable permutation corresponds to a separating tree (Bose, Buss, and Lubiw 1998). Therefore, CCG derives all separable permutations.□

**Theorem 3** The cardinality of the set of permutations derivable by CCG is bounded by *S*_{n−1}, the *n* − 1th Large Schröder Number.

*Proof*. By Theorems 1 and 2, CCG derives all and only separable permutations over an NOD of length *n*. The number of separable permutations of length *n* is the *n* − 1th Large Schröder Number (Shapiro and Stephens 1991). Therefore, the number of permutations derivable by CCG for that NOD is also *n* − 1th Large Schröder Number.□

*S*

_{0}= 1:

### 3.5 Generalization to Full CCG

The above results have been proven using a restricted version of CCG called “pure first-order” CCG (Koller and Kuhlmann 2009), in which all variables in categories *X*_{i}|*X*_{i+1} correspond to atomic types such as N, rather than function categories, such as (*S*∖*NP*). The standard version of CCG is a second-order calculus, in which arguments can have function types, a possibility that is central to the CCG analysis of relativization, coordination, and adjunction, including the phenomenon of parasitic gaps. Indeed, all cases of so-called movement reduce in CCG to contiguous merger with a semantically second-order functor.

Categories that select for multiple arguments are “Curried” unary functors in CCG, with a function as their result, such as (*X*_{i}|*X*_{j})|*X*_{i+1}. They do not define a NOD over the arguments *X*_{i+1}, *X*_{j}, because those arguments do not stand in a dominance relation, and they potentially allow the second-level composition rules such as (12) to apply. Nevertheless, they have a canonical linearization in which all combinations are forward, and (by Lemma 2) dominate separating trees over separable permutations of that canonical ordering, although they no longer allow *all* separable permutations. (The large Schröder numbers thus constitute an *upper bound* on the allowable permutations of canonical linearizations in CCG.)

The present results generalize immediately to full CCG. The semantics of the combinatory rules is invariant, so it makes no difference to the combinatorial possibilities if the “canceling” variable *Y* in a rule such as application matches an atomic category or a function category. (In fact, we have already exploited this possibility in writing the atomic symbols *VP* and *N* for what are semantically predicate or functional types.)

Type-raising is a lexical operation that exchanges the role of argument and function by making the former into a function over the latter and that does allow permutations that would not be allowed by the original categories. It does so by changing the NOD among the categories. However, because this is a lexical operation whose effects are determined before syntactic derivation can apply, the main result of the present paper still applies: Only separable permutations on the canonical linearization order defined by the lexical categories can be captured.

## 4. Normal-Form Separating Trees and CCG Parsing

The results shown above depended on the fact that the canonical linearization on the category sets of interest is binary tree-complete, and therefore separating tree-complete under rotation, and therefore separable-permutation-complete. However, we noted earlier that there are more separating trees than separable permutations, because many separating trees yield the same permutation. Indeed, for the canonical linearization of NOD of length *n* and its inverse, the number of separating trees is *C*_{n−1}, the *n* − 1th Catalan Number (Church and Patil 1982), which is much larger than the corresponding Large Schröder number.

For example, Figure 4 shows all eight separating trees derived by rotation from the two binary trees on the canonical linearization of the canonical linearization *CL*_{3} of three elements, in which + on each of the two nodes means that it is unrotated and − means that it is rotated. These are indeed all separable permutations of three elements, but the two orders 1, 2, 3 and 3, 2, 1 are represented by two separating trees, where *C*_{3} = 2. For the purpose of efficient parsing with CCG, we would like there to be only one CCG derivation, and one separating tree, for each separable permutation.

### 4.1 Normal-Form Separating Trees

Shapiro and Stephens (1991) observed that the spurious ambiguity of separating trees arises when two nodes that are in parent–child relation have the same label, as for the trees in Figure 4a and 4b. They proposed a normal form that makes the mapping between the set of normal form separating trees and the set of separable permutations bijective. This normal form states that no node can be labeled with the same + / − label as its right child. This excludes trees like those in Figure 4a and 4e and means that the number of normal-form separating trees is the same as that of the separable permutations.

This normal form for separating trees is reminiscent of the CCG derivational normal-form proposed by Eisner (1996) to enforce a one-to-one mapping between CCG syntactic derivations and semantic interpretation that they generate (cf. Hockenmaier and Bisk 2010). Eisner normal-form forbids particular sequences of combinatory rules that are of the same type from occurring in a parent-child relationship in the derivation.

Using normal-form separating trees, we can distinguish the possible cases, shown in Figure 5. Figures 5a and 5b show the configurations where the right child is a single word. Figures 5c and 5d show those where the right child is a non-single word constituent that adheres to the normal form. The length of the examples is identified as *n* + 1 in order to be consistent with the conventional numeration for the Large Schröder numbers. The length of the constituent A is *k*. We are interested in the number of normal-form separating-trees *S*_{n}, since that will also be the number of separable permutations of length *n* + 1.

The number of trees that conform to the pattern in Figure 5a clearly depends on the number of possible concrete sub-trees that can act as constituent A. Here A is of length *k* = *n*, which means that the number of separating trees of the pattern in Figure 5a is *S*_{n−1}. If we ignore the label of the root node, the other three patterns show all possible ways of constructing the right child: having a single terminal node, or binary node labeled + or −. This means that we can express the sum of trees confirming to these three patterns as the number of trees that have their root unlabeled (since we are ignoring the root label) and have the left and right child confirming to the normal form for separating trees. For a given *k*, which plays a role of a split point in parsing, the number of those trees would be *S*_{k−1}*S*_{n+1−k−1}, but because *k* can be any number between 1 and *n*, and is not fixed, we need to sum over all of those values $\u2211k=1n$*S*_{k−1}*S*_{n−k}. We can shift this sum to start from zero to obtain the equivalent formula $\u2211k=0n\u22121$*S*_{k}*S*_{n−k−1}. When we put the formulas for all four patterns together we get the Large Schröder recurrence in Equation (18). We can conclude that the number of normal-form separating trees for a NOD of length *n* is also *S*_{n−1}.

### 4.2 The Relation of Large Schröder Numbers to Shift-Reduce Parsing

The Large Schröder numbers are often also presented in relation to the number of lattice paths in the Cartesian plane such that each path starts at lower-left corner (0,0), ends at upper right corner (*n*, *n*), contains no points above the diagonal *y* = *x* (that is, can only be in the lower right triangle), and is composed only of unit steps →, ↑, and ↗ or more concretely (1, 0), (0, 1), and (1, 1) (Weisstein 2018). In this section we show how this view of Large Schröder numbers can be interpreted as Shift-Reduce parsing with CCG derivations corresponding to the normal-form separating trees defined above.

Shift-Reduce is one of the most widely used parsing algorithms in computational linguistics and it is probably the simplest way to construct a parser. It is based on having a parsing state with a stack (filled with constituents built so far) and a buffer (filled with words in the sentence suffix that are yet to be processed). A *shift* transition takes a word from the buffer and puts it on top of the stack. For binarized grammars like CCG, a *reduceX* transition replaces the two topmost constituents on the stack with a single new constituent labeled *X*. In our CCG derivations we will have two reduce operations: reduce + for forward combinators and reduce − for backward combinators. We will again take an example sentence of length *n* + 1. The initial parsing state will be a stack containing the first word and a buffer containing the other *n* words.

We can visually represent any valid successful shift-reduce parsing trace by starting at the point (0, 0) and taking → to represent a shift transition and ↑ labeled + or − to represent a reduceX transition. The Shift-Reduce transition sequence for the derivation tree in Figure 6a is shown in Figure 6b. Clearly, such paths will never cross the diagonal, because we can never have more reduce operations than shifts and every successful parse has *n* shifts and *n* reduce steps, so that every path ends at (*n*, *n*).

The set of SR parse paths can be further restricted to those corresponding to the the normal-form parses and separating trees defined in the preceding Section 4.1, as follows. First, the normal form says that a parent node cannot have the same label as its right child node. What that means in SR parsing terms is that *we cannot have two consecutive transitions both of which are reduce + or reduce −*. Therefore, when we have a sequence of reduce operations, expressed as an uninterrupted vertical sequence of ↑, we need only know the label of the first reduce transition in the sequence because the other labels can be inferred from the normal form: They will alternate between + and −. This modified representation is shown in Figure 6c.

The final modification is to use a slightly different encoding of the initial reduce in each of the reduce sequences. The first reduce step in the sequence always comes after a shift, which means that it forms a shape → ↑. If that sequence is [shift, reduce−], we will encode it with a diagonal ↗ and if it is [shift, reduce+] we will encode it with existing → ↑ but without any label, as shown in Figure 6d. This encoding defines a one-to-one mapping between CCG normal-form shift-reduce parses and paths in the Schröder lattice.

These observations support another way to think about the number of permutations that CCG allows for a NOD of length *n* in terms of SR normal-form parsing. When we start in the initial position (0,0) we can take paths that start either with ↗ or → (↑ is illegal because it would cross the upper diagonal). If we take ↗ (Figure 7a) we find ourselves in a lattice of the same shape, smaller by one unit, which means that the number of normal-form paths through it is *S*_{n−1}. If, on the other hand, we take → we will end up in position (1, 0). Any path that goes from (1, 0) to (*n*, *n*) must at some point have a vertical arrow that crosses diagonal (*y* + 1, *y*), as shown in Figure 7b. The first point at which the path crosses that diagonal will be some concrete point (*k* + 1, *k*). The number of paths that go from (1, 0) to (*n*, *n*) and contain arc ↑ at (*k* + 1, *k*) is *S*_{k}*S*_{n−k−1}. Again, we need to sum over all possible paths that cross the diagonal (*y* + 1, *y*) over each possible *k*, which gives $\u2211k=0n\u22121$*S*_{k}*S*_{n−k−1}. Summing results for initial ↗ and initial → again yields the Large Schröder number recurrence function from Equation (17).

## 5. Statistical Significance of the Global Result

The above results confirm that all 22 separable permutations are allowed, at least as alternates, for any construction defined by four elements of the form *A*|*B*, *B*|*C*, *C*|*D*, *D*, and the 2 non-separable permutations are forbidden. The fact that in both the constructions surveyed, the distribution of the frequencies with which the separable permutation orders are actually observed are so skewed means that some orders are so vanishingly rare as to only be attested in one construction, and/or even by only one or two languages. The possibility that this skewness arises from soft constraints related to ease of processing or learnability is interesting in their own right, and is the subject of ongoing linguistic research (Culbertson, Smolensky, and Legendre 2012; Merlo 2015; Dryer 2018; Merlo and Ouwayda 2018).^{13}

Nevertheless, according to the present theory, the two permutations (g) and (j) are excluded for a different reason, namely, by a hard constraint that is a consequence of the theory of grammar itself. The fact that the two disallowed orders are the only two that are totally unattested in the union of Cinque’s, Nchare’s, and Abels’s surveys is a strong result that is unlikely to have arisen by chance. The strength of this result can be assessed quantitatively, as follows.

*not*falsify them. For instance, a theory that forbids 3 permutations will be consistent only with one possible set of 21 observed permutations, while a theory that forbids 2 (like the one presented here) will be consistent with 22 different sets of 21 permutations. The probability of our predictions being not falsified by chance is the number of observations (permutation sets) that would be consistent with our theory divided by the total number of possible observations—that is, approximately 1 in 100:

Both results are evidently unlikely to come up by chance.

## 6. Discussion

The prediction of CCG concerning possible word orders is significant from both scientific and engineering perspectives. The Large Schröder series *S*_{n} grows much more slowly with *n* than the total factorial number of permutations *n*!, so that the proportion $n!\u2212Sn\u22121n!$ of non-separable permutations that are disallowed by CCG grows rapidly with *n*, as 0%, 0%, 0%, 8%, 25%>, 45%, 64%, 79%, 89%, …. The proportion $Sn\u22121n!$ of separable permutations that are allowed shrinks correspondingly rapidly with *n*.

From a practical NLP perspective there are many areas where this saving can be achieved, with machine translation and parsing, including “sequence-to-sequence” semantic parsing, being the obvious use-cases. If we know the limited number of word-order variations among languages, alignment algorithms (or attention as its neural equivalent [Bahdanau, Cho, and Bengio 2015]) can be constrained (or biased) in the right direction (Zhang and Gildea 2005; Dong and Lapata 2016).^{14}

Unsupervised parser induction is another application where introducing an inductive bias (or hard constraint) for contiguous derivation trees with a normal form significantly reduces the number of alternatives that learner needs to consider (Clark and Lappin 2010). These savings are especially significant in comparison to other theories of grammar that use discontinuous constituents, empty strings, or very powerful, often Turing-complete, unification mechanisms.

We saw earlier that the number and type of permutations allowed by CCG appears to correspond to the crosslinguistic variation in word order that is actually attested. CCG also provides a strong and easily testable prediction of what word orders will be found in the future in human languages that have not so far been examined (Steedman 2020). These properties stem from the combinatory projection principle (13), and in particular from the restriction to *binary* combinatory rules.

So far, the empirical studies of crosslinguistic variation cited above have been confined to no more than four core elements of NP and VP. However, we have shown that the results proved here are completely general to sets of categories of any size that define a natural order of dominance. The verb group in particular can be naturally extended to serial verbs of various kinds. The cartographic spines of functional projections identified by Cinque and Rizzi (2008) offer several tens of functional heads in a supposedly universal order of dominance, although there is more to say here about the relation of syntax and morphology and the degrees of crosslinguistic variation they allow. Any firm evidence for any non-separable permutation of these elements in any language for which the natural order of dominance can be unequivocally determined would immediately falsify the present theory. As we noted earlier, the number of forbidden non-separable falsifying cases rises very rapidly as the size of such sets increases beyond 4.

However, the power-law distributions that we observe on all such dimensions of variation, and the fact that the performance factors that lie behind them favor continuity and semantic homomorphism, mean that those falsifying cases will always lie out in the long tail. We have seen that the total number of languages that have ever been studied in the requisite detail is in the case of the nominal construction only just enough to take us far enough out into that long tail to achieve a confidence level against ever seeing a non-separable permutation of even 1 in 100. That means our chances of reaching a definitive answer to the question for even a five-element construction are quite daunting.

Naturally, our work is not the first work to try to find a formal constraint of this kind. Here we review some results from other formalisms.

**ITG** Inversion Transduction Grammars (Wu 1996) are a type of syntax transduction grammar developed for the task of machine translation. ITG can generate only separable permutations of the words in the source language, so it is natural to ask how this property of ITG differs from the one proposed here. The main difference lies in what is considered the reference word order that is being permuted. Here, CCG gives a clear definition: The reference word order is the CL that reflects the NOD, and ultimately the semantics. To say that CCG generates only separable permutations of the NOD is to make a clear prediction about what permutations can be found in natural languages. However, the same is not the case with ITG, which does not identify any NOD. For ITG the reference word order is that of the source language in the MT task. Taking a different language as a source language gives a different prediction of what will be found in the world’s languages. Take, for instance, the previous example “these five young lads.” If ITG takes English as a source language then it will make the same prediction as CCG. But if we had, for example, taken Spanish as a source language, with the word order “these five lads young,” ITG would be able to derive “five lads these young,” which has not been attested and that CCG predicts to be impossible. More importantly, starting from Spanish ITG would be unable to derive the word order “lads these young five,” which was attested by both Cinque and Nchare for NP and by Abels and Neeleman for the equivalent VP permutation. Extending ITG to a more powerful formalism such as Permutation Trees (Zhang and Gildea 2007; Stanojević and Sima’an, 2015) does not solve this problem as long as the NOD is ignored.

**TAG** Tree-Adjoining Grammars (Joshi 1985) are often compared to CCG because they are weakly equivalent, in the sense that they generate the same set of languages (Joshi, Vijay-Shanker, and Weir 1991). The problem discussed in our article is about the set of permutations allowed for a fixed natural order of dominance. This means that results from our paper would transfer to TAG only if TAG was strongly equivalent to CCG, which Koller and Kuhlmann (2009) have shown not to be the case: There are some dependency structures that are derivable by CCG that are not derivable by TAG and vice versa. In particular, Koller and Kuhlmann’s (2009, Figure 8b) example of a dependency structure derivable by TAG but not by CCG is actually the dependency structure of the non-separable permutation 2413, or (g) under the alphanumeric convention used in this article.

Previous work on the permutations derivable by TAG have looked at the different scrambled argument orders allowed in German. This was motivated by the work of Becker, Rambow, and Niv (1992), who claim that scrambling requires even higher generative power than LCFRS. Joshi, Becker, and Rambow (2000) show that that is indeed the case if we want unbounded embedding levels of scrambling, but note that it is not clear that this is needed in modeling attestable natural language data. The *tree-local multi-component* version of TAG (**TL-MC-TAG**) does not increase weak generative power but does increase strong generative power, allowing complete scrambling of three arguments but not four.

Scrambling in CCG is handled by the assumption that all arguments are lexically type-raised as functions over the verb or verb complex, which allows less oblique arguments (crucially, the subject) to compose with the German verb complex in advance of more oblique arguments. As noted earlier, lexical type raising alters the NOD, making the subject take position 1 and the verb position *n* in the order of dominance. If the generalization of composition rules in CCG is limited to the second-order case illustrated in (12), CCG is subject to the same limitation as TL-MC-TAG: three arguments can scramble freely, but four cannot. For more discussion on German scrambling and CCG, see Hockenmaier and Young (2008).

Chen-Main and Joshi (2008) take the problem of scrambling of noun arguments of four verbs: *N*_{1}*N*_{2}*N*_{3}*N*_{4}*V*_{1}*V*_{2}*V*_{3}*V*_{4}. They show that there are 22 permutations of *N*_{1} to *N*_{4} that can be derived with the extended version of TL-MC-TAG that supports flexible composition and multiple adjoining. However, these permutations cannot be directly compared with the NP and VP constructions considered in the empirical studies discussed earlier.^{15}

**MG** Minimalist Grammars (Stabler 1996) are a rigorously defined formalism for Chomskyan Minimalism (Chomsky 1995b) that has higher generative capacity than CCG. Stabler (2011) has used MG to derive possible permutations for the four core elements of a noun phrase and found that, under some assumptions, MG can generate only 16 out of the complete 24 possible permutations. These 16 include the 14 permutations claimed by Cinque (2005), but exclude some of those claimed by Nchare (2012) for the free word-order language Shupamem. The assumption that Stabler makes is that the grammar uses neither empty strings nor head-movement, both of which are prevalent in most linguistic work on minimalist syntax. Interestingly, Stabler’s 16 include permutation (j), which is one of the two non-separable permutations that are excluded under the present theory.

The limitation of combinatory reduction to the separable permutations is also implicit in Svenonius’s (2007) ***1-3-X-2** restriction on adjunct placement (among other constructions), where X is an adjunct to 1, and in Williams’s (2003, pages 203–211) proposal for his categorial calculus **CAT**. CAT has a standard directional categorial lexicon and rule of application, with a combinatory operation REASSOCIATE equivalent to composition, and an operation FLIP, unavailable in CCG, which reverses the directionality on the argument of a functor category.

Williams’s CAT variety of Minimalism is closely related to CCG. However, without the addition of morpho-lexical type-raising, permitting arguments to compose, CAT is unable to express the variety of constructions that CCG makes available, which include relativization, coordination-reduction, Germanic scrambling, and Hungarian VM fronting (Steedman 2020). In the absence of lexical type-raising these constructions require the extension of CAT by the addition of powerful rules of “action at a distance” unavailable in CCG, such as movement, copying, and/or deletion, together with attendant constraints to limit their considerable expressive power.

A related proposal to ours by Medeiros (2018) hypothesizes that the set of allowed permutations is the one that could be sorted with a single stack, so-called **stack-sortable permutations** or 231-avoiding permutations (Knuth 1968). This set of permutations exactly covers Cinque’s 14, but does not fit the other empirical data considered here: 7 permutations out of 21 that are observed are predicted not to be possible by Medeiros. As with Williams (2003), it is also unclear that Medeiros’s mechanism will also handle phenomena of unbounded or *wh* movement, as CCG does. However, there is an interesting relation between the sorting automata for 231-avoiding permutations and the automata that sort separable permutations: 231-avoiding permutations are sortable with a single pop-stack while separable permutations are sortable with a sequence of *n* − 1 pop-stacks (Avis and Newborn 1981).

## 7. Conclusion

This article has argued that the facts of attested word order in the constructions considered exploits all and only the degrees of freedom that CCG allows, despite the fact that some of the orders that CCG allows for some constructions are vanishingly rare due to unrelated performance soft constraints that induce Zipfian (power-law) distributions discussed by Culbertson, Smolensky, and Legendre (2012) and Culbertson and Adger (2014).

Explanatory adequacy in a theory depends on being able to explain why the data has just the degrees of freedom it does, and why other things *don’t* happen. If the latter have been prevented by hard constraints, as in various ways they have been for these constructions by Cinque, Svenonius, Abels, Neeleman, Nchare, and Stabler, then those constraints themselves have in turn to be explained.

No hard constraints are needed at the level of competence grammar in CCG. The limitation on the grammatical permutation of *n* elements in natural grammars to the Large Schröder number *S*_{n−1} of separable permutations is a formal universal stemming from the limited expressivity of CCG itself—specifically, its restriction of its operators to strictly adjacent pairs of non-empty constituent types via the Combinatory Projection Principle (13). This result is a corollary of the Combinatory Categorial theory of grammar, and the formally explicit reduction that it affords of what in Minimalist terms would be called move or “internal merge” to contiguous (that is, “external”) merge.

## Appendix A: The Ensemble of Attested Word Orders Over 4 Elements

In this appendix we review the attested orders of four core elements of NP and VP as reported by Cinque, Nchare, and Abels and Neeleman. We also give CCG derivation trees for these orders from Steedman (2020).

If our lexicon consists of four entirely unrestricted categories of the form *A*|*B*, *B*|*C*, *C*|*D*, *D*, then for both the nominal and verbal constructions all 22 separable permutations are allowed, and only the two non-separable permutations are unanalyzable, as a consequence of the combinatory projection principle (13), as follows.

**The NP**

For the NP, the 22 orders are allowed under the following derivations (“ ×” marks the two sequences (g) and (j) that are unanalyzable). Only essential compositions are indicated: all other combinations are application. For non-basic orders, the annotation “from *z*” indicates the basic pure-applicative order among those in (8) on whose lexicon a particular derived order is based:^{16}

As noted earlier, to limit a given language to just one of these word orders, we need to restrict its lexicon via slash-typing to allow *only* the rules we see in the relevant derivation in (21) to apply to the categories in question. For example, to restrict the lexicon of Maasai to allow only order (c) and no other, we need the following more restricted version of the lexicon for basic order (21b):

This lexicon allows the following derivation and no other for NP via rule (11a):

This is Cinque’s attested order (21c).^{17}

For a language like Shupamem with many orders, in the worst case we can simply allow multiple entries for lexical items supporting all those orders. However, in practice, smaller lexicons with entries supporting more than one order seem to apply. (For example, in Shupamem, Dem seems to bear the completely unrestricted bidirectional slash type *NP*|*NumP*—Steedman [2020, page 19].)

**The VP**

Turning to the Germanic VP construction surveyed in Abels (2016), we see the identical picture to that for the elements of the NP (21) emerge for the verb group. That is, only 22 out of the total 24 permutations are allowed:

Once again, the two non-separable permutations (24) are among those unattested by Abels for the orders for VP_{1} VP_{2} VP_{3} VP_{4}, including those that he is equivocal toward but which others have claimed. Once again, we can restrict word order for a given language using slash types, and once again the kinds of alternation found for serial verbs in languages like Hungarian can in practice be captured in quite small lexicons using more flexible types (Steedman 2020). Crucially, none of these attested or predicted word orders requires any kind of movement or action-at-a-distance.

Interestingly, the one order predicted under the present hypothesis that was not attested for the NP (21), is among the orders attested by Abels for the Germanic verb group (albeit somewhat grudgingly, as “spontaneously, possibly as alternate,” citing Wurmbrand (2004, page 59), who found it accepted by some Austrian German speakers).

## Notes

Of course, we need further lexical categories to allow, e.g., *These young lads, Five lads*, etc., as NP. This might be done via underspecification using X-bar-theoretic features (Chomsky 1970), but we will ignore them for now.

Although compositional semantics and logical form are suppressed for the purposes of this article, the semantics of the rules in (6) is also the application of semantic functions such as *young*′ to arguments such as *lads*′ to yield logical forms such as *young*’*lads*’. In general, if the functor *X*|*Y* has logical form *f* and the argument *Y* has logical form *a*, then the result *X* always has logical form *f**a* (read “*f* of *a*”). Thus, semantics is “surface compositional” in CCG.

Cinque’s (2013) study doubles the size of his sample to around 1,500 languages, and adds no new attested orders to his original 14—unsurprisingly, given the power-law distribution of the frequencies.

It will be convenient in the following discussions to refer to the function *X*|*Y* in such rules as the “primary” function, and to the other function *Y*|*Z* (etc.) as “secondary.” While we continue to suppress explicit semantics for the purposes of the present article, like the application rules (6) the composition rules (10) and (11) have an invariant surface-compositional semantics, such that if the meaning of the primary function *X*|*Y* is a functor *f* and that of the secondary function *Y*|*Z* is *g*, then the meaning of the result *X*|*Z* is *λz*.*f*(*g**z*), the *composition* of the two functors, which if applied to an argument of type *Z* and meaning *a*, yields an *X* meaning *f*(*g**a*).

This definition of immediate dominance is a special case of dependency structure for Pure First-Order CCG as defined by Koller and Kuhlmann (2009). We shall see below that this definition in fact generalizes to the second-order case where *X* and/or *Y* are function categories.

The actual combinatory rules involved will either be crossed or harmonic, but we can ignore the distinction here because of our use of the universal | slash.

This possibility is also acknowledged by Abels and Neeleman (2012).

This observation is also implicit in Whitelock’s (1992) early proposal for “shake and bake” translation for Japanese.

When Chen-Main and Joshi (2008) report that their variation of TAG can capture permutation (g) 2413 of verbal arguments, they are actually referring to a permutation 24135678 of eight elements, of which the last four (the verbs) are polyvalent, defining a quite different (partial) NOD from the 1 > 2 > 3 > 4 linearization.

See Appendix A for the definition of the quantized counts.

Depending on the facts concerning coordinations like (15) in the language in question, we may want to use /_{◇★} or ∖_{◇★} slash types in place of /_{★} or ∖_{★} to allow harmonic composition as well as application in a language like English.

## Acknowledgments

We are grateful to Peter Buneman, Shay Cohen, Paula Merlo, Chris Stone, Bonnie Webber, and the Referees for *Computational Linguistics* for helpful comments and advice. The work was supported by ERC Advanced Fellowship 742137 SEMANTAX.

## References

*The Grammar of Shupamem*

*N*-kings problem