It linked all perplexed meanings
Into one perfect peace.
—Procter and Sullivan, 1877, “The Lost Chord”
Let me begin by thanking the Association for Computational Linguistics and its Executive Committee for conferring on me the great honor of their Lifetime Achievement Award for 2018, which of course I share with all the wonderful students and colleagues that have made many essential contributions to this work over many years.
At the heart of the work that I have been pursuing over my research lifetime so far, whether in parsing and sentence processing, spoken language understanding, semantics, or even in musical understanding by machine, there lies a theory of natural language grammar that brings parsing, compositional semantics, statistical modeling, and logical inference into the closest possible relation. This theory of grammar is combinatory, in the sense that its operations are type-dependent and restricted to strictly string-adjacent phonologically or graphologically-realized inputs, and categorial, in the sense that those operands pair a syntactic type with a type-transparent semantic representation or logical form.
I’d like to use this opportunity to briefly address three questions that revolve around the theory of grammar, both combinatory and otherwise. The first question concerns the way that Combinatory Categorial Grammar (CCG) was developed with a number of colleagues, over a number of stages and in slightly different forms. The second is an essentially evolutionary question of why natural language grammar should take a combinatory form. The third question is that of what the future holds for CCG and other structural theories of grammar in computational linguistics and NLP in the age of deep learning.
I have called this talk “The Lost Combinator” in homage to the Victorian era poem “The Lost Chord,” in the hope of suggesting that the theoretical development of CCG has always been empirical, rather than axiomatic, in search of the simplest explanation of the facts of language, rather than for confirmation of linguistic received opinion, however intuitively salient.
1. In the Beginning
In the late 1960s (when I was a psychology undergraduate at the University of Sussex under Stuart Sutherland, and then started as a graduate student in artificial intelligence at Edinburgh under Christopher Longuet-Higgins), a broad community of theoretical linguists, psychologists, and computational linguists saw themselves as all working on the same problem, under the definition provided by the “transformational” theory of grammar proposed by Chomsky (1957, 1965), using theories of psycholinguistic processing, language acquisition, and language evolution proposed by Lashley (1951), Miller, Galanter, and Pribram (1960), Miller (1967), and Lenneberg (1967), theories of natural language semantics proposed by Carnap (1956), Montague (1970), and Lewis (1970), and computational models of parsing such as those proposed by Thorne, Bratley, and Dewar (1968) and Woods (1970). (I myself was so convinced that this program would succeed that I believed it was time to apply the same methods to other cognitive faculties, taking as my research project for Ph.D. their application to the interpretation of music by machine, following the lead of Max Clowes [1971] in machine vision.)
Almost immediately, this consensus fell apart. First, Chomsky himself was among the first (1965) to recognize that transformational rules, though descriptively revealing, were so expressive as to have little explanatory force, and required many apparently arbitrary constraints (Ross 1967). Second, psychologists realized that psycholinguistic measures of processing difficulty of sentences bore almost no relation to their transformational derivational complexity (Marslen-Wilson 1973; Fodor, Bever, and Garrett 1974). Finally, computational linguists attempting to implement transformational grammars as parsers realized that they were spending all their time implementing even more constraints on rules, in order to limit search arising from overgeneration (Friedman 1971; Gross 1978). (Meanwhile, I realized that the problem had not in fact been solved, and returned to natural language processing, thanks to a postdoc at Sussex with Philip Johnson-Laird.)
This disillusion wasn’t just a case of internal academic squabbling. There were also a couple of influential reports commissioned by the U.S. and UK governments that ended funding for machine translation (MT) and artificial intelligence (AI) (Pierce et al. 1966; Lighthill 1973). As a result of the second of these reports, which determined that AI was never going to work, PhDs in artificial intelligence like my classmate Geoff Hinton and myself spent ten years or so after graduation in psychology departments (in my case, at the Universities of Sussex and Warwick), until yet another report said AI was working after all and that Britain and the U.S. were falling behind Japan in this vital area. As a result, I could get hired again in computer science, first briefly back at Edinburgh, and then at the University of Pennsylvania (I learned a lesson from this odyssey that I have tried to remember whenever I have been appointed to a committee to report on anything, which is that while reports very rarely do any good, they can very easily do a great deal of harm.)
Meanwhile, as a result of these conflicts, the scientific study of language fragmented. The linguists swiftly abjured any responsibility for their grammars (“Competence”) bearing any relation to processing (“Performance”). Because the psychologists could hardly abandon Performance, they in turn became agnostic about grammar, retreating to context-free surface grammar (which they tended to refer to as “parsing strategies”), or a touchingly optimistic belief in its emergence from neural models. Meanwhile, the computational linguists (whose machines were growing exponentially in size and speed from the 16K byte core of the machine that supported the whole group when I started my graduate studies, on to levels that would soon permit parsing the entire contents of the then embrionic Web) similarly found that very little of what the linguists and psychologists cared about was usable at scale, and that none of it significantly improved overall performance over very much simpler context-free or even finite-state methods that the linguists had shown to be incomplete. The reason of course was Zipf’s law, which means that the events with respect to which the low-level methods are incomplete are off in the long tail.
It also became apparent to a few computationalists working on speech, MT, and information retrieval that the real problem was not grammar but ambiguity and its resolution by world-knowledge, and that the solution lay in probabilistic models (Bar-Hillel 1960/1964; Spärck Jones 1964/1986; Wilks 1975; Jelinek and Lafferty 1991) (although it was not immediately apparent how to combine statistical models with grammar-based systems without making obviously false independence assumptions).
Nevertheless, as any red-blooded psychologist had always insisted, the divorce between competence and performance that everyone else had accepted did not make any sense. The grammar and the processor had to have evolved in lock-step, as a package deal, for what could be the evolutionary selective advantage of a grammar that you cannot process, or a parser without a grammar?
It seemed equally obvious that surface syntax and the underlying semantic or conceptual representation must also be closely related, since the only reasonable basis for child language acquisition that has ever been on offer is that the child attaches language-specific grammar to a universal conceptual relation or “language of mind” (Miller 1967; Bowerman 1973; Wexler and Culicover 1980). It seemed to follow that radically new theories of grammar were needed.
2. The Problem of Discontinuity
In particular, unbounded wh-dependencies like the above were handled by: (a) putting a pointer into a * or HOLD register as soon as the “which” was encountered without regard to where it would end up; and (b) retrieving the pointer from HOLD when the verb needing an object “had” was encountered without regard to where it had started out. (It also included an ingenious mechanism for coordination called SYSCONJ, which one finds even now being reinvented on an almost yearly basis—cf. Woods [2010].) A * register was also used for wh-constructions within a systemic grammar framework by Winograd (1972, pages 52–53) in his inspiring conversational program SHRDLU.
However, it was unclear how to generalize the HOLD register to handle the multiple long-range dependencies, including crossing dependencies, that are found in many other languages. In particular, if the HOLD register were assumed to be a stack, then the ATN becomes a two-stack machine (since we are already implicitly using one stack as a PDA to parse the context-free core grammar).
On the computational side at least, the reaction to this impass took two distinct forms. Both reactions took the form of trying to reduce the two major operators of the transformation theory, substitution of immediate constituents, or what is nowadays called “Merge,” and “Move,” or displacement of non-immediate constituents, to one. On the one hand, Lexical Functional Grammar (Bresnan and Kaplan 1982) and Head-driven Phrase Structure Grammar (Pollard and Sag 1994) followed Kay (1979) in making unification the basis of movement and merger. Because unification can pass information across unbounded structures, this can be thought of as reducing Merge to Move.
On the other hand, Generalized Phrase Structure Grammar (Gazdar 1981), Tree Adjoining Grammar (TAG; Joshi and Levy 1982), and Combinatory Categorial Grammar (CCG, Ades and Steedman, 1982) sought to reduce Move to various forms of local merger. In particular, the latter authors suggested that the same stack could be used to capture both long-range dependency and recursion in CCG.1
3. Why Does Natural Language Allow Discontinuity?
Natural language grammar exhibits discontinuity because semantically language is an applicative system. Applicative systems (such as programming languages) support the twin notions of: (a) Application of a function/concept to an argument/entity; and (b) Abstraction, or the definition of a new function/concept in terms of existing ones.
Language is in that sense inherently computational. It seems to follow that linguistics is (or should be) inherently computational as well. (Of course, it does not follow that computationalists have nothing to learn from linguistics.)
4. Combinatory Categorial Grammar (CCG)
CCG lexicalizes all bounded dependencies, such as passive, raising, control, exceptional case-marking, and so forth, via lexical logical form. All syntactic rules are Combinatory—that is, binary operators over contiguous phonologically realized categories and their logical forms. These rules are restricted by a Combinatory Projection Principle, which in essence says they cannot override the decisions already taken in the language-specific lexicon, but must be consistent with and project unchanged the directionality specified there. All such language-specific information is specified in the lexicon: The combinatory rules like composition are free and universal. All arguments, such as subjects and objects, are lexically type-raised to be functions over the predicate, as if they were morphologically cased as in Latin, exchanging the roles of predicate and argument.
Interestingly, the latest “minimalist” form of the transformational theory has also proposed that Move should be relabeled as an “internal” form of standard or “external” Merge (Chomsky 2001/2004, page 110), though without providing any formal basis for the reduction other than identifying internal Merge as “a grammatical transformation.” (If anything deserved the soubriquet “the lost combinator,” it would be this notional unitary combination of application and abstraction in a single perfect operator, linking or merging all types, as in the epigraph to this article.)
4.1 Expressivity of CCG
B2 rules allow us to “grow” categories of arbitrarily high valency, such as ((S∖NPy)∖NPx)∖NPw. As we saw earlier, in some Germanic languages like Dutch, Swiss German, and West-Flemish, serial verbs are linearized using such rules to require crossing discontinuous dependencies. Thus, B2 rules give CCG slightly greater than context-free power.
Nevertheless, CCG is still not as expressive as movement. In particular, we can only capture permutations that are what is called “separable,” where separability is related to the idea of obtaining the permutations by rebracketing and rotating sister nodes (Steedman 2018).
Twenty-one of the 22 separable permutations of “These five young boys” are attested (Cinque 2005; Nchare 2012). The two forbidden orders are among the unattested three.3
The number of separable permutations grows much more slowly in n than n!, the number of all permutations. For example, for n = 8, around 80% of the permutations are non-separable. There are obvious implications for the problem of alignment in machine translation and neural semantic parsing, to which we will return.
In 1986, I moved to Computer and Information Science at Penn, first as a visitor (incredibly, partly still under Sloan funding—at least, that was what Aravind Joshi told me), and then as a member of faculty, where much of the further development of CCG was worked out. Crucially, Aravind’s students proved in a series of papers (Vijay-Shanker, Weir, and Joshi 1987; Joshi, Vijay-Shanker, and Weir 1991; passim) that the “shared stack” claim of Ades and Steedman (1982) was correct, by showing that both CCG and Aravind Joshi’s TAG were weakly equivalent to Linear Indexed Grammar (Gazdar 1988), a new level of the Language Hierarchy characterized by the (Linear) Embedded Push-down Automaton (cf. Kuhlmann, Koller, and Satta 2015).
The class of languages characterizable by these formalisms fell within the requirements of what Joshi (1988) called “mild context sensitivity” (MCS), which proposed as a criterion for what could count as a computationally “reasonable” theory of natural languages—informally speaking, that they are polynomially recognizable, and exhibit constant growth and some limit on crossing dependencies. However, the MCS class is much much larger than the CCG/TAG languages, including the multiple context free languages and even (under certain further assumptions) the languages of Chomskian minimalism, so it seems appropriate to distinguish TAG and CCG as “slightly non-context-free” (SNCF, with apologies to the French railroad company).
4.2 CCG for Natural Language Processing
Despite its SNCF complexity, CCG was originally assumed to be totally unpromising as a grammar formalism for parsing, because of the extra derivational ambiguity introduced by type-raising and the combinatory rules, particularly composition. However, these supposedly “spurious” constituents also show up under coordination, and as intonational phrases. So any grammar with the same coverage as CCG will engender the same degree of nondeterminism in the parser (because it is there in the grammar). In fact, this is just another drop in the ocean of derivational ambiguity that faces all natural language processors, and can be handled by exactly the same statistical models as other varieties.
In particular, the head-word dependency models pioneered by Don Hindle, Mats Rooth, Mike Collins, and Eugene Charniak are straightforwardly applicable (Hockenmaier and Steedman 2002b; Clark and Curran 2004). CCG is also particularly well-adapted to parsing with “supertagger” front ends, which can be optimized using embeddings and long short-term memory (LSTM) (Lewis and Steedman 2014; Lewis, Lee, and Zettlemoyer 2016).
CCG is now quite widely used in applications, especially those that call for transparency between semantic and syntactic processing, and/or dislocation, such as machine translation (Birch and Osborne 2011; Mehay and Brew 2012), machine reading (Krishnamurthy and Mitchell 2014), incremental parsing (Pareschi and Steedman 1987; Niv 1994; Xu, Clark, and Zhang 2014; Ambati et al. 2015), human–robot interaction (Chai et al. 2014; Matuszek et al. 2013), and semantic parser induction (Zettlemoyer and Collins 2005; Kwiatkowski et al. 2010; Abend et al. 2017). Some of my own work with the same techniques has returned to their application in musical analysis (Granroth-Wilding and Clark 2014; McLeod and Steedman 2016), and shown that CCG grammars of the same SNCF class and parsing models of the same statistical kind are required there as well: It is only in the details of their compositional semantics that music and language differ very greatly.
Rather than reflecting on this substantial body of work in detail, I’d like to conclude by examining two further more speculative questions. The first is an evolutionary question: Why should natural language be a combinatory calculus in the first place? The second is a question about the future development of our subject: Will CCG and other grammar-based theories continue to be relevant to NLP in the age of deep learning and recursive neural networks? I’ll take these questions in order in the next two sections.
5. Why is Language Combinatory?
Natural language looks like a distinctively combinatory applicative system because B, T, and S evolved independently, in our animal ancestors, to support planning of action sequences before there was any language (Steedman 2002). Even pure reactive planning animals like pigeons need application. Composition B and Substitution S are needed for seriation of actions. (Even rats need to compose sensory-motor actions to make plans.) Macaques can form the concept “food that you need to wash before eating”). Type-raising T is needed to map tools onto actions that they allow (affordances). (Chimpanzees and some other animals can plan with tools.) Second-order combinators like B2 are needed to make plans with arbitrary numbers of entities, including non-present tools, and crucially including other agents whose cooperation is yet to be obtained. Only humans seem to be able to do the latter.
Thus, chimpanzees solve the (mistitled) monkey and bananas problem, using tools like old crates to gain altitude in order to obtain objects that are otherwise out of reach (Köhler 1925) (Figure 1(a)).
Köhler’s chimpanzee Grande’s planning amounts to composing (B) affordances (T) of crates and so on, such as actions of moving, stacking, and climbing-on them. Similarly, macacques can apply S(Bbefore eat) wash to a sweet potato (Kawai 1965) (Figure 1(b)). Interestingly, Köhler and much subsequent work showed that the crates and other tools have to be there in the situation already for the animal to be able to plan with them: Grande was unable to achieve plans that involve fetching crates from the next room, even if she had recently seen them there.
More generally, the problem of planning can be viewed as the problem of search for a sequence of actions in a lattice or labyrinth of possible states. Crucially, such search has the same recursive character as parser search. For example, there are both chart-based dynamic-programming and stack-based shift-reduce-style algorithms for the purpose. The latter seem a more plausible alternative in evolutionary terms, as affording a mechanism that is common to both semantic interpretation and parsing, rather than requiring the independent evolution of something like the CKY algorithm.4
Planning therefore provides the infrastructure for linguistic performance, as well as the operators for competence grammar, allowing the two to emerge together, as what was referred to above as an evolutionary “package deal.”
6. CCG in the Age of Deep Learning
I hope to have convinced you that CCG grammars are both linguistically and evolutionarily explanatory, as well as supporting practical semantic parsers that are fast enough to parse the Web.
Nevertheless, like other supervised parsers, CCG parsers are limited by the weakness of parsing models based on only a million words of WSJ training data, no matter how cleverly we smooth them using synthetic data and embeddings. Moreover, the constructions they are specifically needed for—unbounded dependencies and so forth—are, as we have noted, rare, and off in the long tail. As a consequence, supervised parsers are easy to equal on performance overall by using end-to-end models (Vinyals et al. 2015).
This outcome says more about the weakness of treebank grammars and parsers than about the strength of deep learning. Nevertheless, in the area of semantic parser induction for arbitrary knowledge graphs such as FreeBase (Reddy, Lapata, and Steedman 2014), I would say that CCG and other grammar-based parsers have already been superseded by semisupervised end-to-end training of deep neural networks (DNNs) (Dong and Lapata 2016, 2018; Jia and Liang 2016), raising the question of whether DNNs will replace structured models for practical applications in general.
DNNs are effective because we don’t have access to the universal semantic representations that allow the child to induce full CCG for natural languages and that we really ought to be using both in semantic parser induction, and in building knowledge graphs. Because the FreeBase Query language is quite unlike any kind of linguistic logical form, let alone like the universal language of mind, it may well be more practically effective with small and idiosyncratic data sets to induce semantic parsers for them by end-to-end deep neural brute force, rather than by CCG semantic parser induction.
But is it really possible that the problem of parsing could be completely solved by recurrent neural network (RNN)/LSTM, perhaps augmented by attention/a stack (He et al. 2017; Kuncoro et al. 2018)? Do semantic-parsing-as-end-to-end-translation learners actually learn syntax, as has been claimed?
It seems likely that end-to-end semantic role labeler parsers and neural machine translation will continue to have difficulty with long-range wh-dependencies, because the evidence for their detailed idiosyncrasies is so sparse.
By contrast, supervised CCG parsers do rather well on embedded subject extraction (Hockenmaier and Steedman 2002a; Clark, Steedman, and Curran 2004), which is a fairly frequent construction that happens to be rather unambiguously determined by the parsing model.
Moreover, in principle at least, these constructions could be learned by grammar-based semantic parser induction from examples, using the methods of Kwiatkowski et al. (2010, 2011) and Abend et al. (2017).
These observations make it seem likely that there will be a continued need for structured representations in tasks like QA where long-range dependencies matter. Nevertheless, like rock n’roll, deep learning and distributional representations are clearly here to stay. The future in parsing for such tasks probably lies with hybrid systems using neural front ends for disambiguation, and grammars for assembling meaning representations.
7. The Shape of Things to Come
To arrive at a meaning representation language that is form-independent (and ultimately language-independent), we are using CCG parsers to machine-read the Web for relations between typed named entities. in order to detect consistent patterns of entailment between relations over named entities of the same types, using directional similarity over entity vectors representing relations. We then build an entailment graph (cleaning it up and closing it under relations such as transitivity [cf. Berant et al. 2015]). Cliques of mutually entailing relations in the entailment graph then constitute paraphrases that can be collapsed to a single relation identifier (Lewis and Steedman 2013a). (This can be done across text from multiple languages [Lewis and Steedman 2013b].)
We can then replace the original naive semantics for relation expressions with the relevant paraphrase cluster identifiers, and reparse the entire corpus using this now both form-independent and language-independent semantic representation, building an enormous knowledge graph, with the entities as nodes, and the paraphrase identifiers as relations.
If this project is successful, the language-independent paraphrase cluster identifiers will perform the function of a “hidden” version of the decompositional semantic features in semantic representations like those of Katz and Postal (1964), Jackendoff (1990), Moens and Steedman (1988), White (1994), and Pustejovsky (1998), while the entailment graph will form a similarly hidden version of the “meaning postulates” of Carnap (1952) and Fodor, Fodor, and Garrett (1975). Such semantic representations are essentially distributional, but with the advantage that they can be combined with traditional logical operators such as quantifiers and negation.
This proposal for the discovery of hidden semantic primitives underlying natural language semantics stands in contrast to another quite different contemporary approach to distributional semantics that seeks to use dimensionally reduced vector-based representations of collocations to represent word meanings, using linear-algebraic operations such as vector and tensor addition and multiplication in place of traditional compositional semantics. It is an interesting open question whether vector-based distributional word embeddings can be used, together with directional similarity measures, to build entailment graphs of a similar kind (Henderson and Popa 2016; Chang et al. 2018). It is quite likely that some kind of hybrid approach will be needed here too, to combat the eternal silence of the infinite spaces of the long tail.
8. Conclusion
Algorithms like LSTM and RNN may work in practice. But do they work in theory? In particular, can they learn all the syntactic stuff in the long tail, like non-constituent coordination, subject extractions, and crossing dependency, in a way that will support semantic interpretation? If they are not actually learning syntax, but are instead learning a huge finite-state transducer or a soft augmented transition network, then by concentrating on them as mechanisms for natural language processing, we are in danger of losing sight of the computational linguistic project of also providing computational explanations of language and mind.
Even if we convince ourselves that something like SEQ2TREE is a psychologically real learning mechanism, and that children learn their first language by end-to-end mapping of the sentences of their language onto the situationally afforded structures of the universal language of mind, we still face the supreme challenge of finding out what that universal semantic target language looks like. We will not get an answer to that question unless we can rise above using SQL, SPARQL, and hand-built ontologies and logics such as OWL as proxies for the hidden language of mind, to use machine learning for what it is really good at, namely, finding out such hidden variables and their values, for use in a truly natural language processing.
Acknowledgments
The work was supported in part by ERC Advanced Fellowship GA 742137 SEMANTAX; ARC Discovery grant DP160102156; a Google Faculty Award and a Bloomberg L.P. Gift Award; a University of Edinburgh and Huawei Technologies award; my co-investigators Nate Chamber and Mark Johnson; the combinator found, Bonnie Webber; and all my teachers and students over many years.
Notes
As well as at Warwick and Edinburgh, much of the early work on CCG was done during 1980–1981 as a visiting fellow at the University of Texas at Austin, under funding from the Sloan Foundation to Stanley Peters and Phil Gough, my first introduction to the United States.
The combinators are identified as Bn, T, and Sn for historical reasons. This combinatory calculus is distinct from the categorial type-logic stemming from the work of Lambek (1958) that similarly underpins Type-Logical Grammar (Oehrle 1988; Hepple 1990; Morrill 1994; Moortgat 1997), in which the grammar is a logic, rather than a calculus, and some but not all of the CCG combinatory rules are theorems.
This point seems to have escaped Berwick and Chomsky (2016, pages 175–176, n9).