I want to thank the ACL for the Lifetime Achievement Award of 2016. I am deeply honored, and I share this honor with the outstanding collaborators and students I have been lucky to have over my lifetime.
The title of my talk describes two fields of linguistics, which differ in their approaches to data and analysis and in their fundamental concepts. What I call the garden is traditional linguistics, including generative grammar. In the garden linguists primarily analyze what I call “cultivated” data—that is, data elicited or introspected by the linguist—and form qualitative generalizations expressed in symbolic representations such as syntactic trees and prosodic phrases. What I am calling the bush could also be called “the wilderness.” In the bush they collect “wild data,” spontaneously produced by speakers, and form quantitative generalizations based on concepts such as conditional probability and information content.
syntactic trees, prosodic phrases
conditional probability, information content
1. MIT Years: Into the Garden
I came into linguistics with a bachelor's degree in philosophy from Reed College, and after a couple of false starts I ended up in grad school at MIT studying formal grammar in the Department of Linguistics and Philosophy with Chomsky and Halle in the heyday of generative grammar. At MIT, Chomsky was my doctoral advisor, and my mentor was Morris Halle, who ran the Department at that time.
The exciting goal was to infer the nature of the mind's capacity for language from the structure of human language, viewed as a purely combinatorial set of formal patterns, like the formulas of symbolic logic. It was apparently exciting even to Fred Jelinek as an MIT doctoral student in information theory ten years before me (Jelinek 2009). In his Lifetime Award speech he recounts how as a grad student he attended some of Chomsky's lectures with his wife, got the “crazy notion” that he should switch from information theory to linguistics, and went as far as discussing it with Chomsky when his advisor Fano got wind of it and said he had to complete his Ph.D. in Information Theory. He had no choice. The rest is history.1
The MIT epistemology held that the structure of language could not be learned inductively from what we hear; it had to be deduced from innate, universal cognitive structures specific to human language. This approach had methodological advantages for a philosophical linguist: First, a limitless profusion of data in our own minds came from our intuitions about sentences that we had never heard before; second, a sustained and messy relationship to the world of facts and data was not required; and third, (with the proper training) scientific research could conveniently be done from an armchair using introspection.
2. Psychological (Un)reality
I got my Ph.D. from MIT in 1972 and taught briefly at Stanford and at UMass, Amherst, before joining the MIT faculty in 1975 as an Associate Professor of Linguistics. Very early on in my career as a linguist I had become aware of discrepancies between the MIT transformational grammar models and the findings of psycholinguists. For example, the theory that more highly transformed syntactic structures would require more complex processing during language comprehension and development did not work.
With a year off on a Guggenheim fellowship (1975–1976), I began to think about designing a more psychologically realistic system of transformational grammar that made much less use of syntactic transformations in favor of an enriched lexicon and pragmatics. The occasion was a 1975 symposium jointly sponsored by MIT and AT&T to assess the past and future impact of telecommunications technology on society, in celebration of the centennial of the invention of the telephone. What did I know about any of this? Absolutely nothing. I was invited to participate by Morris Halle. From Harvard Psychology, George Miller invited Eric Wanner, Mike Maratsos, and Ron Kaplan.
Ron Kaplan and I developed our common interests in relating formal grammar to computational psycholinguistics, and we began to collaborate. In 1977 we each taught courses at the IV International Summer School in Computational and Mathematical Linguistics, organized by Antonio Zampoli at the Scuola Normale Superiore, Pisa. In 1978 Kaplan visited MIT and we taught a joint graduate course in computational psycholinguistics. From 1978 to 1983, I consulted at the Computer Science Laboratory, Xerox Corporation Palo Alto Research Center (1978–1980) and the Cognitive and Instructional Sciences Group, Xerox PARC (1981–1983).
3. Lexical-Functional Grammar
During the 1978 fall semester at MIT we developed the LFG formalism (Kaplan and Bresnan 1982; Dalrymple et al., 1995). Lexical-functional grammar was a hybrid of augmented recursive transition networks (Woods 1970; Kaplan 1972)—used for computational psycholinguistic modeling of relative clause comprehension (Wanner and Maratsos 1978)—and my “realistic” transformational grammars, which offloaded a huge amount of grammatical encoding from syntactic transformations to the lexicon and pragmatics (Bresnan 1978) (see Figure 1).
As often noted, the lfg functional structures can be directly mapped to dependency graphs (Mel'cuk 1988; Carroll, Briscoe, and Sanfilippo 1998; King et al. 2003; Sagae, MacWhinney, and Lavie 2004; de Marneffe and Manning 2008) (see Figure 2). Some early statistical NLP parsers such as the Stanford Parser were dual-structure models like lfg with dependency graphs labeled by grammatical functions replacing lfg f-structures (de Marneffe and Manning 2008).
A key idea in lfg is that both active and passive argument structures are lexically stored (or created by bounded lexical rules) (see Figure 3). Independent evidence for lexical storage is that passive verbs undergo lexical rules of word-formation (Bresnan 1982b; Bresnan et al. 2015); surface features of passive subj(ects) are retained (e.g., in tag questions). All relation-changing transformations are re-analyzed lexically in this way: passive, dative, raising, there-insertion, etc., etc.
Figure 4 shows how the lexical features of active and passive terminal strings are mapped into the appropriate predicate–argument relations. The respective subjects of the upper and lower c-structure trees are first and third person singular pronouns, which give rise to the f-structures indexed i. The lexical forms are, respectively, active and passive, mapping each subject f-structure to the appropriate argument role of the verb hit.
We soon involved a highly productive group of young researchers in linguistics and psychology (initially represented in Bresnan, 1982a). The original group included Steve Pinker as a young postdoc from Harvard, Marilyn Ford as an MIT postdoc from Australia, Jane Grimshaw then teaching at Brandeis, and doctoral students at MIT and Harvard: Lori Levin, K. P. Mohanan, Carol Neidle, Avery Andrews, and Annie Zaenen.
4. Stanford and Becoming an Africanist
In 1982 I took a sabbatical leave from MIT at the Center for the Study of Behavioral Sciences at Stanford, where I also spent time at Xerox PARC nearby.
At Stanford, the Center for the Study of Language and Information (CSLI) was being launched with a grant from the System Development Corporation. I decided to stay in California by joining the Stanford Linguistics Department and CSLI the following year, half-time. From 1983–1992, I worked the other half of my time as a member of the Research Staff, Intelligent Systems Laboratory, Xerox Corporation Palo Alto Research Center, which John Seely Brown headed during that time. lfg was considered useful in some computational linguistics applications such as MT.
In 1984 a Malaŵian linguist on a Fulbright Fellowship came to Stanford as a visitor. I had corresponded with him a few years earlier when I was a young faculty member at MIT and he had received his Ph.D. from the University of London. We both had lexical syntactic inclinations, and he had evidence from the Bantu language Chicheŵa, one of the major languages of Malaŵi. His name was Sam Mchombo.
Sam Mchombo arrived at Stanford just as CSLI was forming, and we began to collaborate on the problems of analyzing the (sooo cool) linguistic properties of Chicheŵa in the lfg framework: Chicheŵa has 18 genders (but not masculine/feminine), tone morphemes, pronouns incorporated into verb morphology, relation changes all expressed by verb stem suffixation (which undergo derivational morphology), configurational discourse functions, …!
One of Sam's ideas was that the Chicheŵa object marker, prefixed to the verb stem, functions syntactically as a full-blooded object pronoun. In lfg, the formal analysis is simple (see Figure 5). The analysis has rich empirical motivation in phonology, morphology, syntax, discourse, and language change (Bresnan and Mchombo 1987, 1995). Similar analyses have been explored in many disparate languages (e.g., Austin and Bresnan 1996; Bresnan et al. 2015). This and other work by many colleagues and students on a wide variety of languages helped to establish lfg as a flexible and well-developed linguistic theory useful to typologists and field linguists.
I wrote National Science Foundation grant proposals for successive projects with Sam and other Bantuists including Katherine Demuth and Lioba Moshi, and in the summer of 1986 did field work in Tanzania. In Tanzania I took time out to celebrate my birthday by hiking up Mt. Kilimanjaro with a group of young physicians from Europe doing volunteer work in Africa.
There I was, out of the armchair and onto the hard-packed dirt floor of a thatched home in a village on the slopes of Mt. Kilimanjaro, being served the eyeball of a freshly slaughtered young goat, which I was told was a particular honor normally reserved for men. … But I was still in the garden of linguistics:
syntactic trees, prosodic phrases
Two intellectual shocks caused me eventually to leave the garden and completely change my research paradigm.
5. The Shock of Constraint Conflict
The first shock was discovery that universal principles of grammar may be inconsistent and conflict with each other. The expressions of a language are not those that perfectly satisfy a set of true and universal constraints or rules, but are those that may violate some constraints in order to satisfy other more important constraints, optimizing constraint satisfaction. This insight came into linguistics from outside the field, from neural network approaches to cognition (Prince and Smolensky 1997). Yet as my former student Jane Grimshaw pointed out, we can see traces of it everywhere, even in corners of English syntax that had seemed exception-ridden.
In response to these ideas, I began in the mid-1990s to work out how to do optimality-theoretic (ot) syntax using lfg as the representational basis (e.g., Bresnan 2000). ot-style constraint ranking in large-scale lfg grammars was adopted in standard lfg parsing systems for ambiguity management (Frank et al. 1998; Kaplan et al. 2004; King et al. 2004). And Jonas Kuhn (2001, 2003) solved general computational problems of generation and parsing for ot syntax with lfg representations.
6. Grammars Hard and Soft
My former student Chris Manning (who wrote his Linguistics doctoral dissertation at Stanford under my supervision) joined the Stanford faculty as Assistant Professor of Computer Science and Linguistics in 1999, and I began to attend his lectures and meet with him to discuss research. In studies of English using the Switchboard corpus, we found that English has soft, statistical shadows of hard person constraints in other languages: For example, it has person-driven active/passive alternations (Bresnan, Dingare, and Manning 2001), and person-driven dative alternations (Bresnan and Nikitina 2009). As Bresnan, Dingare, and Manning (2001) observe, “The same categorical phenomena which are attributed to hard grammatical constraints in some languages continue to show up as statistical preferences in other languages, motivating a grammatical model that can account for soft constraints.”
Examples of such models include stochastic ot (Bresnan, Deo, and Sharma 2007; Maslova 2007; Bresnan and Nikitina 2009), maximum-entropy ot (Goldwater and Johnson 2003; Gerhard Jaeger 2007), random fields (Johnson and Riezler 2003), data-oriented parsing (Bod and Kaplan 2003; Bod 2006), and other exemplar-based theories of grammar (Hay and Bresnan 2006; Walsh et al. 2010).
7. Into the Bush
In addition to Chris Manning, another person who helped me go into the bush was Harald Baayen. I first met Harald while attending a 2003 LSA workshop on Probability Theory in Linguistics. The presenters included Harald, Janet Pierrehumbert (with her student Jen Hay), Chris Manning, and others. As I watched the presenters give graphic visualizations of quantitative data showing dynamic linguistic phenomena, I thought, “I want to do that!”
Bresnan et al. (2007) collected and manually annotated 2,360 instances of dative NP or PP constructions from the (unparsed) Switchboard Corpus (Godfrey et al. 1992) and another 905 from the Treebank Wall Street Journal corpus, using the manually parsed Treebank corpora (Marcus et al. 1993) to discover the set of dative verbs. With Harald Baayen's help we fit a series of generalized linear and generalized linear mixed-effect models to the data. I read several books on statistical modeling that Harald recommended to me; I learned R, the programming language and environment for computational statistics (R Core Team 2015), which he also highly recommended (stressing that his name is R. Harald Baayen). Then I was able to show that under bootstrapping of speaker clusters and cross-validation, the models were highly accurate.
Figure 7 illustrates a quantitative generalization that emerged from the models: The paired parameter estimates for the recipient and theme have opposite signs. I dubbed this phenomenon quantitative harmonic alignment after the qualitative harmonic alignment in syntax studied by Aissen (1999) and others in the framework of Optimality Theory. Figure 8 provides a qualitative schematic depiction of the quantitative phenomena. The hierarchies of discourse accessibility, animacy, definiteness, pronominality, and weight are aligned with the initial/final syntactic positions of the postverbal arguments across constructions. In lfg, the linear order follows from alignment with the hierarchy of grammatical functions in f-structure (Bresnan and Nikitina 2009).
Interestingly, there are hard animacy constraints on dative as well as genitive alternations in some languages (Rosenbach 2005; Bresnan 2007a; Rosenbach 2008), and even hard weight constraints (O'Connor, Maling, and Skarabela 2009). The facts suggest that these constraints play a role in syntactic typology and should not be brushed off as external to grammar and out of bounds to the theoretical linguist.
In principle, these quantitative harmonic alignment effects could be formulated as constraints on lfg f-structures. Annie Zaenen saw how this kind of work might be useful for paraphrase analysis to improve generation, and she got us involved with Mark Steedman and colleagues at Edinburgh on an animacy annotation project (Zaenen et al. 2004).
8. Data Shock
The second intellectual shock that pushed me further into the bush was realizing that in the garden, we had been relying all along on inconsistent binary grammaticality judgments that can be manipulated by changing the probabilities of the contexts, and we had vastly underestimated the human language capacity.
At the suggestion of Jeff Ellman, I used the dative corpus model to measure the predictive power of English language users (Bresnan 2007b). Inspired by Anette Rosenbach's (2003) beautiful experiment on the genitive alternation, and with her advice, I made questionnaires asking participants to rate the naturalness of contextualized alternative dative constructions sampled from our dative data set, by allocating 100 points between the alternatives (see Figure 9).
In one set of task instructions I asked participants to rate the choices in accordance with their own intuitions; in another I asked them to guess what the original speaker actually said in the discourse excerpt, and to rate their confidence in their guess in the same way, splitting 100 points between the alternatives. The findings were similar: As the log odds of a PP dative construction increased, the ratings of each participant showed a linear increase as well. The participants could tell which dative construction the original speaker was going to use, and their own ratings matched the corpus probabilities (see Figure 10). This finding has been replicated across speakers of other varieties of English (Bresnan and Ford 2010).
A second experiment reported in Bresnan (2007b) used linguistic manipulations that raise or lower probability to see whether they influence grammaticality judgments. Certain semantic classes of verbs reported by linguists to be ungrammatical in the double object construction are nevertheless found in actual usage (Bresnan et al. 2007). For example, whisper is reported to be ungrammatical in the double object construction, but Internet queries yield whisper me the answer, along with whisper the password to the fat lady. The double object context with the pronoun recipient is more harmonically aligned and far more probable. The reportedly ungrammatical examples constructed by linguists tend to utilize the far less probable positionings of argument types, like whisper the fat lady the answer. In the experiment I found that participants rated the reportedly ungrammatical constructions in the more probable contexts higher than the reportedly grammatical constructions in the less probable contexts.
9. Probabilistic Syntactic Knowledge
The simplifying assumption in the garden of linguistic theory has been that speakers' knowledge of their language is characterized by a static, categorical system of grammar. Although this has been a fruitful idealization, I came to see that it ultimately underestimates human language capacities. My own research showed that language users can match the probabilities of linguistic features of the environment and they have powerful predictive capabilities that enable them to anticipate the variable linguistic choices of others. Therefore, the working hypothesis I have adopted contrasts strongly with that of the garden: Grammar itself is inherently variable and probabilistic in nature, rather than categorical and algebraic.
With collaborators, I began to look for implicit knowledge of syntactic probabilities in various areas of linguistics—in phonetic reflexes of construction frequencies and probabilities (Hay and Bresnan 2006; Tily et al. 2009; Kuperman and Bresnan 2012); in language development (de Marneffe et al. 2012; Van den Bosch and Bresnan 2015); in language change (Wolk et al. 2013; Szmrecsányi et al. 2014); and in comparative syntactic variation across varieties of English (Bresnan and Ford 2010; Ford and Bresnan 2015).
10. Deeper into the Bush
My research program is taking me even deeper into the bush, as I will now illustrate with several visualizations of wild data.
Previous corpus studies point to informativeness (or closely related concepts) as an important predictor of verb contraction (Frank and Jaeger 2008; Bresnan and Spencer 2012; Barth and Kapatsinski 2014). Informativeness derives from predictability: More predictable events are less informative, and can even be redundant (Shannon 1948). The less predictable the host–verb combination, the more informative it is, and the less likely to contract.
Figure 11 shows the strong inverse relation between verb contraction and informativity of host–verb bigrams for the verb is/'s in data from the Canterbury Corpus of New Zealand English (Gordon et al. 2004), which I collected with Jen Hay in the summer and fall of 2015.2 Jen and I discussed implications of the informativity of verb contraction. As vocabulary richness grows, local word combinations become less predictable. Average predictability across contexts is what makes a host–verb combination more or less informative. Hence, increases in vocabulary richness would lead to increased informativity of host–verb combinations, potentially causing dynamic changes in verb contractions over time. These implications led me to ask whether children decrease their use of verb contractions in periods of increasing vocabulary richness during language development—a question never before asked, as far as I could tell from the literature.
It was not difficult to answer my question. I had already collected verb contraction data from longitudinal corpora in CHILDES (MacWhinney 2000).3 The CHILDES corpora are an invaluable resource. They have been morphologically analyzed, automatically parsed, and manually checked. As a matter of fact, the syntactic parses use dependency graphs derived from lfg functional structure relations (Sagae, MacWhinney, and Lavie 2004; Sagae, Lavie, and MacWhinney 2005). Computational tools are provided, including the CLAN VOCD tool (MacWhinney 2015), which calculates vocabulary richness based on a sophisticated algorithm for averaging morpheme or lemma counts (lemmas being the distinct words disregarding inflections).
I used the VOCD tool to calculate the lemma diversity of all of the language produced by each child at each recording session in the longitudinal corpora we selected. The panel plots in Figure 12 show increasing vocabulary richness as age increases between about 20 and 60 months, with the exception of one child (“Adam”), whose vocabulary richness was already high at the time of sampling and shows a dip midway in the recordings.
From the relations among verb contraction, informativity, and vocabulary richness we would expect children's subject–verb contractions to decrease as their vocabulary richness increases. Figure 13 shows that this expectation may be met: Overall, the children's contractions are tending to decrease, and only one child shows an increase in contraction during the period from 20 to 60 months: Adam, the child whose vocabulary richness shows the dip in Figure 12.
How do the dynamics of contraction in the children's data compare to their parents' contractions in conversational interactions during the same periods? Presumably the parents themselves would not be experiencing rapid vocabulary growth during this period, so they would not show a decline in verb contractions for that reason. Figure 14 shows the aggregated is contraction data by children vs. parents for all host-is/'s bigrams and for the frequent bigrams what is/'s and Mommy is/'s. Strikingly, the children in the aggregate show declines in the proportion of contractions while their parents' proportions remain constant.
These data visualizations appear to support a major cognitive role for implicit knowledge of syntactic probability. But they also raise many, many questions for further research—a good indicator of their potential fertility. This will be the focus of my next research project.
I am aware that this use of information theory characterizes just one small prosodic island in English syntax, the contractible host–verb sequence. It does make me wonder whether the entirety of hierarchical syntactic structure could somehow reflect a vast topography of overlapping peaks and valleys of informativeness, requiring parallel computations on multiple scales—something that one of you perhaps could figure out how to do.
11. Looking Forward
My work now has essentially the same exciting goal as when I started my study of linguistics at MIT: to infer the nature of the mind's capacity for language from the structure of human language. But now I know that linguistic structure is quantitative as well as qualitative, and I can use methods that I have been learning in the bush —
conditional probability, information content
What I hope to see going forward are increasingly powerful applications of computational linguistic theory, techniques, and resources to deepen our understanding of human language and cognition.
Fred Jelinek, a pioneer in the application of statistical methods to automatic speech recognition and machine translation, became notorious among linguists for his (possibly apocryphal) comment: “Every time I fire a linguist, the performance of the speech recognizer goes up.”
The NZ data consist of 11,719 instances of variable is contraction from 412 speakers, collected with the assistance of Vicky Watson, and n-grams from the entire Canterbury Corpus (n = 1,087,113 words).
The CHILDES data were collected in the summer of 2015 with Arto Anttila and the assistance of Gwynn Lyons from the corpora described by Brown (1973), Clark (1978), Demetras (1986), Kuczaj (1977), Sachs (1983), and Suppes (1974). The data include all full and contracted instances of the verb is (n = 56,088) by children (19,888 instances) and their parents (26,592) as well as others present.