Abstract
Because the 2020 ACL Lifetime Achievement Award presentation could not be done in person, we replaced the usual LTA talk with an interview between Professor Kathy McKeown (Columbia University) and the recipient, Bonnie Webber. The following is an edited version of the interview, with added citations.
Bonnie Webber: Let me start by thanking you, Kathy, for agreeing to interview me. I think you were the first Ph.D. student I met when I joined the faculty at Penn after finishing my own Ph.D. You were Aravind Joshi’s Ph.D. student at the time, working on generating answers to definition questions. I believe your Ph.D. thesis then became the first book published in the Cambridge University Press series on Natural Language Processing, of which Aravind was the editor-in-chief.
Kathy McKeown: Well, Bonnie, thanks so much for asking me to do this. I really will enjoy it.
What I remember of you from that time was your arrival in the department like a breath of fresh air. You were very animated, always vocal. And to a young woman like myself at that time, you were an amazing role model. So I’m happy to have this chance to ask you more questions than I could for my keynote address.
You’ve frequently asserted that we shouldn’t develop computational models based simply on standard assumptions about a linguistic phenomenon, but rather that we should first try to characterize it as fully as possible and then see which of these can be covered computationally. Given that current approaches in NLP don’t focus on subtle linguistic differences and what a model does or doesn’t cover, but instead worry about scale, do you think your approach to research still has something to contribute?
Bonnie: Well, I certainly hope so! Let me answer the question in terms of standard metrics used in Natural Language Processing—in particular, F-score, Precision, and Recall. Although F-score has mostly been used as a basis for “League Table” ratings, I would contend that there is much to be learned from its components—Precision and Recall. I always encourage my students to look at them very carefully.
Let me elaborate. Precision quantifies incorrect positive predictions—cases where a system has generalized incorrectly about what belongs to the class of interest, saying more things belong to it than actually do. Recall, on the other hand, quantifies incorrect negative predictions—cases where a system has generalized incorrectly about what doesn’t belong to the class of interest, omitting things from the class that should be there.
Now, there are at least two possible reasons for low Precision and/or low Recall: Either the system itself is unable to learn from the data because it’s too puny, or there’s something wrong with the data. What doesn’t surprise me, having spent years creating annotated corpora, is how easy it is for the latter to be the case. That is, errors in Precision and Recall show that there’s more to a phenomenon than we thought.
So what I’m trying to say is that there is a lot of linguistics that’s not a solved problem, and that errors in Recall and Precision should encourage people to try to better understand and characterize the phenomenon. Even if a system looks good, in terms of its True Positives and True Negatives, many of the tokens it labels correctly may just be minor variations of one another. On the other hand, looking at what has been incorrectly rejected or accepted can really tell you a lot about a phenomenon and lead to a better result on the next iteration. So, yes, I think an analytic approach to research really does have something to contribute.
Kathy: Could you give an example of this that you’ve come across personally?
Bonnie: Oh yes. I’d be happy to give one. When we started creating the Penn Discourse Tree Bank—that is, when we started annotating discourse coherence (relations that are signaled either explicitly with some kind of connective or implicitly by inference), we followed the standard practice of having annotators label each relation with the sense that was taken to hold between its arguments. For example, the situation described by one argument could be taken to be the result of the situation described by the other argument.
But then we had examples like the following (Rohde et al. 2017, 2018), which is made up, rather than being taken from The Wall Street Journal corpus:
- (1)
It’s too far to walk. Let’s take the bus.
- (2)
It’s too far to walk, so let’s take the bus.
- (3)
It’s too far to walk. Instead, let’s take the bus.
- (4)
It’s too far to walk. So instead, let’s take the bus.
Back in 2006, we modified the tool we used in annotating the PDTB to allow annotaters to record either one or two senses as holding between the arguments to a relation. However, we didn’t either insist that they record multiple senses or even remind them that they could. So there weren’t many tokens annotated with multiple senses in the PDTB-2 (Prasad, Webber, and Joshi 2014), and they were ignored in nearly all of the work done on learning to annotate discourse relations automatically (Lin, Ng, and Kan 2012; Xue et al. 2015, 2016).
Anyway, we moved to allowing annotators to label relations with either one or two senses from noticing the kind of disagreements they had with each other and from seeing for ourselves that the labels they were assigning all basically seemed correct. I’m now hoping that people will start taking seriously the concept of multiple sense labels and distinguishing cases where multiple relations hold from when only one or the other holds. This should be easier, given the larger number of discourse relations annotated with multiple senses in the enlarged PDTB-3 corpus (Webber et al. 2019).
Kathy: When you started in the field in the 1970s and the 1980s, any research that didn’t involve syntax or parsing was taken to be on the lunatic fringe of NLP. More recently, we’ve seen that syntax and parsing seem to be receding and are no longer seen as relevant. They’re being replaced by neural models that reject explicit notions of syntax and parsing, and the same is true of semantics. So, do you think that neural models will also successfully encode discourse, with discourse theory and discourse processing also becoming irrelevant?
Bonnie: That’s a good question. I’m going to frame my answer first, in terms of discourse structure and the knowledge and reasoning needed to handle discourse phenomena such as coreference and discourse coherence relations, and second, in terms of the way that a lot of material is left implicit in text or speech because it can be gleaned from context or inferred.
With respect to discourse structure, discourse generally has some kind of macro structure that often reflects the genre of the text or speech being analyzed and the kind of information being conveyed. Examples include answers to definition questions where, as you detail in your Ph.D. thesis (McKeown 1985), answering a question like What is an X? is often answered by giving a statement identifying what an X is and then giving one or more illustrations of X, and describing the attributes of an X.
Of course, by now everybody knows that news articles have a macro structure consisting of a title and then an optional strap line, followed by text in an inverted pyramid structure (Stede 2012; Webber, Egg, and Kordoni 2012). Similarly, biomedical research papers (and more recently, their abstracts) are written with a macro structure that comprises a statement of the objectives of the work, the methods used in the work, the results that were obtained, and then a discussion of the results with respect to the objectives (Lin et al. 2006; Mizuta et al. 2006; Hirohata et al. 2008; Agarwal and Yu 2009; Cohen et al. 2010; Liakata et al. 2010).
Now, while computational linguistics papers do not share the same structure as biomedical research papers, it’s likely from a recognition perspective that neural models will indeed be effective at recognizing this macro level discourse structure, because the features needed to recognize it are very local and there’s nothing really deep about it. Also, there’s a lot of available text. So we just need researchers working on the problem.
By the way, listeners might be interested to learn how news articles came to adopt the “inverted pyramid” structure: Its adoption goes back to the days when news began to be transmitted quickly across the United States by the new technology of the telegraph in the 1850s.1 If the transmission dropped, the “inverted pyramid” ensured that the most important information would not have been lost.
More recently, neural models are beginning to be used to identify and exploit the reasoning, world knowledge, and pragmatics needed in order to understand text—in particular, to handle the kinds of ambiguity, under-specification and information conveyed implicitly through context or inference. I do believe there will be some success at encoding discourse, but I don’t think it means that discourse theory and discourse processing will become irrelevant.
As you know, the standard illustration of the need for world knowledge and reasoning in resolving ambiguous pronouns comes from the minimal pair that Terry Winograd introduced in his 1970 Ph.D. thesis (Winograd 1972). This is the famous example contrasting the pair of sentences
- (5)
a. The City Council refused the women a permit because they feared violence.
- b.
The City Council refused the women a permit because they advocated violence.
Eight years later, in 2019, a team from Oxford, Imperial, and the Alan Turing Institute (Kocijan et al. 2019b) showed that fine-tuning a BERT language model with several million artificial Winograd schema sentences extracted from the Wikipedia could enable that model to correctly resolve over 75% of the examples in the Challenge, and also improve the performance of systems in resolving more prosaic examples of personal pronoun anaphors (Kocijan et al. 2019a).
Now, although this is very impressive and something we should learn from, it is unclear how performance can further improve, given that it reflects general patterns of personal pronoun use and not any specific understanding of coreference. For example, as much cleverness would be involved in creating versions of WikiCrem to use in training systems to handle event anaphors, bridging anaphors, and other forms of ambiguity, under-specification, and information conveyed implicitly. As such, there is still a lot for current and future NLP researchers to do—and for languages other than English as well!
Kathy: So as my final question, I want to ask whether you feel that discourse or any other field in which you’ve worked has changed over the course of your career.
Bonnie: Well, I’m always surprised to be reminded how long my career has been! As an M.Sc. student at Harvard, I was fortunate to be offered a part-time job at Bolt Beranek and Newman (BBN), working on question answering, as part of Bill Woods’s LUNAR system (Woods, Kaplan, and Nash-Webber 1972; Woods 1978). My responsibility was to enable LUNAR to handle more types of questions, and to support more ways of asking any particular question. My Ph.D. thesis, which I completed while at BBN, characterized the kinds of discourse anaphors found in questions posed to LUNAR, and what kind of entities had to be available for those anaphors to be resolved (Webber 1979, 1982).
When I went to Penn in 1978, Aravind Joshi was also carrying out and supervising work on question answering, among all the many other things that he was interested in, such as code switching and formal languages and the complexity of Natural Languagegrammar. Besides your own work, Aravind’s supervision here included Jerry Kaplan’s work on correcting factual misconceptions underlying a question (Kaplan 1982; Joshi, Webber, and Weischedel 1984; Webber 1986), and Eric Mays’ work on possible versus impossible changes in dynamic databases (Mays, Joshi, and Webber 1982; Mays 1984; Webber 1986). In both cases, it was assumed that directly answering a question that had some underlying misconception would mislead the questioner and prolong the misconception. This wasn’t the only research on what, at the time, we called cooperative question answering (e.g., Cheikes and Webber 1988; Cheikes 1991), a label adopted by other researchers as well, including Farah Benamara and Patrick Saint-Dizer.
However, there was a big change in question answering in the 1990s, when NIST instituted question answering tracks as part of its annual Text Retrieval and Evaluation Conference (TREC). Systems competing in these tracks took an information retrieval approach to question answering—searching free text for answers to factoid questions, which have a single answer, or for answers to list questions or definition questions (cf. https://aclweb.org/aclwiki/Question_Answering_(State_of_the_art)). This was then followed by a focus on using machine learning and neural networks in question answering, reintroducing semantic parsing as a way of directly converting questions to formal queries or database queries. With respect to the kinds of things we were interested in in cooperative question answering, I think it’s fallen to dialogue systems to pick up the slack on the kinds of interactions between a questioner and a respondent that question answering really requires.
On the other hand, I think we can still rely on the fact that problems that we tried to address and approaches that we tried to use 10 or 20 or 30 or even 40 years ago, and then abandoned for lack of machines powerful enough to support our efforts, are now getting rediscovered, with people make new progress. The optimist in me thinks, or at least hopes, that this will continue to be the case. So while question answering has changed, hopefully it will come back to where we started and go off with much more power behind it.
We should probably wrap up now because there’s not that much time left. But before we do, I want to thank you, and Mark Johnson, and Anna Korhonen, and Barbara di Eugenio, and Candy Sidner, and Mark Steedman, all of whom helped me prepare for this interview. And finally, I would also like to express my ever-lasting gratitude to the first recipient of the Lifetime Achievement Award, Aravind Joshi, who, until his death in December 2017, was always a source of generosity and good humor and stimulating ideas and encouragement to all the people that he interacted with. We were all very lucky to have known him.
And thanks again for taking the time to interview me.