Abstract
Suppose we want to build a system that answers a natural language question by representing its semantics as a logical forxm and computing the answer given a structured database of facts. The core part of such a system is the semantic parser that maps questions to logical forms. Semantic parsers are typically trained from examples of questions annotated with their target logical forms, but this type of annotation is expensive.
Our goal is to instead learn a semantic parser from question–answer pairs, where the logical form is modeled as a latent variable. We develop a new semantic formalism, dependency-based compositional semantics (DCS) and define a log-linear distribution over DCS logical forms. The model parameters are estimated using a simple procedure that alternates between beam search and numerical optimization. On two standard semantic parsing benchmarks, we show that our system obtains comparable accuracies to even state-of-the-art systems that do require annotated logical forms.
1. Introduction
One of the major challenges in natural language processing (NLP) is building systems that both handle complex linguistic phenomena and require minimal human effort. The difficulty of achieving both criteria is particularly evident in training semantic parsers, where annotating linguistic expressions with their associated logical forms is expensive but until recently, seemingly unavoidable. Advances in learning latent-variable models, however, have made it possible to progressively reduce the amount of supervision required for various semantics-related tasks (Zettlemoyer and Collins 2005; Branavan et al. 2009; Liang, Jordan, and Klein 2009; Clarke et al. 2010; Artzi and Zettlemoyer 2011; Goldwasser et al. 2011). In this article, we develop new techniques to learn accurate semantic parsers from even weaker supervision.
We demonstrate our techniques on the concrete task of building a system to answer questions given a structured database of facts; see Figure 1 for an example in the domain of U.S. geography. This problem of building natural language interfaces to databases (NLIDBs) has a long history in NLP, starting from the early days of artificial intelligence with systems such as LLunar (Woods, Kaplan, and Webber 1972), Chat-80 (Warren and Pereira 1982), and many others (see Androutsopoulos, Ritchie, and Thanisch [1995] for an overview). We believe NLIDBs provide an appropriate starting point for semantic parsing because they lead directly to practical systems, and they allow us to temporarily sidestep intractable philosophical questions on how to represent meaning in general. Early NLIDBs were quite successful in their respective limited domains, but because these systems were constructed from manually built rules, they became difficult to scale up, both to other domains and to more complex utterances. In response, against the backdrop of a statistical revolution in NLP during the 1990s, researchers began to build systems that could learn from examples, with the hope of overcoming the limitations of rule-based methods. One of the earliest statistical efforts was the Chill system (Zelle and Mooney 1996), which learned a shift-reduce semantic parser. Since then, there has been a healthy line of work yielding increasingly more accurate semantic parsers by using new semantic representations and machine learning techniques (Miller et al. 1996; Zelle and Mooney 1996; Tang and Mooney 2001; Ge and Mooney 2005; Kate, Wong, and Mooney 2005; Zettlemoyer and Collins 2005; Kate and Mooney 2006; Wong and Mooney 2006; Kate and Mooney 2007; Wong and Mooney 2007; Zettlemoyer and Collins 2007; Kwiatkowski et al. 2010; Kwiatkowski et al. 2011).
Although statistical methods provided advantages such as robustness and portability, however, their application in semantic parsing achieved only limited success. One of the main obstacles was that these methods depended crucially on having examples of utterances paired with logical forms, and this requires substantial human effort to obtain. Furthermore, the annotators must be proficient in some formal language, which drastically reduces the size of the annotator pool, dampening any hope of acquiring enough data to fulfill the vision of learning highly accurate systems.
In response to these concerns, researchers have recently begun to explore the possibility of learning a semantic parser without any annotated logical forms (Clarke et al. 2010; Artzi and Zettlemoyer 2011; Goldwasser et al. 2011; Liang, Jordan, and Klein 2011). It is in this vein that we develop our present work. Specifically, given a set of (x,y) example pairs, where x is an utterance (e.g., a question) and y is the corresponding answer, we wish to learn a mapping from x to y. What makes this mapping particularly interesting is that it passes through a latent logical form z, which is necessary to capture the semantic complexities of natural language. Also note that whereas the logical form z was the end goal in much of earlier work on semantic parsing, for us it is just an intermediate variable—a means towards an end. Figure 2 shows the graphical model which captures the learning setting we just described: The question x, answer y, and world/database w are all observed. We want to infer the logical forms z and the parameters θ of the semantic parser, which are unknown quantities.
Although liberating ourselves from annotated logical forms reduces cost, it does increase the difficulty of the learning problem. The core challenge here is program induction: On each example (x,y), we need to efficiently search over the exponential space of possible logical forms (programs) z and find ones that produce the target answer y, a computationally daunting task. There is also a statistical challenge: How do we parametrize the mapping from utterance x to logical form z so that it can be learned from only the indirect signal y? To address these two challenges, we must first discuss the issue of semantic representation. There are two basic questions here: (i) what should the formal language for the logical forms z be, and (ii) what are the compositional mechanisms for constructing those logical forms?
The semantic parsing literature has considered many different formal languages for representing logical forms, including SQL (Giordani and Moschitti 2009), Prolog (Zelle and Mooney 1996; Tang and Mooney 2001), a simple functional query language called FunQL (Kate, Wong, and Mooney 2005), and lambda calculus (Zettlemoyer and Collins 2005), just to name a few. The construction mechanisms are equally diverse, including synchronous grammars (Wong and Mooney 2007), hybrid trees (Lu et al. 2008), Combinatory Categorial Grammars (CCG) (Zettlemoyer and Collins 2005), and shift-reduce derivations (Zelle and Mooney 1996). It is worth pointing out that the choice of formal language and the construction mechanism are decisions which are really more orthogonal than is often assumed—the former is concerned with what the logical forms look like; the latter, with how to generate a set of possible logical forms compositionally given an utterance. (How to score these logical forms is yet another dimension.)
Existing systems are rarely based on the joint design of the formal language and the construction mechanism; one or the other is often chosen for convenience from existing implementations. For example, Prolog and SQL have often been chosen as formal languages for convenience in end applications, but they were not designed for representing the semantics of natural language, and, as a result, the construction mechanism that bridges the gap between natural language and formal language is generally complex and difficult to learn. CCG (Steedman 2000) is quite popular in computational linguistics (for example, see Bos et al. [2004] and Zettlemoyer and Collins [2005]). In CCG, logical forms are constructed compositionally using a small handful of combinators (function application, function composition, and type raising). For a wide range of canonical examples, CCG produces elegant, streamlined analyses, but its success really depends on having a good, clean lexicon. During learning, there is often a great amount of uncertainty over the lexical entries, which makes CCG more cumbersome. Furthermore, in real-world applications, we would like to handle disfluent utterances, and this further strains CCG by demanding either extra type-raising rules and disharmonic combinators (Zettlemoyer and Collins 2007) or a proliferation of redundant lexical entries for each word (Kwiatkowski et al. 2010).
To cope with the challenging demands of program induction, we break away from tradition in favor of a new formal language and construction mechanism, which we call dependency-based compositional semantics (DCS). The guiding principle behind DCS is to provide a simple and intuitive framework for constructing and representing logical forms. Logical forms in DCS are tree structures called DCS trees. The motivation is two-fold: (i) DCS trees are meant to parallel syntactic dependency trees, which facilitates parsing; and (ii) a DCS tree essentially encodes a constraint satisfaction problem, which can be solved efficiently using dynamic programming to obtain the denotation of a DCS tree. In addition, DCS provides a mark–execute construct, which provides a uniform way of dealing with scope variation, a major source of trouble in any semantic formalism. The construction mechanism in DCS is a generalization of labeled dependency parsing, which leads to simple and natural algorithms. To a linguist, DCS might appear unorthodox, but it is important to keep in mind that our primary goal is effective program induction, not necessarily to model new linguistic phenomena in the tradition of formal semantics.
Armed with our new semantic formalism, DCS, we then define a discriminative probabilistic model, which is depicted in Figure 2. The semantic parser is a log-linear distribution over DCS trees z given an utterance x. Notably, z is unobserved, and we instead observe only the answer y, which is obtained by evaluating z on a world/database w. There are an exponential number of possible trees z, and usually dynamic programming can be used to efficiently search over trees. However, in our learning setting (independent of the semantic formalism), we must enforce the global constraint that z produces y. This makes dynamic programming infeasible, so we use beam search (though dynamic programming is still used to compute the denotation of a fixed DCS tree). We estimate the model parameters with a simple procedure that alternates between beam search and optimizing a likelihood objective restricted to those beams. This yields a natural bootstrapping procedure in which learning and search are integrated.
We evaluated our DCS-based approach on two standard benchmarks, Geo, a U.S. geography domain (Zelle and Mooney 1996), and Jobs, a job queries domain (Tang and Mooney 2001). On Geo, we found that our system significantly outperforms previous work that also learns from answers instead of logical forms (Clarke et al. 2010). What is perhaps a more significant result is that our system obtains comparable accuracies to state-of-the-art systems that do rely on annotated logical forms. This demonstrates the viability of training accurate systems with much less supervision than before.
The rest of this article is organized as follows: Section 2 introduces DCS, our new semantic formalism. Section 3 presents our probabilistic model and learning algorithm. Section 4 provides an empirical evaluation of our methods. Section 5 situates this work in a broader context, and Section 6 concludes.
2. Representation
In this section, we present the main conceptual contribution of this work, dependency-based compositional semantics (DCS), using the U.S. geography domain (Zelle and Mooney 1996) as a running example. To do this, we need to define the syntax and semantics of the formal language. The syntax is defined in Section 2.2 and is quite straightforward: The logical forms in the formal language are simply trees, which we call DCS trees. In Section 2.3, we give a type-theoretic definition of worlds (also known as databases or models) with respect to which we can define the semantics of DCS trees.
The semantics, which is the heart of this article, contains two main ideas: (i) using trees to represent logical forms as constraint satisfaction problems or extensions thereof, and (ii) dealing with cases when syntactic and semantic scope diverge (e.g., for generalized quantification and superlative constructions) using a new construct which we call mark–execute. We start in Section 2.4 by introducing the semantics of a basic version of DCS which focuses only on (i) and then extend it to the full version (Section 2.5) to account for (ii).
Finally, having fully specified the formal language, we describe a construction mechanism for mapping a natural language utterance to a set of candidate DCS trees (Section 2.6).
2.1 Notation
Operations on tuples will play a prominent role in this article. For a sequence1v = (v1, …, vk), we use |v| = k to denote the length of the sequence. For two sequences u and v, we use u + v = (u1, …, u|u|, v1, …, v|v|) to denote their concatenation.
For a sequence of positive indices i = (i1, …, im), let consist of the components of v specified by i; we call vi the projection of v onto i. We use negative indices to exclude components: v−i = (v(1,…,|v|)\i). We can also combine sequences of indices by concatenation: vi,j = vi + vj. Some examples: if v = (a, b, c, d), then v2 = b, v3,1 = (c,a), v−3 = (a,b,d), v3,−3 = (c,a,b,d).
2.2 Syntax of DCS Trees
The syntax of the DCS formal language is built from two ingredients, predicates and relations:
Let be a set of predicates. We assume that contains a special null predicate ø, domain-independent predicates (e.g., count, <, >, and =), and domain-specific predicates (for the U.S. geography domain, state, river, border, etc.). Right now, think of predicates as just labels, which have yet to receive formal semantics.
Let be the set of relations. Note that unlike the predicates , which can vary across domains, the relations are fixed. The full set of relations are shown in Table 1. For now, just think of relations as labels—their semantics will be defined in Section 2.4.
. | . | Relations . |
---|---|---|
Name . | Relation . | Description of semantic function . |
join | for j,j′ ∈ {1,2,…} | j-th component of parent = j′-th component of child |
aggregate | Σ | parent = set of feasible values of child |
extract | e | mark node for extraction |
quantify | q | mark node for quantification, negation |
compare | c | mark node for superlatives, comparatives |
execute | xi for i ∈ {1,2…}* | process marked nodes specified by i |
. | . | Relations . |
---|---|---|
Name . | Relation . | Description of semantic function . |
join | for j,j′ ∈ {1,2,…} | j-th component of parent = j′-th component of child |
aggregate | Σ | parent = set of feasible values of child |
extract | e | mark node for extraction |
quantify | q | mark node for quantification, negation |
compare | c | mark node for superlatives, comparatives |
execute | xi for i ∈ {1,2…}* | process marked nodes specified by i |
The logical forms in DCS are called DCS trees. A DCS tree is a directed rooted tree in which nodes are labeled with predicates and edges are labeled with relations; each node also maintains an ordering over its children. Formally:
Definition 1 (DCS trees)
Let be the set of DCS trees, where each consists of (i) a predicate and (ii) a sequence of edges z.e = (z.e1, …, z.em). Each edge e consists of a relation (see Table 1) and a child tree .
We will either draw a DCS tree graphically or write it compactly as 〈p; r1 : c1; …; rm : cm〉 where p is the predicate at the root node and c1, …, cm are its m children connected via edges labeled with relations r1, …, rm, respectively. Figure 3(a) shows an example of a DCS tree expressed using both graphical and compact formats.
A DCS tree is a logical form, but it is designed to look like a syntactic dependency tree, only with predicates in place of words. As we'll see over the course of this section, it is this transparency between syntax and semantics provided by DCS which leads to a simple and streamlined compositional semantics suitable for program induction.
2.3 Worlds
In the context of question answering, the DCS tree is a formal specification of the question. To obtain an answer, we still need to evaluate the DCS tree with respect to a database of facts (see Figure 4 for an example). We will use the term world to refer to this database (it is sometimes also called a model, but we avoid this term to avoid confusion with the probabilistic model for learning that we will present in Section 3.1). Throughout this work, we assume the world is fully observed and fixed, which is a realistic assumption for building natural language interfaces to existing databases, but questionable for modeling the semantics of language in general.
2.3.1 Types and Values
To define a world, we start by constructing a set of values. The exact set of values depends on the domain (we will continue to use U.S. geography as a running example). Briefly, contains numbers (e.g., ), strings (e.g., ), tuples (e.g., ), sets (e.g., ), and other higher-order entities.
To be more precise, we construct recursively. First, define a set of primitive values , which includes the following:
Numeric values. Each value has the form , where x ∈ ℝ is a real number and t ∈ {number, ordinal, percent, length, …} is a tag. The tag allows us to differentiate 3, 3rd, 3%, and 3 miles—this will be important in Section 2.6.3. We simply write x for the value x: number.
Symbolic values. Each value has the form , where x is a string (e.g., Washington) and t ∈ {string, city, state, river, …} is a tag. Again, the tag allows us to differentiate, for example, the entities Washington: city and Washington: state.
Now we build the full set of values from the primitive values . To define , we need a bit more machinery: To avoid logical paradoxes, we construct in increasing order of complexity using types (see (Carpenter 1998) for a similar construction). The casual reader can skip this construction without losing any intuition.
Define the set of types to be the smallest set that satisfies the following properties:
- 1.
The primitive type ;
- 2.
The tuple type for each k ≥ 0 and each non-tuple type for i = 1, …, k; and
- 3.
The set type for each tuple type .
For each type , we construct a corresponding set of values :
- 1.
For the primitive type t = ⋆, the primitive values have already been specified. Note that these types are rather coarse: Primitive values with different tags are considered to have the same type ⋆.
- 2.
- 3. For a set type t = {t′}, contains all subsets of its element type t′:With this last condition, we ensure that all elements of a set must have the same type. Note that a set is still allowed to have values with different tags (e.g., {(Washington: city), (Washington: state)} is a valid set, which might denote the semantics of the utterance things named Washington). Another distinction is that types are domain-independent whereas tags tend to be more domain-specific.
Definition 2 (World)
A world is a function that maps each non-null predicate p ∈ to a set of tuples and maps the null predicate ø to the set of all values .
Remarks. In higher-order logic and lambda calculus, we construct function types and values, whereas in DCS, we construct tuple types and values. The two are equivalent in representational power, but this discrepancy does point out the fact that lambda calculus is based on function application, whereas DCS, as we will see, is based on declarative constraints. The set type {(⋆, ⋆)} in DCS corresponds to the function type ⋆ → (⋆ → bool). In DCS, there is no explicit bool type—it is implicitly represented by using sets.
2.3.2 Examples
These helper functions are monomorphic: For example, countt only computes cardinalities of sets of type {(t)}. In practice, we mostly operate on sets of primitives (t = ⋆). To reduce notation, we omit t to refer to this version: count = count⋆, average = average⋆, and so forth.
2.4 Semantics of DCS Trees without Mark–Execute (Basic Version)
The semantics or denotation of a DCS tree z with respect to a world w is denoted 〚z〛w. First, we define the semantics of DCS trees with only join relations (Section 2.4.1). In this case, a DCS tree encodes a constraint satisfaction problem (CSP); this is important because it highlights the constraint-based nature of DCS and also naturally leads to a computationally efficient way of computing denotations (Section 2.4.2). We then allow DCS trees to have aggregate relations (Section 2.4.3). The fragment of DCS which has only join and aggregate relations is called basic DCS.
2.4.1 Basic DCS Trees as Constraint Satisfaction Problems
Let z be a DCS tree with only join relations on its edges. In this case, z encodes a CSP as follows: For each node x in z, the CSP has a variable with value a(x); the collection of these values is referred to as an assignment a. The predicates and relations of z introduce constraints:
- 1.
a(x) ∈ w(p) for each node x labeled with predicate ; and
- 2.
a(x)j = a(y)j′ for each edge (x,y) labeled with , which says that the j-th component of a(x) must equal the j′-th component of a(y).
Figure 3(a) shows an example of a DCS tree. The corresponding CSP has four variables c, m, ℓ, s.2 In Figure 3(b), we have written the equivalent lambda calculus formula. The non-root nodes are existentially quantified, the root node c is λ-abstracted, and all constraints introduced by predicates and relations are conjoined. The λ-abstraction of c represents the fact that the denotation is the set of feasible values for c (note the equivalence between the Boolean function λc.p(c) and the set {c : p(c)}).
Remarks. Note that CSPs only allow existential quantification and conjunction. Why did we choose this particular logical subset as a starting point, rather than allowing universal quantification, negation, or disjunction? There seems to be something fundamental about this subset, which also appears in Discourse Representation Theory (DRT) (Kamp and Reyle 1993; Kamp, van Genabith, and Reyle 2005). Briefly, logical forms in DRT are called Discourse Representation Structures (DRSs), each of which contains (i) a set of existentially quantified discourse referents (variables), (ii) a set of conjoined discourse conditions (constraints), and (iii) nested DRSs. If we exclude nested DRSs, a DRS is exactly a CSP.3 The default existential quantification and conjunction are quite natural for modeling cross-sentential anaphora: New variables can be added to a DRS and connected to other variables. Indeed, DRT was originally motivated by these phenomena (see (Kamp and Reyle 1993) for more details).4
Tree-structured CSPs can capture unboundedly complex recursive structures—such as cities in states that border states that have rivers that…. Trees are limited, however, in that they are unable to capture long-distance dependencies such as those arising from anaphora. For example, in the phrase a state with a river that traverses its capital, its binds to state, but this dependence cannot be captured in a tree structure. A solution is to simply add an edge between the its node and the state node that forces the two nodes to have the same value. The result is still a well-defined CSP, though not a tree-structured one. The situation would become trickier if we were to integrate the other relations (aggregate, mark, and execute). We might be able to incorporate some ideas from Hybrid Logic Dependency Semantics (Baldridge and Kruijff 2002; White 2006), given that hybrid logic extends the tree structures of modal logic with nominals, thereby allowing a node to freely reference other nodes. In this article, however, we will stick to trees and leave the full exploration of non-trees for future work.
2.4.2 Computation of Join Relations
The time complexity for computing the denotation of a DCS tree 〚z〛w scales linearly with the number of nodes, but there is also a dependence on the cost of performing the join and project operations. For details on how we optimize these operations and handle infinite sets of tuples (for predicates such as count), see Liang (2011).
The denotation of DCS trees is defined in terms of the feasible values of a CSP, and the recurrence in Equation (15) is only one way of computing this denotation. In light of the extensions to come, however, we now consider Equation (15) as the actual definition rather than just a computational mechanism. It will still be useful to refer to the CSP in order to access the intuition of using declarative constraints.
2.4.3 Aggregate Relation
Thus far, we have focused on DCS trees that only use join relations, which are insufficient for capturing higher-order phenomena in language. For example, consider the phrase number of major cities. Suppose that number corresponds to the count predicate, and that major cities maps to the DCS tree . We cannot simply join count with the root of this DCS tree because count needs to be joined with the set of major cities (the denotation of ), not just a single city.
Figure 5(a) shows the DCS tree for our running example. The denotation of the middle node is {(s)}, where s is all major cities. Everything above this node is an ordinary CSP: s constrains the count node, which in turns constrains the root node to |s|. Figure 5(b) shows another example of using the aggregate relation Σ. Here, the node right above Σ is constrained to be a set of pairs of major cities and their populations. The average predicate then computes the desired answer.
Remarks. A DCS tree that contains only join and aggregate relations can be viewed as a collection of tree-structured CSPs connected via aggregate relations. The tree structure still enables us to compute denotations efficiently based on the recurrences in Equations (15) and (16).
Recall that a DCS tree with only join relations is a DRS without nested DRSs. The aggregate relation corresponds to the abstraction operator in DRT and is one way of making nested DRSs. It turns out that the abstraction operator is sufficient to obtain the full representational power of DRT, and subsumes generalized quantification and disjunction constructs in DRT. By analogy, we use the aggregate relation to handle disjunction (Figure 5(c)) and generalized quantification (Section 2.5.6).
DCS restricted to join relations is less expressive than first-order logic because it does not have universal quantification, negation, and disjunction. The aggregate relation is analogous to lambda abstraction, and in basic DCS we use the aggregate relation to implement those basic constructs using higher-order predicates such as not, every, and union. We can also express logical statements such as generalized quantification, which go beyond first-order logic.
2.5 Semantics of DCS Trees with Mark–Execute (Full Version)
Basic DCS includes two types of relations, join and aggregate, but it is already quite expressive. In general, however, it is not enough just to be able to express the meaning of a sentence using some logical form; we must be able to derive the logical form compositionally and simply from the sentence.
Consider the superlative construction most populous city, which has a basic syntactic dependency structure shown in Figure 6(a). Figure 6(b) shows that we can in principle already use a DCS tree with only join and aggregate relations to express the correct semantics of the superlative construction. Note, however, that the two structures are quite divergent—the syntactic head is city and the semantic head is argmax. This divergence runs counter to a principal desideratum of DCS, which is to create a transparent interface between coarse syntax and semantics.
In this section, we introduce mark and execute relations, which will allow us to use the DCS tree in Figure 6(c) to represent the semantics associated with Figure 6(a); these two are more similar than (a) and (b). The focus of this section is on this mark–execute construct—using mark and execute relations to give proper semantically scoped denotations to syntactically scoped tree structures.
The basic intuition of the mark–execute construct is as follows: We mark a node low in the tree with a mark relation; then, higher up in the tree, we invoke it with a corresponding execute relation (Figure 7). For our example in Figure 6(c), we mark the population node, which puts the child argmax in a temporary store; when we execute the node node, we fetch the superlative predicate argmax from the store and invoke it.
This divergence between syntactic and semantic scope arises in other linguistic contexts besides superlatives, such as quantification and negation. In each of these cases, the general template is the same: A syntactic modifier low in the tree needs to have semantic force higher in the tree. A particularly compelling case of this divergence happens with quantifier scope ambiguity (e.g., Some river traverses every city5), where the quantifiers appear in fixed syntactic positions, but the surface and inverse scope readings correspond to different semantically scoped denotations. Analogously, a single syntactic structure involving superlatives can also yield two different semantically scoped denotations—the absolute and relative readings (e.g., state bordering the largest state6). The mark–execute construct provides a unified framework for dealing all these forms of divergence between syntactic and semantic scope. See Figures 8 and 9 for concrete examples of this construct.
2.5.1 Denotations
We now formalize the mark–execute construct. We saw that the mark–execute construct appears to act non-locally, putting things in a store and retrieving them later. This means that if we want the denotation of a DCS tree to only depend on the denotations of its subtrees, the denotations need to contain more than the set of feasible values for the root node, as was the case for basic DCS. We need to augment denotations to include information about all marked nodes, because these can be accessed by an execute relation higher up in the tree.
More specifically, let z be a DCS tree and d = 〚z〛w be its denotation. The denotation d consists of ncolumns. The first column always corresponds to the root node of z, and the rest of the columns correspond to non-root marked nodes in z. In the example in Figure 10, there are two columns, one for the root state node and the other for size node, which is marked by c. The columns are ordered according to a pre-order traversal of z, so column 1 always corresponds to the root node. The denotation d contains a set of arrays d.A, where each array represents a feasible assignment of values to the columns of d; note that we quantify over non-marked nodes, so they do not correspond to any column in the denotation. For example, in Figure 10, the first array in d.A corresponds to assigning (OK) to the state node (column 1) and (TX, 2.7e5) to the size node (column 2). If there are no marked nodes, d.A is basically a set of tuples, which corresponds to a denotation in basic DCS. For each marked node, the denotation d also maintains a store with information to be retrieved when that marked node is executed. A store σ for a marked node contains the following: (i) the mark relation σ.r (c in the example), (ii) the base denotation σ.b, which essentially corresponds to denotation of the subtree rooted at the marked node excluding the mark relation and its subtree (〚〈size〉〛w in the example), and (iii) the denotation of the child of the mark relation (〚〈argmax〉〛w in the example). The store of any unmarked nodes is always empty (σ = ø).
Definition 3 (Denotations)
Let be the set of denotations, where each denotation consists of
a set of arrays d.A, where each array a = [a1, …, an] ∈ d.A is a sequence of n tuples for some n ≥ 0; and
a sequence of n stores d.σ = (d.σ1, …, d.σn), where each store σ contains a mark relation σ.r ∈ {e, q, c, ø}, a base denotation , and a child denotation .
For notational convenience, we write d as 《A; (r1, b1, c1); …; (rn, bn, cn)》. Also let d.ri = d.σi.r, d.bi = d.σi.b, and d.ci = d.σi.c. Let d{σi = x} be the denotation which is identical to d, except with d.σi = x; d{ri = x}, d{bi = x}, and d{ci = x} are defined analogously. We also define a project operation for denotations: . Extending this notation further, we use ø to denote the indices of the non-initial columns with empty stores (i > 1 such that d.σi = ø). We can then use d[−ø] to represent projecting away the non-initial columns with empty stores. For the denotation d in Figure 10, d[1] keeps column 1, d[−ø] keeps both columns, and d[2, −2] swaps the two columns.
In basic DCS, denotations are sets of tuples, which works quite well for representing the semantics of wh-questions such as What states border Texas? But what about polar questions such as Does Louisiana border Texas? The denotation should be a simple Boolean value, which basic DCS does not represent explicitly. Using our new denotations, we can represent Boolean values explicitly using zero-column structures: true corresponds to a singleton set containing just the empty array (dT = 《{[ ]}》) and false is the empty set (dF = 《∅》).
2.5.2 Base Case
Equation (19) defines the denotation for a DCS tree z with a single node with predicate p. The denotation of z has one column whose arrays correspond to the tuples w(p); the store for that column is empty.
2.5.3 Join Relations
Equation (20) defines the recurrence for join relations. On the left-hand side, is a DCS tree with p at the root, a sequence of edges e followed by a final edge with relation connected to a child DCS tree c. On the right-hand side, we take the recursively computed denotation of 〈p; e〉, the DCS tree without the final edge, and perform a join-project-inactive operation (notated ) with the denotation of the child DCS tree c.
Note that the join works on column 1; the other columns are carried along for the ride. As another piece of convenient notation, we use * to represent all components, so imposes the join condition that the entire tuple has to agree (a1 = a′1).
2.5.4 Aggregate Relations
There is another case, however: what happens to settings of a2,…,an that do not co-occur with any value of a1′ in A? Then, S(a) = ∅, but note that A′ by construction will not have the desired array [∅, a2,…,an]. As a concrete example, suppose A = ∅ and we have one column (n = 1). Then A′ = ∅, rather than the desired {[∅]}.
Fixing this problem is slightly tricky. There are an infinite number of a2,…,an which do not co-occur with any a1′ in A, so for which ones do we actually include [ ∅ ,a2,…,an]? Certainly, the answer to this question cannot come from A, so it must come from the stores. In particular, for each column i ∈ {2, …, n}, we have conveniently stored a base denotation σi.b. We consider any ai that occurs in column 1 of the arrays of this base denotation ([ai] ∈ σi.b.A[1]). For this a2,…,an, we include [∅, a2,…,an] in A′′ as long as a2,…,an does not co-occur with any a1. An example is given in Figure 11.
The reason for storing base denotations is thus partially revealed: The arrays represent feasible values of a CSP and can only contain positive information. When we aggregate, we need to access possibly empty sets of feasible values—a kind of negative information, which can only be recovered from the base denotations.
2.5.5 Mark Relations
2.5.6 Execute Relations
Equation (22) defines the denotation of a DCS tree where the last edge of the root is an execute relation. Similar to the aggregate case (21), we recurse on the DCS tree without the last edge (〈p; e〉) and then join it to the result of applying the execute operation Xi to the denotation of the child (〚c〛w).
The execute operation Xi is the most intricate part of DCS and is what does the heavy lifting. The operation is parametrized by a sequence of distinct indices i that specifies the order in which the columns should be processed. Specifically, i indexes into the subsequence of columns with non-empty stores. We then process this subsequence of columns in reverse order, where processing a column means performing some operations depending on the stored relation in that column. For example, suppose that columns 2 and 3 are the only non-empty columns. Then X12 processes column 3 before column 2. On the other hand, X21 processes column 2 before column 3. We first define the execute operation Xi for a single column i. There are three distinct cases, depending on the relation stored in column i:
- 1.
By default, the denotation of a DCS tree is the set of feasible values of the root node (which occupies column 1). To return the set of feasible values of another node, we mark that node with e. Upon execution, the feasible values of that node move into column 1. Extraction can be used to handle in situ questions (see Figure 8(a)).
- 2.
Unmarked nodes (those that do not have an edge with a mark relation) are existentially quantified and have narrower scope than all marked nodes. Therefore, we can make a node x have wider scope than another node y by marking x (with e) and executing y before x (see Figure 8(d,e) for examples). The extract relation e (in fact, any mark relation) signifies that we want to control the scope of a node, and the execute relation allows us to set that scope.
Figure 8(c) shows an example with two interacting quantifiers. The denotation of the DCS tree before execution is the same in both readings, as shown in Figure 15. The quantifier scope ambiguity is resolved by the choice of execute relation: x12 gives the surface scope reading, x21 gives the inverse scope reading.
Figure 8(d) shows how extraction and quantification work together. First, the no quantifier is processed for each city, which is an unprocessed marked node. Here, the extract relation is a technical trick to give city wider scope.
Comparatives and Superlatives. Comparative and superlative constructions involve comparing entities, and for this we rely on a set S of entity–degree pairs (x,y), where x is an entity and y is a numeric degree. Recall that we can treat S as a function, which maps an entity x to the set of degrees S(x) associated with x. Note that this set can contain multiple degrees. For example, in the relative reading of state bordering the largest state, we would have a degree for the size of each neighboring state.
We use the same mark relation c for both comparative and superlative constructions. In terms of the DCS tree, there are three key parts: (i) the root x, which corresponds to the entity to be compared, (ii) the child c of a c relation, which corresponds to the comparative or superlative predicate, and (iii) c's parent p, which contains the “degree information” (which will be described later) used for comparison. We assume that the root is marked (usually with a relation e). This forces us to compute a comparison degree for each value of the root node. In terms of the denotation d corresponding to the DCS tree prior to execution, the entity to be compared occurs in column 1 of the arrays d.A, the degree information occurs in column i of the arrays d.A, and the denotation of the comparative or superlative predicate itself is the child denotation at column i (d.ci).
- 1.
Suppose the degree information has arity 2 (Arity(d.A[i]) = 2). This occurs, for example, in most populous city (see Figure 9(b)), where column i is the population node. In this case, we simply set the degree to the second component of population by projection (). Now columns 1 and 2 contain the degrees and entities, respectively. We concatenate columns 2 and 1 (+2,1 (·)) and aggregate to produce a denotation dS which contains the set of entity–degree pairs in column 1.
- 2.
Suppose the degree information has arity 1 (Arity(d.A[i]) = 1). This occurs, for example, in state bordering the most states (see Figure 9(a)), where column i is the lower marked state node. In this case, the degree of an entity from column 2 is the number of different values that column 1 can take. To compute this, aggregate the set of values (∑ (d′)) and apply the count predicate. Now with the degrees and entities in columns 1 and 2, respectively, we concatenate the columns and aggregate again to obtain dS.
Figure 9(a) and Figure 9(b) show examples of superlative constructions with the arity 1 and arity 2 types of degree information, respectively. Figure 9(c) shows an example of a comparative construction. Comparatives and superlatives use the same machinery, differing only in the predicate: argmax versus (more than Texas). But both predicates have the same template behavior: Each takes a set of entity–degree pairs and returns any entity satisfying some property. For argmax, the property is obtaining the highest degree; for more, it is having a degree higher than a threshold. We can handle generalized superlatives (the five largest or the fifth largest or the 5% largest) as well by swapping in a different predicate; the execution mechanisms defined in Equation (41) remain the same.
We saw that the mark–execute machinery allows decisions regarding quantifier scope to be made in a clean and modular fashion. Superlatives also have scope ambiguities in the form of absolute versus relative readings. Consider the example in Figure 9(d). In the absolute reading, we first compute the superlative in a narrow scope (the largest state is Alaska), and then connect it with the rest of the phrase, resulting in the empty set (because no states border Alaska). In the relative reading, we consider the first state as the entity we want to compare, and its degree is the size of a neighboring state. In this case, the lower state node cannot be set to Alaska because there are no states bordering it. The result is therefore any state that borders Texas (the largest state that does have neighbors). The two DCS trees in Figure 9(d) show that we can naturally account for this form of superlative ambiguity based on where the scope-determining execute relation is placed without drastically changing the underlying tree structure.
Remarks. These scope divergence issues are not specific to DCS—every serious semantic formalism must address them. Generative grammar uses quantifier raising to move the quantifier from its original syntactic position up to the desired semantic position before semantic interpretation even occurs (Heim and Kratzer 1998). Other mechanisms such as Montague's (1973) quantifying in, Cooper storage (Cooper 1975), and Carpenter's (1998) scoping constructor handle scope divergence during semantic interpretation. Roughly speaking, these mechanisms delay application of a quantifier, “marking” its spot with a dummy pronoun (as in Montague's quantifying in) or putting it in a store (as in Cooper storage), and then “executing” the quantifier at a later point in the derivation either by performing a variable substitution or retrieving it from the store. Continuation, from programming languages, is another solution (Barker 2002; Shan 2004); this sets the semantics of a quantifier to be a function from its continuation (which captures all the semantic content of the clause minus the quantifier) to the final denotation of the clause. Intuitively, continuations reverse the normal evaluation order, allowing a quantifier to remain in situ but still outscope the rest of the clause. In fact, the mark and execute relations of DCS are analogous to the shift and reset operators used in continuations. One of the challenges with allowing flexible scope is that free variables can yield invalid scopings, a well-known issue with Cooper storage that the continuation-based approach solves. Invalid scopings are filtered out by the construction mechanism (Section 2.6).
One difference between mark–execute in DCS and many other mechanisms is that DCS trees (which contain mark and execute relations) are the final logical forms—the handling of scope divergence occurs in the computing their denotations. The analog in the other mechanisms resides in the construction mechanism—the actually final logical form is quite simple.9 Therefore, we have essentially pushed the inevitable complexity from the construction mechanism into the semantics of the logical form. This is a conscious design decision: We want our construction mechanism, which maps natural language to logical form, to be simple and not burdened with complex linguistic issues, for our focus is on learning this mapping. Unfortunately, the denotation of our logical forms (Section 2.5.1) do become more complex than those of lambda calculus expressions, but we believe this is a reasonable tradeoff to make for our particular application.
2.6 Construction Mechanism
We have thus far defined the syntax (Section 2.2) and semantics (Section 2.5) of DCS trees, but we have only vaguely hinted at how these DCS trees might be connected to natural language utterances by appealing to idealized examples. In this section, we formally define the construction mechanism for DCS, which takes an utterance x and produces a set of DCS trees .
Because we motivated DCS trees based on dependency syntax, it might be tempting to take a dependency parse tree of the utterance, replace the words with predicates, and attach some relations on the edges to produce a DCS tree. To a first approximation, this is what we will do, but we need to be a bit more flexible for several reasons: (i) some nodes in the DCS tree do not have predicates (e.g., children of an e relation or parent of an xi relation); (ii) nodes have predicates that do not correspond to words (e.g., in California cities, there is a implicit loc predicate that bridges CA and city); (iii) some words might not correspond to any predicates in our world (e.g., please); and (iv) the DCS tree might not always be aligned with the syntactic structure depending on which syntactic formalism one ascribes to. Although syntax was the inspiration for the DCS formalism, we will not actually use it in construction.
It is also worth stressing the purpose of the construction mechanism. In linguistics, the purpose of the construction mechanism is to try to generate the exact set of valid logical forms for a sentence. We view the construction mechanism instead as simply a way of creating a set of candidate logical forms. A separate step defines a distribution over this set to favor certain logical forms over others. The construction mechanism should therefore simply overapproximate the set of logical forms. Linguistic constraints that are normally encoded in the construction mechanism (for example, in CCG, that the disharmonic pair s/np and s\np cannot be coordinated, or that non-indefinite quantifiers cannot extend their scope beyond clause boundaries) would be instead encoded as features (Section 3.1.1). Because feature weights are estimated from data, one can view our approach as automatically learning the linguistic constraints relevant to our end task.
2.6.1 Lexical Triggers
The construction mechanism assumes a fixed set of lexical triggersL. Each trigger is a pair (s, p), where s is a sequence of words (usually one) and p is a predicate (e.g., s = California and p = CA). We use L(s) to denote the set of predicates p triggered by s ((s, p) ∈ L). We should think of the lexical triggers L not as pinning down the precise predicate for each word, but rather as producing an overapproximation. For example, L might contain {(city, city), (city, state), (city, river), … }, reflecting our initial ignorance prior to learning.
We also define a set of trace predicatesL(ε), which can be introduced without an overt lexical element. Their name is inspired by trace/null elements in syntax, but they serve a more practical rather than a theoretical role here. As we shall see in Section 2.6.2, trace predicates provide more flexibility in the construction of logical forms, allowing us to insert a predicate based on the partial logical form constructed thus far and assess its compatibility with the words afterwards (based on features), rather than insisting on a purely lexically driven formalism. Section 4.1.3 describes the lexical triggers and trace predicates that we use in our experiments.
2.6.2 Recursive Construction of DCS Trees
The base case: we take the phrase (sequence of words) over span i..j and look up the set of predicates p in the set of lexical triggers. For each predicate, we construct a one-node DCS tree. We also extend the definition of DCS trees in Section 2.2 to allow each node to store the indices of the span i..j that triggered the predicate at that node; this is denoted by 〈p〉i..j. This span information will be useful in Section 3.1.1, where we will need to talk about how an utterance x is aligned with a DCS tree z.
The recursive case: T1(a,b), which we will define shortly, that takes two DCS trees, a and b, and returns a set of new DCS trees formed by combining a and b. Figure 17 shows this recurrence graphically.
Inserting trace predicates allows us to build logical forms with more predicates than are explicitly triggered by the words. This ability is useful for several reasons. Sometimes, there is a predicate not overtly expressed, especially in noun compounds (e.g., California cities). For semantically light words such as prepositions (e.g., for) it is difficult to enumerate all the possible predicates that they might trigger; it is simpler computationally to try to insert trace predicates. We can even omit lexical triggers for transitive verbs such as border because the corresponding predicate border can be inserted as a trace predicate.
2.6.3 Filtering using Abstract Interpretation
The construction procedure as described thus far is extremely permissive, generating many DCS trees which are obviously wrong—for example, , which tries to compare a state with the number 3. There is nothing wrong with this expression syntactically: Its denotation will simply be empty (with respect to the world). But semantically, this DCS tree is anomalous.
We cannot simply just discard DCS trees with empty denotations, because we would incorrectly rule out . The difference here is that even though the denotation is empty in this world, it is possible that it might not be empty in a different world where history and geology took another turn, whereas it is simply impossible to compare cities and numbers.
Now returning to our motivating example at the beginning of this section, we see that the bad DCS tree has an empty abstract denotation = 《∅; ø》. The good DCS tree has a non-empty abstract denotation: , as desired.
Remarks. Computing denotations on an abstract world is called abstract interpretation (Cousot and Cousot 1977) and is a very powerful framework commonly used in the programming languages community. The idea is to obtain information about a program (in our case, a DCS tree) without running it concretely, but rather just by running it abstractly. It is closely related to type systems, but the type of abstractions one uses is often much richer than standard type systems.
2.6.4 Comparison with CCG
In some sense, the DCS construction mechanism pushes the complexity out of the lexicon. In linguistics, this complexity usually would end up in the grammar, which would be undesirable. We do not have to respect this tradeoff, however, because the construction mechanism only produces an overapproximation, which means it is possible to have both a simple “lexicon” and a simple “grammar.”
Type raising is a combinator in CCG that traditionally converts x to λf.f(x). In recent work, Zettlemoyer and Collins (2007) introduced more general type-changing combinators to allow conversion from one entity into a related entity in general (a kind of generalized metonymy). For example, in order to parse Boston flights, Boston is transformed to λx.to(x, Boston). This type changing is analogous to inserting trace predicates in DCS, but there is an important distinction: Type changing is a unary operation and is unconstrained in that it changes logical forms into new ones without regard for how they will be used downstream. Inserting trace predicates is a binary operation that is constrained by the two predicates that it is mediating. In the example, to would only be inserted to combine Boston with flight. This is another instance of the general principle of delaying uncertain decisions until there is more information.
3. Learning
In Section 2, we defined DCS trees and a construction mechanism for producing a set of candidate DCS trees given an utterance. We now define a probability distribution over that set (Section 3.1) and an algorithm for estimating the parameters (Section 3.2). The number of candidate DCS trees grows exponentially, so we use beam search to control this growth. The final learning algorithm alternates between beam search and optimization of the parameters, leading to a natural bootstrapping procedure which integrates learning and search.
3.1 Semantic Parsing Model
3.1.1 Features
We now define the feature vector φ(x, z) ∈ ℝd, the core part of the semantic parsing model. Each component j = 1, …, d of this vector is a feature, and φ(x,z)j is the number of times that feature occurs in (x,z). Rather than working with indices, we treat features as symbols (e.g., TriggerPred[states, state]). Each feature captures some property about (x,z) that abstracts away from the details of the specific instance and allows us to generalize to new instances that share common features.
The features are organized into feature templates, where each feature template instantiates a set of features. Figure 20 shows all the feature templates for a concrete example. The feature templates are as follows:
PredHit contains the single feature PredHit, which fires for each predicate in z.
PredRel contains features × . A feature fires when a node x has predicate p and is connected via some path q = (d1, r1), …, (dm, rm) to the lowest descendant node y with the property that each node between x and y has a null predicate. Each (d,r) on the path represents an edge labeled with relation r connecting to a left (d = ↙) or right (d = ↘) child. If x has no children, then m = 0. The most common case is when m = 1, but m = 2 also occurs with the aggregate and execute relations (e.g., fires for Figure 5(a)).
PredRelPred contains features {PredRelPred[α(p), q, α(p′] : p, p′ ∈ , which are the same as PredRel, except that we include both the predicate p of x and the predicate p′ of the descendant node y. These features do not fire if m = 0.
TriggerPred contains features , where W = {it, Texas, …} is the set of words. Each of these features fires when a span of the utterance with words s triggers the predicate p—more precisely, when a subtree 〈p; e〉i..j exists with s = xi+1..j. Note that these lexicalized features use the predicate p rather than the abstracted version α(p).
TracePred contains features ∈ {↙, ↘}}, each of which fires when a trace predicate p has been inserted over a word s. The situation is the following: Suppose we have a subtree a that ends at position k (there is a predicate in a that is triggered by a phrase with right endpoint k) and another subtree b that begins at k′. Recall that in the construction mechanism (46), we can insert a trace predicate p ∈ L(ε) between the roots of a and b. Then, for every word xj between the spans of the two subtrees (j = {k + 1, …, k′}), the feature TracePred[xj, p, d] fires (d = ↙ if b dominates a and d = ↘ if a dominates b).
TraceRel contains features {TraceRel[s, d, r] : s ∈ W*, d ∈ {↙, ↘}, r ∈ }, each of which fires when some trace predicate with parent relation r has been inserted over a word s.
TracePredRel contains features {TracePredRel[s, d, r] : s ∈ W*, }, each of which fires when a predicate p is connected via child relation r to some trace predicate over a word s.
These features are simple generic patterns which can be applied for modeling essentially any distribution over sequences and labeled trees—there is nothing specific to DCS at all. The first half of the feature templates (PredHit, Pred, PredRel, PredRelPred) capture properties of the tree independent of the utterance, and are similar to those used for syntactic dependency parsing. The other feature templates (TriggerPred, TracePred, TraceRel, TracePredRel) connect predicates in the DCS tree with words in the utterance, similar to those in a model of machine translation.
3.2 Parameter Estimation
We have now fully specified the details of the graphical model in Figure 2: Section 3.1 described semantic parsing and Section 2 described semantic evaluation. Next, we focus on the inferential problem of estimating the parameters θ of the model from data.
3.2.1 Objective Function
3.2.2 Algorithm
Given a candidate set function C(x), we can optimize Equation (71) to obtain estimates of the parameters θ. Ideally, we would use , the candidate sets from our construction mechanism in Section 2.6, but we quickly run into the problem of computing Equation (72) efficiently. Note that (defined in Equation (44)) grows exponentially with the length of x. This by itself is not a show-stopper. Our features (Section 3.1.1) decompose along the edges of the DCS tree, so it is possible to use dynamic programming12 to compute the second expectation of Equation (72). The problem is computing the first expectation , which sums over the subset of candidate DCS trees z satisfying the constraint 〚z〛w = y. Though this is a smaller set, there is no efficient dynamic program for this set because the constraint does not decompose along the structure of the DCS tree. Therefore, we need to approximate , and, in fact, we will approximate as well so that the two expectations in Equation (72) are coherent.
We now have a chicken-and-egg problem: If we had good parameters θ, we could generate good candidate sets C(x) using beam search . If we had good candidate sets C(x), we could generate good parameters by optimizing our objective in Equation (71). This problem leads to a natural solution: simply alternate between the two steps (Figure 21). This procedure is not guaranteed to converge, due to the heuristic nature of the beam search, but we have found it to be convergent in practice.
4. Experiments
We have now completed the conceptual part of this article—using DCS trees to represent logical forms (Section 2), and learning a probabilistic model over these trees (Section 3). In this section, we evaluate and study our approach empirically. Our main result is that our system can obtain comparable accuracies to state-of-the-art systems that require annotated logical forms. All the code and data are available at cs.stanford.edu/∼pliang/software/.
4.1 Experimental Set-up
We first describe the data sets (Section 4.1.1) that we use to train and evaluate our system. We then mention various choices in the model and learning algorithm (Section 4.1.2). One of these choices is the lexical triggers, which are further discussed in Section 4.1.3.
4.1.1 Data sets
We tested our methods on two standard data sets, referred to in this article as Geo and Jobs. These data sets were created by Ray Mooney's group during the 1990s and have been used to evaluate semantic parsers for over a decade.
U.S. Geography. The Geo data set, originally created by Zelle and Mooney (1996), contains 880 questions about U.S. geography and a database of facts encoded in Prolog. The questions in Geo ask about general properties (e.g., area, elevation, and population) of geographical entities (e.g., cities, states, rivers, and mountains). Across all the questions, there are 280 word types, and the length of an utterance ranges from 4 to 19 words, with an average of 8.5 words. The questions involve conjunctions, superlatives, and negation, but no generalized quantification. Each question is annotated with a logical form in Prolog, for example:
Because our approach learns from answers, not logical forms, we evaluated the annotated logical forms on the provided database to obtain the correct answers.
Recall that a world/database w maps each predicate to a set of tuples w(p). Some predicates contain the set of tuples explicitly (e.g., mountain); others can be derived (e.g., higher takes two entities x and y and returns true if elevation(x) > elevation(y)). Other predicates are higher-order (e.g., sum, highest) in that they take other predicates as arguments. We do not use the provided domain-specific higher-order predicates (e.g., highest), but rather provide domain-independent higher-order predicates (e.g., argmax) and the ordinary domain-specific predicates (e.g., elevation). This provides more compositionality and therefore better generalization. Similarly, we use more and elevation instead of higher. Altogether, contains 43 predicates plus one predicate for each value (e.g., CA).
Job Queries. The Jobs data set (Tang and Mooney 2001) contains 640 natural language queries about job postings. Most of the questions ask for jobs matching various criteria: job title, company, recruiter, location, salary, languages and platforms used, areas of expertise, required/desired degrees, and required/desired years of experience. Across all utterances, there are 388 word types, and the length of an utterance ranges from 2 to 23 words, with an average of 9.8 words.
The utterances are mostly based on conjunctions of criteria, with a sprinkling of negation and disjunction. Here is an example:
The Jobs data set comes with a database, which we can use as the world w. When the logical forms are evaluated on this database, however, close to half of the answers are empty (no jobs match the requested criteria). Therefore, there is a large discrepancy between obtaining the correct logical form (which has been the focus of most work on semantic parsing) and obtaining the correct answer (our focus).
To bring these two into better alignment, we generated a random database as follows: We created m = 100 jobs. For each job j, we go through each predicate p (e.g., company) that takes two arguments, a job, and a target value. For each of the possible target values v, we add (j,v) to w(p) independently with probability α = 0.8. For example, for p = company, j = job37, we might add (job37, IBM) to w(company). The result is a database with a total of 23 predicates (which includes the domain-independent ones) in addition to the value predicates (e.g., IBM).
4.1.2 Settings
There are a number of settings that control the tradeoffs between computation, expressiveness, and generalization power of our model, shown here. For now, we will use generic settings chosen rather crudely; Section 4.3.4 will explore the effect of changing these settings.
Lexical Triggers The lexical triggers L (Section 2.6.1) define the set of candidate DCS trees for each utterance. There is a tradeoff between expressiveness and computational complexity: The more triggers we have, the more DCS trees we can consider for a given utterance, but then either the candidate sets become too large or beam search starts dropping the good DCS trees. Choosing lexical triggers is important and requires additional supervision (Section 4.1.3).
Features Our probabilistic semantic parsing model is defined in terms of feature templates (Section 3.1.1). Richer features increase expressiveness but also might lead to overfitting. By default, we include all the feature templates.
Number of training examples (n) An important property of any learning algorithm is its sample complexity—how many training examples are required to obtain a certain level of accuracy? By default, all training examples are used.
Number of training iterations (T) Our learning algorithm (Figure 21) alternates between updating candidate sets and updating parameters for T iterations. We use T = 5 as the default value.
Beam size (K) The computation of the candidate sets in Figure 21 is based on beam search where each intermediate state keeps at most K DCS trees. The default value is K = 100.
Optimization algorithm To optimize the objective function our default is to use the standard L-BFGS algorithm (Nocedal 1980) with a backtracking line search for choosing the step size.
Regularization (λ) The regularization parameter λ > 0 in the objective function is another knob for controlling the tradeoff between fitting and overfitting. The default is λ = 0.01.
4.1.3 Lexical Triggers
The lexical trigger set L (Section 2.6.1) is a set of entries (s, p), where s is a sequence of words and p is a predicate. We run experiments on two sets of lexical triggers: base triggersLB and augmented triggersLB+P.
Base Triggers. The base trigger set LB includes three types of entries:
Domain-independent triggers: For each domain-independent predicate (e.g., argmax), we manually specify a few words associated with that predicate (e.g., most). The full list is shown at the top of Figure 22.
Values: For each value x that appears in the world (specifically, x ∈ vj ∈ w(p) for some tuple v, index j, and predicate p), LB contains an entry (x, x) (e.g., (Boston, Boston: city)). Note that this rule implicitly specifies an infinite number of triggers.
Regarding predicate names, we do not add entries such as (city, city), because we want our system to be language-independent. In Turkish, for instance, we would not have the luxury of lexicographical cues that associate city with şehir. So we should think of the predicates as just symbols predicate1, predicate2, and so on. On the other hand, values in the database are generally proper nouns (e.g., city names) for which there are generally strong cross-linguistic lexicographic similarities.
Part-of-speech (POS) triggers:13 For each domain-specific predicate p, we specify a set of POS tags T. Implicitly, LB contains all pairs (x, p) where the word x has a POS tag t ∈ T. For example, for city, we would specify nn and nns, which means that any word which is a singular or plural common noun triggers the predicate city. Note that city triggers city as desired, but state also triggers city.
The POS triggers for Geo and Jobs domains are shown in the left side of Figure 22. Note that some predicates such as traverse and loc are not associated with any POS tags. Predicates corresponding to verbs and prepositions are not included as overt lexical triggers, but rather included as trace predicates L(ε). In constructing the logical forms, nouns and adjectives serve as anchor points. Trace predicates can be inserted between these anchors. This strategy is more flexible than requiring each predicate to spring from some word.
Augmented Triggers. We now define the augmented trigger set LB+P, which contains more domain-specific information than LB. Specifically, for each domain-specific predicate (e.g., city), we manually specify a single prototype word (e.g., city) associated with that predicate. Under LB+P, city would trigger only city because city is a prototype word, but town would trigger all the nn predicates (city, state, country, etc.) because it is not a prototype word.
Prototype triggers require only a modest amount of domain-specific supervision (see the right side of Figure 22 for the entire list for Geo and Jobs). In fact, as we'll see in Section 4.2, prototype triggers are not absolutely required to obtain good accuracies, but they give an extra boost and also improve computational efficiency by reducing the set of candidate DCS trees.
Finally, to determine triggering, we stem all words using the Porter stemmer (Porter 1980), so that mountains triggers the same predicates as mountain. We also decompose superlatives into two words (e.g., largest is mapped to most large), allowing us to construct the logical form more compositionally.
4.2 Comparison with Other Systems
We now compare our approach with existing methods. We used the same training-test splits as Zettlemoyer and Collins (2005) (600 training and 280 test examples for Geo, 500 training and 140 test examples for Jobs). For development, we created five random splits of the training data. For each split, we put 70% of the examples into a development training set and the remaining 30% into a development test set. The actual test set was only used for obtaining final numbers.
4.2.1 Systems that Learn from Question–Answer Pairs
We first compare our system (henceforth, LJK11) with Clarke et al. (2010) (henceforth, CGCR10), which is most similar to our work in that it also learns from question–answer pairs without using annotated logical forms. CGCR10 works with the FunQL language and casts semantic parsing as integer linear programming (ILP). In each iteration, the learning algorithm solves the ILP to predict the logical form for each training example. The examples with correct predictions are fed to a structural support vector machine (SVM) and the model parameters are updated.
Though similar in spirit, there are some important differences between CGCR10 and our approach. They use ILP instead of beam search and structural SVM instead of log-linear models, but the main difference is which examples are used for learning. Our approach learns on any feasible example (Section 3.2.1), one where the candidate set contains a logical form that evaluates to the correct answer. CGCR10 uses a much more stringent criterion: The highest scoring logical form must evaluate to the correct answer. Therefore, for their algorithm to progress, the model already must be non-trivially good before learning even starts. This is reflected in the amount of prior knowledge and initialization that CGCR10 uses before learning starts: WordNet features, syntactic parse trees, and a set of lexical triggers with 1.42 words per non-value predicate. Our system with base triggers requires only simple indicator features, POS tags, and 0.5 words per non-value predicate.
CGCR10 created a version of Geo which contains 250 training and 250 test examples. Table 2 compares the empirical results of this split. We see that our system (LJK11) with base triggers significantly outperforms CGCR10 (84% vs. 73.2%), and it even outperforms the version of CGCR10 that is trained using logical forms (84.0% vs. 80.4%). If we use augmented triggers, we widen the gap by another 3.6 percentage points.14
System . | . | Accuracy (%) . |
---|---|---|
CGCR10 w/answers | (Clarke et al. 2010) | 73.2 |
CGCR10 w/logical forms | (Clarke et al. 2010) | 80.4 |
LJK11 w/base triggers | (Liang, Jordan, and Klein 2011) | 84.0 |
LJK11 w/augmented triggers | (Liang, Jordan, and Klein 2011) | 87.6 |
4.2.1 State-of-the-Art Systems
We now compare our system (LJK11) with state-of-the-art systems, which all require annotated logical forms (except Precise). Here is a brief overview of the systems:
Cocktail (Tang and Mooney 2001) uses inductive logic programming to learn rules for driving the decisions of a shift-reduce semantic parser. It assumes that a lexicon (mapping from words to predicates) is provided.
Precise (Popescu, Etzioni, and Kautz 2003) does not use learning, but instead relies on matching words to strings in the database using various heuristics based on WordNet and the Charniak parser. Like our work, it also uses database type constraints to rule out spurious logical forms. One of the unique features of Precise is that it has 100% precision—it refuses to parse an utterance which it deems semantically intractable.
Scissor (Ge and Mooney 2005) learns a generative probabilistic model that extends the Collins (1999) models with semantic labels, so that syntactic and semantic parsing can be done jointly.
Silt (Kate, Wong, and Mooney 2005) learns a set of transformation rules for mapping utterances to logical forms.
Krisp (Kate and Mooney 2006) uses SVMs with string kernels to drive the local decisions of a chart-based semantic parser.
Wasp (Wong and Mooney 2006) uses log-linear synchronous grammars to transform utterances into logical forms, starting with word alignments obtained from the IBM models.
λ-Wasp (Wong and Mooney 2007) extends Wasp to work with logical forms that contain bound variables (lambda abstraction).
LNLZ08 (Lu et al. 2008) learns a generative model over hybrid trees, which are logical forms augmented with natural language words. IBM model 1 is used to initialize the parameters, and a discriminative reranking step works on top of the generative model.
ZC05 (Zettlemoyer and Collins 2005) learns a discriminative log-linear model over CCG derivations. Starting with a manually constructed domain-independent lexicon, the training procedure grows the lexicon by adding lexical entries derived from associating parts of an utterance with parts of the annotated logical form.
ZC07 (Zettlemoyer and Collins 2007) extends ZC05 with extra (disharmonic) combinators to increase the expressive power of the model.
KZGS10 (Kwiatkowski et al. 2010) uses a restricted higher-order unification procedure, which iteratively breaks up a logical form into smaller pieces. This approach gradually adds lexical entries of increasing generality, thus obviating the need for the manually specified templates used by ZC05 and ZC07 for growing the lexicon. IBM model 1 is used to initialize the parameters.
KZGS11 (Kwiatkowski et al. 2011) extends KZGS10 by factoring lexical entries into a template plus a sequence of predicates that fill the slots of the template. This factorization improves generalization.
With the exception of Precise, all other systems require annotated logical forms, whereas our system learns only from annotated answers. On the other hand, our system does rely on a few manually specified lexical triggers, whereas many of the later systems essentially require no manually crafted lexica. For us, the lexical triggers play a crucial role in the initial stages of learning because they constrain the set of candidate DCS trees; otherwise we would face a hopelessly intractable search problem. The other systems induce lexica using unsupervised word alignment (Wong and Mooney 2006; Wong and Mooney 2007; Kwiatkowski et al. 2010; Kwiatkowski et al. 2011) and/or on-line lexicon learning (Zettlemoyer and Collins 2005; Zettlemoyer and Collins 2007; Kwiatkowski et al. 2010; Kwiatkowski et al. 2011). Unfortunately, we cannot use these automatic techniques because they rely on having annotated logical forms.
Table 3 shows the results for Geo. Semantic parsers are typically evaluated on the accuracy of the logical forms: precision (the accuracy on utterances which are successfully parsed) and recall (the accuracy on all utterances). We only focus on recall (a lower bound on precision) and simply use the word accuracy to refer to recall.15 Our system is evaluated only on answer accuracy because our model marginalizes out the latent logical form. All other systems are evaluated on the accuracy of logical forms. To calibrate, we also evaluated KZGS10 on answer accuracy and found that it was quite similar to its logical form accuracy (88.9% vs. 88.2%).16 This does not imply that our system would necessarily have a high logical form accuracy because multiple logical forms can produce the same answer, and our system does not receive a training signal to tease them apart. Even with only base triggers, our system (LJK11) outperforms all but two of the systems, falling short of KZGS10 by only one percentage point (87.9% vs. 88.9%).17 With augmented triggers, our system takes the lead (91.4% vs. 88.9%).
System . | . | LF (%) . | Answer (%) . |
---|---|---|---|
Cocktail | (Tang and Mooney 2001) | 79.4 | – |
Precise | (Popescu, Etzioni, and Kautz 2003) | 77.5 | 77.5 |
Scissor | (Ge and Mooney 2005) | 72.3 | – |
Silt | (Kate, Wong, and Mooney 2005) | 54.1 | – |
Krisp | (Kate and Mooney 2006) | 71.7 | – |
Wasp | (Wong and Mooney 2006) | 74.8 | – |
λ-Wasp | (Wong and Mooney 2007) | 86.6 | – |
LNLZ08 | (Lu et al. 2008) | 81.8 | – |
ZC05 | (Zettlemoyer and Collins 2005) | 79.3 | – |
ZC07 | (Zettlemoyer and Collins 2007) | 86.1 | – |
KZGS10 | (Kwiatkowski et al. 2010) | 88.2 | 88.9 |
KZGS11 | (Kwiatkowski et al. 2010) | 88.6 | – |
LJK11 w/base triggers | (Liang, Jordan, and Klein 2011) | – | 87.9 |
LJK11 w/augmented triggers | (Liang, Jordan, and Klein 2011) | – | 91.4 |
System . | . | LF (%) . | Answer (%) . |
---|---|---|---|
Cocktail | (Tang and Mooney 2001) | 79.4 | – |
Precise | (Popescu, Etzioni, and Kautz 2003) | 77.5 | 77.5 |
Scissor | (Ge and Mooney 2005) | 72.3 | – |
Silt | (Kate, Wong, and Mooney 2005) | 54.1 | – |
Krisp | (Kate and Mooney 2006) | 71.7 | – |
Wasp | (Wong and Mooney 2006) | 74.8 | – |
λ-Wasp | (Wong and Mooney 2007) | 86.6 | – |
LNLZ08 | (Lu et al. 2008) | 81.8 | – |
ZC05 | (Zettlemoyer and Collins 2005) | 79.3 | – |
ZC07 | (Zettlemoyer and Collins 2007) | 86.1 | – |
KZGS10 | (Kwiatkowski et al. 2010) | 88.2 | 88.9 |
KZGS11 | (Kwiatkowski et al. 2010) | 88.6 | – |
LJK11 w/base triggers | (Liang, Jordan, and Klein 2011) | – | 87.9 |
LJK11 w/augmented triggers | (Liang, Jordan, and Klein 2011) | – | 91.4 |
Table 4 shows the results for Jobs. The two learning-based systems (Cocktail and ZC05) are actually outperformed by Precise, which is able to use strong database type constraints. By exploiting this information and doing learning, we obtain the best results.
System . | . | LF (%) . | Answer (%) . |
---|---|---|---|
Cocktail | (Tang and Mooney 2001) | 79.4 | – |
Precise | (Popescu, Etzioni, and Kautz 2003) | 88.0 | 88.0 |
ZC05 | (Zettlemoyer and Collins 2005) | 79.3 | – |
LJK11 w/base triggers | (Liang, Jordan, and Klein 2011) | – | 90.7 |
LJK11 w/augmented triggers | (Liang, Jordan, and Klein 2011) | – | 95.0 |
4.3 Empirical Properties
In this section, we try to gain intuition into properties of our approach. All experiments in this section were performed on random development splits. Throughout this section, “accuracy” means development test accuracy.
4.3.1 Error Analysis
To understand the type of errors our system makes, we examined one of the development runs, which had 34 errors on the test set. We classified these errors into the following categories (the number of errors in each category is shown in parentheses):
Incorrect POS tags (8): Geo is out-of-domain for our POS tagger, so the tagger makes some basic errors that adversely affect the predicates that can be lexically triggered. For example, the question What states border states … is tagged as wp vbz nn nns …, which means that the first states cannot trigger state. In another example, major river is tagged as nnp nnp, so these cannot trigger the appropriate predicates either, and thus the desired DCS tree cannot even be constructed.
Non-projectivity (3): The candidate DCS trees are defined by a projective construction mechanism (Section 2.6) that prohibits edges in the DCS tree from crossing. This means we cannot handle utterances such as largest city by area, because the desired DCS tree would have city dominating area dominating argmax. To construct this DCS tree, we could allow local reordering of the words.
Unseen words (2): We never saw at least or sea level at training time. The former has the correct lexical trigger, but not a sufficiently large feature weight (0) to encourage its use. For the latter, the problem is more structural: We have no lexical triggers for 0: length, and only adding more lexical triggers can solve this problem.
Wrong lexical triggers (7): Sometimes the error is localized to a single lexical trigger. For example, the model incorrectly thinks Mississippi is the state rather than the river, and that Rochester is the city in New York rather than the name, even though there are contextual cues to disambiguate in these cases.
Extra words (5): Sometimes, words trigger predicates that should be ignored. For example, for population density, the first word triggers population, which is used rather than density.
Over-smoothing of DCS tree (9): The first half of our features (Figure 20) are defined on the DCS tree alone; these produce a form of smoothing that encourages DCS trees to look alike regardless of the words. We found several instances where this essential tool for generalization went too far. For example, in state of Nevada, the trace predicate border is inserted between the two nouns, because it creates a structure more similar to that of the common question what states border Nevada?
4.3.2 Visualization of Features
Having analyzed the behavior of our system for individual utterances, let us move from the token level to the type level and analyze the learned parameters of our model. We do not look at raw feature weights, because there are complex interactions between them not represented by examining individual weights. Instead, we look at expected feature counts, which we think are more interpretable.
4.3.3 Learning, Search, Bootstrapping
Recall from Section 3.2.1 that a training example is feasible (with respect to our beam search) if the resulting candidate set contains a DCS tree with the correct answer. Infeasible examples are skipped, but an example may become feasible in a later iteration. A natural question is how many training examples are feasible in each iteration. Figure 24 shows the answer: Initially, only around 30% of the training examples are feasible; this is not surprising given that all the parameters are zero, so our beam search is essentially unguided. Training on just these examples improves the parameters, however, and over the next few iterations, the number of feasible examples steadily increases to around 97%.
In our algorithm, learning and search are deeply intertwined. Search is of course needed to learn, but learning also improves search. The general approach is similar in spirit to Searn (Daume, Langford, and Marcu 2009), although we do not have any formal guarantees at this point.
Our algorithm also has a bootstrapping flavor. The “easy” examples are processed first, where easy is defined by the ability of beam search to generate the correct answer. This bootstrapping occurs quite naturally: Unlike most bootstrapping algorithms, we do not have to set a confidence threshold for accepting new training examples, something that can be quite tricky to do. Instead, our threshold falls out of the discrete nature of the beam search.
4.3.4 Effect of Various Settings
So far, we have used our approach with default settings (Section 4.1.2). How sensitive is the approach to these choices? Table 5 shows the impact of the feature templates. Figure 25 shows the effect of the number of training examples, number of training iterations, beam size, and regularization parameter. The overall conclusion is that there are no big surprises: Our default settings could be improved on slightly, but these differences are often smaller than the variation across different development splits.
Features . | Accuracy (%) . |
---|---|
PRED | 13.4 ±1.6 |
PRED + PREDREL | 18.4 ±3.5 |
PRED + PREDREL + PREDRELPRED | 23.1 ±5.0 |
PRED + TRIGGERPRED | 61.3 ±1.1 |
PRED + TRIGGERPRED + TRACE* | 76.4 ±2.3 |
PRED + PREDREL + PREDRELPRED + TRIGGERPRED + TRACE* | 84.7 ±3.5 |
Features . | Accuracy (%) . |
---|---|
PRED | 13.4 ±1.6 |
PRED + PREDREL | 18.4 ±3.5 |
PRED + PREDREL + PREDRELPRED | 23.1 ±5.0 |
PRED + TRIGGERPRED | 61.3 ±1.1 |
PRED + TRIGGERPRED + TRACE* | 76.4 ±2.3 |
PRED + PREDREL + PREDRELPRED + TRIGGERPRED + TRACE* | 84.7 ±3.5 |
5. Discussion
The work we have presented in this article addresses three important themes. The first theme is semantic representation (Section 5.1): How do we parametrize the mapping from utterances to their meanings? The second theme is program induction (Section 5.2): How do we efficiently search through the space of logical structures given a weak feedback signal? Finally, the last theme is grounded language (Section 5.3): How do we use constraints from the world to guide learning of language and conversely use language to interact with the world?
5.1 Semantic Representation
Since the late nineteenth century, philosophers and linguists have worked on elucidating the relationship between an utterance and its meaning. One of the pillars of formal semantics is Frege's principle of compositionality, that the meaning of an utterance is built by composing the meaning of its parts. What these parts are and how they are composed is the main question. The dominant paradigm, which stems from the seminal work of Richard Montague (1973) in the early 1970s, states that parts are lambda calculus expressions that correspond to syntactic constituents, and composition is function application.
Consider the compositionality principle from a statistical point of view, where we construe compositionality as factorization. Factorization, the way a statistical model breaks into features, is necessary for generalization: It enables us to learn from previously seen examples and interpret new utterances. Projecting back to Frege's original principle, the parts are the features (Section 3.1.1), and composition is the DCS construction mechanism (Section 2.6) driven by parameters learned from training examples.
Taking the statistical view of compositionality, finding a good semantic representation becomes designing a good statistical model. But statistical modeling must also deal with the additional issue of language acquisition or learning, which presents complications: In absorbing training examples, our learning algorithm must inevitably traverse through intermediate models that are wrong or incomplete. The algorithms must therefore tolerate this degradation, and do so in a computationally efficient way. For example, in the line of work on learning probabilistic CCGs (Zettlemoyer and Collins 2005; Zettlemoyer and Collins 2007; Kwiatkowski et al. 2010), many candidate lexical entries must be entertained for each word even when polysemy does not actually exist (Section 2.6.4).
To improve generalization, the lexicon can be further factorized (Kwiatkowski et al. 2011), but this is all done within the constraints of CCG. DCS represents a departure from this tradition, which replaces a heavily lexicalized constituency-based formalism with a lightly-lexicalized dependency-based formalism. We can think of DCS as a shift in linguistic coordinate systems, which makes certain factorizations or features more accessible. For example, we can define features on paths between predicates in a DCS tree which capture certain lexical patterns much more easily than in a lambda calculus expression or a CCG derivation.
DCS has a family resemblance to a semantic representation called natural logic form (Alshawi, Chang, and Ringgaard 2011), which is also motivated by the benefits of working with dependency-based logical forms. The goals and the detailed structure of the two semantic formalisms are different, however. Alshawi, Chang, and Ringgaard (2011) focus on parsing complex sentences in an open domain where a structured database or world does not exist. Whereas they do equip their logical forms with a full model-theoretic semantics, the logical forms are actually closer to dependency trees: Quantifier scope is left unspecified, and the predicates are simply the words.
Perhaps not immediately apparent is the fact that DCS draws an important idea from Discourse Representation Theory (DRT) (Kamp and Reyle 1993)—not from the treatment of anaphora and presupposition which it is known for, but something closer to its core. This is the idea of having a logical form where all variables are existentially quantified and constraints are combined via conjunction—a Discourse Representation Structure (DRS) in DRT, or a basic DCS tree with only join relations. Computationally, these logical structures conveniently encode CSPs. Linguistically, it appears that existential quantifiers play an important role and should be treated specially (Kamp and Reyle 1993). DCS takes this core and focuses on semantic compositionality and computation, whereas DRT focuses more on discourse and pragmatics.
In addition to the statistical view of DCS as a semantic representation, it is useful to think about DCS from the perspective of programming language design. Two programming languages can be equally expressive, but what matters is how simple it is to express a desired type of computation in a given language. In some sense, we designed the DCS formal language to make it easy to represent computations expressed by natural language. An important part of DCS is the mark–execute construct, a uniform framework for dealing with the divergence between syntactic and semantic scope. This construct allows us to build simple DCS tree structures and still handle the complexities of phenomena such as quantifier scope variation. Compared to lambda calculus, think of DCS as a higher-level programming language tailored to natural language, which results in simpler programs (DCS trees). Simpler programs are easier for us to work with and easier for an algorithm to learn.
5.2 Program Induction
Searching over the space of programs is challenging. This is the central computational challenge of program induction, that of inferring programs (logical forms) from their behavior (denotations). This problem has been tackled by different communities in various forms: program induction in AI, programming by demonstration in Human–Computer Interaction, and program synthesis in programming languages. The core computational difficulty is that the supervision signal—the behavior—is a complex function of the program that cannot be easily inverted. What program generated the output Arizona, Nevada, and Oregon?
Perhaps somewhat counterintuitively, program induction is easier if we infer programs for not a single task but for multiple tasks. The intuition is that when the tasks are related, the solution to one task can help another task, both computationally in navigating the program space and statistically in choosing the appropriate program if there are multiple feasible possibilities (Liang, Jordan, and Klein 2010). In our semantic parsing work, we want to infer a logical form for each utterance (task). Clearly the tasks are related because they use the same vocabulary to talk about the same domain.
Natural language also makes program induction easier by providing side information (words) which can be used to guide the search. There have been several papers that induce programs in this setting: Eisenstein et al. (2009) induce conjunctive formulae from natural language instructions, Piantadosi et al. (2008) induce first-order logic formulae using CCG in a small domain assuming observed lexical semantics, and Clarke et al. (2010) induce logical forms in semantic parsing. In the ideal case, the words would determine the program predicates, and the utterance would determine the entire program compositionally. But of course, this mapping is not given and must be learned.
5.3 Grounded Language
In recent years, there has been an increased interest in connecting language with the world.18 One of the primary issues in grounded language is alignment—figuring out what fragments of utterances refer to what aspects of the world. In fact, semantic parsers trained on examples of utterances and annotated logical form (those discussed in Section 4.2.2) need to solve the task of aligning words to predicates. Some can learn from utterances paired with a set of logical forms, one of which is correct (Kate and Mooney 2007; Chen and Mooney 2008). Liang, Jordan, and Klein (2009) tackle the even more difficult alignment problem of segmenting and aligning a discourse to a database of facts, where many parts on either side are irrelevant.
If we know how the world relates to language, we can leverage structure in the world to guide the learning and interpretation of language. We saw that type constraints from the database/world reduce the set of candidate logical forms and lead to more accurate systems (Popescu, Etzioni, and Kautz 2003; Liang, Jordan, and Klein 2011). Even for syntactic parsing, information from the denotation of an utterance can be helpful (Schuler 2003).
One of the exciting aspects about using the world for learning language is that it opens the door to many new types of supervision. We can obtain answers given a world, which are cheaper to obtain than logical forms (Clarke et al. 2010; Liang, Jordan, and Klein 2011). Other researchers have also pushed in this direction in various ways: learning a semantic parser based on bootstrapping and estimating the confidence of its own predictions (Goldwasser et al. 2011), learning a semantic parser from user interactions with a dialog system (Artzi and Zettlemoyer 2011), and learning to execute natural language instructions from just a reward signal using reinforcement learning (Branavan et al. 2009; Branavan, Zettlemoyer, and Barzilay 2010; Branavan, Silver, and Barzilay 2011). In general, supervision from the world is indirectly related to the learning task, but it is often much more plentiful and natural to obtain.
The benefits can also flow from language to the world. For example, previous work learned to interpret language to troubleshoot a Windows machine (Branavan et al. 2009; Branavan, Zettlemoyer, and Barzilay 2010), win a game of Civilization (Branavan, Silver, and Barzilay 2011), play a legal game of solitaire (Eisenstein et al. 2009; Goldwasser and Roth 2011), and navigate a map by following directions (Vogel and Jurafsky 2010; Chen and Mooney 2011). Even when the objective in the world is defined independently of language (e.g., in Civilization), language can provide a useful bias towards the non-linguistic end goal.
6. Conclusions
The main conceptual contribution of this article is a new semantic formalism, dependency-based compositional semantics (DCS), and techniques to learn a semantic parser from question–answer pairs where the intermediate logical form (a DCS tree) is induced in an unsupervised manner. Our final question–answering system was able to match the accuracies of state-of-the-art systems that learn from annotated logical forms.
There is currently a significant conceptual gap between our question–answering system (which can be construed as a natural language interface to a database) and open-domain question–answering systems. The former focuses on understanding a question compositionally and computing the answer compositionally, whereas the latter focuses on retrieving and ranking answers from a large unstructured textual corpus. The former has depth; the latter has breadth. Developing methods that can both model the semantic richness of language and scale up to an open-domain setting remains an open challenge.
We believe that it is possible to push our approach in the open-domain direction. Neither DCS nor the learning algorithm is tied to having a clean rigid database, which could instead be a database generated from a noisy information extraction process. The key is to drive the learning with the desired behavior, the question–answer pairs. The latent variable is the logical form or program, which just tries to compute the desired answer by piecing together whatever information is available. Of course, there are many open challenges ahead, but with the proper combination of linguistic, statistical, and computational insight, we hope to eventually build systems with both breadth and depth.
Acknowledgments
We thank Luke Zettlemoyer and Tom Kwiatkowski for providing us with data and answering questions, as well as the anonymous reviewers for their detailed feedback. P. L. was supported by an NSF Graduate Research Fellowship.
Notes
We use the term sequence to refer to both tuples (v1, …, vk) and arrays [v1, …, vk]. For our purposes, there is no functional difference between tuples and arrays; the distinction is convenient when we start to talk about arrays of tuples.
Technically, the node is c and the variable is a(c), but we use c to denote the variable to simplify notation.
Unlike the CSPs corresponding to DCS trees, the CSPs corresponding to DRSs need not be tree-structured, though economical DRT (Bos 2009) imposes a tree-like restriction on DRSs for computational reasons.
DRT started the dynamic semantics tradition where meanings are context-change potentials, a natural way to capture anaphora. The DCS formalism presented here does not deal with anaphora, so we give it a purely static semantics.
The two meanings are: (i) there is a river x such that x traverses every city; and (ii) for every city x, some river traverses x.
The two meanings are: (i) a state that borders Alaska (which is the largest state); and (ii) a state with the highest score, where the score of a state x is the maximum size of any state that x borders (Alaska is irrelevant here because no states border it).
The join and project operations are taken from relational algebra.
Defined this way, we can only handle conservative quantifiers, because the nuclear scope will always be a subset of the restrictor. This design decision is inspired by DRT, where it provides a way of modeling donkey anaphora. We are not treating anaphora in this work, but we can handle it by allowing pronouns in the nuclear scope to create anaphoric edges into nodes in the restrictor. These constraints naturally propagate through the nuclear scope's CSP without affecting the restrictor.
In the continuation-based approach, this difference corresponds to the difference between assigning a denotational versus an operational semantics.
To further reduce the search space, F imposes a few additional constraints: for example, limiting the number of columns to 2, and only allowing trace predicates between arity 1 predicates.
Notation: .
The state of the dynamic program would be the span i..j and the head predicate over that span.
Note that the numbers for LJK11 differ from those presented in Liang, Jordan, and Klein (2011), which reports results based on 10 different splits rather than the set-up used by CGCR10.
Our system produces a logical form for every utterance, and thus our precision is the same as our recall.
The 88.2% corresponds to 87.9% in Kwiatkowski et al. (2010). The difference is due to using a slightly newer version of the code.
The 87.9% and 91.4% correspond to 88.6% and 91.1% in Liang, Jordan, and Klein (2011). These differences are due to minor differences in the code.
Here, world need not refer to the physical world, but could be any virtual world. The point is that the world has non-trivial structure and exists extra-linguistically.
References
Author notes
Computer Science Division, University of California, Berkeley, CA 94720, USA. E-mail: [email protected].
Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA. E-mail: [email protected].
Computer Science Division, University of California, Berkeley, CA 94720, USA. E-mail: [email protected].