We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on GeoQuery, Scan, and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.

Semantic parsing aims to transform a natural language utterance into a structured representation that can be easily manipulated by a software (e.g., to query a database). As such, it is a central task in human–computer interfaces. Andreas et al. (2013) first proposed to rely on machine translation models for semantic parsing, where the target representation is linearized and treated as a foreign language. Due to recent advances in deep learning and especially in sequence-to-sequence (seq2seq) with attention architectures for machine translation (Bahdanau et al., 2015), it is appealing to use the same architectures for standard structured prediction problems (Vinyals et al., 2015). This approach is indeed common in semantic parsing (Jia and Liang, 2016; Dong and Lapata, 2016; Wang et al., 2020), as well as other domains. Unfortunately, there are well-known limitations to seq2seq architectures for semantic parsing. First, at test time, the decoding algorithm is typically based on beam search as the model is autoregressive and does not make any independence assumption. In case of prediction failure, it is therefore unknown if this is due to errors in the weighting function or to the optimal solution failing out of the beam. Secondly, they are known to fail when compositional generalization is required (Lake and Baroni, 2018; Finegan-Dollak et al., 2018a; Keysers et al., 2020).

In order to bypass these problems, Herzig and Berant (2021) proposed to represent the semantic content associated with an utterance as a phrase structure, i.e., using the same representation usually associated with syntactic constituents. As such, their semantic parser is based on standard span-based decoding algorithms (Hall et al., 2014; Stern et al., 2017; Corro, 2020) with additional well-formedness constraints from the semantic formalism. Given a weighting function, MAP inference is a polynomial time problem that can be solved via a variant of the CYK algorithm (Kasami, 1965; Younger, 1967; Cocke, 1970). Experimentally, Herzig and Berant (2021) show that their approach outperforms seq2seq models in terms of compositional generalization, therefore effectively bypassing the two major problems of these architectures.

The complexity of MAP inference for phrase structure parsing is directly impacted by the considered search space (Kallmeyer, 2010). Importantly, (ill-nested) discontinuous phrase structure parsing is known to be NP-hard, even with a bounded block-degree (Satta, 1992). Herzig and Berant (2021) explore two restricted inference algorithms, both of which have a cubic time complexity with respect to the input length. The first one only considers continuous phrase structures, that is, derived trees that could have been generated by a context-free grammar, and the second one also considers a specific type of discontinuities, see Corro (2020, Section 3.6). Both algorithms fail to cover the full set of phrase structures observed in semantic treebanks, see Figure 1.

Figure 1: 

Example of a semantic phrase structure from GeoQuery. This structure is outside of the search space of the parser of Herzig and Berant (2021) as the constituent in red is discontinuous and also has a discontinuous parent (in red + green).

Figure 1: 

Example of a semantic phrase structure from GeoQuery. This structure is outside of the search space of the parser of Herzig and Berant (2021) as the constituent in red is discontinuous and also has a discontinuous parent (in red + green).

Close modal

In this work, we propose to reduce semantic parsing without reentrancy (i.e., a given predicate or entity cannot be used as an argument for two different predicates) to a bi-lexical dependency parsing problem. As such, we tackle the same semantic content as aforementioned previous work but using a different mathematical representation (Rambow, 2010). We identify two main benefits to our approach: (1) as we allow crossing arcs, i.e., “non-projective graphs”, all datasets are guaranteed to be fully covered and (2) it allows us to rely on optimization methods to tackle inference intractability of our novel graph-based formulation of the problem. More specifically, in our setting we need to jointly assign predicates/entities to words that convey a semantic content and to identify arguments of predicates via bi-lexical dependencies. We show that MAP inference in this setting is equivalent to the maximum generalized spanning arborescence problem (Myung et al., 1995) with supplementary constraints to ensure well-formedness with respect to the semantic formalism. Although this problem is NP-hard, we propose an optimization algorithm that solves a linear relaxation of the problem and can deliver an optimality certificate.

Our contributions can be summarized as follows:

  • We propose a novel graph-based approach for semantic parsing without reentrancy;

  • We prove the NP-hardness of MAP inference and latent anchoring inference;

  • We propose a novel integer linear programming formulation for this problem together with an approximate solver based on conditional gradient and constraint smoothing;

  • We tackle the training problem using variational approximations of objective functions, including the weakly-supervised scenario;

  • We evaluate our approach on GeoQuery, Scan, and Clevr and observe that it outperforms baselines on both i.i.d. splits and splits that test for compositional generalization.

Code to reproduce the experiments is available online.1

We propose to reduce semantic parsing to parsing the abstract syntax tree (AST) associated with a semantic program. We focus on semantic programs whose ASTs do not have any reentrancy, i.e., a single predicate or entity cannot be the argument of two different predicates. Moreover, we assume that each predicate or entity is anchored on exactly one word of the sentence and each word can be the anchor of at most one predicate or entity. As such, the semantic parsing problem can be reduced to assigning predicates and entities to words and identifying arguments via dependency relations, see Figure 2. In order to formalize our approach to the semantic parsing problem, we will use concepts from graph theory. We therefore first introduce the vocabulary and notions that will be useful in the rest of this article. Notably, the notions of cluster and generalized arborescence will be used to formalize our prediction problem.

Figure 2: 

(top) The semantic program corresponding to the sentence “What rivers do not run through Tennessee?” in the GeoQuery dataset. (middle) The associated AST. (bottom) Two examples illustrating the intuition of our model: We jointly assign predicates/entities and identify argument dependencies. As such, the resulting structure is strongly related to a syntactic dependency parse, but where the dependency structure do not cover all words.

Figure 2: 

(top) The semantic program corresponding to the sentence “What rivers do not run through Tennessee?” in the GeoQuery dataset. (middle) The associated AST. (bottom) Two examples illustrating the intuition of our model: We jointly assign predicates/entities and identify argument dependencies. As such, the resulting structure is strongly related to a syntactic dependency parse, but where the dependency structure do not cover all words.

Close modal

Notations and Definitions.

Let G = 〈V, A〉 be a directed graph with vertices V and arcs AV × V. An arc in A from a vertex uV to a vertex vV is denoted either aA or uvA. For any subset of vertices UV, we denote σG+(U) (resp., σG(U)) the set of arcs leaving one vertex of U and entering one vertex of VU (resp., leaving one vertex of VU and entering one vertex of U) in the graph G. Let BA be a subset of arcs. We denote V [B] the cover set of B, i.e., the set of vertices that appear as an extremity of at least one arc in B. A graph G = 〈V, A〉 is an arborescence2 rooted at uV if and only if (iff) it contains |V| − 1 arcs and there is a directed path from u to each vertex in V. In the rest of this work, we will assume that the root is always vertex 0 ∈ V. Let BA be a set of arcs such that G′ = 〈V[B], B〉 is an arborescence. Then G′ is a spanning arborescence of G iff V [B] = V.

Let π = {V0, …, Vn} be a partition of V containing n + 1 clusters. G′ is a generalized not-necessarily-spanning arborescence (resp. generalized spanning arborescence) on the partition π of G iff G′ is an arborescence and V [B] contains at most one vertex per cluster in π (resp. contains exactly one).

Let WV be a set of vertices. Contracting W consists in replacing in G the set W by a new vertex wV, replacing all the arcs uvσ(W) by an arc uw and all the arcs uvσ+(W) by an arc wv. Given a graph with partition π, the contracted graph is the graph where each cluster in π has been contracted. While contracting a graph may introduce parallel arcs, it is not an issue in practice, even for weighted graphs.

2.1 Semantic Grammar and AST

The semantic programs we focus on take the form of a functional language, i.e., a representation where each predicate is a function that takes other predicates or entities as arguments. The semantic language is typed in the same sense than in “typed programming languages”. For example, in GeoQuery, the predicate capital_2 expects an argument of type city and returns an object of type state. In the datasets we use, the typing system disambiguates the position of arguments in a function: For a given function, either all arguments are of the same type or the order of arguments is unimportant—an example of both is the predicate intersection_river in GeoQuery that takes two arguments of type river, but the result of the execution is unchanged if the arguments are swapped.3

Formally, we define the set of valid semantic programs as the set of programs that can be produced with a semantic grammar 𝒢 = 〈E, T, fType, fArgs〉 where:

  • E is the set of predicates and entities, which we will refer to as the set of tags—w.l.o.g. we assume that RootE where Root is a special tag used for parsing;

  • T is the set of types;

  • fType : ET is a typing function that assigns a type to each tag;

  • fArgs : E × T → ℕ is a valency function that assigns the numbers of expected arguments of a given type to each tag.

A tag eE is an entity iff ∀tT : fArgs(e, t) = 0. Otherwise, e is a predicate.

A semantic program in a functional language can be equivalently represented as an AST, a graph where instances of predicates and entities are represented as vertices and where arcs identify arguments of predicates. Formally, an AST is a labeled graph G = 〈V, A, l〉 where function l : VE assigns a tag to each vertex and arcs identify the arguments of tags, see Figure 2. An AST G is well-formed with respect to the grammar 𝒢 iff G is an arborescence and the valency and type constraints are satisfied, i.e., ∀uV, tT:
where:

2.2 Problem Reduction and Complexity

In our setting, semantic parsing is a joint sentence tagging and dependency parsing problem (Bohnet and Nivre, 2012; Li et al., 2011; Corro et al., 2017): each content word (i.e., words that convey a semantic meaning) must be tagged with a predicate or an entity, and dependencies between content words identify arguments of predicates, see Figure 2. However, our semantic parsing setting differs from standard syntactic analysis in two ways: (1) the resulting structure is not-necessarily-spanning, there are words (e.g., function words) that must not be tagged and that do not have any incident dependency—and those words are not known in advance, they must be identified jointly with the rest of the structure; (2) the dependency structure is highly constrained by the typing mechanism, that is, the predicted structure must be a valid AST. Nevertheless, similarly to aforementioned works, our parser is graph-based, that is, for a given input we build a (complete) directed graph and decoding is reduced to computing a constrained subgraph of maximum weight.

Given a sentence w = w1wn with n words and a grammar 𝒢, we construct a clustered labeled graph G = 〈V, A, π, l¯〉 as follows. The partition π = {V0, …, Vn} contains n + 1 clusters, where V0 is a root cluster and each cluster Vi, i ≠ 0, is associated to word wi. The root cluster V0 = {0} contains a single vertex that will be used as the root and every other cluster contains |E| vertices. The extended labeling function l¯ : VE ∪ {Root} assigns a tag in E to each vertex vV ∖ {0} and Root to vertex 0. Distinct vertices in a cluster Vi cannot have the same label, i.e., ∀u, vVi : uvl¯(u) ≠ l¯(v).

Let BA be a subset of arcs. The graph G′ = 〈V[B], B〉 defines a 0-rooted generalized valency-constrained not-necessarily-spanning arborescence iff it is a generalized arborescence of G, there is exactly one arc leaving 0 and the sub-arborescence rooted at the destination of that arc is a valid AST with respect to the grammar 𝒢. As such, there is a one-to-one correspondence between ASTs anchored on the sentence w and generalized valency-constrained not-necessarily-spanning arborescences in the graph G, see Figure 3b.

Figure 3: 

(a) Example of a sentence and its associated AST (solid arcs) from the GeoQuery dataset. The dashed edges indicate predicates and entities anchors (note that this information is not available in the dataset). (b) The corresponding generalized valency-constrained not-necessarily-spanning arborescence (red arcs). The root is the isolated top left vertex. Adding ∅ tags and dotted orange arcs produces a generalized spanning arborescence.

Figure 3: 

(a) Example of a sentence and its associated AST (solid arcs) from the GeoQuery dataset. The dashed edges indicate predicates and entities anchors (note that this information is not available in the dataset). (b) The corresponding generalized valency-constrained not-necessarily-spanning arborescence (red arcs). The root is the isolated top left vertex. Adding ∅ tags and dotted orange arcs produces a generalized spanning arborescence.

Close modal

For any sentence w, our aim is to find the AST that most likely corresponds to it. Thus, after building the graph G as explained above, the neural network described in  Appendix B is used to produce a vector of weights μ ∈ ℝ|V| associated to the set of vertices V and a vector of weights ϕ ∈ ℝ|A| associated to the set of arcs A. Given these weights, graph-based semantic parsing is reduced to an optimization problem called the maximum generalized valency-constrained not-necessarily-spanning arborescence (MGVCNNSA) in the graph G.

Theorem 1.The MGVCNNSA problem is NP-hard.

The proof is in  Appendix A.

2.3 Mathematical Program

Our graph-based approach to semantic parsing has allowed us to prove the intrinsic hardness of the problem. We follow previous work on graph-based parsing (Martins et al., 2009; Koo et al., 2010), and other topics, by proposing an integer linear programming (ILP) formulation in order to compute (approximate) solutions.

Remember that in the joint tagging and dependency parsing interpretation of the semantic parsing problem, the resulting structure is not-necessarily-spanning, meaning that some words may not be tagged. In order to rely on well-known algorithms for computing spanning arborescences as a subroutine of our approximate solver, we first introduce the notion of extended graph. Given a graph G = 〈V, A, π, l¯〉, we construct an extended graph G¯ = 〈V¯, A¯, π¯, l¯4 containing n additional vertices {1¯, …, n¯} that are distributed along clusters, i.e., π¯ = {V0, V1 ∪ {1¯}, …, Vn ∪ {n¯}}, and arcs from the root to these extra vertices, i.e., A¯ = A ∪ {0 → i¯∣1 ≤ in}. Let BA be a subset of arcs such that 〈V[B], B〉 is a generalized not-necessarily-spanning arborescence on G. Let B¯A¯ be a subset of arcs defined as B¯ = B ∪ {0 → i¯σV[B],B(Vi) = ∅}. Then, there is a one-to-one correspondence between generalized not-necessarily-spanning arborescences 〈V[B], B〉 and generalized spanning arborescences 〈V¯[B¯], B¯〉, see Figure 3b.

Let x ∈ {0, 1}V¯ and y ∈ {0, 1}A¯ be variable vectors indexed by vertices and arcs such that a vertex vV (resp., an arc aA) is selected iff xv = 1 (resp., ya = 1). The set of 0-rooted generalized valency-constrained spanning arborescences on G¯ can be written as the set of variables 〈x, y〉 satisfying the following linear constraints. First, we restrict y to structures that are spanning arborescences over G¯ where clusters have been contracted:
(1)
(2)
(3)
Constraints (2) ensure that the contracted graph is weakly connected. Constraints (3) force each cluster to have exactly one incoming arc. The set of vectors y that satisfy these three constraints are exactly the set of 0-rooted spanning arborescences on the contracted graph, see Schrijver (2003, Section 52.4) for an in-depth analysis of this polytope. The root vertex is always selected and other vertices are selected iff they have one incoming selected arc:
(4)
(5)
Note that constraints (1)(3) do not force selected arcs to leave from a selected vertex as they operate at the cluster level. This property will be enforced via the following valency constraints:
(6)
(7)
Constraint (6) forces the root to have exactly one outgoing arc into a vertex uV ∖ {0} (i.e., a vertex that is not part of the extra vertices introduced in the extended graph) that will be the root of the AST. Constraints (7) force the selected vertices and arcs to produce a well-formed AST with respect to the grammar 𝒢. Note that these constraints are only defined for vertices in V ∖ {0}, i.e., they are neither defined for the root vertex nor for the extra vertices introduced in the extended graph.
To simplify notation, we introduce the following sets:
and 𝒞 = 𝒞(sa) ∩ 𝒞(val). Given vertex weights μ ∈ ℝV¯ and arc weights ϕ ∈ ℝA¯, computing the MGVCNNSA is equivalent to solving the following ILP:
Without constraint 〈x, y〉 ∈ 𝒞(val), the problem would be easy to solve. The set 𝒞(sa) is the set of spanning arborescences over the contracted graph, hence to maximize over this set we can simply: (1) contract the graph and assign to each arc in the contracted graph the weight of its corresponding arc plus the weight of its destination vertex in the original graph; (2) run the the maximum spanning arborescence algorithm (MSA, Edmonds, 1967; Tarjan, 1977) on the contracted graph, which has a 𝒪(n2) time-complexity. This process is illustrated on Figure 5 (top). Note that the contracted graph may have parallel arcs, which is not an issue in practice as only the one of maximum weight can appear in a solution of the MSA.

We have established that MAP inference in our semantic parsing framework is a NP-hard problem. We proposed an ILP formulation of the problem that would be easy to solve if some constraints were removed. This property suggests the use of an approximation algorithm that introduces the difficult constraints as penalties. As a similar setting arises from our weakly supervised loss function, the presentation of the approximation algorithm is deferred until Section 4.

3.1 Supervised Training Objective

We define the likelihood of a pair 〈x, y〉 ∈ 𝒞 via the Boltzmann distribution:
where c(μ, ϕ) is the log-partition function:
During training, we aim to maximize the log-likelihood of the training dataset. The log-likelihood of an observation 〈x, y〉 is defined as:
Unfortunately, computing the log-partition function is intractable as it requires summing over all feasible solutions. Instead, we rely on a surrogate lower-bound as an objective function. To this end, we derive an upper bound (because it is negated in ℓ) to the second term: a sum of log-sum-exp functions that sums over each cluster of vertices independently and over incoming arcs in each cluster independently, which is tractable. This loss can be understood as a generalization of the head selection loss used in dependency parsing (Zhang et al., 2017). We now detail the derivation and prove that it is an upper bound to the log-partition function.
Let U be a matrix such that each row contains a pair 〈x, y〉 ∈ 𝒞 and Δ∣𝒞∣ be the simplex of dimension ∣𝒞∣ − 1, i.e., the set of all stochastic vectors of dimension ∣𝒞∣. The log-partition function can then be rewritten using its variational formulation:
where H[p] = −∑ipi log pi is the Shannon entropy. We refer the reader to Boyd and Vandenberghe (2004, Example 3.25), Wainwright and Jordan (2008, Section 3.6), and Beck (2017, Section 4.4.10). Note that this formulation remains impractical as p has an exponential size. Let 𝓜 = conv(𝒞) be the marginal polytope, i.e., the convex hull of the feasible integer solutions, we can rewrite the above variational formulation as:
where H𝓜 is a joint entropy function defined such that the equality holds. The maximization in this reformulation acts on the marginal probabilities of parts (vertices and arcs) and has therefore a polynomial number of variables. We refer the reader to Wainwright and Jordan (2008, 5.2.1) and Blondel et al. (2020, Section 7) for more details. Unfortunately, this optimization problem is hard to solve as 𝓜 cannot be characterized in an explicit manner and H𝓜 is defined indirectly and lacks a polynomial closed form (Wainwright and Jordan, 2008, Section 3.7). However, we can derive an upper bound to the log-partition function by decomposing the entropy term H𝓜 (Cover, 1999, Property 4 on page 41, i.e., H is an upper bound) and by using an outer approximation to the marginal polytope 𝓛 ⊇ 𝓜 (i.e., increasing the search space):
In particular, we observe that each pair 〈x, y〉 ∈ 𝒞 has exactly one vertex selected per cluster Vi¯π¯ and one incoming arc selected per cluster Vi¯π¯ ∖ {V0}. We denote 𝒞(one) the set of all the pairs 〈x, y〉 that satisfy these constraints. By using 𝓛 = conv(𝒞(one)) as an outer approximation to the marginal polytope (see Figure 4) the optimization problem can be rewritten as a sum of independent problems. As each of these problems is the variational formulation of a log-sum-exp term, the upper bound on c(μ, ϕ) can be expressed as a sum of log-sum-exp functions, one over vertices in each cluster Vi¯π¯ ∖ {V0} and one over incoming arcs σG¯(Vi¯) for each cluster Vi¯π¯ ∖ {V0}. Although this type of approximation may not result in a Bayes consistent loss (Corro, 2023), it works well in practice.
Figure 4: 

Polyhedron illustration. The solid lines represent the convex hull of feasible solutions of Ilp1, denoted 𝓜, whose vertices are feasible integer solutions (black vertices). The dashed lines represent the convex hull of feasible solutions of the linear relaxation of Ilp1, which has non integral vertices (in white). Finally, the dotted lines represent the polyhedron 𝓛 that is used to approximate c(μ, ϕ). All its vertices are integral, but some of them are not feasible solutions of Ilp1.

Figure 4: 

Polyhedron illustration. The solid lines represent the convex hull of feasible solutions of Ilp1, denoted 𝓜, whose vertices are feasible integer solutions (black vertices). The dashed lines represent the convex hull of feasible solutions of the linear relaxation of Ilp1, which has non integral vertices (in white). Finally, the dotted lines represent the polyhedron 𝓛 that is used to approximate c(μ, ϕ). All its vertices are integral, but some of them are not feasible solutions of Ilp1.

Close modal

3.2 Weakly Supervised Training Objective

Unfortunately, training data often does not include gold pairs 〈x, y〉 but instead only the AST, without word anchors (or word alignment). This is the case for the three datasets we use in our experiments. We thus consider our training signal to be the set of all structures that induce the annotated AST, which we denote 𝒞*.

The weakly supervised loss is defined as:
i.e., we marginalize over all the structures that induce the gold AST. We can rewrite this loss as:
The two terms are intractable. We approximate the second term using the bound defined in Section 3.1.
We now derive a tractable lower bound to the first term. Let q be a proposal distribution such that q(x, y) = 0 if 〈x, y〉 ∉ 𝒞*. We derive the following lower bound via Jensen’s inequality:
This bound holds for any distribution q satisfying the aforementioned condition. We choose to maximize this lower bound using a distribution that gives a probability of one to a single structure, as in “hard” EM (Neal and Hinton, 1998, Section 6).

For a given sentence w, let G = 〈V, A, π>, l¯〉 be a graph defined as in Section 2.2 and G′ = 〈V′, A′, l′〉 be an AST defined as in Section 2.1. We aim to find the GVCNNSA in G of maximum weight whose induced AST is exactly G′. This is equivalent to aligning each vertex in V′ with one vertex of V ∖ {0} s.t. there is at most one vertex per cluster of π appearing in the alignment and where the weight of an alignment is defined as:

  1. for each vertex u′ ∈ V′, we add the weight of the vertex uV it is aligned to—moreover, if u′ is the root of the AST we also add the weight of the arc 0 → u;

  2. for each arc u′ → v′ ∈ A′, we add the weight of the arc uv where uV (resp. vV) is the vertex u′ (resp. v′) it is aligned with.

Note that this latent anchoring inference consists in computing a (partial) alignment between vertices of G and G′, but the fact that we need to take into account arc weights forbids the use of the Kuhn–Munkres algorithm (Kuhn, 1955).

Theorem 2.Computing the anchoring of maximum weight of an AST with a graph G is NP-hard.

The proof is in  Appendix A.

Therefore, we propose an optimization-based approach to compute the distribution q. Note that the problem has a constraint requiring each cluster Viπ to be aligned with at most one vertex v′ ∈ V′, i.e., each word in the sentence can be aligned with at most one vertex in the AST. If we remove this constraint, then the problem becomes tractable via dynamic programming. Indeed, we can recursively construct a table Chart[u′, u], u′ ∈ V′ and uV, containing the score of aligning vertex u′ to vertex u plus the score of the best alignment of all the descendants of u′. To this end, we simply visit the vertices V′ of the AST in reverse topological order, see Algorithm 1. The best alignment can be retrieved via back-pointers.

graphic

Computing q is therefore equivalent to solving the following ILP:
where the set 𝒞*(relaxed) is the set of feasible solutions of the dynamic program in Algorithm 1, whose convex hull can be described via linear constraints (Martin et al., 1990).
In this section, we propose an efficient way to solve the linear relaxations of MAP inference (Ilp1) and latent anchoring inference (Ilp2) via constraint smoothing and the conditional gradient method. We focus on problems of the following form:
where the vector z is the concatenation of the vectors x and y defined previously and conv denotes the convex hull of a set. We explained previously that if the set of constraints of form Az = b for (Ilp1) or Azb for (Ilp2) was absent, the problem would be easy to solve under a linear objective function. In fact, there exists an efficient linear maximization oracle (LMO), i.e., a function that returns the optimal integral solution, for the set conv(𝒞(easy)). This setting covers both (Ilp1) and (Ilp2) where we have 𝒞(easy) = 𝒞(sa) and 𝒞(easy) = 𝒞*(relaxed), respectively.
An appealing approach in this setting is to introduce the problematic constraints as penalties in the objective:
where δS is the indicator function of the set S:
In the equality case, we use S = {b} and in the inequality case, we use S = {u|ub}.

4.1 Conditional Gradient Method

Given a proper, smooth, and differentiable function g and a nonempty, bounded, closed, and convex set conv(𝒞(easy)), the conditional gradient method (a.k.a. Frank-Wolfe; Frank and Wolfe, 1956; Levitin and Polyak, 1966; Lacoste-Julien and Jaggi, 2015) can be used to solve optimization problems of the following form:
Contrary to the projected gradient method, this approach does not require to compute projections onto the feasible set conv(𝒞(easy)) which is, in most cases, computationally expensive. Instead, the conditional gradient method only relies on a LMO:
The algorithm constructs a solution to the original problem as a convex combination of elements returned by the LMO. The pseudo-code is given in Algorithm 2. An interesting property of this method is that its step size range is bounded, which allows the use of simple linesearch techniques.

graphic

4.2 Smoothing

Unfortunately, the function g(z) = f(z) − δS(Az) is non-smooth due to the indicator function term, preventing the use of the conditional gradient method. We propose to rely on the framework proposed by Yurtsever et al. (2018), where the indicator function is replaced by a smooth approximation. The indicator function of the set S can be rewritten as:
where δS** denotes the Fenchel biconjugate of the indicator function and σS(u) = suptSut is the support function of S. More details can be found in Beck (2017, Section 4.1 and 4.2). In order to smooth the indicator function, we add a β-parameterized convex regularizer −β2∥·22 to its Fenchel biconjugate:
where β > 0 controls the quality and the smoothness of the approximation (Nesterov, 2005).

Equalities.

In the case where S = {b}, with a few computations that are detailed by Yurtsever et al. (2018), we obtain:
That is, we have a quadratic penalty term in the objective. Note that this term is similar to the term introduced in an augmented Lagrangian (Nocedal and Wright, 1999, Equation 17.36), and adds a penalty in the objective for vectors z s.t. Azb.

Inequalities.

In the case where S = {u|ub}, similar computations lead to:
where [·]+ denotes the Euclidian projection into the non-negative orthant (i.e., clipping negative values). Similarly to the equality case, this term introduces a penalty in the objective for vectors z s.t. Az > b. This penalty function is also called the Courant-Beltrami penalty function.

Figure 5 (bottom) illustrates how the gradient of the penalty term can “force” the LMO to return solutions that satisfy the smoothed constraints.

Figure 5: 

Illustration of the approximate inference algorithm on the two-word sentence “List states”, where we assume the grammar has one entity state_all and one predicate loc_1 that takes exactly one entity as argument. The left graph is the extended graph for the sentence, including vertices and arcs weights (in black). If we ignore constraints (6)(7), inference is reduced to computing the MSA on the contracted graph (solid arcs in the middle column). This may lead to solutions that do not satisfy constraints (6)(7) on the expanded graph (top example). However, the gradient of the smoothed constraint (7) will induce penalties (in red) to vertex and arc scores that will encourage the loc_1 predicate to either be dropped from the solution or to have an outgoing arc to a state_all argument. Computing the MSA on the contracted graph with penalties results in a solution that satisfies constraints (6)(7) (bottom example).

Figure 5: 

Illustration of the approximate inference algorithm on the two-word sentence “List states”, where we assume the grammar has one entity state_all and one predicate loc_1 that takes exactly one entity as argument. The left graph is the extended graph for the sentence, including vertices and arcs weights (in black). If we ignore constraints (6)(7), inference is reduced to computing the MSA on the contracted graph (solid arcs in the middle column). This may lead to solutions that do not satisfy constraints (6)(7) on the expanded graph (top example). However, the gradient of the smoothed constraint (7) will induce penalties (in red) to vertex and arc scores that will encourage the loc_1 predicate to either be dropped from the solution or to have an outgoing arc to a state_all argument. Computing the MSA on the contracted graph with penalties results in a solution that satisfies constraints (6)(7) (bottom example).

Close modal

4.3 Practical Details

Smoothness.

In practice, we need to choose the smoothness parameter β. We follow Yurtsever et al. (2018) and use β(k) = β(0)k+1 where k is the iteration number and β(0) = 1.

Step Size.

Another important choice in the algorithm is the step size γ. We show that when the smoothed constraints are equalities, computing the optimal step size has a simple closed form solution if the function f is linear, which is the case for (Ilp1), i.e., MAP decoding. The step size poblem formulation at iteration k is defined as:
By assumption, f is linear and can be written as f(z) = θz. Ignoring the box constraints on γ, by first order optimality conditions, we have:
We can then simply clip the result so that it satisfies the box constraints. Unfortunately, in the inequalities case, there is no simple closed form solution. We approximate the step size using 10 iterations of the bisection algorithm for root finding.

Non-integral Solutions.

As we solve the linear relaxation of original ILPs, the optimal solutions may not be integral (Figure 4). Therefore, we use simple heuristics to construct a feasible solution to the original ILP in these cases. For MAP inference, we simply solve the ILP5 using CPLEX but introducing only variables that have a non-null value in the linear relaxation, leading to a very sparse problem which is fast to solve. For latent anchoring, we simply use the Kuhn–Munkres algorithm using the non-integral solution as assignment costs.

We compare our method to baseline systems both on i.i.d. splits (Iid) and splits that test for compositional generalization for three datasets. The neural network is described in  Appendix B.

Datasets.

Scan (Lake and Baroni, 2018) contains natural language navigation commands. We use the variant of Herzig and Berant (2021) for semantic parsing. The Iid split is the simple split (Lake and Baroni, 2018). The compositional splits are primitive right (Right) and primitive around right (ARight) (Loula et al., 2018).

GeoQuery (Zelle and Mooney, 1996) uses the FunQL formalism (Kate et al., 2005) and contains questions about the US geography. The Iid split is the standard split and compositional generalization is evaluated on two splits: Length where the examples are split by program length and Template (Finegan-Dollak et al., 2018a) where they are split such that all semantic programs having the same AST are in the same split.

Clevr (Johnson et al., 2017) contains synthetic questions over object relations in images. Closure (Bahdanau et al., 2019) introduces additional question templates that require compositional generalization. We use the original split as our Iid split and the Closure split as a compositional split where the model is evaluated on Closure.

Baselines.

We compare our approach against the architecture proposed by Herzig and Berant (2021) (SpanBasedSP) as well as the seq2seq baselines they used. In Seq2Seq (Jia and Liang, 2016), the encoder is a bi-LSTM over pre-trained GloVe embeddings (Pennington et al., 2014) or ELMo (Peters et al., 2018) and the decoder is an attention-based LSTM (Bahdanau et al., 2015). BERT2Seq replaces the encoder with BERT-base. GRAMMAR is similar to Seq2Seq but the decoding is constrained by a grammar. BART (Lewis et al., 2020) is pre-trained as a denoising autoencoder.

Results.

We report the denotation accuracies in Table 1. Our approach outperforms all other methods. In particular, the seq2seq baselines suffer from a significant drop in accuracy on splits that require compositional generalization. While SpanBasedSP is able to generalize, our approach outperforms it. Note that we observed that the GeoQuery execution script used to compute denotation accuracy in previous work contains several bugs that overestimate the true accuracy. Therefore, we also report denotation accuracy with a corrected executor (see  Appendix C) for fair comparison with future work.

Table 1: 

Denotation and exact match accuracy on the test sets. All the baseline results were taken from Herzig and Berant (2021). For our approach, we also report exact match accuracy, i.e., the percentage of sentences for which the prediction is identical to the gold program. The last line reports the exact match accuracy without the use of CPLEX to round non integral solutions (Section 4.3).

 ScanGeoQueryClevr
IidRightARightIidTemplateLengthIidClosure
Baselines (denotation accuracy only) 
Seq2Seq 99.9 11.6 78.5 46.0 24.3 100 59.5 
+ ELMo 100 54.9 41.6 79.3 50.0 25.7 100 64.2 
BERT2Seq 100 77.7 95.3 81.1 49.6 26.1 100 56.4 
GRAMMAR 100 0.0 4.2 72.1 54.0 24.6 100 51.3 
BART 100 50.5 100 87.1 67.0 19.3 100 51.5 
SpanBasedSP 100 100 100 86.1 82.2 63.6 96.7 98.8 
Our approach 
Denotation accuracy 100 100 100 92.9 89.9 74.9 100 99.6 
↳ Corrected executor       91.8 88.7 74.5     
Exact match 100 100 100 90.7 86.2 69.3 100 99.6 
↳ w/o CPLEX heuristic 100 100 100 90.0 83.0 67.5 100 98.0 
 ScanGeoQueryClevr
IidRightARightIidTemplateLengthIidClosure
Baselines (denotation accuracy only) 
Seq2Seq 99.9 11.6 78.5 46.0 24.3 100 59.5 
+ ELMo 100 54.9 41.6 79.3 50.0 25.7 100 64.2 
BERT2Seq 100 77.7 95.3 81.1 49.6 26.1 100 56.4 
GRAMMAR 100 0.0 4.2 72.1 54.0 24.6 100 51.3 
BART 100 50.5 100 87.1 67.0 19.3 100 51.5 
SpanBasedSP 100 100 100 86.1 82.2 63.6 96.7 98.8 
Our approach 
Denotation accuracy 100 100 100 92.9 89.9 74.9 100 99.6 
↳ Corrected executor       91.8 88.7 74.5     
Exact match 100 100 100 90.7 86.2 69.3 100 99.6 
↳ w/o CPLEX heuristic 100 100 100 90.0 83.0 67.5 100 98.0 

We also report exact match accuracy, with and without the heuristic to construct integral solutions from fractional ones. The exact match accuracy is always lower or equal to the denotation accuracy. This shows that our approach can sometimes provide the correct denotation even though the prediction is different from the gold semantic program. Importantly, while our approach outperforms baselines, its accuracy is still significantly worse on the split that requires to generalize to longer programs.

Graph-based Methods.

Graph-based methods have been popularized by syntactic dependency parsing (McDonald et al., 2005) where MAP inference is realized via the maximum spanning arborescence algorithm (Chu and Liu, 1965; Edmonds, 1967). A benefit of this algorithm is that it has a 𝒪(n2) time-complexity (Tarjan, 1977), i.e., it is more efficient than algorithms exploring more restricted search spaces (Eisner, 1997; Gómez-Rodríguez et al., 2011; Pitler et al., 2012, 2013).

In the case of semantic structures, Kuhlmann and Jonsson (2015) proposed a 𝒪(n3) algorithm for the maximum non-necessarily-spanning acyclic graphs with a noncrossing arc constraint. Without the noncrossing constraint, the problem is known to be NP-hard (Grötschel et al., 1985). To bypass this computational complexity, Dozat and Manning (2018) proposed to handle each dependency as an independent binary classification problem, that is they do not enforce any constraint on the output structure. Note that, contrary to our work, these approaches allow for reentrancy but do not enforce well-formedness of the output with respect to the semantic grammar. Lyu and Titov (2018) use a similar approach for AMR parsing where tags are predicted first, followed by arc predictions and finally heuristics are used to ensure the output graph is valid. On the contrary, we do not use a pipeline and we focus on joint decoding where validity of the output is directly encoded in the search space.

Previous work in the literature has also considered reduction to graph-based methods for other problems, e.g., for discontinuous constituency parsing (Fernández-González and Martins, 2015; Corro et al., 2017), lexical segmentation (Constant and Le Roux, 2015), and machine translation (Zaslavskiy et al., 2009), inter alia.

Compositional Generalization.

Several authors observed that compositional generalization insufficiency is an important source of error for semantic parsers, especially ones based on seq2seq architectures (Lake and Baroni, 2018; Finegan-Dollak et al., 2018b; Herzig and Berant, 2019; Keysers et al., 2020). Wang et al. (2021) proposed a latent re-ordering step to improve compositional generalization, whereas Zheng and Lapata (2021) relied on latent predicate tagging in the encoder. There has also been an interest in using data augmentation methods to improve generalization (Jia and Liang, 2016; Andreas, 2020; Akyürek et al., 2021; Qiu et al., 2022; Yang et al., 2022).

Recently, Herzig and Berant (2021) showed that span-based parsers do not exhibit such problematic behavior. Unfortunately, these parsers fail to cover the set of semantic structures observed in English treebanks, and we hypothesize that this would be even worse for free word order languages. Our graph-based approach does not exhibit this downside. Previous work by Jambor and Bahdanau (2022) also considered graph-based methods for compositional generalization, but their approach predicts each part independently without any well-formedness or acyclicity constraint.

In this work, we focused on graph-based semantic parsing for formalisms that do not allow reentrancy. We conducted a complexity study of two inference problems that appear in this setting. We proposed ILP formulations of these problems together with a solver for their linear relaxation based on the conditional gradient method. Experimentally, our approach outperforms comparable baselines.

One downside of our semantic parser is speed (we parse approximately 5 sentences per second for GeoQuery). However, we hope this work will give a better understanding of the semantic parsing problem together with baseline for faster methods.

Future research will investigate extensions for (1) ASTs that contain reentrancies and (2) prediction algorithms for the case where a single word can be the anchor of more than one predicate or entity. These two properties are crucial for semantic representations like Abstract Meaning Representation (Banarescu et al., 2013). Moreover, even if our graph-based semantic parser provides better results than previous work on length generalization, this setting is still difficult. A more general research direction on neural architectures that generalize better to longer sentences is important.

We thank François Yvon and the anonymous reviewers for their comments and suggestions. We thank Jonathan Herzig and Jonathan Berant for fruitful discussions. This work benefited from computations done on the Saclay-IA platform and on the HPC resources of IDRIS under the allocation 2022-AD011013727 made by GENCI.

2 

In the NLP community, arborescences are often called (directed) trees. We stick with the term arborescence as it is more standard in the graph theory literature, see for example Schrijver (2003). Using the term tree introduces a confusion between two unrelated algorithms, Kruskal’s maximum spanning tree algorithm (Kruskal, 1956) that operates on undirected graphs and Edmond’s maximum spanning arborescence algorithm (Edmonds, 1967) that operates on directed graphs. Moreover, this prevents any confusion between the graph object called arborescence and the semantic structure called AST.

3 

There are a few corner cases like exclude_river, for which we simply assume arguments are in the same order as they appear in the input sentence.

4 

The labeling function is unchanged as there is no need for types for vertices in V¯V.

5 

We use the multi-commodity flow formulation of Martins et al. (2009) instead of the cycle breaking constraints (2).

Ekin
Akyürek
,
Afra Feyza
Akyürek
, and
Jacob
Andreas
.
2021
.
Learning to recombine and resample data for compositional generalization
. In
International Conference on Learning Representations
.
Jacob
Andreas
.
2020
.
Good-enough compositional data augmentation
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7556
7566
,
Online
.
Association for Computational Linguistics
.
Jacob
Andreas
,
Andreas
Vlachos
, and
Stephen
Clark
.
2013
.
Semantic parsing as machine translation
. In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
47
52
,
Sofia, Bulgaria
.
Association for Computational Linguistics
.
Dzmitry
Bahdanau
,
Kyunghyun
Cho
, and
Yoshua
Bengio
.
2015
.
Neural machine translation by jointly learning to align and translate
. In
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings
.
Dzmitry
Bahdanau
,
Harm
de Vries
,
Timothy J.
O’Donnell
,
Shikhar
Murty
,
Philippe
Beaudoin
,
Yoshua
Bengio
, and
Aaron C.
Courville
.
2019
.
CLOSURE: Assessing systematic generalization of CLEVR models
.
CoRR
,
abs/1912.05783
.
Laura
Banarescu
,
Claire
Bonial
,
Shu
Cai
,
Madalina
Georgescu
,
Kira
Griffitt
,
Ulf
Hermjakob
,
Kevin
Knight
,
Philipp
Koehn
,
Martha
Palmer
, and
Nathan
Schneider
.
2013
.
Abstract Meaning Representation for sembanking
. In
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
, pages
178
186
,
Sofia, Bulgaria
.
Association for Computational Linguistics
.
Amir
Beck
.
2017
.
First-Order Methods in Optimization
.
SIAM
.
Mathieu
Blondel
,
André F. T.
Martins
, and
Vlad
Niculae
.
2020
.
Learning with Fenchel-Young losses
.
Journal of Machine Learning Research
,
21
(
35
):
1
69
.
Bernd
Bohnet
and
Joakim
Nivre
.
2012
.
A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing
. In
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
, pages
1455
1465
,
Jeju Island, Korea
.
Association for Computational Linguistics
.
Stephen
Boyd
and
Lieven
Vandenberghe
.
2004
.
Convex Optimization
.
Cambridge University Press
.
Yoeng-Jin
Chu
and
Tseng-Hong
Liu
.
1965
.
On the shortest arborescence of a directed graph
.
Scientia Sinica
,
14
:
1396
1400
.
John
Cocke
.
1970
.
Programming Languages and Their Compilers: Preliminary Notes
.
New York University
.
Matthieu
Constant
and
Joseph
Le Roux
.
2015
.
Dependency representations for lexical segmentation
. In
6th Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2015), Bilbao, Spain, July
.
Caio
Corro
.
2020
.
Span-based discontinuous constituency parsing: A family of exact chart-based algorithms with time complexities from O(n^6) down to O(n^3)
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2753
2764
,
Online
.
Association for Computational Linguistics
.
Caio
Corro
.
2023
.
On the inconsistency of separable losses for structured prediction
. In
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
.
Association for Computational Linguistics
.
Caio
Corro
,
Joseph
Le Roux
, and
Mathieu
Lacroix
.
2017
.
Efficient discontinuous phrase-structure parsing via the generalized maximum spanning arborescence
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
1644
1654
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Thomas M.
Cover
.
1999
.
Elements of Information Theory
.
John Wiley & Sons
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Li
Dong
and
Mirella
Lapata
.
2016
.
Language to logical form with neural attention
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
33
43
,
Berlin, Germany
.
Association for Computational Linguistics
.
Timothy
Dozat
and
Christopher D.
Manning
.
2017
.
Deep biaffine attention for neural dependency parsing
. In
International Conference on Learning Representations
.
Timothy
Dozat
and
Christopher D.
Manning
.
2018
.
Simpler but more accurate semantic dependency parsing
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
484
490
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Christophe
Duhamel
,
Luis
Gouveia
,
Pedro
Moura
, and
Mauricio
Souza
.
2008
.
Models and heuristics for a minimum arborescence problem
.
Networks
,
51
(
1
):
34
47
.
Jack
Edmonds
.
1967
.
Optimum branchings
.
Journal of Research of the National Bureau of Standards – B. Mathematics and Mathematical Physics
,
71
:
233
.
Jason
Eisner
.
1997
.
Bilexical grammars and a cubic-time probabilistic parser
. In
Proceedings of the Fifth International Workshop on Parsing Technologies
, pages
54
65
,
Boston/Cambridge, Massachusetts, USA
.
Association for Computational Linguistics
.
Daniel
Fernández-González
and
André F. T.
Martins
.
2015
.
Parsing as reduction
. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1523
1533
,
Beijing, China
.
Association for Computational Linguistics
.
Catherine
Finegan-Dollak
,
Jonathan K.
Kummerfeld
,
Li
Zhang
,
Karthik
Ramanathan
,
Sesh
Sadasivam
,
Rui
Zhang
, and
Dragomir
Radev
.
2018a
.
Improving text-to-SQL evaluation methodology
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
351
360
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Catherine
Finegan-Dollak
,
Jonathan K.
Kummerfeld
,
Li
Zhang
,
Karthik
Ramanathan
,
Sesh
Sadasivam
,
Rui
Zhang
, and
Dragomir
Radev
.
2018b
.
Improving text-to-SQL evaluation methodology
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
351
360
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Marguerite
Frank
and
Philip
Wolfe
.
1956
.
An algorithm for quadratic programming
.
Naval Research Logistics Quarterly
,
3
(
1–2
):
95
110
.
Michael R.
Garey
and
David S.
Johnson
.
1979
.
Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences)
.
W. H. Freeman
.
Carlos
Gómez-Rodríguez
,
John
Carroll
, and
David
Weir
.
2011
.
Dependency parsing schemata and mildly non-projective dependency parsing
.
Computational Linguistics
,
37
(
3
):
541
586
.
Martin
Grötschel
,
Michael
Jünger
, and
Gerhard
Reinelt
.
1985
.
Acyclic Subdigraphs and Linear Orderings: Polytopes, Facets, and a Cutting Plane Algorithm
, pages
217
264
.
Springer Netherlands
.
Dordrecht
.
David
Hall
,
Greg
Durrett
, and
Dan
Klein
.
2014
.
Less grammar, more features
. In
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
228
237
,
Baltimore, Maryland
.
Association for Computational Linguistics
.
Jonathan
Herzig
and
Jonathan
Berant
.
2019
.
Don’t paraphrase, detect! Rapid and effective data collection for semantic parsing
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
3810
3820
,
Hong Kong, China
.
Association for Computational Linguistics
.
Jonathan
Herzig
and
Jonathan
Berant
.
2021
.
Span-based semantic parsing for compositional generalization
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
908
921
,
Online
.
Association for Computational Linguistics
.
Sepp
Hochreiter
and
Jürgen
Schmidhuber
.
1997
.
Long short-term memory
.
Neural Computation
,
9
(
8
):
1735
1780
. ,
[PubMed]
Dora
Jambor
and
Dzmitry
Bahdanau
.
2022
.
LAGr: Label aligned graphs for better systematic generalization in semantic parsing
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
3295
3308
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Robin
Jia
and
Percy
Liang
.
2016
.
Data recombination for neural semantic parsing
. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
12
22
,
Berlin, Germany
.
Association for Computational Linguistics
.
Justin
Johnson
,
Bharath
Hariharan
,
Laurens
van der Maaten
,
Li
Fei-Fei
,
C. Lawrence
Zitnick
, and
Ross B.
Girshick
.
2017
.
CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning
.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages
1988
1997
.
Laura
Kallmeyer
.
2010
.
Parsing Beyond Context-Free Grammars
.
Springer Science & Business Media
.
Tadao
Kasami
.
1965
.
An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages
.
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
.
Rohit J.
Kate
,
Yuk Wah
Wong
, and
Raymond J.
Mooney
.
2005
.
Learning to transform natural to formal languages
. In
Proceedings of the 20th National Conference on Artificial Intelligence - Volume 3
,
AAAI’05
, pages
1062
1068
.
AAAI Press
.
Daniel
Keysers
,
Nathanael
Schärli
,
Nathan
Scales
,
Hylke
Buisman
,
Daniel
Furrer
,
Sergii
Kashubin
,
Nikola
Momchev
,
Danila
Sinopalnikov
,
Lukasz
Stafiniak
,
Tibor
Tihon
,
Dmitry
Tsarkov
,
Xiao
Wang
,
Marc
van Zee
, and
Olivier
Bousquet
.
2020
.
Measuring compositional generalization: A comprehensive method on realistic data
. In
International Conference on Learning Representations
.
Terry
Koo
,
Alexander M.
Rush
,
Michael
Collins
,
Tommi
Jaakkola
, and
David
Sontag
.
2010
.
Dual decomposition for parsing with non-projective head automata
. In
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
, pages
1288
1298
,
Cambridge, MA
.
Association for Computational Linguistics
.
Joseph B.
Kruskal
.
1956
.
On the shortest spanning subtree of a graph and the traveling salesman problem
.
Proceedings of the American Mathematical Society
,
7
(
1
):
48
50
.
Marco
Kuhlmann
and
Peter
Jonsson
.
2015
.
Parsing to noncrossing dependency graphs
.
Transactions of the Association for Computational Linguistics
,
3
:
559
570
.
Harold W.
Kuhn
.
1955
.
The Hungarian method for the assignment problem
.
Naval Research Logistics Quarterly
,
2
(
1–2
):
83
97
.
Simon
Lacoste-Julien
and
Martin
Jaggi
.
2015
.
On the global linear convergence of Frank-Wolfe optimization variants
. In
Advances in Neural Information Processing Systems
,
volume 28
.
Curran Associates, Inc.
Brenden
Lake
and
Marco
Baroni
.
2018
.
Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks
. In
Proceedings of the 35th International Conference on Machine Learning
,
volume 80 of Proceedings of Machine Learning Research
, pages
2873
2882
.
PMLR
.
Evgeny S.
Levitin
and
Boris T.
Polyak
.
1966
.
Constrained minimization methods
.
USSR Computational Mathematics and Mathematical Physics
,
6
(
5
):
1
50
.
Mike
Lewis
,
Yinhan
Liu
,
Naman
Goyal
,
Marjan
Ghazvininejad
,
Abdelrahman
Mohamed
,
Omer
Levy
,
Veselin
Stoyanov
, and
Luke
Zettlemoyer
.
2020
.
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7871
7880
,
Online
.
Association for Computational Linguistics
.
Zhenghua
Li
,
Min
Zhang
,
Wanxiang
Che
,
Ting
Liu
,
Wenliang
Chen
, and
Haizhou
Li
.
2011
.
Joint models for Chinese POS tagging and dependency parsing
. In
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
, pages
1180
1191
,
Edinburgh, Scotland, UK
.
Association for Computational Linguistics
.
João
Loula
,
Marco
Baroni
, and
Brenden
Lake
.
2018
.
Rearranging the familiar: Testing compositional generalization in recurrent networks
. In
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, pages
108
114
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Chunchuan
Lyu
and
Ivan
Titov
.
2018
.
AMR parsing as graph prediction with latent alignment
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
397
407
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Richard Kipp
Martin
,
Ronald L.
Rardin
, and
Brian A.
Campbell
.
1990
.
Polyhedral characterization of discrete dynamic programming
.
Operations Research
,
38
(
1
):
127
138
.
André
Martins
,
Noah
Smith
, and
Eric
Xing
.
2009
.
Concise integer linear programming formulations for dependency parsing
. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
, pages
342
350
,
Suntec, Singapore
.
Association for Computational Linguistics
.
Ryan
McDonald
,
Fernando
Pereira
,
Kiril
Ribarov
, and
Jan
Hajič
.
2005
.
Non-projective dependency parsing using spanning tree algorithms
. In
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
, pages
523
530
,
Vancouver, British Columbia, Canada
.
Association for Computational Linguistics
.
Ryan
McDonald
and
Giorgio
Satta
.
2007
.
On the complexity of non-projective data-driven dependency parsing
. In
Proceedings of the Tenth International Conference on Parsing Technologies
, pages
121
132
,
Prague, Czech Republic
.
Association for Computational Linguistics
.
Young-Soo
Myung
,
Chang-Ho
Lee
, and
Dong-Wan
Tcha
.
1995
.
On the generalized minimum spanning tree problem
.
Networks
,
26
(
4
):
231
241
.
Radford M.
Neal
and
Geoffrey E.
Hinton
.
1998
,
A view of the EM algorithm that justifies incremental, sparse, and other variants
. In
Learning in Graphical Models
, pages
355
368
.
Springer
.
Yu
Nesterov
.
2005
.
Smooth minimization of non-smooth functions
.
Mathematical Programming
,
103
(
1
):
127
152
.
Jorge
Nocedal
and
Stephen J.
Wright
.
1999
.
Numerical Optimization
.
Springer
.
Jeffrey
Pennington
,
Richard
Socher
, and
Christopher
Manning
.
2014
.
GloVe: Global vectors for word representation
. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1532
1543
,
Doha, Qatar
.
Association for Computational Linguistics
.
Matthew E.
Peters
,
Mark
Neumann
,
Mohit
Iyyer
,
Matt
Gardner
,
Christopher
Clark
,
Kenton
Lee
, and
Luke
Zettlemoyer
.
2018
.
Deep contextualized word representations
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
2227
2237
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Emily
Pitler
,
Sampath
Kannan
, and
Mitchell
Marcus
.
2012
.
Dynamic programming for higher order parsing of gap-minding trees
. In
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
, pages
478
488
,
Jeju Island, Korea
.
Association for Computational Linguistics
.
Emily
Pitler
,
Sampath
Kannan
, and
Mitchell
Marcus
.
2013
.
Finding optimal 1-endpoint-crossing trees
.
Transactions of the Association for Computational Linguistics
,
1
:
13
24
.
Linlu
Qiu
,
Peter
Shaw
,
Panupong
Pasupat
,
Pawel
Nowak
,
Tal
Linzen
,
Fei
Sha
, and
Kristina
Toutanova
.
2022
.
Improving compositional generalization with latent structure and data augmentation
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4341
4362
,
Seattle, United States
.
Association for Computational Linguistics
.
Owen
Rambow
.
2010
.
The simple truth about dependency and phrase structure representations: An opinion piece
. In
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
, pages
337
340
,
Los Angeles, California
.
Association for Computational Linguistics
.
V. Venkata
Rao
and
R.
Sridharan
.
2002
.
Minimum-weight rooted not-necessarily-spanning arborescence problem
.
Networks
,
39
(
2
):
77
87
.
Giorgio
Satta
.
1992
.
Recognition of linear context-free rewriting systems
. In
30th Annual Meeting of the Association for Computational Linguistics
, pages
89
95
,
Newark, Delaware, USA
.
Association for Computational Linguistics
.
Alexander
Schrijver
.
2003
.
Combinatorial Optimization: Polyhedra and Efficiency
,
volume 24
.
Springer
.
Mitchell
Stern
,
Jacob
Andreas
, and
Dan
Klein
.
2017
.
A minimal span-based neural constituency parser
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
818
827
,
Vancouver, Canada
.
Association for Computational Linguistics
.
Robert Endre
Tarjan
.
1977
.
Finding optimum branchings
.
Networks
,
7
(
1
):
25
35
.
Oriol
Vinyals
,
Łukasz
Kaiser
,
Terry
Koo
,
Slav
Petrov
,
Ilya
Sutskever
, and
Geoffrey
Hinton
.
2015
.
Grammar as a foreign language
. In
Advances in Neural Information Processing Systems
,
volume 28
.
Curran Associates, Inc.
Martin J.
Wainwright
and
Michael Irwin
Jordan
.
2008
.
Graphical Models, Exponential Families, and Variational Inference
.
Now Publishers Inc.
Bailin
Wang
,
Mirella
Lapata
, and
Ivan
Titov
.
2021
.
Structured reordering for modeling latent alignments in sequence transduction
. In
Advances in Neural Information Processing Systems
,
volume 34
, pages
13378
13391
.
Curran Associates, Inc.
Bailin
Wang
,
Richard
Shin
,
Xiaodong
Liu
,
Oleksandr
Polozov
, and
Matthew
Richardson
.
2020
.
RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7567
7578
,
Online
.
Association for Computational Linguistics
.
Jingfeng
Yang
,
Le
Zhang
, and
Diyi
Yang
.
2022
.
SUBS: Subtree substitution for compositional semantic parsing
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
169
174
,
Seattle, United States
.
Association for Computational Linguistics
.
Daniel H.
Younger
.
1967
.
Recognition and parsing of context-free languages in time n3
.
Information and Control
,
10
(
2
):
189
208
.
Alp
Yurtsever
,
Olivier
Fercoq
,
Francesco
Locatello
, and
Volkan
Cevher
.
2018
.
A conditional gradient framework for composite convex minimization with applications to semidefinite programming
. In
Proceedings of the 35th International Conference on Machine Learning
,
volume 80 of Proceedings of Machine Learning Research
, pages
5727
5736
.
PMLR
.
Mikhail
Zaslavskiy
,
Marc
Dymetman
, and
Nicola
Cancedda
.
2009
.
Phrase-based statistical machine translation as a traveling salesman problem
. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
, pages
333
341
,
Suntec, Singapore
.
Association for Computational Linguistics
.
John M.
Zelle
and
Raymond J.
Mooney
.
1996
.
Learning to parse database queries using inductive logic programming
. In
Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2
,
AAAI’96
, pages
105
1055
.
AAAI Press
.
Xingxing
Zhang
,
Jianpeng
Cheng
, and
Mirella
Lapata
.
2017
.
Dependency parsing as head selection
. In
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
, pages
665
676
,
Valencia, Spain
.
Association for Computational Linguistics
.
Hao
Zheng
and
Mirella
Lapata
.
2021
.
Compositional generalization via semantic tagging
. In
Findings of the Association for Computational Linguistics: EMNLP 2021
, pages
1022
1032
,
Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.

A Proofs

Proof: Theorem 1. We prove Theorem 1 by reducing the maximum not-necessarily-spanning arborescence (MNNSA) problem, which is known to be NP-hard (Rao and Sridharan, 2002; Duhamel et al., 2008), to the MGVCNNSA.

Let G = 〈V, A, ψ〉 be a weighted graph where V = {0, …, n} and ψ ∈ ℝ|A| are arc weights. The MNNSA problem aims to compute the subset of arcs BA such that 〈V[B], B〉 is an arborescence of maximum weight, where its weight is defined as ∑aBψa.

Let 𝒢 = 〈E, T, fType, fArgs〉 be a grammar such that E = {0, …, n − 1}, T = {t} and ∀eE : fType(e) = tfArgs(e, t) = e. Intuitively, a tag eE will be associated to vertices that require exactly e outgoing arcs.

We construct a clustered labeled weighted graph G′ = 〈V′, A′, π, l, ψ′〉 as follows. π = {V0′, …, Vn′} is a partition of V′ such that each cluster Vi′ contains n − 1 vertices and represents the vertex iV. The labeling function l assigns a different tag to each vertex in a cluster, i.e., ∀Vi′ ∈ π, ∀u′, v′ ∈ Vi′ : u′ ≠ v′ ⇒ l(u′) ≠ l(v′). The set of arcs is defined as A′ = {u′ → v′|∃ijA s.t. u′ ∈ Vi′ ∧ v′ ∈ Vj′}. The weight vector ψ′ ∈ ℝ|A′| is such that ∀u′ → v′ ∈ A′ : u′ ∈ Vu′ ∧ v′ ∈ Vv′ ⇒ ψu′→v = ψuv.

As such, there is a one-to-one correspondence between solutions of the MNNSA on graph G and solutions of the MGVCNNSA on graph G′.

Note that our proof considers that arcs leaving from the root cluster satisfy constraints defined by the grammar whereas we previously only required the root vertex to have a single outgoing arc. The latter constraint can be added directly in a grammar, but we omit presentation for brevity. The constrained arity case presented by McDonald and Satta (2007) focuses on spanning arborescences with an arity constraint by reducing the Hamiltonian path problem to their problem. While the arity constraint is similar in their problem and ours, our proof considers the not-necessarily-spanning case instead of the spanning one. Although the two problems seem related, they need to be studied separately, e.g., computing the maximum spanning arborescence is a polynomial time problem whereas computing the MNNSA is a NP-hard problem.

Proof: Theorem 2. We prove Theorem 2 by reducing the maximum directed Hamiltonian path problem, which is known to be NP-hard (Garey and Johnson, 1979, Appendix A1.3), to the latent anchoring.

Let G = 〈V, A, ψ〉 be a weighted graph where V = {1, …, n} and ψ ∈ ℝ|A| are arc weights. The maximum Hamiltonian path problem aims to compute the subset of arcs BA such that V [B] = V and 〈V[B], B〉 is a path of maximum weight, where its weight is defined as ∑aBψa.

Let 𝒢 = 〈E, T, fType, fArgs〉 be a grammar such that E = {0, 1}, T = {t} and ∀eE : fType(e) = tfArgs(e, t) = e.

We construct a clustered labeled weighted graph G′ = 〈V′, A′, π, l, ψ′〉 as follows. π = {V0′, …, Vn′} is a partition of V′ such that V0′ = {0}, each cluster Vi′ ≠ V0′ contains 2 vertices and represents the vertex iV. The labeling function l assigns a different tag to each vertex in a cluster except the root, i.e., ∀Vi′ ∈ π, i > 0, ∀u′, v′ ∈ Vi′ : u′ ≠ v′ ⇒ l(u′) ≠ l(v′). The set of arcs is defined as A′ = {0 → u′|u′ ∈ V′ ∖ {0}} ∪ {u′ → v′|ijAu′ ∈ Vi′ ∧ v′ ∈ Vj′}. The weight vector ψ′ ∈ ℝ|A′| is such that ∀u′ → v′ ∈ A′ : u′ ∈ Vu′ ∧ v′ ∈ Vv′ ⇒ ψu′→v = ψuv and arcs leaving 0 have null weights.

We construct an AST G″ = 〈V″, A″, l′〉 such that V″ = {1, …, n}, A″ = {ii + 1|1 ≤ i < n} and the labeling function l′ assigns the tag 0 to n and the tag 1 to every other vertex.

As such, there is a one-to-one correspondence between solutions of the maximum Hamiltonian path problem on graph G and solutions of the mapping of maximum weight of G″ with G′.

B Experimental Setup

The neural architecture used in our experiments to produce the weights μ and ϕ is composed of: (1) an embedding layer of dimension 100 for Scan or BERT-base (Devlin et al., 2019) for the other datasets, followed by a bi-LSTM (Hochreiter and Schmidhuber, 1997) with a hidden size of 400; (2) a linear projection of dimension 500 over the output of the bi-LSTM followed by a Tanh activation and another linear projection of dimension |E| to obtain μ; (3) a linear projection of dimension 500 followed by a Tanh activation and a bi-affine layer (Dozat and Manning, 2017) to obtain ϕ.

We apply dropout with a probability of 0.3 over the outputs of BERT-base and the bi-LSTM and after both Tanh activations. The learning rate is 5 × 10−4 and each batch is composed of 30 examples. We keep the parameters that obtain the best accuracy on the development set after 25 epochs. Training the model takes between 40 minutes for GeoQuery and 8 hours for Clevr. However, note that the bottleneck is the conditional gradient method which is computed on the CPU.

C GeoQuery Denotation Accuracy Issue

The denotation accuracy is evaluated by checking whether the denotation returned by an executor is the same when given the gold semantic program and the prediction of the model. It can be higher than the exact match accuracy when different semantic programs yield the same denotation.

When we evaluated our approach using the same executor as the baselines of Herzig and Berant (2021), we observed two main issues regarding the behavior of predicates: (1) Several predicates have undefined behaviors (e.g., population_1 and traverse_2 in the case of an argument of type country), in the sense that they are not implemented; (2) The behavior of some predicates are incorrect with respect to their expected semantic (e.g., traverse_1 and traverse_2). These two sources of errors result in incorrect denotation for several semantic programs, leading to an overestimation of the denotation accuracy when both the gold and predicted programs return by accident an empty denotation (potentially for different reasons, due to aforementioned implementation issues).

We implemented a corrected executor addressing the issues that we found. It is available here: https://github.com/alban-petit/geoquery-funql-executor.

Author notes

Action Editor: James Henderson

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.