## Abstract

We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on GeoQuery, Scan, and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.

## 1 Introduction

Semantic parsing aims to transform a natural language utterance into a structured representation that can be easily manipulated by a software (e.g., to query a database). As such, it is a central task in human–computer interfaces. Andreas et al. (2013) first proposed to rely on machine translation models for semantic parsing, where the target representation is linearized and treated as a foreign language. Due to recent advances in deep learning and especially in sequence-to-sequence (seq2seq) with attention architectures for machine translation (Bahdanau et al., 2015), it is appealing to use the same architectures for standard structured prediction problems (Vinyals et al., 2015). This approach is indeed common in semantic parsing (Jia and Liang, 2016; Dong and Lapata, 2016; Wang et al., 2020), as well as other domains. Unfortunately, there are well-known limitations to seq2seq architectures for semantic parsing. First, at test time, the decoding algorithm is typically based on beam search as the model is autoregressive and does not make any independence assumption. In case of prediction failure, it is therefore unknown if this is due to errors in the weighting function or to the optimal solution failing out of the beam. Secondly, they are known to fail when compositional generalization is required (Lake and Baroni, 2018; Finegan-Dollak et al., 2018a; Keysers et al., 2020).

In order to bypass these problems, Herzig and Berant (2021) proposed to represent the semantic content associated with an utterance as a phrase structure, i.e., using the same representation usually associated with syntactic constituents. As such, their semantic parser is based on standard span-based decoding algorithms (Hall et al., 2014; Stern et al., 2017; Corro, 2020) with additional well-formedness constraints from the semantic formalism. Given a weighting function, MAP inference is a polynomial time problem that can be solved via a variant of the CYK algorithm (Kasami, 1965; Younger, 1967; Cocke, 1970). Experimentally, Herzig and Berant (2021) show that their approach outperforms seq2seq models in terms of compositional generalization, therefore effectively bypassing the two major problems of these architectures.

The complexity of MAP inference for phrase structure parsing is directly impacted by the considered search space (Kallmeyer, 2010). Importantly, (ill-nested) discontinuous phrase structure parsing is known to be NP-hard, even with a bounded block-degree (Satta, 1992). Herzig and Berant (2021) explore two restricted inference algorithms, both of which have a cubic time complexity with respect to the input length. The first one only considers continuous phrase structures, that is, derived trees that could have been generated by a context-free grammar, and the second one also considers a specific type of discontinuities, see Corro (2020, Section 3.6). Both algorithms fail to cover the full set of phrase structures observed in semantic treebanks, see Figure 1.

In this work, we propose to reduce semantic parsing without reentrancy (i.e., a given predicate or entity cannot be used as an argument for two different predicates) to a bi-lexical dependency parsing problem. As such, we tackle the same semantic content as aforementioned previous work but using a different mathematical representation (Rambow, 2010). We identify two main benefits to our approach: (1) as we allow crossing arcs, i.e., “non-projective graphs”, all datasets are guaranteed to be fully covered and (2) it allows us to rely on optimization methods to tackle inference intractability of our novel graph-based formulation of the problem. More specifically, in our setting we need to jointly assign predicates/entities to words that convey a semantic content and to identify arguments of predicates via bi-lexical dependencies. We show that MAP inference in this setting is equivalent to the maximum generalized spanning arborescence problem (Myung et al., 1995) with supplementary constraints to ensure well-formedness with respect to the semantic formalism. Although this problem is NP-hard, we propose an optimization algorithm that solves a linear relaxation of the problem and can deliver an optimality certificate.

Our contributions can be summarized as follows:

We propose a novel graph-based approach for semantic parsing without reentrancy;

We prove the NP-hardness of MAP inference and latent anchoring inference;

We propose a novel integer linear programming formulation for this problem together with an approximate solver based on conditional gradient and constraint smoothing;

We tackle the training problem using variational approximations of objective functions, including the weakly-supervised scenario;

We evaluate our approach on GeoQuery, Scan, and Clevr and observe that it outperforms baselines on both i.i.d. splits and splits that test for compositional generalization.

^{1}

## 2 Graph-based Semantic Parsing

We propose to reduce semantic parsing to parsing the abstract syntax tree (AST) associated with a semantic program. We focus on semantic programs whose ASTs do not have any reentrancy, i.e., a single predicate or entity cannot be the argument of two different predicates. Moreover, we assume that each predicate or entity is anchored on exactly one word of the sentence and each word can be the anchor of at most one predicate or entity. As such, the semantic parsing problem can be reduced to assigning predicates and entities to words and identifying arguments via dependency relations, see Figure 2. In order to formalize our approach to the semantic parsing problem, we will use concepts from graph theory. We therefore first introduce the vocabulary and notions that will be useful in the rest of this article. Notably, the notions of cluster and generalized arborescence will be used to formalize our prediction problem.

### Notations and Definitions.

Let *G* = 〈*V*, *A*〉 be a directed graph with vertices *V* and arcs *A* ⊆ *V* × *V*. An arc in *A* from a vertex *u* ∈ *V* to a vertex *v* ∈ *V* is denoted either *a* ∈ *A* or *u* → *v* ∈ *A*. For any subset of vertices *U* ⊆ *V*, we denote $\sigma G+$(*U*) (resp., $\sigma G\u2212$(*U*)) the set of arcs leaving one vertex of *U* and entering one vertex of *V* ∖ *U* (resp., leaving one vertex of *V* ∖ *U* and entering one vertex of *U*) in the graph *G*. Let *B* ⊆ *A* be a subset of arcs. We denote *V* [*B*] the cover set of *B*, i.e., the set of vertices that appear as an extremity of at least one arc in *B*. A graph *G* = 〈*V*, *A*〉 is an arborescence^{2} rooted at *u* ∈ *V* if and only if (iff) it contains |*V*| − 1 arcs and there is a directed path from *u* to each vertex in *V*. In the rest of this work, we will assume that the root is always vertex 0 ∈ *V*. Let *B* ⊆ *A* be a set of arcs such that *G*′ = 〈*V*[*B*], *B*〉 is an arborescence. Then *G*′ is a spanning arborescence of *G* iff *V* [*B*] = *V*.

Let *π* = {*V*_{0}, …, *V*_{n}} be a partition of *V* containing *n* + 1 clusters. *G*′ is a generalized not-necessarily-spanning arborescence (resp. generalized spanning arborescence) on the partition *π* of *G* iff *G*′ is an arborescence and *V* [*B*] contains at most one vertex per cluster in *π* (resp. contains exactly one).

Let *W* ⊆ *V* be a set of vertices. Contracting *W* consists in replacing in *G* the set *W* by a new vertex *w* ∉ *V*, replacing all the arcs *u* → *v* ∈ *σ*^{−}(*W*) by an arc *u* → *w* and all the arcs *u* → *v* ∈ *σ*^{+}(*W*) by an arc *w* → *v*. Given a graph with partition *π*, the contracted graph is the graph where each cluster in *π* has been contracted. While contracting a graph may introduce parallel arcs, it is not an issue in practice, even for weighted graphs.

### 2.1 Semantic Grammar and AST

The semantic programs we focus on take the form of a functional language, i.e., a representation where each predicate is a function that takes other predicates or entities as arguments. The semantic language is typed in the same sense than in “typed programming languages”. For example, in GeoQuery, the predicate capital_2 expects an argument of type city and returns an object of type state. In the datasets we use, the typing system disambiguates the position of arguments in a function: For a given function, either all arguments are of the same type or the order of arguments is unimportant—an example of both is the predicate intersection_river in GeoQuery that takes two arguments of type river, but the result of the execution is unchanged if the arguments are swapped.^{3}

Formally, we define the set of valid semantic programs as the set of programs that can be produced with a semantic grammar 𝒢 = 〈*E*, *T*, *f*_{Type}, *f*_{Args}〉 where:

*E*is the set of predicates and entities, which we will refer to as the set of tags—w.l.o.g. we assume that Root ∉*E*where Root is a special tag used for parsing;*T*is the set of types;*f*_{Type}:*E*→*T*is a typing function that assigns a type to each tag;*f*_{Args}:*E*×*T*→ ℕ is a valency function that assigns the numbers of expected arguments of a given type to each tag.

*e*∈

*E*is an entity iff ∀

*t*∈

*T*:

*f*

_{Args}(

*e*,

*t*) = 0. Otherwise,

*e*is a predicate.

*G*= 〈

*V*,

*A*,

*l*〉 where function

*l*:

*V*→

*E*assigns a tag to each vertex and arcs identify the arguments of tags, see Figure 2. An AST

*G*is well-formed with respect to the grammar 𝒢 iff

*G*is an arborescence and the valency and type constraints are satisfied, i.e., ∀

*u*∈

*V*,

*t*∈

*T*:

### 2.2 Problem Reduction and Complexity

In our setting, semantic parsing is a joint sentence tagging and dependency parsing problem (Bohnet and Nivre, 2012; Li et al., 2011; Corro et al., 2017): each content word (i.e., words that convey a semantic meaning) must be tagged with a predicate or an entity, and dependencies between content words identify arguments of predicates, see Figure 2. However, our semantic parsing setting differs from standard syntactic analysis in two ways: (1) the resulting structure is not-necessarily-spanning, there are words (e.g., function words) that must not be tagged and that do not have any incident dependency—and those words are not known in advance, they must be identified jointly with the rest of the structure; (2) the dependency structure is highly constrained by the typing mechanism, that is, the predicted structure must be a valid AST. Nevertheless, similarly to aforementioned works, our parser is graph-based, that is, for a given input we build a (complete) directed graph and decoding is reduced to computing a constrained subgraph of maximum weight.

Given a sentence ** w** =

*w*

_{1}…

*w*

_{n}with

*n*words and a grammar 𝒢, we construct a clustered labeled graph

*G*= 〈

*V*,

*A*,

*π*, $l\xaf$〉 as follows. The partition

*π*= {

*V*

_{0}, …,

*V*

_{n}} contains

*n*+ 1 clusters, where

*V*

_{0}is a root cluster and each cluster

*V*

_{i},

*i*≠ 0, is associated to word

*w*

_{i}. The root cluster

*V*

_{0}= {0} contains a single vertex that will be used as the root and every other cluster contains |

*E*| vertices. The extended labeling function $l\xaf$ :

*V*→

*E*∪ {Root} assigns a tag in

*E*to each vertex

*v*∈

*V*∖ {0} and Root to vertex 0. Distinct vertices in a cluster

*V*

_{i}cannot have the same label, i.e., ∀

*u*,

*v*∈

*V*

_{i}:

*u*≠

*v*⇒ $l\xaf$(

*u*) ≠ $l\xaf$(

*v*).

Let *B* ⊆ *A* be a subset of arcs. The graph *G*′ = 〈*V*[*B*], *B*〉 defines a 0-rooted generalized valency-constrained not-necessarily-spanning arborescence iff it is a generalized arborescence of *G*, there is exactly one arc leaving 0 and the sub-arborescence rooted at the destination of that arc is a valid AST with respect to the grammar 𝒢. As such, there is a one-to-one correspondence between ASTs anchored on the sentence ** w** and generalized valency-constrained not-necessarily-spanning arborescences in the graph

*G*, see Figure 3b.

For any sentence ** w**, our aim is to find the AST that most likely corresponds to it. Thus, after building the graph

*G*as explained above, the neural network described in Appendix B is used to produce a vector of weights

**∈ ℝ**

*μ*^{|V|}associated to the set of vertices

*V*and a vector of weights

**∈ ℝ**

*ϕ*^{|A|}associated to the set of arcs

*A*. Given these weights, graph-based semantic parsing is reduced to an optimization problem called the maximum generalized valency-constrained not-necessarily-spanning arborescence (MGVCNNSA) in the graph

*G*.

**Theorem 1.***The MGVCNNSA problem is NP-hard.*

The proof is in Appendix A.

### 2.3 Mathematical Program

Our graph-based approach to semantic parsing has allowed us to prove the intrinsic hardness of the problem. We follow previous work on graph-based parsing (Martins et al., 2009; Koo et al., 2010), and other topics, by proposing an integer linear programming (ILP) formulation in order to compute (approximate) solutions.

Remember that in the joint tagging and dependency parsing interpretation of the semantic parsing problem, the resulting structure is not-necessarily-spanning, meaning that some words may not be tagged. In order to rely on well-known algorithms for computing spanning arborescences as a subroutine of our approximate solver, we first introduce the notion of extended graph. Given a graph *G* = 〈*V*, *A*, *π*, $l\xaf$〉, we construct an extended graph $G\xaf$ = 〈$V\xaf$, $A\xaf$, $\pi \xaf$, $l\xaf$〉^{4} containing *n* additional vertices {$1\xaf$, …, $n\xaf$} that are distributed along clusters, i.e., $\pi \xaf$ = {*V*_{0}, *V*_{1} ∪ {$1\xaf$}, …, *V*_{n} ∪ {$n\xaf$}}, and arcs from the root to these extra vertices, i.e., $A\xaf$ = *A* ∪ {0 → $i\xaf$∣1 ≤ *i* ≤ *n*}. Let *B* ⊆ *A* be a subset of arcs such that 〈*V*[*B*], *B*〉 is a generalized not-necessarily-spanning arborescence on *G*. Let $B\xaf$ ⊆ $A\xaf$ be a subset of arcs defined as $B\xaf$ = *B* ∪ {0 → $i\xaf$∣$\sigma \u2329V[B],B\u232a\u2212$(*V*_{i}) = ∅}. Then, there is a one-to-one correspondence between generalized not-necessarily-spanning arborescences 〈*V*[*B*], *B*〉 and generalized spanning arborescences 〈$V\xaf$[$B\xaf$], $B\xaf$〉, see Figure 3b.

**∈ {0, 1}**

*x*^{∣$V\xaf$∣}and

**∈ {0, 1}**

*y*^{∣$A\xaf$∣}be variable vectors indexed by vertices and arcs such that a vertex

*v*∈

*V*(resp., an arc

*a*∈

*A*) is selected iff

*x*

_{v}= 1 (resp.,

*y*

_{a}= 1). The set of 0-rooted generalized valency-constrained spanning arborescences on $G\xaf$ can be written as the set of variables 〈

**,**

*x***〉 satisfying the following linear constraints. First, we restrict**

*y***to structures that are spanning arborescences over $G\xaf$ where clusters have been contracted:**

*y***that satisfy these three constraints are exactly the set of 0-rooted spanning arborescences on the contracted graph, see Schrijver (2003, Section 52.4) for an in-depth analysis of this polytope. The root vertex is always selected and other vertices are selected iff they have one incoming selected arc:**

*y**u*∈

*V*∖ {0} (i.e., a vertex that is not part of the extra vertices introduced in the extended graph) that will be the root of the AST. Constraints (7) force the selected vertices and arcs to produce a well-formed AST with respect to the grammar 𝒢. Note that these constraints are only defined for vertices in

*V*∖ {0}, i.e., they are neither defined for the root vertex nor for the extra vertices introduced in the extended graph.

^{(sa)}∩ 𝒞

^{(val)}. Given vertex weights

**∈ ℝ**

*μ*^{∣$V\xaf$∣}and arc weights

**∈ ℝ**

*ϕ*^{∣$A\xaf$∣}, computing the MGVCNNSA is equivalent to solving the following ILP:

**,**

*x***〉 ∈ 𝒞**

*y*^{(val)}, the problem would be easy to solve. The set 𝒞

^{(sa)}is the set of spanning arborescences over the contracted graph, hence to maximize over this set we can simply: (1) contract the graph and assign to each arc in the contracted graph the weight of its corresponding arc plus the weight of its destination vertex in the original graph; (2) run the the maximum spanning arborescence algorithm (MSA, Edmonds, 1967; Tarjan, 1977) on the contracted graph, which has a 𝒪(

*n*

^{2}) time-complexity. This process is illustrated on Figure 5 (top). Note that the contracted graph may have parallel arcs, which is not an issue in practice as only the one of maximum weight can appear in a solution of the MSA.

We have established that MAP inference in our semantic parsing framework is a NP-hard problem. We proposed an ILP formulation of the problem that would be easy to solve if some constraints were removed. This property suggests the use of an approximation algorithm that introduces the difficult constraints as penalties. As a similar setting arises from our weakly supervised loss function, the presentation of the approximation algorithm is deferred until Section 4.

## 3 Training Objective Functions

### 3.1 Supervised Training Objective

**,**

*x***〉 ∈ 𝒞 via the Boltzmann distribution:**

*y**c*(

**,**

*μ***) is the log-partition function:**

*ϕ***,**

*x***〉 is defined as:**

*y***be a matrix such that each row contains a pair 〈**

*U***,**

*x***〉 ∈ 𝒞 and Δ**

*y*^{∣𝒞∣}be the simplex of dimension ∣𝒞∣ − 1, i.e., the set of all stochastic vectors of dimension ∣𝒞∣. The log-partition function can then be rewritten using its variational formulation:

*H*[

**] = −∑**

*p*_{i}

*p*

_{i}log

*p*

_{i}is the Shannon entropy. We refer the reader to Boyd and Vandenberghe (2004, Example 3.25), Wainwright and Jordan (2008, Section 3.6), and Beck (2017, Section 4.4.10). Note that this formulation remains impractical as

**has an exponential size. Let 𝓜 = conv(𝒞) be the marginal polytope, i.e., the convex hull of the feasible integer solutions, we can rewrite the above variational formulation as:**

*p**H*

_{𝓜}is a joint entropy function defined such that the equality holds. The maximization in this reformulation acts on the marginal probabilities of parts (vertices and arcs) and has therefore a polynomial number of variables. We refer the reader to Wainwright and Jordan (2008, 5.2.1) and Blondel et al. (2020, Section 7) for more details. Unfortunately, this optimization problem is hard to solve as 𝓜 cannot be characterized in an explicit manner and

*H*

_{𝓜}is defined indirectly and lacks a polynomial closed form (Wainwright and Jordan, 2008, Section 3.7). However, we can derive an upper bound to the log-partition function by decomposing the entropy term

*H*

_{𝓜}(Cover, 1999, Property 4 on page 41, i.e.,

*H*is an upper bound) and by using an outer approximation to the marginal polytope 𝓛 ⊇ 𝓜 (i.e., increasing the search space):

**,**

*x***〉 ∈ 𝒞 has exactly one vertex selected per cluster $Vi\xaf$ ∈ $\pi \xaf$ and one incoming arc selected per cluster $Vi\xaf$ ∈ $\pi \xaf$ ∖ {**

*y**V*

_{0}}. We denote 𝒞

^{(one)}the set of all the pairs 〈

**,**

*x***〉 that satisfy these constraints. By using 𝓛 = conv(𝒞**

*y*^{(one)}) as an outer approximation to the marginal polytope (see Figure 4) the optimization problem can be rewritten as a sum of independent problems. As each of these problems is the variational formulation of a log-sum-exp term, the upper bound on

*c*(

**,**

*μ***) can be expressed as a sum of log-sum-exp functions, one over vertices in each cluster $Vi\xaf$ ∈ $\pi \xaf$ ∖ {**

*ϕ**V*

_{0}} and one over incoming arcs $\sigma G\xaf\u2212$($Vi\xaf$) for each cluster $Vi\xaf$ ∈ $\pi \xaf$ ∖ {

*V*

_{0}}. Although this type of approximation may not result in a Bayes consistent loss (Corro, 2023), it works well in practice.

### 3.2 Weakly Supervised Training Objective

Unfortunately, training data often does not include gold pairs 〈** x**,

**〉 but instead only the AST, without word anchors (or word alignment). This is the case for the three datasets we use in our experiments. We thus consider our training signal to be the set of all structures that induce the annotated AST, which we denote 𝒞*.**

*y**q*be a proposal distribution such that

*q*(

**,**

*x***) = 0 if 〈**

*y***,**

*x***〉 ∉ 𝒞*. We derive the following lower bound via Jensen’s inequality:**

*y**q*satisfying the aforementioned condition. We choose to maximize this lower bound using a distribution that gives a probability of one to a single structure, as in “hard” EM (Neal and Hinton, 1998, Section 6).

For a given sentence ** w**, let

*G*= 〈

*V*,

*A*,

*π*>, $l\xaf$〉 be a graph defined as in Section 2.2 and

*G*′ = 〈

*V*′,

*A*′,

*l*′〉 be an AST defined as in Section 2.1. We aim to find the GVCNNSA in

*G*of maximum weight whose induced AST is exactly

*G*′. This is equivalent to aligning each vertex in

*V*′ with one vertex of

*V*∖ {0} s.t. there is at most one vertex per cluster of

*π*appearing in the alignment and where the weight of an alignment is defined as:

for each vertex

*u*′ ∈*V*′, we add the weight of the vertex*u*∈*V*it is aligned to—moreover, if*u*′ is the root of the AST we also add the weight of the arc 0 →*u*;for each arc

*u*′ →*v*′ ∈*A*′, we add the weight of the arc*u*→*v*where*u*∈*V*(resp.*v*∈*V*) is the vertex*u*′ (resp.*v*′) it is aligned with.

*G*and

*G*′, but the fact that we need to take into account arc weights forbids the use of the Kuhn–Munkres algorithm (Kuhn, 1955).

**Theorem 2.***Computing the anchoring of maximum weight of an AST with a graph G is NP-hard.*

The proof is in Appendix A.

Therefore, we propose an optimization-based approach to compute the distribution *q*. Note that the problem has a constraint requiring each cluster *V*_{i} ∈ *π* to be aligned with at most one vertex *v*′ ∈ *V*′, i.e., each word in the sentence can be aligned with at most one vertex in the AST. If we remove this constraint, then the problem becomes tractable via dynamic programming. Indeed, we can recursively construct a table Chart[*u*′, *u*], *u*′ ∈ *V*′ and *u* ∈ *V*, containing the score of aligning vertex *u*′ to vertex *u* plus the score of the best alignment of all the descendants of *u*′. To this end, we simply visit the vertices *V*′ of the AST in reverse topological order, see Algorithm 1. The best alignment can be retrieved via back-pointers.

*q*is therefore equivalent to solving the following ILP:

^{*(relaxed)}is the set of feasible solutions of the dynamic program in Algorithm 1, whose convex hull can be described via linear constraints (Martin et al., 1990).

## 4 Efficient Inference

**is the concatenation of the vectors**

*z***and**

*x***defined previously and conv denotes the convex hull of a set. We explained previously that if the set of constraints of form**

*y***=**

*Az***for (Ilp1) or**

*b***≤**

*Az***for (Ilp2) was absent, the problem would be easy to solve under a linear objective function. In fact, there exists an efficient linear maximization oracle (LMO), i.e., a function that returns the optimal integral solution, for the set conv(𝒞**

*b*^{(easy)}). This setting covers both (Ilp1) and (Ilp2) where we have 𝒞

^{(easy)}= 𝒞

^{(sa)}and 𝒞

^{(easy)}= 𝒞

^{*(relaxed)}, respectively.

*δ*

_{S}is the indicator function of the set

*S*:

*S*= {

**} and in the inequality case, we use**

*b**S*= {

**|**

*u***≤**

*u***}.**

*b*### 4.1 Conditional Gradient Method

*g*and a nonempty, bounded, closed, and convex set conv(𝒞

^{(easy)}), the conditional gradient method (a.k.a. Frank-Wolfe; Frank and Wolfe, 1956; Levitin and Polyak, 1966; Lacoste-Julien and Jaggi, 2015) can be used to solve optimization problems of the following form:

^{(easy)}) which is, in most cases, computationally expensive. Instead, the conditional gradient method only relies on a LMO:

### 4.2 Smoothing

*g*(

**) =**

*z**f*(

**) −**

*z**δ*

_{S}(

**) is non-smooth due to the indicator function term, preventing the use of the conditional gradient method. We propose to rely on the framework proposed by Yurtsever et al. (2018), where the indicator function is replaced by a smooth approximation. The indicator function of the set**

*Az**S*can be rewritten as:

*σ*

_{S}(

**) = sup**

*u*_{t∈S}

*u*^{⊤}

**is the support function of**

*t**S*. More details can be found in Beck (2017, Section 4.1 and 4.2). In order to smooth the indicator function, we add a

*β*-parameterized convex regularizer −$\beta 2$∥·$\u222522$ to its Fenchel biconjugate:

*β*> 0 controls the quality and the smoothness of the approximation (Nesterov, 2005).

#### Equalities.

*S*= {

**}, with a few computations that are detailed by Yurtsever et al. (2018), we obtain:**

*b***s.t.**

*z***≠**

*Az***.**

*b*#### Inequalities.

*S*= {

**|**

*u***≤**

*u***}, similar computations lead to:**

*b*_{+}denotes the Euclidian projection into the non-negative orthant (i.e., clipping negative values). Similarly to the equality case, this term introduces a penalty in the objective for vectors

**s.t.**

*z***>**

*Az***. This penalty function is also called the Courant-Beltrami penalty function.**

*b*Figure 5 (bottom) illustrates how the gradient of the penalty term can “force” the LMO to return solutions that satisfy the smoothed constraints.

### 4.3 Practical Details

#### Smoothness.

In practice, we need to choose the smoothness parameter *β*. We follow Yurtsever et al. (2018) and use *β*^{(k)} = $\beta (0)k+1$ where *k* is the iteration number and *β*^{(0)} = 1.

#### Step Size.

*γ*. We show that when the smoothed constraints are equalities, computing the optimal step size has a simple closed form solution if the function

*f*is linear, which is the case for (Ilp1), i.e., MAP decoding. The step size poblem formulation at iteration

*k*is defined as:

*f*is linear and can be written as

*f*(

**) =**

*z*

*θ*^{⊤}

**. Ignoring the box constraints on**

*z**γ*, by first order optimality conditions, we have:

#### Non-integral Solutions.

As we solve the linear relaxation of original ILPs, the optimal solutions may not be integral (Figure 4). Therefore, we use simple heuristics to construct a feasible solution to the original ILP in these cases. For MAP inference, we simply solve the ILP^{5} using CPLEX but introducing only variables that have a non-null value in the linear relaxation, leading to a very sparse problem which is fast to solve. For latent anchoring, we simply use the Kuhn–Munkres algorithm using the non-integral solution as assignment costs.

## 5 Experiments

We compare our method to baseline systems both on i.i.d. splits (Iid) and splits that test for compositional generalization for three datasets. The neural network is described in Appendix B.

### Datasets.

Scan (Lake and Baroni, 2018) contains natural language navigation commands. We use the variant of Herzig and Berant (2021) for semantic parsing. The Iid split is the *simple* split (Lake and Baroni, 2018). The compositional splits are *primitive right* (Right) and *primitive around right* (ARight) (Loula et al., 2018).

GeoQuery (Zelle and Mooney, 1996) uses the FunQL formalism (Kate et al., 2005) and contains questions about the US geography. The Iid split is the standard split and compositional generalization is evaluated on two splits: Length where the examples are split by program length and Template (Finegan-Dollak et al., 2018a) where they are split such that all semantic programs having the same AST are in the same split.

Clevr (Johnson et al., 2017) contains synthetic questions over object relations in images. Closure (Bahdanau et al., 2019) introduces additional question templates that require compositional generalization. We use the original split as our Iid split and the Closure split as a compositional split where the model is evaluated on Closure.

### Baselines.

We compare our approach against the architecture proposed by Herzig and Berant (2021) (SpanBasedSP) as well as the seq2seq baselines they used. In Seq2Seq (Jia and Liang, 2016), the encoder is a bi-LSTM over pre-trained GloVe embeddings (Pennington et al., 2014) or ELMo (Peters et al., 2018) and the decoder is an attention-based LSTM (Bahdanau et al., 2015). BERT2Seq replaces the encoder with BERT-base. GRAMMAR is similar to Seq2Seq but the decoding is constrained by a grammar. BART (Lewis et al., 2020) is pre-trained as a denoising autoencoder.

### Results.

We report the denotation accuracies in Table 1. Our approach outperforms all other methods. In particular, the seq2seq baselines suffer from a significant drop in accuracy on splits that require compositional generalization. While SpanBasedSP is able to generalize, our approach outperforms it. Note that we observed that the GeoQuery execution script used to compute denotation accuracy in previous work contains several bugs that overestimate the true accuracy. Therefore, we also report denotation accuracy with a corrected executor (see Appendix C) for fair comparison with future work.

. | Scan . | GeoQuery . | Clevr . | |||||
---|---|---|---|---|---|---|---|---|

Iid . | Right . | ARight . | Iid . | Template . | Length . | Iid . | Closure . | |

Baselines (denotation accuracy only) | ||||||||

Seq2Seq | 99.9 | 11.6 | 0 | 78.5 | 46.0 | 24.3 | 100 | 59.5 |

+ ELMo | 100 | 54.9 | 41.6 | 79.3 | 50.0 | 25.7 | 100 | 64.2 |

BERT2Seq | 100 | 77.7 | 95.3 | 81.1 | 49.6 | 26.1 | 100 | 56.4 |

GRAMMAR | 100 | 0.0 | 4.2 | 72.1 | 54.0 | 24.6 | 100 | 51.3 |

BART | 100 | 50.5 | 100 | 87.1 | 67.0 | 19.3 | 100 | 51.5 |

SpanBasedSP | 100 | 100 | 100 | 86.1 | 82.2 | 63.6 | 96.7 | 98.8 |

Our approach | ||||||||

Denotation accuracy | 100 | 100 | 100 | 92.9 | 89.9 | 74.9 | 100 | 99.6 |

↳ Corrected executor | 91.8 | 88.7 | 74.5 | |||||

Exact match | 100 | 100 | 100 | 90.7 | 86.2 | 69.3 | 100 | 99.6 |

↳ w/o CPLEX heuristic | 100 | 100 | 100 | 90.0 | 83.0 | 67.5 | 100 | 98.0 |

. | Scan . | GeoQuery . | Clevr . | |||||
---|---|---|---|---|---|---|---|---|

Iid . | Right . | ARight . | Iid . | Template . | Length . | Iid . | Closure . | |

Baselines (denotation accuracy only) | ||||||||

Seq2Seq | 99.9 | 11.6 | 0 | 78.5 | 46.0 | 24.3 | 100 | 59.5 |

+ ELMo | 100 | 54.9 | 41.6 | 79.3 | 50.0 | 25.7 | 100 | 64.2 |

BERT2Seq | 100 | 77.7 | 95.3 | 81.1 | 49.6 | 26.1 | 100 | 56.4 |

GRAMMAR | 100 | 0.0 | 4.2 | 72.1 | 54.0 | 24.6 | 100 | 51.3 |

BART | 100 | 50.5 | 100 | 87.1 | 67.0 | 19.3 | 100 | 51.5 |

SpanBasedSP | 100 | 100 | 100 | 86.1 | 82.2 | 63.6 | 96.7 | 98.8 |

Our approach | ||||||||

Denotation accuracy | 100 | 100 | 100 | 92.9 | 89.9 | 74.9 | 100 | 99.6 |

↳ Corrected executor | 91.8 | 88.7 | 74.5 | |||||

Exact match | 100 | 100 | 100 | 90.7 | 86.2 | 69.3 | 100 | 99.6 |

↳ w/o CPLEX heuristic | 100 | 100 | 100 | 90.0 | 83.0 | 67.5 | 100 | 98.0 |

We also report exact match accuracy, with and without the heuristic to construct integral solutions from fractional ones. The exact match accuracy is always lower or equal to the denotation accuracy. This shows that our approach can sometimes provide the correct denotation even though the prediction is different from the gold semantic program. Importantly, while our approach outperforms baselines, its accuracy is still significantly worse on the split that requires to generalize to longer programs.

## 6 Related Work

### Graph-based Methods.

Graph-based methods have been popularized by syntactic dependency parsing (McDonald et al., 2005) where MAP inference is realized via the maximum spanning arborescence algorithm (Chu and Liu, 1965; Edmonds, 1967). A benefit of this algorithm is that it has a 𝒪(*n*^{2}) time-complexity (Tarjan, 1977), i.e., it is more efficient than algorithms exploring more restricted search spaces (Eisner, 1997; Gómez-Rodríguez et al., 2011; Pitler et al., 2012, 2013).

In the case of semantic structures, Kuhlmann and Jonsson (2015) proposed a 𝒪(*n*^{3}) algorithm for the maximum non-necessarily-spanning acyclic graphs with a noncrossing arc constraint. Without the noncrossing constraint, the problem is known to be NP-hard (Grötschel et al., 1985). To bypass this computational complexity, Dozat and Manning (2018) proposed to handle each dependency as an independent binary classification problem, that is they do not enforce any constraint on the output structure. Note that, contrary to our work, these approaches allow for reentrancy but do not enforce well-formedness of the output with respect to the semantic grammar. Lyu and Titov (2018) use a similar approach for AMR parsing where tags are predicted first, followed by arc predictions and finally heuristics are used to ensure the output graph is valid. On the contrary, we do not use a pipeline and we focus on joint decoding where validity of the output is directly encoded in the search space.

Previous work in the literature has also considered reduction to graph-based methods for other problems, e.g., for discontinuous constituency parsing (Fernández-González and Martins, 2015; Corro et al., 2017), lexical segmentation (Constant and Le Roux, 2015), and machine translation (Zaslavskiy et al., 2009), *inter alia*.

### Compositional Generalization.

Several authors observed that compositional generalization insufficiency is an important source of error for semantic parsers, especially ones based on seq2seq architectures (Lake and Baroni, 2018; Finegan-Dollak et al., 2018b; Herzig and Berant, 2019; Keysers et al., 2020). Wang et al. (2021) proposed a latent re-ordering step to improve compositional generalization, whereas Zheng and Lapata (2021) relied on latent predicate tagging in the encoder. There has also been an interest in using data augmentation methods to improve generalization (Jia and Liang, 2016; Andreas, 2020; Akyürek et al., 2021; Qiu et al., 2022; Yang et al., 2022).

Recently, Herzig and Berant (2021) showed that span-based parsers do not exhibit such problematic behavior. Unfortunately, these parsers fail to cover the set of semantic structures observed in English treebanks, and we hypothesize that this would be even worse for free word order languages. Our graph-based approach does not exhibit this downside. Previous work by Jambor and Bahdanau (2022) also considered graph-based methods for compositional generalization, but their approach predicts each part independently without any well-formedness or acyclicity constraint.

## 7 Conclusion

In this work, we focused on graph-based semantic parsing for formalisms that do not allow reentrancy. We conducted a complexity study of two inference problems that appear in this setting. We proposed ILP formulations of these problems together with a solver for their linear relaxation based on the conditional gradient method. Experimentally, our approach outperforms comparable baselines.

One downside of our semantic parser is speed (we parse approximately 5 sentences per second for GeoQuery). However, we hope this work will give a better understanding of the semantic parsing problem together with baseline for faster methods.

Future research will investigate extensions for (1) ASTs that contain reentrancies and (2) prediction algorithms for the case where a single word can be the anchor of more than one predicate or entity. These two properties are crucial for semantic representations like Abstract Meaning Representation (Banarescu et al., 2013). Moreover, even if our graph-based semantic parser provides better results than previous work on length generalization, this setting is still difficult. A more general research direction on neural architectures that generalize better to longer sentences is important.

## Acknowledgments

We thank François Yvon and the anonymous reviewers for their comments and suggestions. We thank Jonathan Herzig and Jonathan Berant for fruitful discussions. This work benefited from computations done on the Saclay-IA platform and on the HPC resources of IDRIS under the allocation 2022-AD011013727 made by GENCI.

## Notes

In the NLP community, arborescences are often called (directed) trees. We stick with the term arborescence as it is more standard in the graph theory literature, see for example Schrijver (2003). Using the term tree introduces a confusion between two unrelated algorithms, Kruskal’s maximum spanning tree algorithm (Kruskal, 1956) that operates on undirected graphs and Edmond’s maximum spanning arborescence algorithm (Edmonds, 1967) that operates on directed graphs. Moreover, this prevents any confusion between the graph object called arborescence and the semantic structure called AST.

There are a few corner cases like exclude_river, for which we simply assume arguments are in the same order as they appear in the input sentence.

The labeling function is unchanged as there is no need for types for vertices in $V\xaf$ ∖ *V*.

## References

### A Proofs

*Proof: Theorem 1*. We prove Theorem 1 by reducing the maximum not-necessarily-spanning arborescence (MNNSA) problem, which is known to be NP-hard (Rao and Sridharan, 2002; Duhamel et al., 2008), to the MGVCNNSA.

Let *G* = 〈*V*, *A*, ** ψ**〉 be a weighted graph where

*V*= {0, …,

*n*} and

**∈ ℝ**

*ψ*^{|A|}are arc weights. The MNNSA problem aims to compute the subset of arcs

*B*⊆

*A*such that 〈

*V*[

*B*],

*B*〉 is an arborescence of maximum weight, where its weight is defined as ∑

_{a∈B}

*ψ*

_{a}.

Let 𝒢 = 〈*E*, *T*, *f*_{Type}, *f*_{Args}〉 be a grammar such that *E* = {0, …, *n* − 1}, *T* = {*t*} and ∀*e* ∈ *E* : *f*_{Type}(*e*) = *t* ∧ *f*_{Args}(*e*, *t*) = *e*. Intuitively, a tag *e* ∈ *E* will be associated to vertices that require exactly *e* outgoing arcs.

We construct a clustered labeled weighted graph *G*′ = 〈*V*′, *A*′, *π*, *l*, ** ψ**′〉 as follows.

*π*= {

*V*

_{0}′, …,

*V*

_{n}′} is a partition of

*V*′ such that each cluster

*V*

_{i}′ contains

*n*− 1 vertices and represents the vertex

*i*∈

*V*. The labeling function

*l*assigns a different tag to each vertex in a cluster, i.e., ∀

*V*

_{i}′ ∈

*π*, ∀

*u*′,

*v*′ ∈

*V*

_{i}′ :

*u*′ ≠

*v*′ ⇒

*l*(

*u*′) ≠

*l*(

*v*′). The set of arcs is defined as

*A*′ = {

*u*′ →

*v*′|∃

*i*→

*j*∈

*A*s.t.

*u*′ ∈

*V*

_{i}′ ∧

*v*′ ∈

*V*

_{j}′}. The weight vector

**′ ∈ ℝ**

*ψ*^{|A′|}is such that ∀

*u*′ →

*v*′ ∈

*A*′ :

*u*′ ∈

*V*

_{u}′ ∧

*v*′ ∈

*V*

_{v}′ ⇒

*ψ*′

_{u′→v′}=

*ψ*

_{u→v}.

As such, there is a one-to-one correspondence between solutions of the MNNSA on graph *G* and solutions of the MGVCNNSA on graph *G*′.

Note that our proof considers that arcs leaving from the root cluster satisfy constraints defined by the grammar whereas we previously only required the root vertex to have a single outgoing arc. The latter constraint can be added directly in a grammar, but we omit presentation for brevity. The constrained arity case presented by McDonald and Satta (2007) focuses on spanning arborescences with an arity constraint by reducing the Hamiltonian path problem to their problem. While the arity constraint is similar in their problem and ours, our proof considers the not-necessarily-spanning case instead of the spanning one. Although the two problems seem related, they need to be studied separately, e.g., computing the maximum spanning arborescence is a polynomial time problem whereas computing the MNNSA is a NP-hard problem.

*Proof: Theorem 2*. We prove Theorem 2 by reducing the maximum directed Hamiltonian path problem, which is known to be NP-hard (Garey and Johnson, 1979, Appendix A1.3), to the latent anchoring.

Let *G* = 〈*V*, *A*, ** ψ**〉 be a weighted graph where

*V*= {1, …,

*n*} and

**∈ ℝ**

*ψ*^{|A|}are arc weights. The maximum Hamiltonian path problem aims to compute the subset of arcs

*B*⊆

*A*such that

*V*[

*B*] =

*V*and 〈

*V*[

*B*],

*B*〉 is a path of maximum weight, where its weight is defined as ∑

_{a∈B}

*ψ*

_{a}.

Let 𝒢 = 〈*E*, *T*, *f*_{Type}, *f*_{Args}〉 be a grammar such that *E* = {0, 1}, *T* = {*t*} and ∀*e* ∈ *E* : *f*_{Type}(*e*) = *t* ∧ *f*_{Args}(*e*, *t*) = *e*.

We construct a clustered labeled weighted graph *G*′ = 〈*V*′, *A*′, *π*, *l*, ** ψ**′〉 as follows.

*π*= {

*V*

_{0}′, …,

*V*

_{n}′} is a partition of

*V*′ such that

*V*

_{0}′ = {0}, each cluster

*V*

_{i}′ ≠

*V*

_{0}′ contains 2 vertices and represents the vertex

*i*∈

*V*. The labeling function

*l*assigns a different tag to each vertex in a cluster except the root, i.e., ∀

*V*

_{i}′ ∈

*π*,

*i*> 0, ∀

*u*′,

*v*′ ∈

*V*

_{i}′ :

*u*′ ≠

*v*′ ⇒

*l*(

*u*′) ≠

*l*(

*v*′). The set of arcs is defined as

*A*′ = {0 →

*u*′|

*u*′ ∈

*V*′ ∖ {0}} ∪ {

*u*′ →

*v*′|

*i*→

*j*∈

*A*∧

*u*′ ∈

*V*

_{i}′ ∧

*v*′ ∈

*V*

_{j}′}. The weight vector

**′ ∈ ℝ**

*ψ*^{|A′|}is such that ∀

*u*′ →

*v*′ ∈

*A*′ :

*u*′ ∈

*V*

_{u}′ ∧

*v*′ ∈

*V*

_{v}′ ⇒

*ψ*′

_{u′→v′}=

*ψ*

_{u→v}and arcs leaving 0 have null weights.

We construct an AST *G*″ = 〈*V*″, *A*″, *l*′〉 such that *V*″ = {1, …, *n*}, *A*″ = {*i* → *i* + 1|1 ≤ *i* < *n*} and the labeling function *l*′ assigns the tag 0 to *n* and the tag 1 to every other vertex.

As such, there is a one-to-one correspondence between solutions of the maximum Hamiltonian path problem on graph *G* and solutions of the mapping of maximum weight of *G*″ with *G*′.

### B Experimental Setup

The neural architecture used in our experiments to produce the weights ** μ** and

**is composed of: (1) an embedding layer of dimension 100 for Scan or BERT-base (Devlin et al., 2019) for the other datasets, followed by a bi-LSTM (Hochreiter and Schmidhuber, 1997) with a hidden size of 400; (2) a linear projection of dimension 500 over the output of the bi-LSTM followed by a Tanh activation and another linear projection of dimension |**

*ϕ**E*| to obtain

**; (3) a linear projection of dimension 500 followed by a Tanh activation and a bi-affine layer (Dozat and Manning, 2017) to obtain**

*μ***.**

*ϕ*We apply dropout with a probability of 0.3 over the outputs of BERT-base and the bi-LSTM and after both Tanh activations. The learning rate is 5 × 10^{−4} and each batch is composed of 30 examples. We keep the parameters that obtain the best accuracy on the development set after 25 epochs. Training the model takes between 40 minutes for GeoQuery and 8 hours for Clevr. However, note that the bottleneck is the conditional gradient method which is computed on the CPU.

### C GeoQuery Denotation Accuracy Issue

The denotation accuracy is evaluated by checking whether the denotation returned by an executor is the same when given the gold semantic program and the prediction of the model. It can be higher than the exact match accuracy when different semantic programs yield the same denotation.

When we evaluated our approach using the same executor as the baselines of Herzig and Berant (2021), we observed two main issues regarding the behavior of predicates: (1) Several predicates have undefined behaviors (e.g., population_1 and traverse_2 in the case of an argument of type country), in the sense that they are not implemented; (2) The behavior of some predicates are incorrect with respect to their expected semantic (e.g., traverse_1 and traverse_2). These two sources of errors result in incorrect denotation for several semantic programs, leading to an overestimation of the denotation accuracy when both the gold and predicted programs return by accident an empty denotation (potentially for different reasons, due to aforementioned implementation issues).

We implemented a corrected executor addressing the issues that we found. It is available here: https://github.com/alban-petit/geoquery-funql-executor.

## Author notes

Action Editor: James Henderson