On Graph-based Reentrancy-free Semantic Parsing

We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on GeoQuery, Scan, and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.


Introduction
Semantic parsing aims to transform a natural language utterance into a structured representation that can be easily manipulated by a software (for example to query a database).As such, it is a central task in human-computer interfaces.Andreas et al. (2013) first proposed to rely on machine translation models for semantic parsing, where the target representation is linearized and treated as a foreign language.Due to recent advances in deep learning and especially in sequence-to-sequence (seq2seq) with attention architectures for machine translation (Bahdanau et al., 2015), it is appealing to use the same architectures for standard structured prediction problems (Vinyals et al., 2015).This approach is indeed common in semantic parsing (Jia and Liang, 2016;Dong and Lapata, 2016;Wang et al., 2020), inter alia.Unfortunately, there are well known limitations to seq2seq architectures This work has been accepted for publication in TACL.This version is a pre-MIT Press publication version.
for semantic parsing.First, at test time, the decoding algorithm is typically based on beam search as the model is autoregressive and does not make any independence assumption.In case of prediction failure, it is therefore unknown if this is due to errors in the weighting function or to the optimal solution failing out of the beam.Secondly, they are known to fail when compositional generalization is required (Lake and Baroni, 2018;Finegan-Dollak et al., 2018a;Keysers et al., 2020).
In order to bypass these problems, Herzig and Berant (2021) proposed to represent the semantic content associated with an utterance as a phrase structure, i.e. using the same representation usually associated with syntactic constituents.As such, their semantic parser is based on standard span-based decoding algorithms (Hall et al., 2014;Stern et al., 2017;Corro, 2020) with additional well-formedness constraints from the semantic formalism.Given a weighting function, MAP inference is a polynomial time problem that can be solved via a variant of the CYK algorithm (Kasami, 1965;Younger, 1967;Cocke, 1970).Experimentally, Herzig and Berant (2021) show that their approach outperforms seq2seq models in terms of compositional generalization, therefore effectively bypassing the two major problems of these architectures.
The complexity of MAP inference for phrase structure parsing is directly impacted by the search space considered (Kallmeyer, 2010).Importantly, (ill-nested) discontinuous phrase structure parsing is known to be NP-hard, even with a bounded blockdegree (Satta, 1992).Herzig and Berant (2021) explore two restricted inference algorithms, both of which have a cubic time complexity with respect to the input length.The first one only considers continuous phrase structures, i.e. derived trees that could have been generated by a context-free grammar, and the second one also considers a specific type of discontinuities, see Corro (2020, Section What state has the most major cities ?∅ state loc_1 ∅ most major city_all ∅ join : loc_1 join : city_all join : major(city_all) join : loc_1(major(city_all)) join : state(loc_1(major(city_all))) join : most(state(loc_1(major(city_all)))) join : most(state(loc_1(major(city_all)))) Figure 1: Example of a semantic phrase structure from GEOQUERY.This structure is outside of the search space of the parser of Herzig and Berant (2021) as the constituent in red is discontinuous and also has a discontinuous parent (in red+green).
3.6).Both algorithms fail to cover the full set of phrase structures observed in semantic treebanks, see Figure 1.
In this work, we propose to reduce semantic parsing without reentrancy (i.e. a given predicate or entity cannot be used as an argument for two different predicates) to a bi-lexical dependency parsing problem.As such, we tackle the same semantic content as aforementioned previous work but using a different mathematical representation (Rambow, 2010).We identify two main benefits to our approach: (1) as we allow crossing arcs, i.e. "nonprojective graphs", all datasets are guaranteed to be fully covered and (2) it allows us to rely on optimization methods to tackle inference intractability of our novel graph-based formulation of the problem.More specifically, in our setting we need to jointly assign predicates/entities to words that convey a semantic content and to identify arguments of predicates via bi-lexical dependencies.We show that MAP inference in this setting is equivalent to the maximum generalized spanning arborescence problem (Myung et al., 1995) with supplementary constraints to ensure well-formedness with respect to the semantic formalism.Although this problem is NP-hard, we propose an optimization algorithm that solves a linear relaxation of the problem and can deliver an optimality certificate.
Our contributions can be summarized as follows: • We propose a novel graph-based approach for semantic parsing without reentrancy; • We prove the NP-hardness of MAP inference and latent anchoring inference; • We propose a novel integer linear programming formulation for this problem together with an approximate solver based on conditional gradient and constraint smoothing; • We tackle the training problem using variational approximations of objective functions, including the weakly-supervised scenario; • We evaluate our approach on GEOQUERY, SCAN and CLEVR and observe that it outperforms baselines on both i.i.d.splits and splits that test for compositional generalization.
Code to reproduce the experiments is available online.1

Graph-based semantic parsing
We propose to reduce semantic parsing to parsing the abstract syntax tree (AST) associated to a semantic program.We focus on semantic programs whose ASTs do not have any reentrancy, i.e. a single predicate or entity cannot be the argument of two different predicates.Moreover, we assume that each predicate or entity is anchored on exactly one word of the sentence and each word can be the anchor of at most one predicate or entity.As such, the semantic parsing problem can be reduced to assigning predicates and entities to words and identifying arguments via dependency relations, see Figure 2.
In order to formalize our approach to the semantic parsing problem, we will use concepts from graph theory.We therefore first introduce the vocabulary and notions that will be useful in the rest of this article.Notably, the notions of cluster and generalized arborescence will be used to formalize our prediction problem.Notations and definitions.Let G = V, A be a directed graph with vertices V and arcs A ⊆ ) the set of arcs leaving one vertex of U and entering one vertex of V \ U (resp.leaving one vertex of V \ U and entering one vertex of U ) in the graph G. Let B ⊆ A be a subset of arcs.We denote V [B] the cover set of B, i.e. the set of vertices that appear as an extremity of at least one arc in B. A graph G = V, A is an arborescence 2 rooted at u ∈ V if and only if (iff) it 2 In the NLP community, arborescences are often called (directed) trees.We stick with the term arborescence as it is more standard in the graph theory literature, see for example Schrijver (2003).Using the term tree introduces a confusion between two unrelated algorithms, Kruskal's maximum spanning tree algorithm (Kruskal, 1956) that operates on contains |V | − 1 arcs and there is a directed path from u to each vertex in V .In the rest of this work, we will assume that the root is always vertex Let π = {V 0 , ..., V n } be a partition of V containing n + 1 clusters.G is a generalized notnecessarily-spanning arborescence (resp.generalized spanning arborescence) on the partition π of G iff G is an arborescence and V [B] contains at most one vertex per cluster in π (resp.contains exactly one).
Let W ⊆ V be a set of vertices.Contracting W consists in replacing in G the set W by a new vertex w / ∈ V , replacing all the arcs u → v ∈ σ − (W ) by an arc u → w and all the arcs u → v ∈ σ + (W ) by an arc w → v. Given a graph with partition π, the contracted graph is the graph where each cluster in π has been contracted.While contracting a graph may introduce parallel arcs, it is not an issue in practice, even for weighted graphs.

Semantic grammar and AST.
The semantic programs we focus on take the form of a functional language, i.e. a representation where each predicate is a function that takes other predicates or entities as arguments.The semantic language is typed in the same sense than in "typed programming languages".For example, in GEOundirected graphs and Edmond's maximum spanning arborescence algorithm (Edmonds, 1967) that operates on directed graphs.Moreover, this prevents any confusion between the graph object called arborescence and the semantic structure called AST.
QUERY, the predicate capital_2 expects an argument of type city and returns an object of type state.In the datasets we use, the typing system disambiguates the position of arguments in a function: for a given function, either all arguments are of the same type or the order of arguments is unimportant -an example of both is the predicate intersection_river in GEO-QUERY that takes two arguments of type river, but the result of the execution is unchanged if the arguments are swapped. 3ormally, we define the set of valid semantic programs as the set of programs that can be produced with a semantic grammar G = E, T, f TYPE , f ARGS where: • E is the set of predicates and entities, which we will refer to as the set of tags -w.l.o.g.we assume that ROOT / ∈ E where ROOT is a special tag used for parsing; • T is the set of types; • f TYPE : E → T is a typing function that assigns a type to each tag; • f ARGS : E × T → N is a valency function that assigns the numbers of expected arguments of a given type to each tag.
A tag e ∈ E is an entity iff ∀t ∈ T : f ARGS (e, t) = 0. Otherwise, e is a predicate.A semantic program in a functional language can be equivalently represented as an AST, a graph where instances of predicates and entities are represented as vertices and where arcs identify arguments of predicates.Formally, an AST is a labeled graph G = V, A, l where function l : V → E assigns a tag to each vertex and arcs identify the arguments of tags, see Figure 2.An AST G is wellformed with respect to the grammar G iff G is an arborescence and the valency and type constraints are satisfied, i.e. ∀u ∈ V, t ∈ T : where:

Problem reduction and complexity
In our setting, semantic parsing is a joint sentence tagging and dependency parsing problem (Bohnet and Nivre, 2012;Li et al., 2011;Corro et al., 2017): each content word (i.e.words that convey a semantic meaning) must be tagged with a predicate or an entity, and dependencies between content words identify arguments of predicates, see Figure 2.However, our semantic parsing setting differs from standard syntactic analysis in two ways: (1) the resulting structure is not-necessarily-spanning, there are words (e.g.function words) that must not be tagged and that do not have any incident dependency -and those words are not known in advance, they must be identified jointly with the rest of the structure; (2) the dependency structure is highly constrained by the typing mechanism, that is the predicted structure must be a valid AST.Nevertheless, similarly to aforementioned works, our parser is graph-based, that is for a given input we build a (complete) directed graph and decoding is reduced to computing a constrained subgraph of maximum weight.Given a sentence w = w 1 ...w n with n words and a grammar G, we construct a clustered labeled graph G = V, A, π, l as follows.The partition π = {V 0 , ..., V n } contains n + 1 clusters, where V 0 is a root cluster and each cluster V i , i = 0, is associated to word w i .The root cluster V 0 = {0} contains a single vertex that will be used as the root and every other cluster contains |E| vertices.The extended labeling function l : Let B ⊆ A be a subset of arcs.The graph G = V [B], B defines a 0-rooted generalized valency-constrained not-necessarily-spanning arborescence iff it is a generalized arborescence of G, there is exactly one arc leaving 0 and the subarborescence rooted at the destination of that arc is a valid AST with respect to the grammar G.As such, there is a one-to-one correspondence between ASTs anchored on the sentence w and generalized valency-constrained not-necessarily-spanning arborescences in the graph G, see Figure 3b.
For any sentence w, our aim is to find the AST that most likely corresponds to it.Thus, after building the graph G as explained above, the neural network described in Appendix B is used to pro-duce a vector of weights µ ∈ R |V | associated to the set of vertices V and a vector of weights φ ∈ R |A| associated to the set of arcs A. Given these weights, graph-based semantic parsing is reduced to an optimization problem called the maximum generalized valency-constrained not-necessarily-spanning arborescence (MGVCNNSA) in the graph G.
The proof is in Appendix A.

Mathematical program
Our graph-based approach to semantic parsing has allowed us to prove the intrinsic hardness of the problem.We follow previous work on graph-based parsing (Martins et al., 2009;Koo et al., 2010), inter alia, by proposing an integer linear programming (ILP) formulation in order to compute (approximate) solutions.
Remember that in the joint tagging and dependency parsing interpretation of the semantic parsing problem, the resulting structure is notnecessarily-spanning, meaning that some words may not be tagged.In order to rely on well-known algorithms for computing spanning arborescences as a subroutine of our approximate solver, we first introduce the notion of extended graph.Given a graph G = V, A, π, l , we construct an extended graph G = V , A, π, l4 containing n additional vertices {1, ..., n} that are distributed along clusters, i.e. π = {V 0 , V 1 ∪ {1}, ..., V n ∪ {n}}, and arcs from the root to these extra vertices, i.e.
B is a generalized not-necessarily-spanning arborescence on G. Let B ⊆ A be a subset of arcs defined as Then, there is a one-to-one correspondence between generalized not-necessarily-spanning arborescences V [B], B and generalized spanning arborescences V [B], B , see Figure 3b.
Let x ∈ {0, 1} |V | and y ∈ {0, 1} |A| be variable vectors indexed by vertices and arcs such that a vertex v ∈ V (resp.an arc a ∈ A) is selected iff x v = 1 (resp.y a = 1).The set of 0-rooted generalized valency-constrained spanning arborescences on G can be written as the set of variables x, y satisfying the following linear constraints.First, we restrict y to structures that are spanning arborescences over G where clusters have been contracted: Constraints (2) ensure that the contracted graph is weakly connected.Constraints (3) force each cluster to have exactly one incoming arc.The set of vectors y that satisfy these three constraints are exactly the set of 0-rooted spanning arborescences on the contracted graph, see Schrijver (2003, Section 52.4) for an in-depth analysis of this polytope.The root vertex is always selected and other vertices are selected iff they have one incoming selected arc: Note that constraints (1)-( 3) do not force selected arcs to leave from a selected vertex as they operate at the cluster level.This property will be enforced via the valency constraints: Constraint (6) forces the root to have exactly one outgoing arc into a vertex u ∈ V \ {0} (i.e. a vertex that is not part of the extra vertices introduced in the extended graph) that will be the root of the AST.Constraints (7) force the selected vertices and arcs to produce a well-formed AST with respect to the grammar G.Note that these constraints are only defined for vertices in V \ {0}, i.e. they are neither defined for the root vertex nor for the extra vertices introduced in the extended graph.
To simplify notations, we introduce the following sets: x and y satisfy (1)-( 5) , x and y satisfy ( 6)-( 7) , and C = C (sa) ∩ C (val) .Given vertex weights µ ∈ R |V | and arc weights φ ∈ R |A| , computing the MGVCNNSA is equivalent to solving the following ILP: (ILP1) max x,y µ x + φ y s.t.x, y ∈ C (sa) and x, y ∈ C (val)   Without constraint x, y ∈ C (val) , the problem would be easy to solve.The set C (sa) is the set of spanning arborescences over the contracted graph, hence to maximize over this set we can simply: (1) contract the graph and assign to each arc in the contracted graph the weight of its corresponding arc plus the weight of its destination vertex in the original graph; (2) run the the maximum spanning arborescence algorithm (MSA, Edmonds, 1967;Tarjan, 1977) on the contracted graph, which has a O(n 2 ) time-complexity.This process is illustrated on Figure 5 (top).Note that the contracted graph may have parallel arcs, which is not an issue in practice as only the one of maximum weight can appear in a solution of the MSA.
We have established that MAP inference in our semantic parsing framework is a NP-hard problem.We proposed an ILP formulation of the problem that would be easy to solve if some constraints were removed.This property suggests the use of an approximation algorithm that introduces the difficult constraints as penalties.As a similar setting arises from our weakly supervised loss function, the presentation of the approximation algorithm is deferred until Section 4.

Supervised training objective
We define the likelihood of a pair x, y ∈ C via the Boltzmann distribution: where c(µ, φ) is the log-partition function: During training, we aim to maximize the loglikelihood of the training dataset.The loglikelihood of an observation x, y is defined as: Unfortunately, computing the log-partition function is intractable as it requires summing over all feasible solutions.Instead, we rely on a surrogate lower-bound as an objective function.To this end, we derive an upper bound (because it is negated in ) to the second term: a sum of log-sum-exp functions that sums over each cluster of vertices independently and over incoming arcs in each cluster independently, which is tractable.This loss can be understood as a generalization of the head selection loss used in dependency parsing (Zhang et al., 2017).We now detail the derivation and prove that it is an upper bound to the log-partition function.
Let U be a matrix such that each row contains a pair x, y ∈ C and ∆ |C| be the simplex of dimension |C| − 1, i.e. the set of all stochastic vectors of dimension |C|.The log-partition function can then be rewritten using its variational formulation: where H[p] = − i p i log p i is the Shannon entropy.We refer the reader to Boyd and Vandenberghe (2004, Example 3.25), Wainwright and Jordan (2008, Section 3.6) and Beck (2017, Section 4.4.10).Note that this formulation remains impractical as p has an exponential size.Let M = conv(C) be the marginal polytope, i.e. the convex hull of the feasible integer solutions, we can rewrite the above variational formulation as: where H M is a joint entropy function defined such that the equality holds.The maximization in this reformulation acts on the marginal probabilities of parts (vertices and arcs) and has therefore a polynomial number of variables.We refer the reader to Wainwright and Jordan (2008, 5.In particular, we observe that each pair x, y ∈ C has exactly one vertex selected per cluster V i ∈ π and one incoming arc selected per cluster V i ∈ π \ {V 0 }.We denote C (one) the set of all the pairs x, y that satisfy these constraints.By using L = conv(C (one) ) as an outer approximation to the marginal polytope (see Figure 4) the optimization problem can be rewritten as a sum of independent problems.As each of these problems is the variational formulation of a log-sum-exp term, the upper bound on c(µ, φ) can be expressed as a sum of log-sum-exp functions, one over vertices in each cluster V i ∈ π \ {V 0 } and one over incoming arcs σ − G (V i ) for each cluster V i ∈ π \ {V 0 }.Although this type of approximation may not result in a Bayes consistent loss (Corro, 2023), it works well in practice.

Weakly-supervised training objective
Unfortunately, training data often does not include gold pairs x, y but instead only the AST, without word anchors (or word alignment).This is the case for the three datasets we use in our experiments.We thus consider our training signal to be the set of all structures that induce the annotated AST, which we denote C * .
The weakly-supervised loss is defined as: x,y ∈C * p µ,φ (x, y), i.e. we marginalize over all the structures that in-duce the gold AST.We can rewrite this loss as: The two terms are intractable.We approximate the second term using the bound defined in Section 3.1.We now derive a tractable lower bound to the first term.Let q be a proposal distribution such that q(x, y) = 0 if x, y / ∈ C * .We derive the following lower bound via Jensen's inequality: This bound holds for any distribution q satisfying the aforementioned condition.We choose to maximize this lower bound using a distribution that gives a probability of one to a single structure, as in "hard" EM (Neal and Hinton, 1998, Section 6).
For a given sentence w, let G = V, A, π, l be a graph defined as in Section 2.2 and G = V , A , l be an AST defined as in Section 2.1.We aim to find the GVCNNSA in G of maximum weight whose induced AST is exactly G .This is equivalent to aligning each vertex in V with one vertex of V \ {0} s.t.there is at most one vertex per cluster of π appearing in the alignment and where the weight of an alignment is defined as: 1. for each vertex u ∈ V , we add the weight of the vertex u ∈ V it is aligned to -moreover, if u is the root of the AST we also add the weight of the arc 0 → u; 2. for each arc u → v ∈ A , we add the weight of the arc u → v where u ∈ V (resp.v ∈ V ) is the vertex u (resp.v ) it is aligned with.
Note that this latent anchoring inference consists in computing a (partial) alignment between vertices of G and G , but the fact that we need to take into account arc weights forbids the use of the Kuhn-Munkres algorithm (Kuhn, 1955).
Theorem 2. Computing the anchoring of maximum weight of an AST with a graph G is NP-hard.
The proof is in Appendix A. Therefore, we propose an optimization-based approach to compute the distribution q.Note that Algorithm 1 Unconstrained alignment of maximum weight between a graph G and an AST G We can only map u to vertices u if they have the same tag.CHART [u , u] Where r ∈ A is the root of the AST.
Algorithm 2 Conditional gradient If the dual gap is small, Compute or approximate the optimal stepsize z (k+1) = z (k) + γd Update the current point return z (k)   the problem has a constraint requiring each cluster V i ∈ π to be aligned with at most one vertex v ∈ V , i.e. each word in the sentence can be aligned with at most one vertex in the AST.If we remove this constraint, then the problem becomes tractable via dynamic programming.Indeed, we can recursively construct a table CHART[u , u], u ∈ V and u ∈ V , containing the score of aligning vertex u to vertex u plus the score of the best alignment of all the descendants of u .To this end, we simply visit the vertices V of the AST in reverse topological order, see Algorithm 1.The best alignment can be retrieved via back-pointers.
Computing q is therefore equivalent to solving the following ILP: The set C *(relaxed) is the set of feasible solutions of the dynamic program in Algorithm 1, whose convex hull can be described via linear constraints (Martin et al., 1990).

Efficient inference
In this section, we propose an efficient way to solve the linear relaxations of MAP inference (ILP1) and latent anchoring inference (ILP2) via constraint smoothing and the conditional gradient method.
We focus on problems of the following form: where the vector z is the concatenation of the vectors x and y defined previously and conv denotes the convex hull of a set.We explained previously that if the set of constraints of form Az = b for (ILP1) or Az ≤ b for (ILP2) was absent, the problem would be easy to solve under a linear objective function.In fact, there exists an efficient linear maximization oracle (LMO), i.e. a function that returns the optimal integral solution, for the set conv(C (easy) ).This setting covers both (ILP1) and (ILP2) where we have C (easy) = C (sa)  and C (easy) = C * (relaxed) , respectively.An appealing approach in this setting is to introduce the problematic constraints as penalties in the objective: where δ S is the indicator function of the set S: In the equality case, we use S = {b} and in the inequality case, we use S = {u|u ≤ b}.
states contract expand add penalties + contract expand Figure 5: Illustration of the approximate inference algorithm on the two-word sentence "List states", where we assume the grammar has one entity state_all and one predicate loc_1 that takes exactly one entity as argument.The left graph is the extended graph for the sentence, including vertices and arcs weights (in black).If we ignore constraints ( 6)-( 7), inference is reduced to computing the MSA on the contracted graph (solid arcs in the middle column).This may lead to solutions that do not satisfy constraints ( 6)-( 7) on the expanded graph (top example).However, the gradient of the smoothed constraint (7) will induce penalties (in red) to vertex and arc scores that will encourage the loc_1 predicate to either be dropped from the solution or to have an outgoing arc to a state_all argument.Computing the MSA on the contracted graph with penalties results in a solution that satisfies constraints ( 6)-(7) (bottom example).

Conditional gradient method
Given a proper, smooth and differentiable function g and a nonempty, bounded, closed and convex set conv(C (easy) ), the conditional gradient method (a.k.a.Frank-Wolfe, Frank and Wolfe, 1956;Levitin and Polyak, 1966;Lacoste-Julien and Jaggi, 2015) can be used to solve optimization problems of the following form: Contrary to the projected gradient method, this approach does not require to compute projections onto the feasible set conv(C (easy) ) which is, in most cases, computationally expensive.Instead, the conditional gradient method only relies on a LMO: The algorithm constructs a solution to the original problem as a convex combination of elements returned by the LMO.The pseudo-code is given in Algorithm 2. An interesting property of this method is that its step size range is bounded.This allows for simple linesearch techniques.

Smoothing
Unfortunately, the function g(z) = f (z)−δ S (Az) is non-smooth due to the indicator function term, preventing the use of the conditional gradient method.We propose to rely on the framework proposed by Yurtsever et al. (2018) where the indicator function is replaced by a smooth approximation.The indicator function of the set S can be rewritten as: where δ * * S denotes the Fenchel biconjugate of the indicator function and σ S (u) = sup t∈S u t is the support function of S.More details can be found in Beck (2017, Section 4.1 and 4.2).In order to smooth the indicator function, we add a β-parameterized convex regularizer − β 2 • 2 2 to its Fenchel biconjugate: where β > 0 controls the quality and the smoothness of the approximation (Nesterov, 2005).
Equalities.In the case where S = {b}, with a few computations that are detailed by Yurtsever et al. (2018), we obtain: That is, we have a quadratic penalty term in the objective.Note that this term is similar to the term introduced in an augmented Lagrangian (Nocedal and Wright, 1999, Equation 17.36), and adds a penalty in the objective for vectors z s.t.Az = b.
Inequalities.In the case where S = {u|u ≤ b}, similar computations lead to: where [•] + denotes the Euclidian projection into the non-negative orthant (i.e.clipping negative values).
Similarly to the equality case, this term introduces a penalty in the objective for vectors z s.t.Az > b.This penalty function is also called the Courant-Beltrami penalty function.
Figure 5 (bottom) illustrates how the gradient of the penalty term can "force" the LMO to return solutions that satisfy the smoothed constraints.

Practical details
Smoothness.In practice, we need to choose the smoothness parameter β.We follow Yurtsever et al. (2018) and use where k is the iteration number and β (0) = 1.
Step size.Another important choice in the algorithm is the step size γ.We show that when the smoothed constraints are equalities, computing the optimal step size has a simple closed form solution if the function f is linear, which is the case for (ILP1), i.e.MAP decoding.The step size problem formulation at iteration k is defined as: By assumption, f is linear and can be written as f (z) = θ z.Ignoring the box constraints on γ, by first order optimality conditions, we have: We can then simply clip the result so that it satisfies the box constraints.Unfortunately, in the inequalities case, there is no simple closed form solution.
We approximate the step size using 10 iterations of the bisection algorithm for root finding.
Non-integral solutions.As we solve the linear relaxation of original ILPs, the optimal solutions may not be integral.Therefore, we use simple heuristics to construct a feasible solution to the original ILP in these cases.For MAP inference, we simply solve the ILP5 using CPLEX but introducing only variables that have a non-null value in the linear relaxation, leading to a very sparse problem which is fast to solve.For latent anchoring, we simply use the Kuhn-Munkres algorithm using the non-integral solution as assignment costs.

Experiments
We compare our method to baseline systems both on i.i.d.splits (IID) and splits that test for compositional generalization for three datasets.The neural network is described in Appendix B.
Datasets.SCAN (Lake and Baroni, 2018) contains natural language navigation commands.We use the variant of Herzig and Berant (2021) for semantic parsing.The IID split is the simple split (Lake and Baroni, 2018).The compositional splits are primitive right (RIGHT) and primitive around right (ARIGHT) (Loula et al., 2018).
GEOQUERY (Zelle and Mooney, 1996) uses the FunQL formalism (Kate et al., 2005) and contains questions about the US geography.The IID split is the standard split and compositional generalization is evaluated on two splits: LENGTH where the examples are split by program length and TEMPLATE (Finegan-Dollak et al., 2018a) where they are split such that all semantic programs having the same AST are in the same split.
CLEVR (Johnson et al., 2017) contains synthetic questions over object relations in images.CLO-SURE (Bahdanau et al., 2019) introduces additional question templates that require compositional generalization.We use the original split as our IID split and the CLOSURE split as a compositional split where the model is evaluated on CLOSURE.
Baselines.We compare our approach against the architecture proposed by Herzig and Berant (2021) (SPANBASEDSP) as well as the seq2seq baselines they used.In SEQ2SEQ (Jia and Liang, 2016), the encoder is a bi-LSTM over pre-trained GloVe embeddings (Pennington et al., 2014) or ELMO (Peters et al., 2018) and the decoder is an attentionbased LSTM (Bahdanau et al., 2015).BERT2SEQ replaces the encoder with BERT-base.GRAM-  Herzig and Berant (2021).For our approach, we also report exact match accuracy, i.e. the percentage of sentences for which the prediction is identical to the gold program.The last line reports the exact match accuracy without the use of CPLEX to round non integral solutions (Section 4.3).
MAR is similar to SEQ2SEQ but the decoding is constrained by a grammar.BART (Lewis et al., 2020) is pre-trained as a denoising autoencoder.
Results.We report the denotation accuracies in Table 1.Our approach outperforms all other methods.In particular, the seq2seq baselines suffer from a significant drop in accuracy on splits that require compositional generalization.While SPANBASEDSP is able to generalize, our approach outperforms it.Note that we observed that the GEOQUERY execution script used to compute denotation accuracy in previous work contains several bugs that overestimate the true accuracy.Therefore, we also report denotation accuracy with a corrected executor 6 (see Appendix C) for fair comparison with future work.
We also report exact match accuracy, with and without the heuristic to construct integral solutions from fractional ones.The exact match accuracy is always lower or equal to the denotation accuracy.This shows that our approach can sometimes provide the correct denotation even though the prediction is different from the gold semantic program.Importantly, while our approach outperforms baselines, its accuracy is still significantly worse on the split that requires to generalize to longer programs.(McDonald et al., 2005) where MAP inference is realized via the maximum spanning arborescence algorithm (Chu and Liu, 1965;Edmonds, 1967).A benefit of this algorithm is that it has a O(n 2 ) time-complexity (Tarjan, 1977), i.e. it is more efficient than algorithms exploring more restricted search spaces (Eisner, 1997;Gómez-Rodríguez et al., 2011;Pitler et al., 2012Pitler et al., , 2013)).
In the case of semantic structures, Kuhlmann and Jonsson (2015) proposed a O(n 3 ) algorithm for the maximum non-necessarily-spanning acyclic graphs with a noncrossing arc constraint.Without the noncrossing constraint, the problem is known to be NP-hard (Grötschel et al., 1985).To bypass this computational complexity, Dozat and Manning (2018) proposed to handle each dependency as an independent binary classification problem, that is they do not enforce any constraint on the output structure.Note that, contrary to our work, these approaches allow for reentrancy but do not enforce well-formedness of the output with respect to the semantic grammar.Lyu and Titov (2018) use a similar approach for AMR parsing where tags are predicted first, followed by arc predictions and finally heuristics are used to ensure the output graph is valid.On the contrary, we do not use a pipeline and we focus on joint decoding where validity of the output is directly encoded in the search space.
Previous work in the literature has also considered reduction to graph-based methods for other problems, e.g. for discontinuous constituency parsing (Fernández-González and Martins, 2015;Corro et al., 2017), lexical segmentation (Constant and Le Roux, 2015) and machine translation (Zaslavskiy et al., 2009), inter alia.
Compositional generalization.Several authors observed that compositional generalization insufficiency is an important source of error for semantic parsers, especially ones based on seq2seq architectures (Lake and Baroni, 2018;Finegan-Dollak et al., 2018b;Herzig and Berant, 2019;Keysers et al., 2020).Wang et al. (2021) proposed a latent re-ordering step to improve compositional generalization, whereas Zheng and Lapata (2021) relied on latent predicate tagging in the encoder.There has also been an interest in using data augmentation methods to improve generalization (Jia and Liang, 2016;Andreas, 2020;Akyürek et al., 2021;Qiu et al., 2022;Yang et al., 2022).
Recently, Herzig and Berant (2021) showed that span-based parsers do not exhibit such problematic behavior.Unfortunately, these parsers fail to cover the set of semantic structures observed in English treebanks, and we hypothesize that this would be even worse for free word order languages.Our graph-based approach does not exhibit this downside.Previous work by Jambor and Bahdanau (2022) also considered graph-based methods for compositional generalization, but their approach predicts each part independently without any wellformedness or acyclicity constraint.

Conclusion
In this work, we focused on graph-based semantic parsing for formalisms that do not allow reentrancy.We conducted a complexity study of two inference problems that appear in this setting.We proposed ILP formulations of these problems together with a solver for their linear relaxation based on the conditional gradient method.Experimentally, our approach outperforms comparable baselines.
One downside of our semantic parser is speed (we parse approximately 5 sentences per second for GEOQUERY).However, we hope this work will give a better understanding of the semantic parsing problem together with baseline for faster methods.
Future research will investigate extensions for (1) ASTs that contain reentrancies and (2) predic-tion algorithms for the case where a single word can be the anchor of more than one predicate or entity.These two properties are crucial for semantic representations like Abstract Meaning Representation (Banarescu et al., 2013).Moreover, even if our graph-based semantic parser provides better results than previous work on length generalization, this setting is still difficult.A more general research direction on neural architectures that generalize better to longer sentences is important.
Giorgio Satta.1992.Recognition of linear contextfree rewriting systems.In 30th Annual Meeting of the Association for Computational Linguistics, pages 89-95, Newark, Delaware, USA.Association for Computational Linguistics.

6
https://github.com/alban-petit/geoquery-funql-executor 6 Related work Graph-based methods.Graph-based methods have been popularized by syntactic dependency parsing Example of a sentence and its associated AST (solid arcs) from the GEOQUERY dataset.The dashed edges indicate predicates and entities anchors (note that this information is not available in the dataset).(b) The corresponding generalized valency-constrained not-necessarily-spanning arborescence (red arcs).The root is the isolated top left vertex.Adding ∅ tags and dotted orange arcs produces a generalized spanning arborescence.

Table 1 :
Denotation and exact match accuracy on the test sets.All the baseline results were taken from