Abstract
We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard problems. We propose two optimization algorithms based on constraint smoothing and conditional gradient to approximately solve these inference problems. Experimentally, our approach delivers state-of-the-art results on GeoQuery, Scan, and Clevr, both for i.i.d. splits and for splits that test for compositional generalization.
1 Introduction
Semantic parsing aims to transform a natural language utterance into a structured representation that can be easily manipulated by a software (e.g., to query a database). As such, it is a central task in human–computer interfaces. Andreas et al. (2013) first proposed to rely on machine translation models for semantic parsing, where the target representation is linearized and treated as a foreign language. Due to recent advances in deep learning and especially in sequence-to-sequence (seq2seq) with attention architectures for machine translation (Bahdanau et al., 2015), it is appealing to use the same architectures for standard structured prediction problems (Vinyals et al., 2015). This approach is indeed common in semantic parsing (Jia and Liang, 2016; Dong and Lapata, 2016; Wang et al., 2020), as well as other domains. Unfortunately, there are well-known limitations to seq2seq architectures for semantic parsing. First, at test time, the decoding algorithm is typically based on beam search as the model is autoregressive and does not make any independence assumption. In case of prediction failure, it is therefore unknown if this is due to errors in the weighting function or to the optimal solution failing out of the beam. Secondly, they are known to fail when compositional generalization is required (Lake and Baroni, 2018; Finegan-Dollak et al., 2018a; Keysers et al., 2020).
In order to bypass these problems, Herzig and Berant (2021) proposed to represent the semantic content associated with an utterance as a phrase structure, i.e., using the same representation usually associated with syntactic constituents. As such, their semantic parser is based on standard span-based decoding algorithms (Hall et al., 2014; Stern et al., 2017; Corro, 2020) with additional well-formedness constraints from the semantic formalism. Given a weighting function, MAP inference is a polynomial time problem that can be solved via a variant of the CYK algorithm (Kasami, 1965; Younger, 1967; Cocke, 1970). Experimentally, Herzig and Berant (2021) show that their approach outperforms seq2seq models in terms of compositional generalization, therefore effectively bypassing the two major problems of these architectures.
The complexity of MAP inference for phrase structure parsing is directly impacted by the considered search space (Kallmeyer, 2010). Importantly, (ill-nested) discontinuous phrase structure parsing is known to be NP-hard, even with a bounded block-degree (Satta, 1992). Herzig and Berant (2021) explore two restricted inference algorithms, both of which have a cubic time complexity with respect to the input length. The first one only considers continuous phrase structures, that is, derived trees that could have been generated by a context-free grammar, and the second one also considers a specific type of discontinuities, see Corro (2020, Section 3.6). Both algorithms fail to cover the full set of phrase structures observed in semantic treebanks, see Figure 1.
In this work, we propose to reduce semantic parsing without reentrancy (i.e., a given predicate or entity cannot be used as an argument for two different predicates) to a bi-lexical dependency parsing problem. As such, we tackle the same semantic content as aforementioned previous work but using a different mathematical representation (Rambow, 2010). We identify two main benefits to our approach: (1) as we allow crossing arcs, i.e., “non-projective graphs”, all datasets are guaranteed to be fully covered and (2) it allows us to rely on optimization methods to tackle inference intractability of our novel graph-based formulation of the problem. More specifically, in our setting we need to jointly assign predicates/entities to words that convey a semantic content and to identify arguments of predicates via bi-lexical dependencies. We show that MAP inference in this setting is equivalent to the maximum generalized spanning arborescence problem (Myung et al., 1995) with supplementary constraints to ensure well-formedness with respect to the semantic formalism. Although this problem is NP-hard, we propose an optimization algorithm that solves a linear relaxation of the problem and can deliver an optimality certificate.
Our contributions can be summarized as follows:
We propose a novel graph-based approach for semantic parsing without reentrancy;
We prove the NP-hardness of MAP inference and latent anchoring inference;
We propose a novel integer linear programming formulation for this problem together with an approximate solver based on conditional gradient and constraint smoothing;
We tackle the training problem using variational approximations of objective functions, including the weakly-supervised scenario;
We evaluate our approach on GeoQuery, Scan, and Clevr and observe that it outperforms baselines on both i.i.d. splits and splits that test for compositional generalization.
2 Graph-based Semantic Parsing
We propose to reduce semantic parsing to parsing the abstract syntax tree (AST) associated with a semantic program. We focus on semantic programs whose ASTs do not have any reentrancy, i.e., a single predicate or entity cannot be the argument of two different predicates. Moreover, we assume that each predicate or entity is anchored on exactly one word of the sentence and each word can be the anchor of at most one predicate or entity. As such, the semantic parsing problem can be reduced to assigning predicates and entities to words and identifying arguments via dependency relations, see Figure 2. In order to formalize our approach to the semantic parsing problem, we will use concepts from graph theory. We therefore first introduce the vocabulary and notions that will be useful in the rest of this article. Notably, the notions of cluster and generalized arborescence will be used to formalize our prediction problem.
Notations and Definitions.
Let G = 〈V, A〉 be a directed graph with vertices V and arcs A ⊆ V × V. An arc in A from a vertex u ∈ V to a vertex v ∈ V is denoted either a ∈ A or u → v ∈ A. For any subset of vertices U ⊆ V, we denote (U) (resp., (U)) the set of arcs leaving one vertex of U and entering one vertex of V ∖ U (resp., leaving one vertex of V ∖ U and entering one vertex of U) in the graph G. Let B ⊆ A be a subset of arcs. We denote V [B] the cover set of B, i.e., the set of vertices that appear as an extremity of at least one arc in B. A graph G = 〈V, A〉 is an arborescence2 rooted at u ∈ V if and only if (iff) it contains |V| − 1 arcs and there is a directed path from u to each vertex in V. In the rest of this work, we will assume that the root is always vertex 0 ∈ V. Let B ⊆ A be a set of arcs such that G′ = 〈V[B], B〉 is an arborescence. Then G′ is a spanning arborescence of G iff V [B] = V.
Let π = {V0, …, Vn} be a partition of V containing n + 1 clusters. G′ is a generalized not-necessarily-spanning arborescence (resp. generalized spanning arborescence) on the partition π of G iff G′ is an arborescence and V [B] contains at most one vertex per cluster in π (resp. contains exactly one).
Let W ⊆ V be a set of vertices. Contracting W consists in replacing in G the set W by a new vertex w ∉ V, replacing all the arcs u → v ∈ σ−(W) by an arc u → w and all the arcs u → v ∈ σ+(W) by an arc w → v. Given a graph with partition π, the contracted graph is the graph where each cluster in π has been contracted. While contracting a graph may introduce parallel arcs, it is not an issue in practice, even for weighted graphs.
2.1 Semantic Grammar and AST
The semantic programs we focus on take the form of a functional language, i.e., a representation where each predicate is a function that takes other predicates or entities as arguments. The semantic language is typed in the same sense than in “typed programming languages”. For example, in GeoQuery, the predicate capital_2 expects an argument of type city and returns an object of type state. In the datasets we use, the typing system disambiguates the position of arguments in a function: For a given function, either all arguments are of the same type or the order of arguments is unimportant—an example of both is the predicate intersection_river in GeoQuery that takes two arguments of type river, but the result of the execution is unchanged if the arguments are swapped.3
Formally, we define the set of valid semantic programs as the set of programs that can be produced with a semantic grammar 𝒢 = 〈E, T, fType, fArgs〉 where:
E is the set of predicates and entities, which we will refer to as the set of tags—w.l.o.g. we assume that Root ∉ E where Root is a special tag used for parsing;
T is the set of types;
fType : E → T is a typing function that assigns a type to each tag;
fArgs : E × T → ℕ is a valency function that assigns the numbers of expected arguments of a given type to each tag.
2.2 Problem Reduction and Complexity
In our setting, semantic parsing is a joint sentence tagging and dependency parsing problem (Bohnet and Nivre, 2012; Li et al., 2011; Corro et al., 2017): each content word (i.e., words that convey a semantic meaning) must be tagged with a predicate or an entity, and dependencies between content words identify arguments of predicates, see Figure 2. However, our semantic parsing setting differs from standard syntactic analysis in two ways: (1) the resulting structure is not-necessarily-spanning, there are words (e.g., function words) that must not be tagged and that do not have any incident dependency—and those words are not known in advance, they must be identified jointly with the rest of the structure; (2) the dependency structure is highly constrained by the typing mechanism, that is, the predicted structure must be a valid AST. Nevertheless, similarly to aforementioned works, our parser is graph-based, that is, for a given input we build a (complete) directed graph and decoding is reduced to computing a constrained subgraph of maximum weight.
Given a sentence w = w1 … wn with n words and a grammar 𝒢, we construct a clustered labeled graph G = 〈V, A, π, 〉 as follows. The partition π = {V0, …, Vn} contains n + 1 clusters, where V0 is a root cluster and each cluster Vi, i ≠ 0, is associated to word wi. The root cluster V0 = {0} contains a single vertex that will be used as the root and every other cluster contains |E| vertices. The extended labeling function : V → E ∪ {Root} assigns a tag in E to each vertex v ∈ V ∖ {0} and Root to vertex 0. Distinct vertices in a cluster Vi cannot have the same label, i.e., ∀u, v ∈ Vi : u ≠ v ⇒ (u) ≠ (v).
Let B ⊆ A be a subset of arcs. The graph G′ = 〈V[B], B〉 defines a 0-rooted generalized valency-constrained not-necessarily-spanning arborescence iff it is a generalized arborescence of G, there is exactly one arc leaving 0 and the sub-arborescence rooted at the destination of that arc is a valid AST with respect to the grammar 𝒢. As such, there is a one-to-one correspondence between ASTs anchored on the sentence w and generalized valency-constrained not-necessarily-spanning arborescences in the graph G, see Figure 3b.
For any sentence w, our aim is to find the AST that most likely corresponds to it. Thus, after building the graph G as explained above, the neural network described in Appendix B is used to produce a vector of weights μ ∈ ℝ|V| associated to the set of vertices V and a vector of weights ϕ ∈ ℝ|A| associated to the set of arcs A. Given these weights, graph-based semantic parsing is reduced to an optimization problem called the maximum generalized valency-constrained not-necessarily-spanning arborescence (MGVCNNSA) in the graph G.
Theorem 1.The MGVCNNSA problem is NP-hard.
The proof is in Appendix A.
2.3 Mathematical Program
Our graph-based approach to semantic parsing has allowed us to prove the intrinsic hardness of the problem. We follow previous work on graph-based parsing (Martins et al., 2009; Koo et al., 2010), and other topics, by proposing an integer linear programming (ILP) formulation in order to compute (approximate) solutions.
Remember that in the joint tagging and dependency parsing interpretation of the semantic parsing problem, the resulting structure is not-necessarily-spanning, meaning that some words may not be tagged. In order to rely on well-known algorithms for computing spanning arborescences as a subroutine of our approximate solver, we first introduce the notion of extended graph. Given a graph G = 〈V, A, π, 〉, we construct an extended graph = 〈, , , 〉4 containing n additional vertices {, …, } that are distributed along clusters, i.e., = {V0, V1 ∪ {}, …, Vn ∪ {}}, and arcs from the root to these extra vertices, i.e., = A ∪ {0 → ∣1 ≤ i ≤ n}. Let B ⊆ A be a subset of arcs such that 〈V[B], B〉 is a generalized not-necessarily-spanning arborescence on G. Let ⊆ be a subset of arcs defined as = B ∪ {0 → ∣(Vi) = ∅}. Then, there is a one-to-one correspondence between generalized not-necessarily-spanning arborescences 〈V[B], B〉 and generalized spanning arborescences 〈[], 〉, see Figure 3b.
We have established that MAP inference in our semantic parsing framework is a NP-hard problem. We proposed an ILP formulation of the problem that would be easy to solve if some constraints were removed. This property suggests the use of an approximation algorithm that introduces the difficult constraints as penalties. As a similar setting arises from our weakly supervised loss function, the presentation of the approximation algorithm is deferred until Section 4.
3 Training Objective Functions
3.1 Supervised Training Objective
3.2 Weakly Supervised Training Objective
Unfortunately, training data often does not include gold pairs 〈x, y〉 but instead only the AST, without word anchors (or word alignment). This is the case for the three datasets we use in our experiments. We thus consider our training signal to be the set of all structures that induce the annotated AST, which we denote 𝒞*.
For a given sentence w, let G = 〈V, A, π>, 〉 be a graph defined as in Section 2.2 and G′ = 〈V′, A′, l′〉 be an AST defined as in Section 2.1. We aim to find the GVCNNSA in G of maximum weight whose induced AST is exactly G′. This is equivalent to aligning each vertex in V′ with one vertex of V ∖ {0} s.t. there is at most one vertex per cluster of π appearing in the alignment and where the weight of an alignment is defined as:
for each vertex u′ ∈ V′, we add the weight of the vertex u ∈ V it is aligned to—moreover, if u′ is the root of the AST we also add the weight of the arc 0 → u;
for each arc u′ → v′ ∈ A′, we add the weight of the arc u → v where u ∈ V (resp. v ∈ V) is the vertex u′ (resp. v′) it is aligned with.
Theorem 2.Computing the anchoring of maximum weight of an AST with a graph G is NP-hard.
The proof is in Appendix A.
Therefore, we propose an optimization-based approach to compute the distribution q. Note that the problem has a constraint requiring each cluster Vi ∈ π to be aligned with at most one vertex v′ ∈ V′, i.e., each word in the sentence can be aligned with at most one vertex in the AST. If we remove this constraint, then the problem becomes tractable via dynamic programming. Indeed, we can recursively construct a table Chart[u′, u], u′ ∈ V′ and u ∈ V, containing the score of aligning vertex u′ to vertex u plus the score of the best alignment of all the descendants of u′. To this end, we simply visit the vertices V′ of the AST in reverse topological order, see Algorithm 1. The best alignment can be retrieved via back-pointers.
4 Efficient Inference
4.1 Conditional Gradient Method
4.2 Smoothing
Equalities.
Inequalities.
Figure 5 (bottom) illustrates how the gradient of the penalty term can “force” the LMO to return solutions that satisfy the smoothed constraints.
4.3 Practical Details
Smoothness.
In practice, we need to choose the smoothness parameter β. We follow Yurtsever et al. (2018) and use β(k) = where k is the iteration number and β(0) = 1.
Step Size.
Non-integral Solutions.
As we solve the linear relaxation of original ILPs, the optimal solutions may not be integral (Figure 4). Therefore, we use simple heuristics to construct a feasible solution to the original ILP in these cases. For MAP inference, we simply solve the ILP5 using CPLEX but introducing only variables that have a non-null value in the linear relaxation, leading to a very sparse problem which is fast to solve. For latent anchoring, we simply use the Kuhn–Munkres algorithm using the non-integral solution as assignment costs.
5 Experiments
We compare our method to baseline systems both on i.i.d. splits (Iid) and splits that test for compositional generalization for three datasets. The neural network is described in Appendix B.
Datasets.
Scan (Lake and Baroni, 2018) contains natural language navigation commands. We use the variant of Herzig and Berant (2021) for semantic parsing. The Iid split is the simple split (Lake and Baroni, 2018). The compositional splits are primitive right (Right) and primitive around right (ARight) (Loula et al., 2018).
GeoQuery (Zelle and Mooney, 1996) uses the FunQL formalism (Kate et al., 2005) and contains questions about the US geography. The Iid split is the standard split and compositional generalization is evaluated on two splits: Length where the examples are split by program length and Template (Finegan-Dollak et al., 2018a) where they are split such that all semantic programs having the same AST are in the same split.
Clevr (Johnson et al., 2017) contains synthetic questions over object relations in images. Closure (Bahdanau et al., 2019) introduces additional question templates that require compositional generalization. We use the original split as our Iid split and the Closure split as a compositional split where the model is evaluated on Closure.
Baselines.
We compare our approach against the architecture proposed by Herzig and Berant (2021) (SpanBasedSP) as well as the seq2seq baselines they used. In Seq2Seq (Jia and Liang, 2016), the encoder is a bi-LSTM over pre-trained GloVe embeddings (Pennington et al., 2014) or ELMo (Peters et al., 2018) and the decoder is an attention-based LSTM (Bahdanau et al., 2015). BERT2Seq replaces the encoder with BERT-base. GRAMMAR is similar to Seq2Seq but the decoding is constrained by a grammar. BART (Lewis et al., 2020) is pre-trained as a denoising autoencoder.
Results.
We report the denotation accuracies in Table 1. Our approach outperforms all other methods. In particular, the seq2seq baselines suffer from a significant drop in accuracy on splits that require compositional generalization. While SpanBasedSP is able to generalize, our approach outperforms it. Note that we observed that the GeoQuery execution script used to compute denotation accuracy in previous work contains several bugs that overestimate the true accuracy. Therefore, we also report denotation accuracy with a corrected executor (see Appendix C) for fair comparison with future work.
. | Scan . | GeoQuery . | Clevr . | |||||
---|---|---|---|---|---|---|---|---|
Iid . | Right . | ARight . | Iid . | Template . | Length . | Iid . | Closure . | |
Baselines (denotation accuracy only) | ||||||||
Seq2Seq | 99.9 | 11.6 | 0 | 78.5 | 46.0 | 24.3 | 100 | 59.5 |
+ ELMo | 100 | 54.9 | 41.6 | 79.3 | 50.0 | 25.7 | 100 | 64.2 |
BERT2Seq | 100 | 77.7 | 95.3 | 81.1 | 49.6 | 26.1 | 100 | 56.4 |
GRAMMAR | 100 | 0.0 | 4.2 | 72.1 | 54.0 | 24.6 | 100 | 51.3 |
BART | 100 | 50.5 | 100 | 87.1 | 67.0 | 19.3 | 100 | 51.5 |
SpanBasedSP | 100 | 100 | 100 | 86.1 | 82.2 | 63.6 | 96.7 | 98.8 |
Our approach | ||||||||
Denotation accuracy | 100 | 100 | 100 | 92.9 | 89.9 | 74.9 | 100 | 99.6 |
↳ Corrected executor | 91.8 | 88.7 | 74.5 | |||||
Exact match | 100 | 100 | 100 | 90.7 | 86.2 | 69.3 | 100 | 99.6 |
↳ w/o CPLEX heuristic | 100 | 100 | 100 | 90.0 | 83.0 | 67.5 | 100 | 98.0 |
. | Scan . | GeoQuery . | Clevr . | |||||
---|---|---|---|---|---|---|---|---|
Iid . | Right . | ARight . | Iid . | Template . | Length . | Iid . | Closure . | |
Baselines (denotation accuracy only) | ||||||||
Seq2Seq | 99.9 | 11.6 | 0 | 78.5 | 46.0 | 24.3 | 100 | 59.5 |
+ ELMo | 100 | 54.9 | 41.6 | 79.3 | 50.0 | 25.7 | 100 | 64.2 |
BERT2Seq | 100 | 77.7 | 95.3 | 81.1 | 49.6 | 26.1 | 100 | 56.4 |
GRAMMAR | 100 | 0.0 | 4.2 | 72.1 | 54.0 | 24.6 | 100 | 51.3 |
BART | 100 | 50.5 | 100 | 87.1 | 67.0 | 19.3 | 100 | 51.5 |
SpanBasedSP | 100 | 100 | 100 | 86.1 | 82.2 | 63.6 | 96.7 | 98.8 |
Our approach | ||||||||
Denotation accuracy | 100 | 100 | 100 | 92.9 | 89.9 | 74.9 | 100 | 99.6 |
↳ Corrected executor | 91.8 | 88.7 | 74.5 | |||||
Exact match | 100 | 100 | 100 | 90.7 | 86.2 | 69.3 | 100 | 99.6 |
↳ w/o CPLEX heuristic | 100 | 100 | 100 | 90.0 | 83.0 | 67.5 | 100 | 98.0 |
We also report exact match accuracy, with and without the heuristic to construct integral solutions from fractional ones. The exact match accuracy is always lower or equal to the denotation accuracy. This shows that our approach can sometimes provide the correct denotation even though the prediction is different from the gold semantic program. Importantly, while our approach outperforms baselines, its accuracy is still significantly worse on the split that requires to generalize to longer programs.
6 Related Work
Graph-based Methods.
Graph-based methods have been popularized by syntactic dependency parsing (McDonald et al., 2005) where MAP inference is realized via the maximum spanning arborescence algorithm (Chu and Liu, 1965; Edmonds, 1967). A benefit of this algorithm is that it has a 𝒪(n2) time-complexity (Tarjan, 1977), i.e., it is more efficient than algorithms exploring more restricted search spaces (Eisner, 1997; Gómez-Rodríguez et al., 2011; Pitler et al., 2012, 2013).
In the case of semantic structures, Kuhlmann and Jonsson (2015) proposed a 𝒪(n3) algorithm for the maximum non-necessarily-spanning acyclic graphs with a noncrossing arc constraint. Without the noncrossing constraint, the problem is known to be NP-hard (Grötschel et al., 1985). To bypass this computational complexity, Dozat and Manning (2018) proposed to handle each dependency as an independent binary classification problem, that is they do not enforce any constraint on the output structure. Note that, contrary to our work, these approaches allow for reentrancy but do not enforce well-formedness of the output with respect to the semantic grammar. Lyu and Titov (2018) use a similar approach for AMR parsing where tags are predicted first, followed by arc predictions and finally heuristics are used to ensure the output graph is valid. On the contrary, we do not use a pipeline and we focus on joint decoding where validity of the output is directly encoded in the search space.
Previous work in the literature has also considered reduction to graph-based methods for other problems, e.g., for discontinuous constituency parsing (Fernández-González and Martins, 2015; Corro et al., 2017), lexical segmentation (Constant and Le Roux, 2015), and machine translation (Zaslavskiy et al., 2009), inter alia.
Compositional Generalization.
Several authors observed that compositional generalization insufficiency is an important source of error for semantic parsers, especially ones based on seq2seq architectures (Lake and Baroni, 2018; Finegan-Dollak et al., 2018b; Herzig and Berant, 2019; Keysers et al., 2020). Wang et al. (2021) proposed a latent re-ordering step to improve compositional generalization, whereas Zheng and Lapata (2021) relied on latent predicate tagging in the encoder. There has also been an interest in using data augmentation methods to improve generalization (Jia and Liang, 2016; Andreas, 2020; Akyürek et al., 2021; Qiu et al., 2022; Yang et al., 2022).
Recently, Herzig and Berant (2021) showed that span-based parsers do not exhibit such problematic behavior. Unfortunately, these parsers fail to cover the set of semantic structures observed in English treebanks, and we hypothesize that this would be even worse for free word order languages. Our graph-based approach does not exhibit this downside. Previous work by Jambor and Bahdanau (2022) also considered graph-based methods for compositional generalization, but their approach predicts each part independently without any well-formedness or acyclicity constraint.
7 Conclusion
In this work, we focused on graph-based semantic parsing for formalisms that do not allow reentrancy. We conducted a complexity study of two inference problems that appear in this setting. We proposed ILP formulations of these problems together with a solver for their linear relaxation based on the conditional gradient method. Experimentally, our approach outperforms comparable baselines.
One downside of our semantic parser is speed (we parse approximately 5 sentences per second for GeoQuery). However, we hope this work will give a better understanding of the semantic parsing problem together with baseline for faster methods.
Future research will investigate extensions for (1) ASTs that contain reentrancies and (2) prediction algorithms for the case where a single word can be the anchor of more than one predicate or entity. These two properties are crucial for semantic representations like Abstract Meaning Representation (Banarescu et al., 2013). Moreover, even if our graph-based semantic parser provides better results than previous work on length generalization, this setting is still difficult. A more general research direction on neural architectures that generalize better to longer sentences is important.
Acknowledgments
We thank François Yvon and the anonymous reviewers for their comments and suggestions. We thank Jonathan Herzig and Jonathan Berant for fruitful discussions. This work benefited from computations done on the Saclay-IA platform and on the HPC resources of IDRIS under the allocation 2022-AD011013727 made by GENCI.
Notes
In the NLP community, arborescences are often called (directed) trees. We stick with the term arborescence as it is more standard in the graph theory literature, see for example Schrijver (2003). Using the term tree introduces a confusion between two unrelated algorithms, Kruskal’s maximum spanning tree algorithm (Kruskal, 1956) that operates on undirected graphs and Edmond’s maximum spanning arborescence algorithm (Edmonds, 1967) that operates on directed graphs. Moreover, this prevents any confusion between the graph object called arborescence and the semantic structure called AST.
There are a few corner cases like exclude_river, for which we simply assume arguments are in the same order as they appear in the input sentence.
The labeling function is unchanged as there is no need for types for vertices in ∖ V.
References
A Proofs
Proof: Theorem 1. We prove Theorem 1 by reducing the maximum not-necessarily-spanning arborescence (MNNSA) problem, which is known to be NP-hard (Rao and Sridharan, 2002; Duhamel et al., 2008), to the MGVCNNSA.
Let G = 〈V, A, ψ〉 be a weighted graph where V = {0, …, n} and ψ ∈ ℝ|A| are arc weights. The MNNSA problem aims to compute the subset of arcs B ⊆ A such that 〈V[B], B〉 is an arborescence of maximum weight, where its weight is defined as ∑a∈Bψa.
Let 𝒢 = 〈E, T, fType, fArgs〉 be a grammar such that E = {0, …, n − 1}, T = {t} and ∀e ∈ E : fType(e) = t ∧ fArgs(e, t) = e. Intuitively, a tag e ∈ E will be associated to vertices that require exactly e outgoing arcs.
We construct a clustered labeled weighted graph G′ = 〈V′, A′, π, l, ψ′〉 as follows. π = {V0′, …, Vn′} is a partition of V′ such that each cluster Vi′ contains n − 1 vertices and represents the vertex i ∈ V. The labeling function l assigns a different tag to each vertex in a cluster, i.e., ∀Vi′ ∈ π, ∀u′, v′ ∈ Vi′ : u′ ≠ v′ ⇒ l(u′) ≠ l(v′). The set of arcs is defined as A′ = {u′ → v′|∃i → j ∈ A s.t. u′ ∈ Vi′ ∧ v′ ∈ Vj′}. The weight vector ψ′ ∈ ℝ|A′| is such that ∀u′ → v′ ∈ A′ : u′ ∈ Vu′ ∧ v′ ∈ Vv′ ⇒ ψ′u′→v′ = ψu→v.
As such, there is a one-to-one correspondence between solutions of the MNNSA on graph G and solutions of the MGVCNNSA on graph G′.
Note that our proof considers that arcs leaving from the root cluster satisfy constraints defined by the grammar whereas we previously only required the root vertex to have a single outgoing arc. The latter constraint can be added directly in a grammar, but we omit presentation for brevity. The constrained arity case presented by McDonald and Satta (2007) focuses on spanning arborescences with an arity constraint by reducing the Hamiltonian path problem to their problem. While the arity constraint is similar in their problem and ours, our proof considers the not-necessarily-spanning case instead of the spanning one. Although the two problems seem related, they need to be studied separately, e.g., computing the maximum spanning arborescence is a polynomial time problem whereas computing the MNNSA is a NP-hard problem.
Proof: Theorem 2. We prove Theorem 2 by reducing the maximum directed Hamiltonian path problem, which is known to be NP-hard (Garey and Johnson, 1979, Appendix A1.3), to the latent anchoring.
Let G = 〈V, A, ψ〉 be a weighted graph where V = {1, …, n} and ψ ∈ ℝ|A| are arc weights. The maximum Hamiltonian path problem aims to compute the subset of arcs B ⊆ A such that V [B] = V and 〈V[B], B〉 is a path of maximum weight, where its weight is defined as ∑a∈Bψa.
Let 𝒢 = 〈E, T, fType, fArgs〉 be a grammar such that E = {0, 1}, T = {t} and ∀e ∈ E : fType(e) = t ∧ fArgs(e, t) = e.
We construct a clustered labeled weighted graph G′ = 〈V′, A′, π, l, ψ′〉 as follows. π = {V0′, …, Vn′} is a partition of V′ such that V0′ = {0}, each cluster Vi′ ≠ V0′ contains 2 vertices and represents the vertex i ∈ V. The labeling function l assigns a different tag to each vertex in a cluster except the root, i.e., ∀Vi′ ∈ π, i > 0, ∀u′, v′ ∈ Vi′ : u′ ≠ v′ ⇒ l(u′) ≠ l(v′). The set of arcs is defined as A′ = {0 → u′|u′ ∈ V′ ∖ {0}} ∪ {u′ → v′|i → j ∈ A ∧ u′ ∈ Vi′ ∧ v′ ∈ Vj′}. The weight vector ψ′ ∈ ℝ|A′| is such that ∀u′ → v′ ∈ A′ : u′ ∈ Vu′ ∧ v′ ∈ Vv′ ⇒ ψ′u′→v′ = ψu→v and arcs leaving 0 have null weights.
We construct an AST G″ = 〈V″, A″, l′〉 such that V″ = {1, …, n}, A″ = {i → i + 1|1 ≤ i < n} and the labeling function l′ assigns the tag 0 to n and the tag 1 to every other vertex.
As such, there is a one-to-one correspondence between solutions of the maximum Hamiltonian path problem on graph G and solutions of the mapping of maximum weight of G″ with G′.
B Experimental Setup
The neural architecture used in our experiments to produce the weights μ and ϕ is composed of: (1) an embedding layer of dimension 100 for Scan or BERT-base (Devlin et al., 2019) for the other datasets, followed by a bi-LSTM (Hochreiter and Schmidhuber, 1997) with a hidden size of 400; (2) a linear projection of dimension 500 over the output of the bi-LSTM followed by a Tanh activation and another linear projection of dimension |E| to obtain μ; (3) a linear projection of dimension 500 followed by a Tanh activation and a bi-affine layer (Dozat and Manning, 2017) to obtain ϕ.
We apply dropout with a probability of 0.3 over the outputs of BERT-base and the bi-LSTM and after both Tanh activations. The learning rate is 5 × 10−4 and each batch is composed of 30 examples. We keep the parameters that obtain the best accuracy on the development set after 25 epochs. Training the model takes between 40 minutes for GeoQuery and 8 hours for Clevr. However, note that the bottleneck is the conditional gradient method which is computed on the CPU.
C GeoQuery Denotation Accuracy Issue
The denotation accuracy is evaluated by checking whether the denotation returned by an executor is the same when given the gold semantic program and the prediction of the model. It can be higher than the exact match accuracy when different semantic programs yield the same denotation.
When we evaluated our approach using the same executor as the baselines of Herzig and Berant (2021), we observed two main issues regarding the behavior of predicates: (1) Several predicates have undefined behaviors (e.g., population_1 and traverse_2 in the case of an argument of type country), in the sense that they are not implemented; (2) The behavior of some predicates are incorrect with respect to their expected semantic (e.g., traverse_1 and traverse_2). These two sources of errors result in incorrect denotation for several semantic programs, leading to an overestimation of the denotation accuracy when both the gold and predicted programs return by accident an empty denotation (potentially for different reasons, due to aforementioned implementation issues).
We implemented a corrected executor addressing the issues that we found. It is available here: https://github.com/alban-petit/geoquery-funql-executor.
Author notes
Action Editor: James Henderson