## Abstract

Motivated by the task of semantic parsing, we describe a transition system that generalizes standard transition-based dependency parsing techniques to generate a graph rather than a tree. Our system includes a cache with fixed size *m*, and we characterize the relationship between the parameter *m* and the class of graphs that can be produced through the graph-theoretic concept of tree decomposition. We find empirically that small cache sizes cover a high percentage of sentences in existing semantic corpora.

## 1. Introduction

As statistical natural language processing systems have progressed to provide deeper representations, there has been renewed interest in graph-based representations of semantic structures and in algorithms to produce them. Typically, these algorithms behave similarly to standard parsing algorithms for retrieving syntactic representations: They take as input a sentence and produce as output a graph representation of the semantics of the sentence itself.

At the same time, recent years have seen a general trend from chart-based syntactic parsers toward stack-based transition systems, as the accuracy of transition systems has increased, and as speed has become increasingly important for real-world applications. On the syntactic side, stack-based transition systems for projective dependency parsing run in time $O(n)$, where *n* is the sentence length; for a general overview of these systems, see, for instance, the presentation of Nivre (2008). There have also been a number of extensions of stack-based transition systems to handle non-projective trees (e.g., Attardi 2006; Nivre 2009; Choi and McCallum 2013; Gómez-Rodríguez and Nivre 2013; Pitler and McDonald 2015).

Stack-based transition systems can produce general graphs rather than trees. Perhaps the simplest way to generate graphs is to shift one word at a time onto the stack, and then consider building all possible arcs between each word on the stack and the next word in the buffer. This is essentially the algorithm of Covington (2001), generalized to produce graphs rather than non-projective trees. This algorithm was also cast as a stack-based transition system by Nivre (2008). The algorithm runs in time $O(n2)$, and requires the system to discriminate the arcs to be built from a large set of possibilities, potentially leading to errors.

Traditional stack-based parsing, which is restricted to trees, and the Covington algorithm as generalized to graph parsing can be thought of as two extremes, with a wide set of possible intermediate approaches staking out different trade-offs between expressiveness, on the one hand, and time and the discrimination required of machine learning components on the other. In this article, we mathematically explore this trade-off and precisely characterize the relationship between parsing systems and the set of graphs they can build. We describe a parsing system based on adding a working set, which we refer to as a **cache**, to the traditional stack and buffer. With cache size 2, our algorithm can only build trees, while with unbounded cache, our algorithm can build any graph, because it is then equivalent to the Covington algorithm generalized to graphs. We speculate that small, fixed cache sizes provide a good trade-off for fast and accurate string-to-graph parsing.

We analyze the class of graphs that can be successfully constructed by our parsing system, making use of the graph-theoretic notion of treewidth. The treewidth of a graph gives a measure of how tightly interconnected it is: Trees have treewidth 1, and fully connected graphs on *n* vertices have treewidth *n* − 1. We show that the class of graphs constructed by our parser is precisely characterized by treewidth: A transition system of cache size *m* can produce graphs of treewidth *m* − 1. Our framework assumes an input order of vertices, corresponding to the word order of the string, and we define a concept of relative treewidth to characterize the set of graphs that the parser can produce given a fixed input order of vertices. Finally, we develop an oracle algorithm for our parsing system, and prove its correctness. We also provide an algorithm for computing the minimal cache size needed to parse a given data set.

In general, a graph’s relative treewidth with respect to an input order may be much higher than its absolute treewidth. However, if relative treewidth with respect to the real English word order is low, and not significantly higher than the absolute treewidth, this indicates that the word order provides valuable information about the graph structure to be predicted, and that efficient parsing is possible by making use of this information. We test this hypothesis with experiments on Abstract Meaning Representation (Banarescu et al. 2013), a semantic formalism where the meaning of a sentence is encoded as a directed graph. We find that, for English sentences, these structures have low relative treewidth with respect to the English word order, and can thus be parsed efficiently using a transition-based parser with small cache size. In order to compare across a wider variety of the semantic representations that have been proposed (Kuhlmann and Oepen 2016), we also experiment with three sets of semantic dependencies from the Semeval 2015 semantic dependency parsing task (Oepen et al. 2015). With these data sets, which are generally closer to the surface string structure than Abstract Meaning Representation, we find somewhat higher relative treewidth. In every data set that we analyzed, over 99% of sentences can be covered with a cache size of eight.

## 2. Tree Decomposition and Treewidth

In this section we introduce and define the notions of tree decomposition and treewidth, as well as a few related concepts that we will use throughout this article. As usual, we denote an undirected graph as *G* = (*V*, *E*), where *V* is the set of vertices and *E* is the set of edges. Each edge is represented as an unordered pair (*u*, *v*) with *u*, *v* ∈ *V*.

The theory developed in this article is based on the graph theoretical notion of tree decomposition, which has been independently developed in several areas of computer science and discrete mathematics. From an application-oriented perspective, tree decomposition has proven very useful in discrete optimization and in the design of polynomial time algorithms using dynamic programming techniques.

The intuitive idea behind the notion of tree decomposition can be explained as follows. At this point, we use the term “interconnection” in a rather informal way; the precise meaning of this notion will be mathematically defined later. A tree is a special kind of graph where the vertices are arranged in a hierarchical way, with the property that the set of vertices in any subtree have only one interconnection with the set of the remaining vertices. In contrast, for a general graph this is not possible, meaning that we cannot group vertices in a hierarchical structure and pretend that there are a small number of interconnections between vertices in any subtree and the remaining vertices. This is apparent for a complete graph, that is, a graph where each vertex is connected with every other vertex. More interestingly, this is also true for a grid-like graph, where any hierarchical decomposition of the set of vertices will always lead to some subtree with a number of interconnections with the remaining structure that is not bounded by a constant. Note that, in contrast with a complete graph, where each vertex has a number of neighbors that is not bounded by a constant, in a grid each vertex has at most four neighbors. Still, the internal structure of a grid is unfavorable for this type of hierarchical arrangement. The notion of tree decomposition of a graph, and the related notion of treewidth, provide us precisely with the information we need: To what degree is it possible to arrange the vertices of a graph into some hierarchical structure, with the property that interconnections between vertices in any subtree and the remaining vertices are kept to a minimum?

*G*is a type of tree having a subset of

*G*’s vertices at each node. To avoid confusion, when describing tree decompositions we use the terms

**node**and

**arc**, and when describing graphs we use the terms

**vertex**and

**edge**. In a tree decomposition

*T*, the set of nodes is denoted

*I*and the set of arcs is denoted

*F*. The subset of

*V*associated with node

*i*∈

*I*is referred to as a

**bag**, and is denoted by

*X*

_{i}. Formally, a

**tree decomposition**of a graph

*G*= (

*V*,

*E*) is defined as a pair ({

*X*

_{i}∣

*i*∈

*I*},

*T*= (

*I*,

*F*)) where tree

*T*satisfies all of the following properties.

- •
*Vertex cover*: The nodes of the tree*T*cover all the vertices of*G*: $\u22c3i\u2208IXi=V$. - •
*Edge cover*: Each edge in*G*is included in some node of*T*. That is, for all edges (*u*,*v*) ∈*E*, there exists an*i*∈*I*with*u*,*v*∈*X*_{i}. - •
*Running intersection*: The nodes of*T*containing a given vertex of*G*form a connected subtree. Mathematically, for all*i*,*j*,*k*∈*I*, if*j*is on the (unique) path from*i*to*k*in*T*, then*X*_{i}⋂*X*_{k}⊆*X*>_{j}.

**width**of a tree decomposition ({

*X*

_{i}},

*T*) is max

_{i}|

*X*

_{i}| − 1. The

**treewidth**of a graph is the minimum width over all tree decompositions

*TD*(

*G*) is the set of valid tree decompositions of

*G*. We refer to a tree decomposition achieving the minimum possible width as being

**optimal**.

In general, more densely interconnected graphs have higher treewidth. For instance, any tree has treewidth 1, a graph consisting of a single cycle with three or more vertices has treewidth 2, and a fully connected graph of *n* vertices has treewidth *n* − 1. Low treewidth thus indicates some treelike structure underlying the graph. When certain properties of the graph must be checked, the tree decomposition is often helpful in organizing computation (see, e.g., results of Courcelle [1990] and Arnborg, Lagergren, and Seese [1991]). Finding the treewidth of a graph is an NP-complete problem (Arnborg, Corneil, and Proskurowski 1987).

Consider the graph *G* in Figure 1 with vertex set *V* = {*A*, *B*, *C*, …, *Q*, *R*, *S*}. At first sight, *G*’s structure seems rather intricate, with edges scattered all over the picture. However, an optimal tree decomposition of *G* reveals that there is a tree-like structure underlying *G*. An optimal tree decomposition *T* of *G* is displayed in Figure 2(a), where we use gray circles to indicate the sets of vertices of *G* that represent the bags of *T*. Because adjacent bags of *T* share some of their vertices, our gray circles partially overlap. The tree-like structure underlying *G* is apparent from this overlapping. This tree decomposition has bags of three vertices each, and thus the graph’s treewidth is 2. In this representation, it is also apparent that there is a low number of interconnections among vertices in a subtree of *T* and the remaining vertices of *G*.

An alternative representation of the same tree decomposition is shown in Figure 2(b), where we focus on the vertices and ignore the edges of the graph. It is easy to see that the vertex cover and the edge cover conditions in the definition of tree decomposition are both satisfied by *T*. As an example of the running intersection property, note that the vertex *S* appears in three adjacent nodes of the tree decomposition.

Although general tree decompositions are undirected trees, in this article we will work with rooted, directed tree decompositions, in which one node is designated as the root, and the children of each node are ordered. We say that a rooted, ordered tree decomposition of graph *G* having width *k* is **smooth** if each bag contains exactly *k* + 1 vertices, and each bag contains the same vertices as its parent bag, with exactly one vertex removed and one vertex added. The tree decomposition in Figure 2(b) is smooth.

The concept of smooth tree decompositions, for standard unrooted tree decompositions, was introduced by Bodlaender (1996). Throughout this article, we also require that the root of a smooth tree decomposition contains *k* + 1 copies of the special symbol $, with vertices of *G* being added one at a time in the bags below the root. It is easy to see that the size of a smooth tree decomposition (i.e., the number of nodes of the tree) is the number of vertices in the graph plus one.

Any tree decomposition *T* of graph *G* can be transformed into a smooth tree decomposition *T*′ of *G* of equal width.

*Proof.* Let *k* be the width of *T*. At each bag having fewer than *k* + 1 vertices, continue adding vertices from adjacent bags until all bags have the same size. If two adjacent bags *B*_{1} and *B*_{2} end up having the same vertices, collapse *B*_{1} and *B*_{2} into a single bag, and merge the children of the two bags in a way that preserves their order. If two adjacent bags *B*_{1} and *B*_{2} differ by more than one vertex in their contents, add intermediate bags by adding vertices from *B*_{2} and removing vertices from *B*_{1} one at a time. Finally, choose a bag *B* as the root of the tree constructed so far. Add a new root containing *k* + 1 instances of the special symbol $, and intermediate bags connecting the root to *B* adding one vertex of *B* at a time, and removing instances of $.

As already discussed in the Introduction, in natural language processing applications, we are not provided with a graph structure as input, and we are not asked to recognize whether that graph belongs to some formal language. We are instead given as input an ordered sequence of vertices of some graph, or a superset thereof, and we are asked to retrieve the graph itself. Although this latter problem is apparently more difficult than the former, since the edges of the graph must be decoded, the input ordering of the vertices plays an important role and can be used to restrict the search space, ultimately ending up with a more efficient computation. This idea is at the basis of the algorithms for graph parsing developed in this article. We therefore now introduce the notion of relative treewidth with respect to a given order of the vertices of a graph, which is original to this article.

*G*= (

*V*,

*E*) be some graph and let

*T*be a smooth tree decomposition of

*G*. We define the

**vertex order**π(

*T*) of

*T*to be the sequence of vertices produced by visiting

*T*in a preorder, left to right traversal and by listing the vertices newly introduced at the visited bags. Each vertex of

*V*will appear exactly once in π(

*T*). We will analyze the behavior of our parser when given a fixed input order over the vertices in terms of a notion of relative treewidth with respect to the input order. We define the

**relative treewidth**of

*G*with respect to an order π of

*G*’s vertices to be the minimum width of any tree decomposition of

*G*whose vertex order is π. Formally, we write

Consider the chain-like graph *G* in Figure 3, with vertex set {1, 2, 3, 4}. In the top row, *G* is presented in vertex order π = (1, 2, 3, 4), along with a tree decomposition *T* such that π(*T*) = π. In the bottom row *G* is presented in vertex order π′ = (1, 2, 4, 3), with the tree decomposition *T*′ achieving the minimum width such that π(*T*′) = π′. The order π′ increases the relative treewidth: **rtw**(*G*, π) = 1, while **rtw**(*G*, π′) = 2.

*G*, there exists a vertex order achieving its optimal width

*G*can always be converted to a smooth tree decomposition of equal width, which provides us with the optimal order.

Although our notion of relative treewidth with respect to a vertex order is superficially similar to the standard vertex elimination algorithm for finding a tree decomposition (Bodlaender 2006), which also takes as input a vertex order, these orders are in fact distinct. We will not make use of the vertex elimination algorithm in this article, but we describe it briefly here for readers interested in the connection between these two concepts. In the vertex elimination algorithm, vertices are processed in the input order, by adding edges connecting the current vertex’s remaining neighbors and then eliminating the current vertex. Each vertex along with its neighbors at the time of its elimination form one bag of the tree decomposition. The order of the vertex elimination algorithm corresponds to the order in which vertices are introduced in some outside–in traversal of the tree decomposition, whereas the order of our concept of relative treewidth corresponds to a pre-order traversal of a smooth tree decomposition.

## 3. Cache Transition Parser

In this section, we introduce a nondeterministic computational model for graph-based parsing, which we call a **cache transition parser**. The model takes as input an ordered sequence of vertices, reads it strictly from left to right, and incrementally produces a graph as output. Our model is an extension of the transition-based parsing framework described by Nivre (2008) for dependency tree parsing. We assume the reader is familiar with such a framework. We also provide a characterization of our cache transition parsers using the notions of tree decomposition and width that have been introduced in Section 2. Throughout this section, for integer *m* ≥ 1 we write [*m*] to denote the set {1, …, *m*}.

Informally, a cache transition parser is a transition-based parser that processes input vertices and produces an output graph. The graph is defined on the input vertices, or on a subset thereof. Besides its stack and buffer, the parser also uses a cache. A cache is a fixed-size array of *m* ≥ 1 elements and, along with the stack, represents the storage of the parser. At any time during the computation, a vertex that is in the storage of the parser is either in the cache or else in the stack, but not in both at the same time. The graph vertices in the input buffer are shifted into the cache *before* entering the stack. While in the cache, vertices can be directly accessed and edges between these vertices can be constructed.

Because the cache has fixed size, in order to be able to read a new vertex from the buffer and shift it into the cache, we need to make new room in the cache by moving some other vertex *v* from the cache into the stack. Once *v* is in the stack, it is no longer accessible for the operations of edge construction. Typically, the parser moves *v* out of the cache and into the stack when it predicts that, in the process of edge construction, *v* does not need to be accessed for a while. For instance, this happens when *v*’s neighbors that still need to be processed are all placed at a far distance in the buffer.

Crucially, the cache is not governed by a first-in first-out policy as in a queue: The vertex *v* that we move out of the cache might not be the “oldest” vertex that has been introduced in the cache itself. As a consequence, the choice of the vertex that is moved out of the cache and into the stack at each step may considerably alter the original ordering of the vertices in the input.

In addition to this operation, it is also possible to pop some vertex from the stack and put it back into the cache. Again, because the cache has fixed size, in order to be able to do this we need to make new room in the cache. This time this is done by permanently removing some vertex from the cache, meaning that this vertex is dropped out of the parser storage. This happens when the parser decides that all of the edges impinging on a vertex have been processed, and the vertex itself is no longer needed. Going back to our running example about vertex *v*, when the far distance neighbors of *v* will reach the foremost position of the buffer and will be shifted into the cache, we can exploit the previous operation, pop *v* from the stack, and move it back into the cache, where it will be available for the construction of the new edges. Altogether, the combination of these two operations has the effect of repeatedly moving *v* back and forth between the cache and the stack.

Formally, a **cache transition parser** consists of a stack, a cache, and an input buffer. The stack is a sequence σ of vertices and integers, as explained subsequently, with the topmost element always at the rightmost position. The buffer is a sequence of vertices β containing a suffix of the input, with the first element to be read at the leftmost position. Finally, the cache is a sequence of vertices η. The element at the leftmost position is called the first element of the cache, and the element at the rightmost position is called the last element.

**configuration**of our parser has the form:

*E*is the set of edges being built. The initial configuration of the parser is ([], [$, …, $], [

*v*

_{1}, …,

*v*

_{n}], ∅), meaning that the stack and edge set are initially empty, and the cache is filled with

*m*occurrences of the special symbol $. The final configuration is ([], [$, …, $], [],

*E*

_{G}), where the stack and the cache are as in the initial configuration and the buffer is empty. The constructed graph has set of vertices {

*v*

_{1}, …,

*v*

_{n}} and set of edges

*E*

_{G}.

The **transitions** of the parser are specified as follows.

- • push(
*i*,*C*) is parameterized by a position in the cache*i*∈ [*m*] and a set of positions in the cache*C*⊆ [*m*] ∖ {*i*}. It takes a configuration:and moves to a configuration:$(\sigma ,[v1,\u2026,vi\u22121,vi,vi+1,\u2026,vm],v|\beta ,E)$Here, we have shifted the next vertex$(\sigma |i|vi,[v1,\u2026,vi\u22121,vi+1,\u2026,vm,v],\beta ,E\u2032)E\u2032=E\u222a{(vk,v)\u2223k\u2208C}$*v*out of the buffer and moved it into the last position of the cache. We have also taken the vertex*v*_{i}appearing in position*i*in the cache and pushed it onto the stack σ, along with the integer*i*recording the position in the cache from which it came. Finally, we have added some edges to the graph being built, where the new edges connect the shifted vertex*v*with some subset of the other vertices in the cache. This subset is specified by the parameter*C*. - • pop takes a configuration:and moves to a configuration:$(\sigma |i|v,[v1,\u2026,vm],\beta ,E)$Here we have popped a vertex$(\sigma ,[v1,\u2026,vi\u22121,v,vi,\u2026,vm\u22121],\beta ,E)$
*v*from the stack, along with the integer*i*recording the position in the cache that it originally came from. We place*v*in position*i*in the cache, shifting the remainder of the cache one position to the right, and discarding the last element in the cache.

Consider the sentence “John wants Mary to succeed” and the associated semantic representation displayed as a graph in Figure 4. Note that the graph, with vertices in the order of the English sentence, corresponds to the graph at the bottom row of Figure 3. When given the vertex sequence [j, w, m, t, s] as input, our nondeterministic parser will be able to construct the given graph using the run displayed in Figure 5. For instance, when the parser reaches the configuration ([1, $, 1, $, 1, $], [j, w, m], [s], *E*_{1}), the transition push(1, {1, 2}) pushes the vertex j from the cache into the stack, shifts the vertex s from the buffer into the cache, and constructs the two new edges (w, s) and (m, s). The resulting configuration is then ([1, $, 1, $, 1, $, 1, j], [w, m, s], [], *E*_{1} ∪ {(w, s), (m, s)}).

We have described our transition system as producing undirected graphs with unlabeled edges, but it can be easily extended to produce directed graphs and labeled edges. Directed graphs can be produced by modifying the parameter *C* of the push transition to be defined as a set of tuples, where each tuple consists of an integer *k* and a binary variable that specifies whether to produce edge (*v*, *v*_{k}) or edge (*v*_{k}, *v*). Similarly, labeled edges can be produced by adding a value to each tuple specifying the edge’s label. Notice that the tuple representation could also be used to allow multiple arcs between the same two nodes, with different directions and labels. These extensions do not fundamentally change the set of graphs that can be produced with a given cache size; a directed (or labeled) graph can be produced if and only if its undirected (or unlabeled) counterpart can be produced. For this reason, we treat our graphs as undirected (and unlabeled) in the remainder of this article.

We now show that, in our parser, each pop transition reverses the effect of some previous push transition, in a sense that will be specified below. For *s* ≥ 1, consider a sequence of 2*s* transitions γ = *t*_{1}, …, *t*_{2s}. We say that γ is **minimal reversing** if it consists of *s* push transitions intermixed with *s* pop transitions, with the property that in any proper prefix of γ of the form *t*_{1}, …, *t*_{k}, *k* ∈ [2*s* − 1], the number of push transitions is strictly greater than the number of pop transitions. It is not difficult to see that, if γ is minimal reversing, *t*_{1} must be a push transition and *t*_{2s} must be a pop transition.

Let *c* be a configuration of the parser with stack σ and cache η. Let also γ be a minimal reversing sequence of transitions. If we apply to *c* the transitions of γ in the given order, we reach a configuration *c*′ with stack σ′ = σ and cache η′ = η.

*Proof.* Let γ = *t*_{1}, …, *t*_{2s}. We proceed by induction on *s*. If *s* = 1, γ must be composed by a push followed by a pop. The definition of the pop transition exactly restores the stack and the cache of the configuration *c* to which the push applied.

If *s* > 1, let γ′ = *t*_{2}, …, *t*_{2s−1}. It is not difficult to see that *t*_{2} must be a push transition and *t*_{2s−1} must be a pop transition. However, a proper prefix of γ′ might now have a number of push transitions that equals the number of pop transitions, making γ′ not minimal reversing. If this is the case, we split γ′ exactly at that point, and apply the same reasoning to the two subsequences, until γ′ is divided into subsequences that are all minimal reversing. Assume now that *c*_{1} is the configuration obtained by applying *t*_{1} to *c*, and *c*_{2s−1} is the configuration obtained by applying γ′ to *c*_{1}. Using the inductive hypothesis on each of the minimal reversing subsequences of γ′, we obtain that the stack and the cache of *c*_{1} and *c*_{2s−1} are equal. We have already observed that a pop transition applied to *c*_{1} would restore the stack and the cache of *c*. Because the stack and the cache of *c*_{1} and *c*_{2s−1} are the same, we conclude that the pop transition *t*_{2s} applied to *c*_{2s−1} produces configuration *c*_{2s} with exactly the same stack and cache as *c*.

Consider now a complete run of the parser, that is, a run starting at the initial configuration for a given input, and ending in a final configuration. Lemma 2 suggests that such run can be represented by means of some underlying tree structure, as described in what follows. Each configuration of the cache reached at some timestep in the run is a node of the tree. Each push transition descends from one node of the tree to some of its children, and each pop transition returns to the parent node. We call this underlying tree structure the **derivation tree**. The derivation tree represents the history of the parsing process that produces the output graph, and it is possible to show that the set of derivation trees associated with the runs of a cache parser on any input can be generated by a context-free grammar. This follows from the fact that our parser is a special kind of push-down automaton.^{1}

Consider again the run displayed in Figure 5. We represent this run by means of the derivation tree displayed in Figure 6. Note that a walk through the tree that combines a preorder and a postorder visit exactly provides the sequence with the content of the cache at each timestep in the original run. Observe that each subtree of the derivation tree corresponds to a minimal reversing sequence within the run.

We now list some important properties of the derivation trees representing the runs of a cache transition parser, which are used subsequently. All these properties are direct consequences of the definition of the push and pop transitions and are rather intuitive; we therefore omit a formal proof.

- 1.
The bag at each node contains the same items as its parent, with one vertex removed and one vertex added.

- 2.
Every edge of the graph being built by the run of the parser can be associated with some bag that contains both of the edge’s endpoints, with one of the endpoints in the

*m*-th position of the cache. - 3.
The bags containing a vertex

*v*form a connected subgraph of the tree. This is in turn a subtree rooted at the bag where the vertex is first pushed into the cache (and also eventually deleted from the cache), and having as leaves the bags where the vertex is removed from the cache and pushed onto the stack (or equivalently, the bags where the vertex is popped from the stack and pushed back into the cache).

We can now provide a characterization of the runs/derivation trees of a cache transition parser in terms of the notions of tree decomposition and width of the graph being constructed by the parser itself.

Consider a cache transition parser with cache size *m*, and consider a run of the parser with input a vertex sequence π and with output the constructed graph *G*. Let *T* be the derivation tree representing the run. Then *T* forms a smooth tree decomposition of *G* having width *m* − 1 and having vertex order π(*T*) = π.

*Proof.* Properties 1 to 3 guarantee that *T* is a smooth tree decomposition of *G*. Each bag is first created by a push transition, which adds one vertex to the cache and removes one vertex from the cache. Because the bags of *T* have size *m*, the size of the cache, the width of *T* is *m* − 1. Recall that the vertex order π(*T*) is the sequence of vertices produced by visiting *T* in a preorder traversal and listing the vertices newly introduced at the visited bags. Since the derivation tree *T* is constructed depth first by pushing vertices from the input buffer into the cache, π(*T*) is exactly the order of the vertices in π.

We can also prove the inverse of the previous lemma.

Consider a graph *G* with a smooth tree decomposition *T* having width *m* − 1, and let π(*T*) be the vertex order of *T*. Then *T* is a derivation tree of a cache transition parser with cache size *m*, and *G* is constructed by the associated run given π(*T*) as input.

*Proof.* Let the cache transition parser take a sequence of transitions corresponding to a depth-first traversal of *T*, pushing an element from π(*T*) into the cache each time it descends one level in *T*, and popping each time it ascends. Let (*u*, *v*) be an edge of *G*. Because *T* is a tree decomposition of *G*, there is a bag of *T* containing both *u* and *v*. Without loss of generality, let *u* be the vertex that was introduced before *v* along the path from the root of *T* to the bag containing both *u* and *v*. Let *b*_{v} be the bag at which *v* is introduced. Because *v* can only appear in bags in the subtree of *T* rooted at *b*_{v}, this bag containing both *u* and *v* must appear in this subtree. Furthermore, by the running intersection property, since *u* appears in a bag at or below *b*_{v}, and is introduced above *b*_{v}, *u* must appear in *b*_{v}. Thus, because bags of *T* correspond to the cache at each step of the parser, the parser’s cache will contain *u* at the step at which *v* is pushed into the rightmost position of the cache. Therefore, the automaton can build each edge of *G*.

Combining Lemmas 3 and 4, and using Lemma 1 from Section 2, we have the following main result, which is a characterization of the relative treewidth of a graph with respect to an ordering of its vertices.

Let *G* be some graph and let π be some ordering of its vertices. The relative treewidth of *G* with respect to π is *m* − 1 if and only if a transition parser with input π can construct *G* using cache size *m* but not using cache size *m* − 1.

The computational problem of deciding whether a transition parser with cache size *m* and with input π can construct *G* is treated in Section 4. Furthermore, the problem of efficiently computing the smallest cache size *m* that allows a transition parser to construct *G* from input π is treated in Section 5.

Similarly to Theorem 1, the following result provides a characterization of the treewidth of a graph. Again, the result is a direct consequence of Lemmas 3, 4, and 1.

A graph *G* has treewidth *m* − 1 if and only if a transition parser with cache size *m* can construct *G* for some input ordering of *G*’s vertices, and for no ordering of *G*’s vertices a transition parser with cache size *m* − 1 can construct *G*.

## 4. Oracle Algorithm

A cache transition parser is a nondeterministic automaton: For a fixed vertex sequence π, the parser could construct several graphs, all having tree decompositions with vertex order π (see Lemma 4). Even for an individual graph *G*, there may be several runs of the parser on π, each constructing *G* through a tree decomposition having vertex order π. This is usually called spurious ambiguity.

In this section we develop an algorithm that can be used to drive a cache transition parser with cache size *m*, in such a way that the parser becomes deterministic. This means that at most one computation is possible for each pair of *G* and π. More precisely, our algorithm takes as input a configuration *c* of the parser obtained when running on π, and a graph *G* to be constructed. Then the algorithm computes the unique transition that should be applied to *c* in order to construct *G* according to a canonical tree decomposition of width *m* − 1 having vertex order π. If such tree decomposition does not exist, then the algorithm fails at some configuration obtained when running on π.

In the literature on transition-based parsing, algorithms of this type are called **oracles** (Nivre 2008). Oracles are used to produce training data for the parser out of gold target structures. In our case, if we are given a data set of vertex sequences paired with gold graphs, an oracle can be used to provide a set of canonical transition sequences for training a classifier to predict the best transition at each configuration. The oracle algorithm can also be used to support Theorem 1 in computing the relative treewidth of *G* with respect to some vertex order π. Finally, we will later use the oracle algorithm (in Section 5) to compute the minimal cache size needed to parse a given data set of gold graphs.

Let *E*_{G} be the set of edges of the gold graph *G*. The oracle algorithm can look into *E*_{G} in order to decide which transition to use at *c*, or else to decide that it should fail. This decision is based on three mutually exclusive rules, listed below. Assume that *c* has cache η = [*v*_{1}, …, *v*_{m}] and buffer β. The first rule is given by:

- 1.
If there is no edge (

*v*_{m},*v*) in*E*_{G}such that vertex*v*is in β, the oracle chooses transition pop.

*j*∈ [|β|], we write β

_{j}to denote the

*j*-th vertex in β. We choose a vertex $vi*$ in η such that:

*c*:

- 2.
If Rule 1 does not apply, and there is no edge (

*v*, β_{1}) in*E*_{G}such that vertex*v*is in the stack or $v=vi*$, the oracle chooses transition push(*i**,*C*). - 3.
If Rule 1 and Rule 2 do not apply, the oracle fails.

_{1}using the cache. If this is not the case, it means that

*G*cannot be produced with the given cache size, and thus the parser rejects it.

The restrictions on the transitions imposed by the oracle algorithm lead to certain properties in the tree decompositions that a cache transition parser produces when running in oracle mode. For a graph *G* we define an **eager tree decomposition** of *G* to be a smooth tree decomposition *T* produced by the parser running on input π in oracle mode, where π is some sequence of *G*’s vertices. We now show that the eager tree decomposition is a normal form for tree decompositions of graphs, preserving both the width and the vertex order.

Any smooth tree decomposition *T* of graph *G* can be transformed into an eager tree decomposition *T*′ of *G* of equal width. Moreover, we have π(*T*′) = π(*T*).

*Proof*. Because *T* is a smooth tree decomposition, by Lemma 4 there exists a cache transition parser with cache size equal to the width of *T* + 1, such that a run of this parser on π(*T*) produces *T*. If the transitions of this run do not violate Rules 1 to 3 in the definition of our oracle, then *T* is also an eager tree decomposition. In case the run shows some violations of the three rules, we change *T* in order to eliminate these violations from the run, in a way that does not increase the width/cache size and preserves the order.

Suppose that our run contains some push transition that occurs when the rightmost vertex *v* in the cache η has no forward-pointing edge leading to some vertex in the buffer. This represents a violation of Rule 1 of the oracle. Let *I* be the set of nodes of *T*, and let *i* ∈ *I* be the node of *T* with rightmost vertex *v* in the cache, to which this push transition applies; see Figure 7. If there are several push transitions out of node *i*, those that represent a violation of Rule 1 must all be grouped at the right. We then choose the rightmost one. Let *i*_{1} ∈ *I* be the node of *T* produced by this push transition, and let *T*_{1} be the subtree of *T* rooted at *i*_{1}. The vertices of *G* that are pushed into the cache in the run associated with *T*_{1} cannot contain any neighbor of *v*. Thus *v* is not needed in *T*_{1}. We can therefore reattach subtree *T*_{1} to the parent node of *i*, *p*(*i*), in such a way that *i*_{1} becomes the immediate right sibling of *i*; see again Figure 7. Furthermore, we can replace all occurrences of *v* in *T*_{1} with copies of the vertex introduced at *p*(*i*).

Let *T*′ be the tree resulting from the above transformation of *T*. Because our transformation has not changed the size of the bags of *T*, *T*′ is still a smooth tree decomposition of *G*, with the same width as *T*. Since our transformation has moved *T*_{1} one level up in *T* without “jumping over” any other subtree of *T*, we must have π(*T*′) = π(*T*). Note that this transformation of *T* has removed from our run the alleged violation of Rule 1.

Suppose now that our run violates Rule 2 of the oracle. Because the run produces *G*, this can only happen if the parser does not push into the stack the vertex from the cache that will be needed furthest in the future. Let then *v*_{1} be the vertex that is pushed onto the stack, and let *v*_{2} ≠ *v*_{1} be the vertex that is needed furthest in the future. Let also *i*_{1} ∈ *I* be the node of *T* that is created at this step, and let *T*_{1} be the subtree of *T* rooted at *i*_{1}. Because *v*_{1} is removed from the cache when *i*_{1} is created, *v*_{1} does not appear anywhere in *T*_{1}, and none of the vertices that are pushed in *T*_{1} are neighbors of *v*_{1} in *G*. If *v*_{1} is not a neighbor of the vertices that are pushed in *T*_{1}, then *v*_{2} cannot be a neighbor of these vertices either, since *v*_{2}’s first neighbor occurs strictly after *v*_{1}’s first neighbor in β. Therefore, although *v*_{2} appears in subtree *T*_{1}, it is never used there to construct an edge of *G*. All occurrences of *v*_{2} in *T*_{1} can then be replaced by occurrences of *v*_{1}.

Let *T*′ be the tree resulting from the second transformation above. Again, tree *T*′ is a smooth tree decomposition of *G*, with the same width as *T*. Furthermore, the replacement of *v*_{2} by *v*_{1} does not affect the bags of *T* where these nodes have been introduced for the first time. Therefore we must have π(*T*′) = π(*T*). Note that this transformation of *T* has removed from our run the alleged violation of Rule 2.

These two transformations can be iterated until the resulting tree is an eager tree decomposition. From these observations, this tree has the same width and vertex order as *T*.

We can now prove the correctness of our oracle. We say that a cache transition parser running in oracle mode **accepts** its input if it reaches a final configuration.

Let *G* be some graph and let π be some ordering of its vertices. Assume that the relative treewidth of *G* with respect to π is *m* − 1. Then a cache transition parser with cache size *m* running in oracle mode on input *G* and π will accept.

*Proof*. By Lemma 5, there exists an eager tree decomposition *T* for *G* of width *m* − 1 such that π(*T*) = π. By definition of eager tree decomposition, a cache transition parser with cache size *m* can run in oracle mode on input *G* and π, without any violation of Rules 1 to 3. The parser will then accept.

We conclude this section with a computational analysis of the cache transition parser running in oracle mode. Let *G* and π be the input to the parser, and assume the cache size is *m*. Each pop transition can be carried out in constant time. Each push transition involves the processing of *m* − 1 vertices from the cache, testing their connection in *G* to the vertex shifted into the cache. This can be easily carried out in total time $O(m)$.

We now consider the computation of Rules 1 to 3 of the oracle at each step of the parser. We preprocess *G* in such a way that, for each vertex *v*, we have an adjacency list *a*(*v*) with all of *v*’s neighbors, sorted according to the left-to-right order in which these vertices appear in π. All of the adjacency lists together can be computed in time $O(|G|log(d))$, where *d* is the maximum degree of a vertex of *G*. The main idea, explained in more detail subsequently, is to remove from each *a*(*v*) the vertices as soon as the associated edges are processed. In this way, at each timestep, each *a*(*v*) is an ordered list of the unprocessed neighbors of *v*. These vertices must necessarily appear in the buffer.

Assume the current configuration has cache [*v*_{1}, …, *v*_{m}]. The computation of the oracle rules can be carried out as follows.

- •
To compute Rule 1, it suffices to check whether

*a*(*v*_{m}) is empty, because any unprocessed neighbor of*v*_{m}must necessarily be located in the buffer. This takes time $O(1)$. - •
To compute Rule 2, we consider each vertex

*v*_{i}in the cache. If*a*(*v*_{i}) is empty, we assign to*v*_{i}a score of +∞. Otherwise, let*v*be the first vertex in*a*(*v*_{i}), and let*i*_{v}be the index of*v*in π. We then assign to*v*_{i}a score of*i*_{v}. According to Equation (1), we can now compute index*i** by finding the vertex in the cache with the maximum score, arbitrarily solving any tie. This can be done in time $O(m)$.Next, we need to compute set

*C*as defined in Equation (2). Let*v*be the first vertex in the buffer. We check that the backward neighbors in*a*(*v*) are all in the cache and do not include vertex $vi*$, as required by Rule 2. This again can be done in time $O(m)$. - •
Finally, Rule 3 trivially takes time $O(1)$.

To conclude our analysis, we need to consider the amount of time spent in the updating of the adjacency lists. This is done right after each push transition, when a vertex *v* is shifted from the buffer into the cache. We observe that, at this time, all of the backward neighbors in *a*(*v*) must be in the cache, otherwise the push transition would not be possible and the computation would fail by Rule 3. We can then remove these backward neighbors from *a*(*v*) in time $O(m)$. Symmetrically, for each vertex *v*′ that is removed from *a*(*v*), we also remove *v* from *a*(*v*′). Note that *v* is always the first element of *a*(*v*′), and can thus be removed in time $O(1)$.

To summarize, at each step in the parsing process, we check Rules 1 to 3 of the oracle, we perform the required transition, and we update all of the adjacency lists in total time $O(m)$. The parser makes exactly one push and one pop transition for each arc of the eager tree decomposition of *G* given vertex order π. Because the number of arcs is |π|, the processing time (excluding the initialization of the adjacency lists) is $O(|\pi |m)$.

Combining the initialization and the processing time, we have the following result.

Let graph *G* and vertex ordering π be the input to a cache transition parser with cache size *m*, running in oracle mode. Let also *d* be the maximum degree of a vertex of *G*. A run of the parser takes time $O(|G|log(d)+|\pi |m)$.

As already discussed, this computational result refers to the training phase, where we use the oracle to map gold graphs and orderings into canonical transition sequences for training a classifier that would choose the optimal transition when decoding strings into graphs. As for the decoder itself, because there is no need to compute Rules 1 to 3 of the oracle or to initialize the adjacency lists, the running time will be $O(|\pi |m)$ plus some function that accounts for the time for the computation of the classifier, which we do not deal with here.

As a second remark, in case we have a small value for the cache size *m*, the decoding time $O(|\pi |m)$ is very close to the linear time of a transition-based system for dependency tree parsing. More precisely, when *m* = 2 our parser will only be able to build trees (see discussion in Section 7.2). On the other extreme, when *m* = |π|, the parser will become the transition-based implementation of the Covington algorithm (Nivre 2008) generalized to graphs. This algorithm is able to parse arbitrary graphs, and will run in quadratic time in the length of the input. In the next section we discuss an algorithm that computes the minimal value of *m* for an input set of data. As we will see in Section 6, on real data for English we obtain values of *m* that are very small. This suggests that the graphs of interest for semantic representation of English sentences can be processed almost as efficiently as their syntactic dependency tree counterpart, when the vertices are provided according to the English order.

## 5. Computing Minimal Cache Size

We now examine the problem of computing the relative treewidth of a graph *G* with respect to an order π. As already seen in Theorem 1, this provides the smallest cache size needed by our parser in order to process π and produce *G*. This problem is also central in parsing applications: Its solution will allow us to compute in Section 6 the minimal cache size that guarantees a complete coverage of a given data set.

Let *T*_{1} and *T*_{2} be two smooth tree decompositions for the same graph *G*. We say that *T*_{1} and *T*_{2} are **m-equivalent** if the following conditions both hold.

- •
*T*_{1}and*T*_{2}have the same branching structure, that is,*T*_{1}and*T*_{2}are the same if we ignore the content of the bags at their nodes. - •
Corresponding nodes of

*T*_{1}and*T*_{2}introduce the same vertex of*G*.

*T*

_{1}and

*T*

_{2}are m-equivalent if they differ only in the choice of the vertices of

*G*that are dropped off at corresponding bags. As a direct consequence of the definition, we have that if

*T*

_{1}and

*T*

_{2}are m-equivalent, then π(

*T*

_{1}) = π(

*T*

_{2}).

Let *G* be a graph and let π be a vertex order for *G*. When running in oracle mode on *G* and π, transition parsers with different cache sizes have associated eager tree decompositions that are m-equivalent.

*Proof.* We start by showing that, regardless of the size of the cache, when parsing in oracle mode, the sequence of push and pop transitions is always the same, and at corresponding timesteps of parsers with different cache size, the vertex in the rightmost position of the cache is always the same. To do this, we use induction on the number of moves, and we take advantage of the fact that, when parsing in oracle mode, the sequence of push and pop transitions depends only on the rightmost vertex in the cache and on the current position in the input buffer (see Rule 1 of our oracle).

Suppose that the first *h* − 1 moves for parsers of two different cache sizes, both producing *G*, consist of the same sequence of push and pop transitions (although the vertices chosen to be removed from the cache and pushed onto the stack may differ). Assume also that the rightmost vertex in the cache is the same for each of the first *h* − 1 moves for both parsers. Because both parsers have pushed the same number of times, both parsers will be at the same location in the buffer. Because the choice of push or pop depends only on the rightmost vertex in the cache and the position in the buffer, the choice of push or pop at step *h* will be the same for both parsers. If both parsers push, they will both shift the same vertex from the buffer, and will place the same vertex in the rightmost position of the cache. If both parsers pop, they will both return at timestep *h* to the cache configuration that they had at some previous timestep *i* < *h*. By the induction hypothesis, the cache configuration will have the same rightmost vertex.

Because parsers of any cache size running in oracle mode on *G* and π follow the same sequence of push and pop transitions, for all these runs the associated derivation trees and eager tree decompositions have the same branching structure. Furthermore, because corresponding nodes in these derivation trees have cache configurations with the same rightmost vertex, corresponding nodes of the tree decompositions introduce the same vertex of *G*. We thus conclude that the associated tree decompositions are all m-equivalent.

The next result shows an easy lower bound on the width of a smooth tree decomposition.

Let τ be a subtree of a smooth tree decomposition *T* of graph *G*, and let *h* be the number of vertices of *G* introduced outside τ that are adjacent in *G* to vertices introduced inside τ. Then the width of *T* is at least *h*.

*Proof.* Each vertex is introduced in the topmost node of *T* in which it appears, so vertices introduced in τ appear only in τ. Each edge *e* of *G* incident on a node introduced inside τ must be assigned to a bag of *T* inside τ. If the other endpoint of *e* is introduced outside of τ, then the other endpoint must occur both inside and outside τ, and, by the running intersection property of tree decompositions, must occur in the bag *B* at the root of τ. If there are *h* such distinct endpoints, *B* must contain these *h* vertices and the vertex introduced at *B*, for a total size of *h* + 1 vertices. Therefore, the width of *T* is at least *h*.

The combination of Lemmas 6 and 7 leads to an efficient algorithm for finding the relative treewidth of *G* with respect to π, reported below. In the algorithm we use the following property. Let *T* be a smooth tree decomposition of *G*, and let τ be a subtree of *T*. Let also *v* be a vertex of *G* that is introduced outside of τ and that is adjacent in *G* to some vertex introduced inside τ. Then *v* must be introduced at some node of *T* that dominates the root of τ. To see this, consider that *v* is introduced at the topmost node of *T* in which it appears, since *T* is smooth. Furthermore, by the running intersection property this node must dominate the root of τ.

The next result proves the correctness of Algorithm 1.

Let graph *G* and vertex order π be the input to Algorithm 1. Then the algorithm returns the relative treewidth of *G* with respect to π.

*Proof.* Assume that Algorithm 1 returns integer *k*. We start by showing that there exists an eager tree decomposition *T* of *G* such that π(*T*) = π and the width of *T* is *k*. Let *T*_{a} be the eager tree decomposition produced at Step 3 of Algorithm 1. We construct a tree decomposition $Ta\u2032$ by copying the branching structure of *T*_{a} and by editing each of the bags of *T*_{a} as described in what follows.

*i*be a node of

*T*

_{a}and let

*X*

_{i}be the associated bag, introducing vertex

*v*

_{i}of

*G*. We replace

*X*

_{i}with the bag

*i*∈

*I*we have $Xi\u2032$ ⊆

*X*

_{i}.

We now argue that $Ta\u2032$ is a valid tree decomposition of *G*. First, note that every vertex of *G* is introduced at some bag of $Ta\u2032$. More precisely, if *v* is introduced at bag *X*_{i} of *T*_{a}, for some *i* ∈ *I*, then *v* is introduced at the corresponding bag $Xi\u2032$ of $Ta\u2032$. Furthermore, each vertex *v* of *G* appears in a connected subtree of $Ta\u2032$. To see this observe that if *v* ∈ *X*_{i} is dropped from $Xi\u2032$, for some *i* ∈ *I*, then *v* will not appear in any of the bags $Xj\u2032$ for nodes *j* that are dominated by *i*. Finally, each edge of *G* can be assigned to the bag that introduces the lower of its two endpoints (as noted above, the node introducing one endpoint must be an ancestor of the node introducing the other).

Because *T*_{a} and $Ta\u2032$ have the same branching structure, and because vertices of *G* are introduced at corresponding nodes in *T*_{a} and $Ta\u2032$, we have that π$(Ta\u2032)$ = π(*T*_{a}) = π. Note that each $Xi\u2032$ is constructed following essentially the same condition at Step 4 of Algorithm 1, which provides value *k*. Hence the largest bag of $Ta\u2032$ has size *k* + 1 and $Ta\u2032$ has width *k*. By Lemma 1, $Ta\u2032$ can be transformed into a smooth tree decomposition of width *k*, preserving the order, and by Lemma 5 this smooth tree decomposition can in turn be transformed into an eager tree decomposition of width *k*, again preserving the order.

Let us now assume that the relative treewidth of *G* with respect to π is *k*′ < *k*. From Theorem 3, we have that a cache transition parser with cache size *k*′ + 1 running in oracle mode on *G* and π will accept. Let *T*′ be the eager tree decomposition associated with the run of the parser, and let *T* be the eager tree decomposition at Step 3 of Algorithm 1. By Lemma 6, *T* and *T*′ are m-equivalent.

Because corresponding nodes of *T* and *T*′ introduce the same vertex of *G*, we have that Step 4 of Algorithm 1 would return the same value *k* when running on *T*′. We can then apply Lemma 7 to *T*′, and conclude that *T*′ has width at least *k*. However, by Lemma 3, *T*′ will have width at most *k*′ < *k*. Since this is a contradiction, it cannot be possible for a tree decomposition of *G* to have order π and to have width smaller than *k*.

We conclude this section with a computational analysis of Algorithm 1. Following Theorem 4, Step 2 of the algorithm takes time $O(|G|log(d)+|VG|2)$, where *V*_{G} is the set of *G*’s vertices (with |*V*_{G}| = |π|). To compute Step 4, let *I* be the set of nodes of *T*. For each *i* ∈ *I* we maintain a list of vertices introduced above *i* that are connected to nodes introduced below *i*. This list can be computed in time $O(|VG|)$ using the lists at the children of *i*. The whole step then takes time $O(|VG|2)$. We have thus shown the following result.

Let graph *G* and vertex order π be the input to Algorithm 1. Let also *d* be the maximum degree of a vertex in *G*. Algorithm 1 can be implemented to run in time $O(|VG|2log(d))$.

## 6. Experiments

In this section we consider several families of graph-based representations of semantic structures for natural language that are commonly used nowadays. We run experiments on graph data sets for these representations, with the aim to assess the coverage that our cache parser provides with different cache sizes.

We first evaluate our algorithm on Abstract Meaning Representation (AMR) (Banarescu et al. 2013). AMR is a semantic formalism where the meaning of a sentence is encoded as a rooted, directed graph. Figure 8 shows an example of an AMR graph in which the nodes represent the AMR concepts and the edges represent the relations between the concepts they connect. AMR concepts consist of predicate senses, named entity annotations, and in some cases, simply lemmas of English words. AMR relations consist of core semantic roles drawn from the Propbank (Palmer, Gildea, and Kingsbury 2005) as well as very fine-grained semantic relations defined specifically for AMR. We use the training set of LDC2015E86 for SemEval 2016 task 8 on meaning representation parsing (May 2016), which contains 16,833 sentences. This data set covers various domains including newswire and Web discussion forums.

For each graph, we derive a vertex order corresponding to the English word order by using the automatically generated alignments provided with the data set, which align tokens in the string to concepts or edges in the graph. We first collapse subgraphs of named entities and dates to a single node on the graph side. For example, the subgraph corresponding to “John” is collapsed to a single node “person+John,” and the same goes for the subgraph for person name “Mary.” There are also some vertices in the graph that are not aligned to any token. We want to linearize all vertices (concepts) in the graph in such a way that the string side order is kept as much as possible (we call it string order). We first sort the aligned vertices according to the position of their token side. We use the first position in case a vertex is aligned to multiple positions. If an unaligned vertex is the parent of an aligned vertex, we insert it right before the aligned vertex in the sequence. Otherwise, for simplicity, we append the unaligned vertex to the end of the vertex sequence, according to its relative order in the depth-first traversal of the graph.

After we have constructed the input vertices with vertex order π, we run Algorithm 1 to determine the relative treewidth of each AMR graph with respect to the vertex order π. Figure 9 shows the distribution of relative treewidth of AMR graphs in the data set. We can see that over 99% of the AMR graphs can be built using a cache size of 8. As shown in Table 1, the average relative treewidth with respect to the string order is 2.80. The average treewidth of this data set (i.e., the average of minimum relative treewidth of each AMR graph with respect to any vertex order) is 1.52. This shows that using the string order as a constraint does not significantly increase the treewidth statistics of the data set.

data . | real . | string (gold) . | string . | reversed string (gold) . | reversed string . | random . |
---|---|---|---|---|---|---|

LDC2015E86 | 1.52 | - | 2.80 | - | 3.08 | 4.84 |

hand aligned | 1.43 | 2.61 | 2.68 | 2.79 | 2.90 | 4.81 |

data . | real . | string (gold) . | string . | reversed string (gold) . | reversed string . | random . |
---|---|---|---|---|---|---|

LDC2015E86 | 1.52 | - | 2.80 | - | 3.08 | 4.84 |

hand aligned | 1.43 | 2.61 | 2.68 | 2.79 | 2.90 | 4.81 |

In the worst case, the maximum relative treewidth of a graph can be 16, while the maximum treewidth of the data is 4. This is because when using the string order as a constraint for the vertex order, the string-to-vertex alignment does not always follow the preorder traversal of vertices that is desirable in the width computation. The most problematic case is the traversal of a node with many branches, with op’s structure most significant in the AMR data. For example, in Figure 10, *n* different concepts are connected to the parent concept “and” with the op_{k} (*k* = 1, ⋯, *n*) relation. This structure does not introduce high treewidth because we can put “and” and each *c*_{i} into a separate bag, forming a chain of bags of width 1. However, when we use the string order as a constraint, we first introduce vertices *c*_{1}, *c*_{2}, …, *c*_{n−1}, then we further introduce “and” and *c*_{n}. This would result in a chain structure of length (*n* + 1) in the tree decomposition, where “and” is introduced at the *n*-th bag in the chain. According to Algorithm 1, because *c*_{1}, *c*_{2}, …, *c*_{n−1} are all introduced above the bag that introduces “and” and all connect to “and,” the relative treewidth is at least *n* − 1. In general, a high-branching structure with most children introduced before the parent would result in larger relative treewidth and distort from the real treewidth of the graph.

Another reason for high relative treewidth is the alignment errors from the automatic alignments. When multiple instances of the same word align to multiple vertices with the same concept labels, the automatic alignment usually cannot distinguish them and often creates a many-to-many alignment between instances of the word in the string and instances of the concept in the graph. This results in a wrong traversal order of the multiple vertices and a larger relative treewidth. Our worst-case sentence, with relative treewidth of 16, is due to this type of error in the automatically generated alignments.

We additionally experiment on a smaller data set of 200 hand-aligned AMR/English sentence pairs by Pourdamghani et al. (2014). From Table 1, we can see that the average relative treewidth of these AMR graphs with respect to the string order is 2.61 when using the gold alignment, and the average treewidth is 1.43. If we use the automatic alignment for these AMRs, the relative treewidth becomes 2.68. The maximum relative treewidth for both cases are 6. This number is much lower than the maximum relative treewidth of the LDC2015E86 training data because the maximum sentence length of the smaller data set is 54, whereas for the latter data set the maximum sentence length can be as large as 225. By comparison, we can also see that alignment errors can result in higher relative treewidth, though not significantly.

We also evaluate the impact of different vertex order on the relative treewidth. We can see from the table that if we reverse the vertex order (reversed string order), the relative treewidth is 3.08. This number is slightly larger than using the string order. The reason might be that English is more likely to have relation arcs going from left to right. If we randomize the vertex order, the relative treewidth becomes 4.84.

We also evaluate the coverage of our algorithm on semantic graph-based representations other than AMR. We consider the set of semantic graphs in the Broad-Coverage Semantic Dependency Parsing task of SemEval 2015 (Oepen et al. 2015), which uses three distinct graph representations for English semantic dependencies.

- •
**DELPH-IN MRS-Derived Bi-Lexical Dependencies (DM)**: These semantic dependencies are derived from the annotation of Sections 00-21 of the WSJ Corpus with gold-standard HPSG analyses provided by the LinGO English Resource Grammar (Flickinger 2000; Flickinger, Zhang, and Kordoni 2012). Among other layers of linguistic analysis, this representation also includes logical-form meaning representations in the framework of Minimal Recursion Semantics (MRS) (Copestake et al. 2005). - •
**Enju Predicate-Argument Structures (PAS)**: This data set comes from the HPSG-based annotation of Penn Treebank, which is used for training the wide-coverage HPSG parser Enju (Miyao 2006). Enju can effectively analyze syntactic/semantic structures of English sentences and output phrase structures and predicate-argument structures with a wide-coverage grammar and a probabilistic model trained on this data. - •
**Prague Semantic Dependencies (PSD)**: The Prague Czech-English Dependency Treebank (Hajic et al. 2012) is a set of parallel dependency trees over the WSJ texts from the Penn Treebank, and their Czech translations. The PSD bi-lexical dependencies have been extracted from what is called the**tectogrammatical**annotation layer (t-trees). We experiment on the English part of the treebank in this article.

data . | real . | string . |
---|---|---|

AMR | 1.52 | 2.80 |

DM | 1.30 | 2.95 |

PSD | 1.61 | 3.01 |

PAS | 1.72 | 3.84 |

data . | real . | string . |
---|---|---|

AMR | 1.52 | 2.80 |

DM | 1.30 | 2.95 |

PSD | 1.61 | 3.01 |

PAS | 1.72 | 3.84 |

To summarize, we find that the relative treewidth of the analyzed semantic graph-based structures, with respect to the English word order, is low enough to make efficient parsing possible. Furthermore, the fact that the real word order results in lower relative treewidth than random orders, or even the reverse order, indicates that the real English word order provides valuable information that our parsing framework can exploit.

## 7. Comparison with Other Formalisms

In this section we compare our cache transition parser with existing formalisms that have been used for graph-based parsing, as well as to similar transition-based systems for dependency tree parsing.

### 7.1. Connection to Hyperedge Replacement Grammars

Hyperedge Replacement Grammars (HRGs) are a general graph rewriting formalism (Drewes, Kreowski, and Habel 1997) that has been applied by a number of authors to semantic graphs such as AMRs (Jones et al. 2012; Jones, Goldwater, and Johnson 2013; Peng, Song, and Gildea 2015). Our parsing formalism can be related to HRG through the concept of tree decomposition.

HRGs contain rules that rewrite a nonterminal hyperedge into a graph fragment consisting of a number of new nonterminal hyperedges and terminal edges (see the example grammar in Figure 12). The vertices that a nonterminal hyperedge connects to, known as its **ports**, are also the points at which the rule’s right-hand side graph fragment is attached to the rest of the graph after the nonterminal is rewritten. An HRG derivation of a graph can be viewed as a derivation tree, with a grammar rule at each node, where the rule expanding a nonterminal hyperedge is a child of the rule that introduced the nonterminal in its righthand side. This derivation tree provides a tree decomposition of the derived graph. More precisely, the tree decomposition has the same nodes and the same branching structure as the derivation tree, and each bag in the tree decomposition contains all the vertices that appear in the right-hand side of the rule.

In the other direction, it is possible to extract HRG rules from a tree decomposition by treating each bag as a rule. Nonterminals in the extracted rules correspond to the arcs of the tree decomposition. For an arc (*i*, *j*) of the tree decomposition, let *X*_{i} and *X*_{j} be the bags at nodes *i* and *j*, respectively. The set of vertices *X*_{i} ∩ *X*_{j}, known as the arc’s separator, are the vertices to which the corresponding HRG nonterminal was connected before being rewritten.

An HRG generating graphs consisting of a single cycle of any size is shown in Figure 12. An example derivation of this grammar along with the corresponding tree decomposition is shown in Figure 13. Cycles have treewidth 2, as seen from the fact that the largest bag in the figure has size 3, and the grammar rules involve at most three vertices.

Because each run of our parser corresponds to a smooth tree decomposition, it is possible to describe a corresponding HRG. This HRG will have a number of specific properties. Because our tree decompositions are smooth, the HRG rules will always introduce exactly one new vertex in their right-hand side. Because the separators of a smooth tree decomposition of treewidth *k* all contain *k* vertices, the nonterminals of the derived HRG all have exactly *k* ports. The branching factor of a node in our tree decomposition corresponds to the number of right-hand side nonterminals in the corresponding HRG rule, and is potentially unlimited.

In general, the branching factor of the tree decompositions produced by our parser is not bounded by a constant, while any fixed HRG has a maximum number of righthand side nonterminals in its rules. This implies that, although it is possible to extract an HRG corresponding to one run of our parser, it is not always possible to produce an HRG whose derivations correspond to all possible runs of a parser. We emphasize that these observations apply to the derivations of the HRG rather than to the language of graphs produced. It is an open question whether all the graphs of fixed relative treewidth with respect to a vertex order can be generated by a fixed HRG.

### 7.2. Connection to Existing Transition-Based Systems

As already mentioned in the Introduction, transition systems using a stack data structure have been very successful in dependency tree parsing, and several proposals can be found in the literature. In this section we compare some of these systems with our cache transition parser.

We have already mentioned that when we use a cache with size 2, we can only construct graphs that are trees. This follows from the fact that any graph with at least one cycle has relative treewidth larger than one, and thus cannot be parsed with cache size 2, by Theorem 1. When the cache size is bounded by two, the edge construction operations that are available to the parser resemble the left-arc and the right-arc transitions of the arc-standard parser described by Nivre (2008), if we disregard the fact that these transitions remove the dependent vertex from the stack. However, this similarity is only superficial and the two parsers are incomparable, as explained subsequently.

The arc-standard parser can construct trees in which the root is at the rightmost position of the input string, and all of the remaining tokens are left children of the root. As already discussed in Section 6, such trees have relative treewidth proportional to the length of the string. By Theorem 1, these trees cannot be parsed with cache size of 2.

In the opposite direction, when using cache size of 2 we can construct non-projective trees, something that is not possible with an arc-standard parser. As a simple example, consider the non-projective dependency tree *G* shown in the left part of Figure 14, where we have disregarded the edge directions. The derivation tree in the right part of Figure 14 displays a run of a parser with cache size 2 constructing *G*. When vertices *v*_{1} and *v*_{2} are in the cache for the first time, the parser constructs the edge (*v*_{1}, *v*_{2}). It then pushes *v*_{3} into the cache, moving *v*_{2} out of the cache and into the stack and constructing the edge (*v*_{1}, *v*_{3}). It then pops *v*_{3} from the cache, moving back to the configuration with cache content *v*_{1}, *v*_{2}. Next, the parser pushes *v*_{4} into the cache, moving *v*_{1} into the stack and constructing the edge (*v*_{2}, *v*_{4}). Afterward, it pops *v*_{4} from the cache, again moving back to the cache content *v*_{1}, *v*_{2}. Two more pops conclude the computation.

The fact that we can build non-projective trees, even with the minimum cache size of 2, is explained by the capability of our parser to use the cache to “reorder” vertices with respect to their original ordering in the input. In our example, for instance, at some step in the computation we have *v*_{2} in the stack and *v*_{1}, *v*_{3} in the cache in the given order. This reordering of the input tokens is somehow reminiscent of the parsing model proposed by Nivre (2009), where a special transition called swap is used to reorder the input tokens and to construct non-projective dependency trees.

An alternative model for parsing non-projective dependency trees has been proposed by Attardi (2006). This parser constructs edges by connecting vertices in the stack that are not at adjacent positions, using special transitions called left-arc_{k} and right-arc_{k} for arbitrarily large values of integer *k* > 0. More precisely, left-arc_{k} and right-arc_{k} create an edge between the topmost vertex in the stack and the vertex at position *k* + 1. (In the original formulation of the parser, the two vertices involved are the first vertex in the buffer and the vertex at position *k* in the stack. If we consider the first element of the buffer as an additional stack element sitting on the top of the top-most stack symbol, the two formulations are equivalent.) The left-arc_{k} and right-arc_{k} transitions are superficially similar to the operations for edge construction that we exploit in our cache, since both operations connect vertices that are not at adjacent positions. However, the two models present two substantial differences, as discussed below.

First, the left-arc_{k} and right-arc_{k} transitions remove the dependent vertex from the stack, while vertices in the cache are retained after edge construction. This feature allows us to create loops in the produced graphs, or edge re-entrancies in case we produce directed graphs. Second, and most important, in the cache parser there is an interplay between the stack and the cache that allows us to reorder vertices in the stack with respect to their original ordering in the input, as already observed. This is not possible in Attardi’s parser. In different words, the cache should not be regarded as a finite size window at the top of the stack, as the case of Attardi’s parser might suggest: In the cache parser we can pick up any vertex from the cache and move it into the stack. Hence the stack is effectively used to “delay” the processing of vertices, if these vertices are not needed in the current branch of the computation.

The cache transition parsing proposed in this article can also be related to the two-register transition system of Pitler and McDonald (2015) for non-projective dependency parsing. In addition to the buffer and the stack, a two-register transition system uses two registers to store input vertices and to create edges involving these vertices and vertices in the stack. Because the stack and the registers are manipulated independently one of the other, this technique basically alters the input order of the tokens, making it possible to produce non-projective trees. The cache parser can then be viewed as a generalization of the two-register transition systems. This is because in a cache parser one can move tokens in and out of the cache repeatedly, as already discussed. This is not possible in a register transition system. It would be interesting then to explore the use of our cache parsers for non-projective dependency grammars.

We conclude this section with a discussion of other transition-based systems explicitly designed for graph parsing, as opposed to tree parsing. Sagae and Tsujii (2008) have possibly been the first authors to extend the stack-based transition framework for dependency tree parsing to directed acyclic graphs, with the motivation of representing semantically motivated predicate-argument relations and anaphoric references. This is done by dropping the constraint of a single head per word, and by using post-processing transformations that introduce non-projectivity. Titov et al. (2009) and Henderson et al. (2013) present a transition system for synchronous syntactic-semantic parsing, with the motivation of modeling the syntax/semantic interface. On the semantic side, their system mainly captures the predicate-argument structure and semantic role labeling. Their model has then been adapted by Du et al. (2014) for semantic-only parsing.

Later, Wang, Xue, and Pradhan (2015) proposed a transition system for AMR parsing. Unlike traditional stack-based transition parsers that process input strings, this system takes as input a dependency tree and processes its edges using a stack, applying tree-to-graph transformations that produce a directed acyclic graph. Similarly to Sagae and Tsujii (2008), the system presented by Damonte, Cohen, and Satta (2017) extends standard approaches for transition-based dependency parsing to AMR parsing, allowing re-entrancies. Similar extensions of transition-based systems to AMR parsing also appear in Zhou et al. (2016) and Ribeyre, de La Clergerie, and Seddah (2015).

All of these approaches are based on the idea of extending the transition inventory of standard transition-based dependency parsing systems in order to produce graph representations. On a theoretical perspective, what is missing from these proposals is a mathematical characterizaton of the set of graphs that can be produced and, with few exceptions, a precise description of the oracle algorithms that are used to produce training data from the gold graphs. Furthermore, all of these proposals still retain the stack and buffer architecture of the transition-based dependency parsing system they extend. In contrast, the proposal in this article introduces the novel idea of using a cache component in stack-based transition systems. As we have already discussed, the specific interplay between the stack and the cache allows the system to split the computation into different branches, and for each branch to reorder the input tokens in a way that allows edge processing locally to the cache, even in cases where the involved vertices are at a long distance in the input sequence. We have also provided a mathematical characterization of the graphs that can be constructed in this way, in terms of the novel notion of relative treewidth, and we have specified and analyzed an oracle algorithm to produce training data from the gold graphs.

## 8. Concluding Remarks

Our transition system is motivated by the task of semantic parsing of natural language sentences, and we now proceed to discuss some of the issues that still need to be addressed in developing a practical system based on our framework. The primary task is to develop a machine learning system for predicting the parser’s next action at each step. The optimal cache size will need to be determined empirically, as it may be beneficial to trade off coverage of the small number of sentences requiring large cache size in order to make the prediction of parser actions more accurate. We speculate that it will be desirable to decompose the push action into steps that first make the decision of whether to push or pop, and then whether to build each of the potential arcs within the cache individually, in order to reduce the space of predictions at each step. In the literature on dependency grammar parsing, models of this type are called **arc-factored models** and are frequently used. Further experimentation will be required to determine the best set of features and the best architecture for the machine learning component.

A possible extension of our framework is the development of a dynamic programming algorithm to allow efficient exploration of the space of possible runs of a parser on an input string. Intuitively, different runs on the same string might share common subparts. These subparts can be computed only once, and then “shared” among different runs using dynamic programming techniques. Dynamic programming algorithms for transition-based dependency parsing have been proposed by Huang and Sagae (2010) and Kuhlmann, Gómez-Rodríguez, and Satta (2011). These algorithms could be extended to our system, which is also fundamentally stack-based. Dynamic programming algorithms simulating transition-based parsers have proven useful in the realization of so-called dynamic oracles (Goldberg, Sartorio, and Satta 2014) for transition-based parsers, improving parsing performance with respect to static oracles, that is, oracles of the type discussed in Section 4. Furthermore, dynamic programming algorithms are at the basis of the development of methods for unsupervised learning, as for example the inside-outside algorithm (Charniak 1993).

Although we have treated the input buffer as an ordering of the vertices of the final graph, this is a simplification of the problem setting of semantic parsing for NLP. Given as input a sequence of English words, the parser must also predict which words correspond to zero, one, or more vertices of the final graph, and possibly insert vertices not corresponding to any English word. This could be accomplished either by preprocessing the input string with a separate concept identification phase (Flanigan et al. 2014), or by extending the actions of the transition system to include moves inserting new vertices into the graph. We have not included moves inserting new vertices, in order to simplify our exposition, but such moves would not fundamentally alter the correspondence between parsing runs and tree decompositions described in this article.

The correspondence between runs of our parser and tree decompositions of the output graph allows for a precise characterization of the class of graphs covered, as well as simple and efficient algorithms for providing an oracle sequence of parser moves, and for determining the minimum cache size required to cover a data set. We find experimentally that semantic graphs have low relative treewidth with respect to English word order, indicating that our parsing approach provides a practical method of exploiting the word order in semantic parsing. Our concept of relative treewidth with respect to a vertex order appears to be new in the graph theory literature, and may have applications outside of natural language processing. Our transition system was primarily motivated by these theoretical considerations, and many other definitions are possible. In particular, our decision that vertices can only be popped from the rightmost position in the cache simplifies our analysis. Theoretical characterization of, and experimentation with, the set of other possible transition systems for building graphs is a promising area for future research.

## Note

The term “derivation tree” has been used in the literature to denote an underlying derivation process associated with a generative grammar (a rewriting system); see, for instance, the definition of tree-adjoining grammars (Joshi and Schabes 1997). Because we have a recognition device here, rather than a grammar, we are making a slight abuse of terminology.

## References

*k*-tree

## Author notes

Dipartimento di Ingegneria dell’Informazione, Università di Padova, Via Gradenigo 6/A, 35131 Padova, Italy. E-mail: satta@dei.unipd.it.

Computer Science Department, University of Rochester, Rochester NY 14627. E-mail: xpeng@cs.rochester.edu.s