## Abstract

Transition-based parsing is a widely used approach for dependency parsing that combines high efficiency with expressive feature models. Many different transition systems have been proposed, often formalized in slightly different frameworks. In this article, we show that a large number of the known systems for projective dependency parsing can be viewed as variants of the same stack-based system with a small set of elementary transitions that can be composed into complex transitions and restricted in different ways. We call these systems divisible transition systems and prove a number of theoretical results about their expressivity and complexity. In particular, we characterize an important subclass called efficient divisible transition systems that parse planar dependency graphs in linear time. We go on to show, first, how this system can be restricted to capture exactly the set of planar dependency trees and, secondly, how the system can be generalized to *k*-planar trees by making use of multiple stacks. Using the first known efficient test for *k*-planarity, we investigate the coverage of *k*-planar trees in available dependency treebanks and find a very good fit for 2-planar trees. We end with an experimental evaluation showing that our 2-planar parser gives significant improvements in parsing accuracy over the corresponding 1-planar and projective parsers for data sets with non-projective dependency trees and performs on a par with the widely used arc-eager pseudo-projective parser.

## 1. Introduction

Syntactic parsing using dependency-based representations has attracted considerable interest in computational linguistics in recent years, both because it appears to provide a useful interface to downstream applications of parsing and because many dependency parsers combine competitive parsing accuracy with highly efficient processing. Among the most efficient systems available are transition-based dependency parsers, which perform a greedy search through a transition system, or abstract state machines, that map sentences to dependency trees, guided by statistical models trained on treebank data (Yamada and Matsumoto 2003; Nivre, Hall, and Nilsson 2004; Attardi 2006; Zhang and Clark 2008). Transition systems for dependency parsing come in many different varieties, and our aim in the first part of this article is to deepen our understanding of these systems by analyzing them in a uniform framework.

More precisely, we demonstrate that a number of well-known systems from the literature can all be viewed as variants of a stack-based system with five elementary transitions, where different variants are obtained by composing elementary transitions into complex transitions and by adding restrictions on their applicability. We call such systems **divisible** transition systems and prove a number of theoretical results about their expressivity (which classes of dependency graphs they can handle) and their complexity (what upper bounds exist on the length of transition sequences). In particular, we show that an important subclass called **efficient** divisible transition systems derive planar dependency graphs in time that is linear in the length of the sentence using standard inference methods for transition-based dependency parsing. Even though many of these results were already known for particular systems, the general framework allows us to derive these results from more general principles and thereby to establish connections between previously unrelated systems. We then go on to show that there are interesting cases of efficient divisible transition systems that have not yet been explored, notably a system that is sound and complete for **planar** dependency trees, a mild extension to the class of projective trees that are assumed in most existing systems.

In the second part of the article, we take the planar parsing system as our point of departure for addressing the problem of non-projective dependency parsing. Despite the impressive results obtained with dependency parsers limited to strictly projective dependency trees—that is, trees where every subtree has a contiguous yield—it is clear that most if not all languages have syntactic constructions whose analysis requires non-projective trees. It is also clear, however, that allowing arbitrary non-projective trees makes parsing computationally hard (McDonald and Satta 2007) and does not seem justified by the data in available treebanks (Kuhlmann and Nivre 2006; Nivre 2006a; Havelka 2007). This suggests that we should try to find a superset of projective trees that is permissive enough to encompass constructions found in natural language yet restricted enough to permit efficient parsing. Proposals for such a set include trees with bounded arc degree (Nivre 2006a; Nivre 2007), well-nested trees with bounded gap degree (Kuhlmann and Nivre 2006; Kuhlmann and Möhl 2007), as well as trees parsable by a particular transition system such as that proposed by Attardi (2006).

In the same vein, Yli-Jyrä (2003) introduced the concept of **multiplanarity**, which generalizes the simple notion of planarity by saying that a dependency tree is *k*-planar if it can be decomposed into at most *k* planar subgraphs, a proposal that remains largely unexplored because an efficient test for *k*-planarity has been lacking. In this article, we construct a test for *k*-planarity by reducing it to a graph coloring problem. Applying this test to a wide range of dependency treebanks, we show that, although simple planarity (or 1-planarity) is clearly insufficient (Kuhlmann and Nivre 2006), the set of 2-planar dependency trees gives a very good fit with the available data, better than many of the previously proposed superclasses of projective trees. We then demonstrate how the transition system for planar dependency parsing can be generalized to *k*-planarity by introducing additional stacks. In particular, we define a two-stack system for 2-planar dependency parsing that is provably correct and has linear complexity. Finally, we show that the 2-planar parser, when evaluated on data sets with a non-negligible proportion of non-projective trees, gives significant improvements in parsing accuracy over the corresponding 1-planar and projective parsers, and provides comparable accuracy to the widely used arc-eager pseudo-projective parser.

The remainder of the article is structured as follows. Section 2 reviews basic concepts of dependency parsing and in particular the formalization of stack-based transition systems from Nivre (2008). Section 3 introduces our system of elementary transitions, uses it to analyze a number of parsing algorithms from the literature as divisible transition systems, proves a number of theoretical results about the expressivity and complexity of such systems, and finally introduces a divisible transition system for 1-planar dependency parsing. Section 4 reviews the notion of multiplanarity, introduces an efficient procedure for determining the smallest *k* for which a dependency tree is *k*-planar, and uses this procedure in an empirical investigation of available dependency treebanks. Section 5 shows how the divisible transition system framework and the 1-planar parser can be generalized to handle *k*-planar trees by introducing additional stacks, presents proofs of correctness and complexity for the 2-planar case, and reports the results of an experimental evaluation of projective, pseudo-projective, 1-planar and 2-planar dependency parsing. Section 6 reviews related work, and Section 7 concludes and makes suggestions for future research.

Part of the contributions in this article (namely, the test for multiplanarity and the 1-planar and 2-planar parsers) have been published previously by Gómez-Rodríguez and Nivre (2010); this article substantially revises and extends the ideas presented in that paper. The framework of divisible transition systems and all the derived theoretical results, including the properties and proofs regarding the 1-planar and 2-planar parsers, are entirely new contributions of this article.

## 2. Dependency Parsing

Dependency parsing is based on the idea that syntactic structure can be analyzed in terms of binary, asymmetric relations between the words of a sentence, an idea that has a long tradition in descriptive and theoretical linguistics (Tesnière 1959; Sgall, Hajičová, and Panevová 1986; Mel'čuk 1988; Hudson 1990). In computational linguistics, dependency structures have become increasingly popular in the interface to downstream applications of parsing, such as information extraction (Culotta and Sorensen 2004; Stevenson and Greenwood 2006; Buyko and Hahn 2010), question answering (Shen and Klakow 2006; Bikel and Castelli 2008), and machine translation (Quirk, Menezes, and Cherry 2005; Xu et al. 2009). And although dependency structures can easily be extracted from other syntactic representations, such as phrase structure trees, this has also led to an increased interest in statistical parsers that specifically produce dependency trees (Eisner 1996; Yamada and Matsumoto 2003; Nivre, Hall, and Nilsson 2004; McDonald, Crammer, and Pereira 2005).

Current approaches to statistical dependency parsing can be broadly grouped into **graph-based** and **transition-based** techniques (McDonald and Nivre 2007). Graph-based parsers parameterize the parsing problem by the structure of the dependency trees and learn models for scoring entire parse trees for a given sentence. Many of these models permit exact inference using dynamic programming (Eisner 1996; McDonald, Crammer, and Pereira 2005; Carreras 2007; Koo and Collins 2010), but recent work has explored approximate search methods in order to widen the scope of features especially when processing non-projective trees (McDonald and Pereira 2006; Riedel and Clarke 2006; Nakagawa 2007; Smith and Eisner 2008; Martins, Smith, and Xing 2009; Koo et al. 2010; Martins et al. 2010). Transition-based parsers parameterize the parsing problem by the structure of a transition system, or abstract state machine, for mapping sentences to dependency trees and learn models for scoring individual transitions from one state to the other. Traditionally, transition-based parsers have relied on local optimization and greedy, deterministic parsing (Yamada and Matsumoto 2003; Nivre, Hall, and Nilsson 2004; Attardi 2006; Nivre 2008), but globally trained models and non-greedy parsing methods such as beam search are increasingly used (Johansson and Nugues 2006; Titov and Henderson 2007; Zhang and Clark 2008; Huang, Jiang, and Liu 2009; Huang and Sagae 2010; Zhang and Nivre 2011). In empirical evaluations, the two main approaches to dependency parsing often achieve very similar accuracy, but transition-based parsers tend to be more efficient. In this article, we will be concerned exclusively with transition-based models.

In the remainder of this background section, we first introduce the syntactic representations used by dependency parsers, starting from a general characterization of dependency graphs and discussing a number of different restrictions of this class that will be relevant for the analysis later on. We then go on to review the formalization of transition systems proposed by Nivre (2008), and in particular the class of stack-based systems that provides the framework for our discussion of existing and novel transition-based models. Finally, we discuss the implementation of efficient parsers based on these transition systems.

### 2.1 Dependency Graphs

In dependency parsing, the syntactic structure of a sentence is modeled by a **dependency graph**, which represents each token and its syntactic dependents through labeled, directed arcs. This is exemplified in Figure 1 for a Czech sentence taken from the Prague Dependency Treebank (Hajič et al. 2001; Böhmová et al. 2003), and in Figure 2 for an English sentence taken from the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993; Marcus et al. 1994).^{1} In the former case, an artificial token root has been inserted at the beginning of the sentence, serving as the unique root of the graph and ensuring that the graph is a tree even if more than one token is independent of all other tokens. In the latter case, no such device has been used, and we will not in general assume the existence of an artificial root node prefixed to the sentence, although all our models will be compatible with such a device.

**Definition 1**

A **dependency graph** for a sentence *x* = *w*_{1}, …, *w*_{n} is a directed graph *G* = (*V*, *A*), where

- 1.
*V*= {1, …,*n*} is a set of nodes, - 2.
*A*⊆*V*×*V*is a set of directed arcs, containing no loops (i.e., arcs of the form (*v*,*v*) are disallowed for all*v*∈*V*).

*V*of

**nodes**(or

**vertices**) is the set of positive integers up to and including

*n*, each corresponding to the linear position of a token in the sentence (where the first token may or may not be the special token root). The set

*A*of

**arcs**(or

**directed edges**) is a set of pairs (

*i*,

*j*), where

*i*and

*j*are distinct nodes. Because arcs are used to represent dependency relations, we say that

*i*is the

**head**of

*j*; conversely, we say that

*j*is a

**dependent**of

*i*. A node with no incoming arcs is called a

**root**.

We will say that two arcs (*i*,*j*) and (*k*,*l*) **cross** if min (*i*,*j*) < min (*k*, *l*) < max (*i*,*j*) < max (*k*,*l*) or min (*k*,*l*) < min (*i*, *j*) < max (*k*,*l*) < max (*i*,*j*), and that an arc (*i*,*j*) **covers** a node *k* if min (*i*,*j*) < *k* < max (*i*,*j*).

Note that the dependency graphs defined by Definition 1 are **unlabeled** dependency graphs. Adding labels is straightforward by redefining arcs as triples (*i*, *l*, *j*), consisting of a head *i*, a label *l*, and a dependent *j*, but excluding labels for now will simplify the formal analysis without limiting the generality of the results. We will discuss the generalization to labeled dependency graphs whenever relevant, and the experiments reported in Section 5 all use labeled graphs.

**Definition 2**

Let *G* = (*V*, *A*) be a dependency graph.

- 1.
Single-Head(

*G*) ⇔ every node in*G*has at most one incoming arc. - 2.
Acyclic(

*G*) ⇔ there are no (directed) cycles in*G*. - 3.
Connected(

*G*) ⇔*G*is weakly connected. - 4.
Tree(

*G*) ⇔*G*is a directed tree. - 5.
Planar(

*G*) ⇔ there are no crossing arcs in*G*. - 6.
No-Covered-Roots(

*G*) ⇔ there is no root covered by an arc in*G*. - 7.
Projective(

*G*) ⇔ Planar(*G*) and No-Covered-Roots(*G*).

**dependency tree**.

The final three constraints are usually defined only for dependency trees, although we have extended them to apply to dependency graphs in general. The most common of these is the Projective constraint, which for dependency trees is equivalent to the requirement that every subtree must have a contiguous yield and rules out both crossing arcs and covered roots. By contrast, the Planar constraint forbids crossing arcs but allows covered roots, which in the case of dependency trees is a very mild relaxation because there can be at most one covered root without violating the Tree constraint.

**Example 1**

Consider the dependency graphs depicted in Figures 1 and 2, ignoring labels for the time being:

*G*

_{1}satisfies Tree (hence also Single-Head, Acyclic, and Connected) and No-Covered-Roots but violates Planar (hence also Projective) because there are crossing arcs. By contrast,

*G*

_{2}satisfies all constraints listed in Definition 2.

### 2.2 Transition Systems for Dependency Parsing

Transition-based dependency parsing is based on the notion of a **transition system**, or abstract state machine, for mapping sentences to dependency graphs. Such systems are nondeterministic in general and usually combined with heuristic search, guided by a treebank-induced function for scoring different transitions out of a given configuration. For the time being, we will ignore the details of the search procedure and concentrate on the underlying transition systems. We will adopt the general framework of Nivre (2008) but restricted to *stack-based* systems.^{2}

**Definition 3**

A **transition system** for dependency parsing is a quadruple *S* = (*C*, *T*, *c*_{s}, *C*_{t}), where

- 1.
*C*is a set of**configurations**, each of which contains a buffer*β*of (remaining) nodes and a set*A*of dependency arcs, - 2.
*T*is a set of**transitions**, each of which is a (partial) function*t*:*C*→*C*, - 3.
*c*_{s}is an**initialization function**, mapping a sentence*x*=*w*_{1}, …,*w*_{n}to a configuration with*β*= [1, …,*n*], - 4.
*C*_{t}⊆*C*is a set of**terminal configurations**.

**stack-based**transition systems, a configuration takes the form of a triple

*c*= (

*σ*,

*β*,

*A*), where

*σ*is a stack of nodes,

*β*is a buffer of nodes, and

*A*is a set of dependency arcs; the initialization function is

*c*

_{s}(

*x*) = ([ ], [1, …,

*n*], ∅ ) (for

*x*=

*w*

_{1}, …,

*w*

_{n}); and the set of terminal configurations is

*C*

_{t}= {

*c*|

*c*= (

*σ*, [],

*A*) for any

*σ*,

*A*} (Nivre 2008).

We use the notation *σ*_{c}, *β*_{c}, and *A*_{c} to refer to the value of *σ*, *β*, and *A* in a given configuration *c*; we use |*σ*| and |*β*| to refer to the size of *σ* and *β* (i.e., the number of nodes), and we use [ ] to denote an empty stack or buffer.

**Definition 4**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be a transition system. A **transition sequence**^{3} for a sentence *x* = *w*_{1}, …, *w*_{n} in *S* is a sequence *C*_{0,m} = (*c*_{0}, *c*_{1}, …, *c*_{m}) of configurations, such that

- 1.
*c*_{0}=*c*_{s}(*x*), - 2.
*c*_{m}∈*C*_{t}, - 3.
for every

*i*(1 ≤*i*≤*m*),*c*_{i}=*t*(*c*_{i − 1}) for some*t*∈*T*.

**parse**assigned to

*x*by

*C*

_{0,m}is the dependency graph

*G*

_{cm}= ({1, …,

*n*},

*A*

_{cm}), where

*A*

_{cm}is the set of dependency arcs in

*c*

_{m}.

Starting from the initial configuration for the sentence to be parsed, transitions will manipulate *σ*, *β*, and *A* until a terminal configuration is reached (*β* is empty). Because the node set *V* is given by the input sentence itself, the set *A*_{cm} of dependency arcs in the terminal configuration will determine the output dependency graph *G*_{cm} = (*V*, *A*_{cm}).

**Definition 5**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be a transition system for dependency parsing.

- 1.
*S*is**sound**for a class of dependency graphs if and only if, for every sentence*x*and every transition sequence*C*_{0,m}for*x*in*S*, the parse . - 2.
*S*is**complete**for a class of dependency graphs if and only if, for every sentence*x*and every dependency graph*G*_{x}for*x*in , there is a transition sequence*C*_{0,m}for*x*in*S*such that*G*_{cm}=*G*_{x}. - 3.
*S*is**correct**for a class of dependency graphs if and only if it is sound and complete for .

**Example 2**

Nivre's (2008) arc-standard transition system uses three transitions:

The unlabeled dependency graph in Figure 2 is derived by the transition sequence in Figure 3. For labeled dependency parsing, the Left-Arc and Right-Arc transitions in addition have a parameter for the label*l*of the arc being added.

### 2.3 Transition-Based Parsing

**f**(

*c*

_{i − 1},

*t*

_{i}) is a feature vector representation of transition

*t*

_{i}out of configuration

*c*

_{i − 1}and

**w**is a corresponding weight vector. Finding the highest scoring transition sequence under this model is a hard problem in general, and transition-based parsers therefore have to rely on heuristic search for the optimal transition sequence. Many systems simply use greedy 1-best search (Yamada and Matsumoto 2003; Nivre, Hall, and Nilsson 2004; Attardi 2006):Another common approach is to use beam search with a fixed beam size (Johansson and Nugues 2006; Titov and Henderson 2007; Zhang and Clark 2008). In this case, lines 3 and 4 are replaced by an inner loop that expands all configurations in the current beam using all permissible transitions and then discards all except the

*k*highest scoring configurations. The outer loop terminates when all configurations in the beam are terminal, and the dependency graph corresponding to the highest scoring configuration is returned. Setting the beam size to 1 makes this equivalent to greedy 1-best search.

The time complexity of transition-based parsing depends not only on the underlying transition system but also on the scoring model and the search algorithm. As long as the number of configurations considered by the search algorithm is bounded by a constant *k* and as long as every transition can be scored and executed in constant time relative to a fixed model, however, then the asymptotic time complexity of a parser using a transition system *S* is given by an upper bound on the length of transition sequences in *S* (Nivre 2008). Similarly, the space complexity is given by an upper bound on the size of a configuration *c* ∈ *C*, because at most *k* configurations need to be stored at any given time. For most of the systems considered in this article, we will see that the length of a transition sequence is *O*(*n*), where *n* is the length of the input sentence, which translates into a linear bound on parsing time for transition-based parsers using beam search (with greedy 1-best search as a special case).

Transition-based dependency parsing using beam search has the advantage of low parsing complexity in combination with very few restrictions on feature representations, which enables fast and accurate parsing, but does not guarantee that the optimal transition sequence is found. Recent work on tabularization for transition-based parsing has shown that it is possible to use exact dynamic programming under certain conditions, but this leads either to very inefficient parsing or to very restricted feature representations. Thus, Huang and Sagae (2010) present a dynamic programming scheme for a feature-rich arc-standard parser, but the resulting parsing complexity is *O*(*n*^{7}) and they therefore have to resort to beam search in practical parsing experiments. Conversely, Kuhlmann, Gómez-Rodríguez, and Satta (2011) show how to obtain cubic complexity for a tabularized arc-eager parser but only for very impoverished feature representations. Hence, for the remainder of this article, we will assume that transition sequence length is a relevant complexity bound, because it translates into a bound on running time for parsers that use beam search, as practically all state-of-the-art transition-based parsers currently do. This bound holds as long as every transition can be scored and executed in constant time, which is true even when including complex features like the valency features of Zhang and Nivre (2011), which are expensive to use in dynamic programming because of the combinatorial effect they have on parsing complexity.

## 3. Divisible Transition Systems

In the last decade, several different dependency parsers have been defined as stack-based transition systems, which differ from each other in the order in which they add dependency arcs as well as in the constraints that they impose on output dependency graphs. In their original definitions, these differences arise from the fact that each algorithm uses a distinct set of transitions. In this section, we show how these algorithms can be expressed using a common set of transitions, which we call **elementary transitions**. Under this framework, the original transitions of each algorithm are viewed as combinations of one or more elementary transitions by means of the standard function operations of composition and restriction. A direct consequence of this is that each of the parsers expressed in this framework can be viewed as a restriction of the algorithm that uses elementary transitions directly, allowing any possible concatenation of elementary transitions. We call the systems that are analyzable within this framework **divisible transition systems**.

The elementary transitions in our framework represent five primitive operations that can be applied to stack-based configurations:

The first three operations modify the stack and/or buffer by moving a word from the buffer to the top of the stack (Shift), moving a word from the stack to the buffer (Unshift), or popping a word from the stack (Reduce). The remaining two operations create dependency arcs involving the top of the stack and the first word in the buffer (Left-Arc, Right-Arc). We assume that Left-Arc and Right-Arc only apply to configurations where the new arc is not already an element of the arc set*A*, an assumption that is needed in certain cases to guarantee termination (that is, to rule out transition sequences where the same arc is added an indefinite number of times). Note that, in the case of labeled dependency graphs, the Left-Arc and Right-Arc transitions will have a label parameter, and this restriction should not prevent the addition of an arc with the same head and dependent as one or more existing arcs, as long as the label is different.

Different parsing algorithms can now be defined using *composition* of elementary transitions, which is defined as standard function composition.

**Definition 6**

Let *t*_{1},*t*_{2} : *C* →*C* be transitions. Their **composition** is the partial function *t*_{1};*t*_{2} : *C* → *C* mapping each *c* ∈ *C* to *t*_{2}(*t*_{1}(*c*)).

*t*:

*C*→

*C*, and we use

*T*

_{e}to refer to the set of elementary transitions. In addition, we use function restriction to impose constraints on their domain, traditionally expressed in the literature as side conditions. For this purpose, we use the standard notation by which the

**restriction**of a function

*f*:

*X*→

*Y*to a subset

*A*⊆

*X*is written as:Transition systems that can be defined using composition of elementary transitions with restrictions are said to be

*divisible*.

**Definition 7**

A stack-based transition system *S* = (*C*,*T*,*c*_{s},*C*_{t}) is **divisible** if and only if every transition in *T* is of the form , where *p* > 0, *t*_{i} ∈ *T*_{e}, *s*_{i} ⊆ *C*.

In other words, a stack-based transition system is divisible if and only if each of its transitions can be written as a composition of restrictions of the elementary transitions Shift, Unshift, Reduce, Left-Arc, and Right-Arc. Note that the definition allows the use of unrestricted elementary transitions in the composition, because for any transition *t*, we have that .^{4}

### 3.1 Examples of Divisible Transition Systems

In this section, we show that a number of transition-based parsers from the literature use divisible transition systems that can be defined using only elementary transitions. This includes the arc-eager and arc-standard projective parsers described in Nivre (2003) and Nivre (2008), the arc-eager and arc-standard parsers for directed acyclic graphs from Sagae and Tsujii (2008), the hybrid parser of Kuhlmann, Gómez-Rodríguez, and Satta (2011), and the easy-first parser of Goldberg and Elhadad (2010). We also give examples of transition systems that are not divisible (Attardi 2006; Nivre 2009).

*C*:The set

*H*

_{σ}(

*C*) is the subset of configurations where the node on top of the stack has been assigned a head in

*A*, and is the subset where the top node has

*not*been assigned a head in

*A*. Similarly, the set

*H*

_{β}(

*C*) is the subset of configurations where the first node in the buffer has been assigned a head in

*A*, and is the subset where the first node has

*not*been assigned a head in

*A*. Note that there are configurations that are neither in

*H*

_{σ}(

*C*) nor in , namely, those where the stack is empty. There are also configurations that are neither in

*H*

_{β}(

*C*) nor in , because the buffer is empty, but these are all terminal configurations.

**Example 3**

Nivre's (2008) arc-standard parser, previously defined in Example 2, is a bottom–up parser for projective dependency trees. Its transitions can be defined in terms of elementary transitions as follows:

The Shift_{AS} transition is the same as the elementary Shift transition. The Left-Arc_{AS} transition composes the elementary Left-Arc transition with the Reduce transition to ensure that the left dependent of the new arc is popped from the stack and therefore cannot be assigned more than one head. The Right-Arc_{AS} transition, finally, composes four elementary transitions, where Right-Arc is responsible for adding a left-headed arc, Shift and Reduce jointly remove the dependent of the new arc from the buffer, and Unshift moves the head of the new arc back to the buffer so that it can find a head to the left. It is worth noting that the arc-standard system for projective trees does not make use of restrictions.

Although this description of the arc-standard parser corresponds to its definition in Nivre (2008), where arcs are created involving the topmost stack node and the first buffer node, the system has also been presented in an equivalent form with arcs built between the two top nodes in the stack (Nivre 2004). This variant can also be described as a divisible transition system, with Left-Arc_{AS′} = Unshift; Left-Arc; Reduce; Shift and Right-Arc_{AS′} = Unshift; Right-Arc; Shift; Reduce.^{5}

**Example 4**

Nivre's (2003) arc-eager parser is a parser for projective dependency trees, which adds arcs in a strict left-to-right order using the following transitions:

As in the first example, the Shift_{AE}transition is equivalent to the elementary Shift transition, but the Right-Arc

_{AE}transition differs from Right-Arc

_{AS}by not popping the right dependent from the stack after adding the arc and shifting. Instead, right dependents are removed from the stack in a separate transition Reduce

_{AE}, which is equivalent to the elementary transition Reduce but restricted to to ensure that unattached nodes are not removed. The Left-Arc

_{AE}transition, finally, is the same as Left-Arc

_{AS}but restricted to

*H*

_{σ}(

*C*), a restriction that is not needed in the arc-standard system where nodes on the stack can never have a head.

**Example 5**

The easy-first parser of Goldberg and Elhadad (2010) is a parser for projective trees that adds arcs in a bottom–up order but in a non-directional manner, trying to make the easier attachment decisions first regardless of the position of the corresponding words in the sentence. This parsing strategy corresponds to the following divisible transition system:

where*i*is a strictly positive integer. Note that this means that the system has an infinite set of transitions. In practice, however, only the Attach-Right(

*i*)

_{EF}and Attach-Left(

*i*)

_{EF}transitions such that 1 ≤

*i*≤

*n*− 1 need to be considered when parsing a string of length

*n*: Because the number of nodes in the buffer is bounded by

*n*, transitions with

*i*≥

*n*will always be undefined because the buffer will become empty before the first

*i*+ 1 elementary transitions can be applied. Therefore, to parse strings of length

*n*we only need 2

*n*− 1 transitions.

The purpose of an Attach-Right(*i*)_{EF} (or Attach-Left(*i*)_{EF}) is to create a rightward (or leftward) arc involving the *i*th and (*i* + 1)th words in the input string, and then remove the dependent. This means that the system is not limited to building arcs in a predetermined order (such as left to right). Instead, it can generate the same tree in different orders depending on the criterion used to choose a transition at each configuration. In particular, the parser by Goldberg and Elhadad (2010) can be seen as an implementation of this transition system, which uses a training algorithm that assigns a weight to each of the Attach-Right(*i*)_{EF} and Attach-Left(*i*)_{EF} transitions in such a way that “easier” (more reliable) attachments are performed first.

The Attach-Left(*i*)_{EF} and Attach-Right(*i*)_{EF} transitions are essentially the same as Left-Arc_{AS} and Right-Arc_{AS} in the arc-standard system, but preceded by *i* instances of Shift and succeeded by *i* − 1 instances of Unshift, which means that a separate Shift transition is needed only to reach a terminal configuration by pushing the final root(s) onto the stack. This analysis reveals that the two systems are similar in that they build dependency trees bottom–up but differ with respect to the order in which arcs are added. It is worth pointing out that using sequences of Shift and Unshift transitions is not the most efficient way of implementing easy-first parsing in practice.

**Example 6**

The hybrid parser introduced by Kuhlmann, Gómez-Rodríguez, and Satta (2011) is a bottom–up projective transition system that builds each given dependency tree in a unique order, rather than allowing each node to collect its dependents in different orders like the arc-standard or easy-first systems. Its transitions can be defined as follows:

Note that this parser creates leftward arcs between the first node in the buffer and the top node on the stack, just like arc-standard and arc-eager. Rightward arcs, however, are created by making the topmost stack node a dependent of the second topmost stack node, and removing the former from the stack.**Example 7**

Sagae and Tsujii's (2008) arc-standard DAG parser performs bottom–up parsing without the common assumption that syntactic structures are represented as trees, allowing nodes to have multiple heads. It uses the following transitions:

Whereas the first three transitions are exactly the same as in Nivre's (2008) arc-standard parser, the Left-Attach_{DS}and Right-Attach

_{DS}transitions differ from Left-Reduce

_{DS}and Right-Reduce

_{DS}in that they do not remove the dependent of the new arc, thus allowing it to have additional incoming arcs. The restrictions on these transitions disallow the creation of both a left and a right arc between the same pair of nodes. Note that the class of dependency structures that can be output by this system does not exactly correspond to DAGs, however, because the system allows transition sequences that create dependency graphs with cycles. For example, starting from any configuration with at least two nodes on the stack and one node in the buffer and applying Right-Attach

_{DS}, Right-Reduce

_{DS}, Shift

_{DS}, and Left-Attach

_{DS}gives rise to a cyclic structure.

**Example 8**

Sagae and Tsujii's (2008) arc-eager DAG parser allows nodes with multiple heads like the previous one but adds arcs in a strict left-to-right order like Nivre's (2003) arc-eager parser. The transition system can be defined as follows:

Here the first two transitions are the same as in Nivre's (2003) arc-eager parser, whereas the Left-Arc_{DE}and Right-Arc

_{DE}transitions differ from their counterparts Left-Arc

_{AE}and Right-Arc

_{AE}by not removing the dependent of the new arc. Like the previous arc-standard system, this system can produce cyclic dependency graphs. For example, starting from any configuration with at least one node in the stack and two nodes in the buffer and applying Right-Arc

_{DE}, Shift

_{DE}, Right-Arc

_{DE}, Reduce

_{DE}, and Left-Arc

_{DE}creates a cycle of length 3.

### 3.2 Properties of Divisible Transition Systems

The elementary transition framework not only allows us to describe a wide range of transition-based parsers in a clear and concise way, but it can also easily be used to prove formal properties of transition systems. To do so, we consider the successions of transitions allowed by these algorithms, and break their transitions up into chains of elementary transitions.

**Definition 8**

Let *C*_{0,m} = (*c*_{0},*c*_{1},…,*c*_{m}) be a transition sequence for a sentence *x* under a transition system *S*. The **standard transition chain** associated with *C*_{0,m} is the sequence of transitions *T*_{0,m} = (*t*_{1},*t*_{2},…,*t*_{m}) such that *t*_{i}(*c*_{i − 1}) = *c*_{i} for each *i* ∈ [1,*m*].

**Definition 9**

**Definition 10**

Let *E*_{0,m} = (*e*_{1},*e*_{2},…,*e*_{q}) be the elementary transition chain for some transition sequence *C*_{0,m} = (*c*_{0},*c*_{1},…,*c*_{m}). Then:

The

**computation function**associated with*C*_{0,m}is the function*e*_{1};*e*_{2};…;*e*_{q}, resulting from composing the elementary transitions in the chain. Note that the same function could also be obtained from composing the transitions in the standard transition chain associated with*C*_{0,m}, and that this function will always map*c*_{0}to*c*_{m}.The

**elementary transition sequence**associated with*C*_{0,m}is the sequence of configurations*C*′_{0,m}= (*c*′_{0},*c*′_{1},…,*c*′_{q}) such that*c*′_{0}=*c*_{0}, and*c*′_{i}=*e*_{i}(*c*′_{i − 1}) for all*i*∈ [1,*q*]. Note that*c*′_{q}will always equal*c*_{m}. We will say that*e*_{i}is**applied**to the configuration*c*_{i − 1}in*C*′_{0,m}.

#### 3.2.1 Constraints on Dependency Graphs

Here we consider properties related to the graph constraints No-Covered-Roots, Single-Head, Acyclicity, and Planar.

**Proposition 1**

If all elementary Reduce transitions in the elementary transition chains under *S* are applied to configurations in *H*_{σ}(*C*), then no dependency graph generated by *S* contains covered roots.

This property implies that algorithms where Reduce transitions are restricted to the set *H*_{σ}(*C*) always satisfy the No-covered-roots constraint. Note that this restriction may be expressed explicitly in the transition definitions (as in Example (4)), but it may also be implicit. For example, in Example (3), we defined Left-Arc_{AS} = Left-Arc; Reduce. Although we did not explicitly write Left-Arc; , the Left-Arc transition always produces configurations that are trivially in *H*_{σ}(*C*) (because the transition gives the topmost stack node a head), so the Reduce transition in this algorithm is implicitly restricted to *H*_{σ}(*C*). The same observation can be applied to subsequent properties.

**Proof**

To prove this proposition, we first make some simple observations about divisible transition systems that will be useful for this and subsequent proofs.

**Lemma 1**

In every configuration in an (elementary) transition sequence under a divisible transition system *S*, elements in the stack and buffer are ordered, that is, if the configuration is of the form ([*s*_{1}, …, *s*_{k}], [*b*_{1}, …, *b*_{l}], *A*), then we know that *s*_{1} < … < *s*_{k} < *b*_{1} < … < *b*_{l}. This can be easily seen by induction. It holds in initial configurations, because the stack is empty and the buffer is ordered, and all of the elementary transitions preserve the order of the nodes. Note that this lemma implies that a node cannot be in both the stack and the buffer of the same configuration.

**Lemma 2**

We will call *Π*(*c*) the set of elements that are present either in the stack or in the buffer in a configuration *c*. Let *C*′_{0,m} = (*c*′_{0},*c*′_{1},…,*c*′_{q}) be an (elementary) transition sequence under a divisible transition system *S*. Then, we have that = {1, …, *n*}. This means that the set *Π* monotonically decreases in the course of an (elementary) transition sequence or, in plain language, that a node that is removed from the stack and buffer can never be placed back there by elementary transitions. This can be easily seen by observing that the transitions Shift, Unshift, Left-Arc, and Right-Arc leave the set *Π* unchanged, whereas the Reduce transition removes one element from it by popping the stack.

**Lemma 3**

Let *E*_{0,m} = (*e*_{0},*e*_{1}, …, *e*_{q}) and *C*′_{0,m} = (*c*′_{0},*c*′_{1}, …, *c*′_{q}) be an elementary transition chain and its corresponding elementary transition sequence under a divisible transition system *S*. If for some *v* ∈ [0,*n*] and *i* ∈ [0,*q*], then there exists some *j* ∈ [0,*i*] such that *e*_{j} = Reduce and *c*′_{j − 1} has *v* on the top of the stack. This amounts to saying that the only way an element can be removed from the set *Π* in a divisible system is by a Reduce transition, as observed earlier. Thus, whenever a token *v* is not present in *Π*(*c*′_{i} for a given configuration *c*′_{i}, we can assume that it was previously popped by a Reduce transition applied to a configuration that had *v* on the top of the stack.

With these observations, it is easy to show that if a graph generated by a transition sequence in a divisible transition system *S* has at least one covered root, then the transition sequence applies at least one Reduce transition to a configuration that is not in *H*_{σ}(*C*). Let *G* be a dependency graph in which the node *j* is a root, covered by an arc connecting the nodes *i* and *k* (*i* < *k*). If a transition sequence *C*_{0,m} generates *G*, then it must apply a Left-Arc or Right-Arc transition to a configuration having *i* at the top of the stack and *k* as the first element in the buffer, which is the only way of adding the arc involving *i* and *k*. By Lemma 1, we know that in that configuration *c*, *j* ∉ *Π*(*c*). By Lemma 3, we know that there must thus be a previous application of a Reduce transition with *j* on the top of the stack. Because *j* is a root, by definition this configuration is not in *H*_{σ}(*C*), and the proposition is proved.

**Proposition 2**

If all the elementary Left-Arc transitions in the elementary transition chains under *S* are applied to configurations in , and all the Right-Arc elementary transitions are applied to configurations in , then all the dependency graphs generated by *S* obey the Single-Head constraint.

**Proof**

The proof of this proposition is straightforward. Because elementary transitions either leave the generated dependency graph as it is or add one dependency arc to it, an elementary transition sequence will generate a graph violating the Single-Head constraint if and only if it contains a Left-Arc or Right-Arc transition that adds an incoming arc to a node that already has a head in the graph.

**Proposition 3**

If all the elementary Left-Arc and Right-Arc transitions in the elementary transition chains under *S* are applied to configurations (*σ*|*i*, *j*| *β*, *A*) ∈ *C* where *i* and *j* belong to different connected components of the undirected graph underlying *A*, then the undirected graphs underlying all the dependency graphs generated by *S* are acyclic (i.e., the dependency graphs generated by *S* have no undirected cycles). Note that this in turn implies Acyclicity.

**Proof**

Again, this proposition is straightforward, because a cycle can only be created in the undirected graph underlying the generated dependency graph if an arc is added between nodes that are already connected.

**Proposition 4**

All dependency graphs generated by a divisible system *S* are planar.

**Proof**

To prove this proposition, we observe that a graph is non-planar if and only if it contains two arcs (*i*, *j*) and (*k*, *l*) such that min (*i*, *j*) < min (*k*, *l*) < max (*i*, *j*) < max (*k*, *l*). We can show that there is no elementary transition chain that creates such a pair of arcs.

An elementary transition chain that first adds the arc (

*i*,*j*) and later the arc (*k*,*l*) must apply a Left-Arc or Right-Arc transition to a configuration having min (*i*,*j*) at the top of the stack and max (*i*,*j*) as the first element in the buffer, which is the only way of adding the first arc. By Lemma 1, we know that in that configuration*c*, min(*k*,*l*) ∉*Π*(*c*); and by Lemma 2, we know that min(*k*,*l*) ∉*Π*(*c*′) for every subsequent configuration*c*′ in the sequence. Given that an arc involving min (*k*,*l*) and max (*k*,*l*) can only be built from a configuration having min (*k*,*l*) in the stack, we conclude that after adding the arc (*i*,*j*) to the arc set, the parser will never be able to reach a configuration allowing it to add the arc (*k*,*l*).An elementary transition chain that first adds the arc (

*k*,*l*) and later the arc (*i*,*j*) is not possible. The reasoning is analogous, but in this case max (*i*,*j*) is the node that gets removed from the set*Π*when the arc (*k*,*l*) is added, making it impossible to add the arc (*i*,*j*) afterwards.

The properties considered in this section can be used as a tool set for easily proving the soundness of transition systems with respect to different sets of dependency graphs, as well as for designing new transition systems. We exemplify the former in Example (9) and the latter in Section 3.3.

**Example 9**

Consider the transition set of the arc-eager parser in Example (4), repeated here for convenience:

We can easily conclude the following:The algorithm enforces the No-covered-roots constraint by Proposition 1, because Reduce transitions are restricted to

*H*_{σ}(*c*).The algorithm enforces the Single-Head constraint by Proposition 2, because Left-Arc elementary transitions are explicitly restricted to and Right-Arc transitions are implicitly restricted to . (Trivially, none of the transitions can produce a configuration outside .)

The algorithm enforces the Acyclicity constraint by Proposition 3, because by construction none of the transitions can produce a configuration

*c*where the first node in the buffer is connected to any node in*Π*(*c*).The graphs it generates are planar by Proposition 4.

The algorithm generates only projective dependency graphs, because the combination of Planar, Acyclicity, Single-Head, and No-covered-roots implies Projective.

#### 3.2.2 Termination and Complexity

In general, there are two ways in which a transition-based parser may fail to parse a given input sentence. On the one hand, it may terminate in a non-terminal configuration where no transition can be applied. On the other hand, it may fail to terminate at all, because the system allows an infinite sequence of transitions. We say that a system is **robust** if it can never get stuck in a non-terminal configuration and **bounded** if it does not permit infinite loops.

**Definition 11**

A divisible transition system *S* = (*C*, *T*, *c*_{s}, *C*_{t}) is **robust** if and only if, for every non-terminal configuration *c* ∈ *C* ∖ *C*_{t}, there is some transition *t* ∈ *T* such that *t*(*c*) ∈ *C*.

**Definition 12**

A divisible transition system *S* = (*C*, *T*, *c*_{s}, *C*_{t}) is **bounded** if and only if there exists no non-terminal configuration *c* ∈ *C* ∖ *C*_{t} and (non-empty) sequence of transitions *t*_{1}, …*t*_{k} (*t*_{i} ∈ *T*) such that *t*_{1};…;*t*_{k}(*c*) = *c*.

In this section, we first provide sufficient conditions for robustness and boundedness and then go on to discuss the parsing complexity for a subset of divisible systems that are guaranteed to be robust and bounded.

**Proposition 5**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be a divisible transition system. If Shift ∈ *T*, then *S* is robust.

**Proof**

It is clear that Shift ∈ *T* is sufficient for robustness, because it applies to every configuration that has a non-empty buffer *β*, which by definition includes every non-terminal configuration.

Thus, in order to guarantee robustness, it is enough that a divisible transition system includes the elementary Shift transition. This is the case for all the divisible systems exemplified in Section 3.1. Before we go on to characterize bounded systems, it is convenient to introduce three auxiliary functions that characterize the effect a transition *t* has on an arbitrary configuration *c*:

*A*(*t*) = |*A*_{t(c)}| − |*A*_{c}|*Π*(*t*) = |*Π*(*c*)| − |*Π*(*t*(*c*))|*β*(*t*) = |*β*_{c}| − |*β*_{t(c)}|

*A*(

*t*) is the

*increase*in size of the arc set

*A*, which is always non-negative as there are no elementary transitions that remove arcs.

*Π*(

*t*) is the

*decrease*in size of the set of nodes that are on the stack

*σ*or in the buffer

*β*, which is also non-negative as there are no elementary transitions that add new nodes.

*β*(

*t*) is the

*decrease*in the size of the buffer

*β*, which can be negative as well as positive (or zero).

**Proposition 6**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be a divisible transition system. If every transition *t* ∈ *T* is such that *A*(*t*) > 0 or *Π*(*t*) > 0 or *β*(*t*) > 0, then *S* is bounded.

**Proof**

To see why the disjunctive condition excludes looping transition sequences, consider an arbitrary configuration *c* and an arbitrary transition *t* for which the condition holds. If *A*(*t*) > 0 or *Π*(*t*) > 0, then *c* is clearly not reachable from *t*(*c*), because there are no transitions that delete arcs (first case) or insert nodes (second case). If *A*(*t*) = 0 and *Π*(*t*) = 0, then *β*(*t*) > 0 and *c* could be reachable from *t*(*c*) only if there is a transition *t*′ such that *β*(*t*′) < 0 (that is, a transition that puts nodes back in the buffer). But any such transition *t*′ would have to have either *A*(*t*) > 0 or *Π*(*t*) > 0, which would again rule out the possibility of a loop. We may therefore conclude that there is no sequence of transitions *t*_{1}, …, *t*_{2} such that *t*_{1};…;*t*_{k}(*c*) = *c* and, hence, that *S* is bounded.

**Example 10**

The condition of Proposition 6 does not hold for the elementary transition system, because *A*(Unshift) = 0, *Π*(Unshift) = 0, and *β*(Unshift) = −1. In fact, this system is not bounded, because we can have an unbounded number of alternating Shift and Unshift transitions without reaching a terminal configuration.

By contrast, the arc-eager system from Examples (4) and (9) is bounded, which can be seen by observing that *β*(Shift_{AE}) = 1, *Π*(Reduce_{AE}) = 1, *A*(Left-Arc_{AE}) = 1, and *A*(Left-Arc_{AE}) = 1. The same reasoning can be applied to show that all the transition systems introduced in Examples (3)–(8) are bounded.

As already stated, the running time of a transition-based parser that only explores a constant number of transition sequences (such as a greedy deterministic parser or a beam-search parser with a constant-size beam) is given by an upper bound on the length of a transition sequence. To prove such bounds for divisible transition systems, we will first prove a linear bound on the number of arcs in planar graphs.

**Lemma 4**

A planar dependency graph with *n* nodes (*n* > 1) has no more than 4*n* − 6 arcs.

**Proof**

For *n* = 2, we can trivially have at most two arcs, (1,2) and (2,1), and thus the lemma holds because 2 = 4 ·2 − 6. For the induction step, let *n* > 2. We will show that if the lemma holds for graphs with less than *n* nodes, then it also holds for graphs with *n* nodes.

To do so, we first give some preliminary definitions. We will say that the **length** of an arc (*i*,*j*) is ℓ(*i*,*j*) = max (*i*,*j*) − min (*i*,*j*). We will call the **domain** of an arc (*i*,*j*) the set *δ*(*i*,*j*) = { min (*i*,*j*) , min (*i*,*j*) + 1 , …, max (*i*,*j*) − 1 }. Note that the number of elements in the domain of an arc equals its length. We will say that an arc (*i*,*j*) **covers** an arc (*k*,*l*) if (*i*,*j*) ≠ (*k*,*l*) and min (*i*,*j*) ≤ min (*k*, *l*) < max (*k*,*l*) ≤ max (*i*,*j*). Note that an arc (*i*,*j*) covers an arc (*k*,*l*) if and only if *δ*(*k*,*l*) ⊂ *δ*(*i*,*j*), and a pair of distinct arcs (*i*,*j*) and (*k*,*l*) cross (as defined in Section 2.1) if and only if none of them covers the other and *δ*(*k*,*l*) ∩ *δ*(*i*,*j*) ≠ ∅. Thus, we conclude that a pair of distinct arcs that do not cross or cover each other have disjoint domains.

*G*be a planar dependency graph

*G*= (

*V*= {1, …,

*n*},

*A*). Let

*A*

_{c}= {

*a*

_{1}, …,

*a*

_{m}} be the set of arcs in

*A*with length strictly smaller than

*n*− 1, and that are not covered by any arc in

*A*with length strictly smaller than

*n*− 1. By definition of

*A*

_{c}, we know thatOn the other hand, because

*G*is planar, a pair of arcs in

*A*

_{c}cannot cross each other. Furthermore, by definition of

*A*

_{c}, an arc in

*A*

_{c}cannot cover another arc in

*A*

_{c}. Therefore, the domains of

*a*

_{1}, …,

*a*

_{m}are disjoint subsets of {1,…,

*n*}, and thusBy definition of

*A*

_{c}, every arc in

*A*is either (i) an arc of length at least

*n*− 1 (i.e., (1,

*n*) or (

*n*,1)), or (ii) an arc

*a*

_{i}∈

*A*

_{c}, or (iii) an arc covered by some arc

*a*

_{i}∈

*A*

_{c}. For each given

*i*∈ {1,…,

*m*}, the arcs of types (ii) and (iii) form a subgraph of

*G*with ℓ(

*a*

_{i}) + 1 nodes. Because ℓ(

*a*

_{i}) + 1 <

*n*, we can apply the induction hypothesis to conclude that there are at most 4(ℓ(

*a*

_{i}) + 1) − 6 arcs of this type for each value of

*i*. Combining this with Equations (1) and (2), we conclude that the total amount of arcs in

*A*is bounded byIt is easy to see that the expression is maximized for

*m*= 2, and in that case the value of the expression is bounded by 2 − 2 ·2 + 4 (

*n*− 1) = 4

*n*− 6. This proves the induction step and thus concludes the proof of Lemma 4.

Thanks to the result in Lemma 4, we can now proceed to prove bounds on the length of transition sequences in divisible systems that are guaranteed to be robust and bounded, that is, systems that satisfy the conditions of Propositions 5 and 6. We call such systems *efficient* divisible transition systems.

**Definition 13**

A divisible transition system *S* = (*C*, *T*, *c*_{s}, *C*_{t}) is **efficient** if and only if Shift ∈ *T* and, for every *t* ∈ *T*, *A*(*t*) > 0 or *Π*(*t*) > 0 or *β*(*t*) > 0.

We give three increasingly tight bounds for (i) arbitrary efficient divisible transition systems, (ii) systems that in addition have a constant bound on the growth of the buffer, and (iii) systems that have a constant bound on the number of elementary transitions that a composite transition can contain.

**Proposition 7**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be an efficient divisible transition system. Then the length of a transition sequence for a sentence *x* of length *n* in *S* is *O*(*n*^{2}).

**Proof**

Consider an arbitrary transition sequence *C*_{0,m} = (*c*_{s}(*x*), …, *c*_{m}) in *S* for a sentence *x* of length *n* and the corresponding transition chain *T*_{1,m} = (*t*_{1}, …, *t*_{m}). The following must hold:

The number of transitions

*t*in*T*_{1,m}for which*A*(*t*) > 0 is bounded by the maximum number of arcs in a planar dependency graph, which is 4*n*− 6 (by Lemma 4).The number of transitions

*t*in*T*_{1,m}for which*Π*(*t*) > 0 is according to Lemma 2 bounded by the number of nodes in the initial configuration*c*_{s}(*x*), which is*n*.The longest contiguous subsequence

*T*_{i,k}= (*t*_{i}, …,*t*_{k}) of*T*_{1,m}such that all transitions*t*_{j}∈*T*_{i,k}have*A*(*t*) = 0,*Π*(*t*) = 0 and*β*(*t*) > 0 is bounded by the maximum size of the buffer, which again according to Lemma 2 is bounded by the number*n*of nodes in the initial configuration*c*_{s}(*c*).

*T*

_{1,m}can contain at most

*O*(

*n*) transitions of the first two types and at most

*O*(

*n*

^{2}) transitions of the third type, because there are at most

*O*(

*n*) transitions that increase the size of the buffer.

**Proposition 8**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be an efficient divisible transition system such that, for every transition *t* ∈ *T, **β*(*t*) > *k* for some constant *k*. Then the length of every transition sequence for a sentence *x* of length *n* in *S* is *O*(*n*).

**Proof**

This follows from the same kind of considerations as in the proof of Proposition 7 together with the observation that the total number of transitions *t* for which *A*(*t*) = 0, *Π*(*t*) = 0, and *β*(*t*) > 0 is now bounded by *kn* instead of *n*^{2}, because each of the *O*(*n*) transitions that may increase the size of the buffer can only do so by at most *k*.

**Proposition 9**

Let *S* = (*C*, *T*, *c*_{s}, *C*_{t}) be an efficient divisible transition system such that, for every transition (*t* ∈ *T*, *t*_{i} ∈ *T*_{e}), *m* ≤ *k* for some constant *k*. The length of every elementary transition sequence for a sentence *x* of length *n* in *S* is *O*(*n*).

**Proof**

This follows from Proposition 8 together with the constant bound on the number of elementary transitions in a composite transition.

**Example 11**

All the systems defined in Section 3.1 satisfy the condition of Proposition 8 and therefore have a linear bound on the length of their transition sequences. In addition, all the systems except the easy-first parser satisfy the condition of Proposition 9 and therefore also have a linear bound on the number of elementary transitions. To see why this fails for the easy-first parser, note that the number of elementary transitions in Attach-Left(*i*)_{EF} and Attach-Right(*i*)_{EF} depends on *i*, which can grow with the size of the sentence. Nevertheless, *β*(Attach-Left(*i*)_{EF}) = *β*(Attach-Right(*i*)_{EF}) = 1 (for all values of *i*), which guarantees the linear bound on composite transitions.

### 3.3 Planar Dependency Parsing

So far in this section, we have shown how a number of well-known transition systems from the literature can be formulated and studied as divisible transition systems, that is, as restrictions of the same generic system based on five elementary transitions. In this section, we show how this formulation can also be used to define a novel algorithm. Specifically, we can obtain a transition system that will be able to parse any planar dependency graph (regardless of projectivity) if we use all elementary transitions except Unshift directly as the transitions of the system. On top of this system, we can use Propositions 1–4 to add optional restrictions to the system in order to enforce the Single-Head, Acyclicity, and No-Covered-Roots constraints. In addition, we can use Propositions 5–9 to show that there is a linear bound on the length of elementary transition sequences in this system. In this way, we obtain an efficient parser for planar dependency graphs, optionally restricted to trees, which is a novel contribution in itself. More importantly, however, we will show in Section 5 how this system can be generalized to a system capable of handling non-planar, hence non-projective, dependency trees using the concept of multiplanarity (to be introduced in Section 4).

#### 3.3.1 Correctness

This transition system can parse all the planar dependency graphs. To prove its correctness, we must show soundness (all the graphs produced by the system are planar) and completeness (all the planar graphs can be obtained by the system). Soundness is trivial given Property 4, so we only need to prove completeness. To do so, we prove the stronger claim in Lemma 5.

**Lemma 5**

Let *G* = (*V*,*A*) be a planar dependency graph for a sentence *w*_{1} …*w*_{n}. Then there is a transition sequence in *S*_{P} ending in a terminal configuration of the form (*σ*, [ ], *A*) such that all the nodes that are not covered by any dependency arc in *A* are in *σ*.

**Proof**

To prove this lemma, we proceed by induction on the length *n* of the sentence. In the case where *n* = 1, the only possible planar dependency graph is the graph *G*_{0} = ({1} , ∅ ) with a single node and no arcs. It is easy to see that the transition sequence that applies a single Shift transition meets the required conditions, because it ends in a terminal configuration ([1], [ ], ∅ ).

*n*and prove that it then also holds for sentences of length

*n*+ 1, for any

*n*≥ 1. Let

*G*

_{n + 1}= (

*V*

_{n + 1},

*A*

_{n + 1}) be a planar dependency graph for a sentence

*w*

_{1}…

*w*

_{n + 1}. We denote by

*L*

_{n + 1}the set of arcs that is, the set of incoming and outgoing arcs from the node

*n*+ 1 in

*G*

_{n + 1}, and we denote by

*G*

_{n}the graphthat is, the graph obtained by removing the node

*n*+ 1 and all its incoming and outgoing arcs from

*G*

_{n + 1}. By the induction hypothesis, there exists a transition sequence

*C*

_{n}whose final configuration is of the form (

*σ*

_{n}, [ ],

*A*

_{n}), such that

*σ*

_{n}contains all the nodes that are not covered by any dependency arc in

*A*

_{n}. From this transition sequence

*C*

_{n}, we will obtain a transition sequence

*C*

_{n + 1}meeting the conditions asserted by the lemma for the graph

*G*

_{n + 1}. To do so, we first observe that the planarity of the graph

*G*

_{n + 1}implies that the left endpoints of the arcs in

*L*

_{n + 1}cannot be covered by any arc in

*A*

_{n}, because this would mean that the arc in

*L*

_{n + 1}and the covering arc would cross. Therefore, by the induction hypothesis, we know that all the left endpoints of the arcs in

*L*

_{n + 1}are in

*σ*

_{n}. Thus, if the left endpoints of the arcs in

*L*

_{n + 1}are

*i*

_{1},

*i*

_{2}, …,

*i*

_{e}; then the stack

*σ*

_{n}(which is ordered, by Lemma 1) is of the formWith this in mind, we can obtain the transition sequence

*C*

_{n + 1}from

*C*

_{n}by adding the following extra transitions at the end of its associated transition chain:where we use the notation

*arcs*(

*i*) as shorthand for:

Left-Arc, if (

*n*+ 1,*i*) ∈*L*_{n+1}∧ (*i*,*n*+ 1) ∉*L*_{n+1},Right-Arc, if (

*n*+ 1,*i*) ∉*L*_{n+1}∧ (*i*,*n*+ 1) ∈*L*_{n+1},Left-Arc; Right-Arc, if (

*n*+ 1,*i*) ∈*L*_{n + 1}∧ (*i*,*n*+ 1) ∈*L*_{n + 1}.

*C*

_{n}is of the form (

*σ*,

*β*,

*A*), where:

*β*= [ ]; since the nodes 1, …,*n*are removed from the buffer by*C*_{n}, and*n*+ 1 is removed by the extra Shift transition,*A*=*A*_{n + 1}, because*A*_{n + 1}=*A*_{n}∪*L*_{n + 1}, the arcs in*A*_{n}are added to the set by*C*_{n}, and all the arcs in*L*_{n + 1}are added by*arcs*(*i*_{e}), …,*arcs*(*i*_{1}),All the nodes that are not covered by arcs in

*A*_{n + 1}are in*σ*, because they were in*σ*_{n}(a node not covered by arcs in*A*_{n + 1}is trivially not covered by arcs in*A*_{n}) and the Reduce transitions applied after*C*_{n}only remove nodes to the right of*i*_{1}, which are covered by the arc (*n*+ 1,*i*_{1}) or (*i*_{1},*n*+ 1).^{6}

#### 3.3.2 Constraints on Planar Dependency Parsing

As we have just proved, the transition system *S*_{P} is able to parse all planar dependency graphs. In many practical applications, however, it is convenient to exclude some subset of those graphs, for example, those that have cycles or more than one head per node. The results obtained in Section 3.2 can be used to easily add common constraints to the planar parser. The constraints can be added individually or jointly, so that we can obtain a variant of the planar parser with the Single-Head, Acyclicity, and No-covered-roots constraint, or with any combination of them.

*Single-Head Constraint.*To add the Single-Head constraint to the

*S*

_{P}transition system, we restrict the L

*eft*-A

*rc*

_{P}transition to , and the R

*ight*-A

*rc*

_{P}transition to :The soundness of this variant for the set of planar dependency graphs that meet the Single-Head constraint is trivially given by Proposition 2. Completeness is also straightforward, because, as discussed in Proposition 2, applying a Left-Arc transition to a configuration of

*H*

_{σ}(

*C*) or a Right-Arc transition to a configuration of

*H*

_{β}(

*C*) will

*always*generate a graph violating the Single-Head constraint. Therefore, any graph that meets the Single-Head constraint and can be obtained using the

*S*

_{P}transition system (which has been proven complete) can also be generated by this one.

*Acyclicity Constraint.*Analogously to the case for the Single-Head constraint, we can add the Acyclicity constraint to the

*S*

_{P}transition system by applying Proposition 3. To do so, we restrict the L

*eft*-A

*rc*

_{P}and R

*ight*-A

*rc*

_{P}transitions as follows:The soundness of this variant for the set of acyclic planar dependency graphs is trivially implied by Proposition 3. This variant is not complete for acyclic planar dependency graphs, because it actually enforces a stronger variant of Acyclicity, namely, it will only accept dependency graphs that have no undirected cycles. We can combine this acyclicity check with the Single-Head constraint by intersecting the restrictions:We then obtain a parser that is sound and complete for the set of planar dependency graphs that meet the Single-Head and Acyclicity constraints. The reason is that, under the Single-Head constraint, standard Acyclicity and undirected acyclicity are equivalent, because every undirected cycle is also a directed cycle. If we need a parser that enforces only directed Acyclicity but allows nodes with multiple heads, this can also be achieved. Instead of checking , the restrictions must check that the arc does not create a directed cycle (that is, for Left-Arc and for Right-Arc). Although the check for undirected cycles can be implemented in constant time if the parser implementation keeps track of the connected component of each node in

*A*, the check for directed cycles is more computationally costly, however.

^{7}

*No-Covered-Roots Constraint.*Similarly to the other constraints, we can add the No-Covered-Roots constraint to

*S*

_{P}by applying Proposition 1. To do so, we restrict the Reduce

_{P}transition as follows:The soundness of the resulting parser with respect to planar dependency graphs complying with the No-covered-roots constraint is directly given by Proposition 1. To prove completeness, we observe that the transition sequences that we build for each graph in the proof of Lemma 5 only reduce nodes that are then covered by an arc. Therefore, given a graph

*G*that satisfies the No-Covered-Roots constraint, we know that the transition sequence built as in that proof will never reduce a root node. Therefore, all of its Reduce transitions will be applied to configurations in

*H*

_{σ}(

*C*) and, hence, that same transition sequence will also parse

*G*in this variant of the transition system, which proves completeness. The No-Covered-Roots restriction can be combined with any combination of the other two restrictions. Note that the result of applying the No-Covered-Roots restriction alone is equivalent to the arc-eager parser by Sagae and Tsujii (2008). If the Single-head, Acyclicity, and No-Covered-Roots restrictions are applied at the same time, together with the Planar constraint that is implicit in the algorithm itself, we obtain a projective parser different from the projective parsers described in Section 3.1.

#### 3.3.3 Complexity of Planar Dependency Parsing

To study the runtime complexity of the planar parser, it suffices to observe that the planar transition system in any of its variants (with or without constraints) satisfies the following:

It is efficient, by Definition 13, because it contains the elementary Shift transition and β(Shift

_{P}) = 1, ∏(Reduce_{P}) = 1,*A*(Left-Arc_{P}) = 1, and*A*(Right-Arc_{P}) = 1. This implies that the system is robust (by Proposition 5) and bounded (by Proposition 6).The length of every transition sequence in the planar parser is

*O*(*n*) (by Proposition 8) and the same holds for elementary transition sequences (by Proposition 9).

*O*(

*n*), both for the unrestricted version and for the variant that enforces the Single-Head and Acyclicity constraints.

### 3.4 Beyond Planarity

Although the divisible transition system framework introduced in Section 3 can be used to represent and study a wide range of parsers, we have seen by Proposition 4 that it is limited to parsers that generate planar dependency graphs. As already noted, planarity is a very mild relaxation of the better known projectivity constraint, the only difference being that planarity allows graphs with covered roots (see Definition 2), and studies of natural language treebanks have shown the vast majority of non-projective structures to be non-planar as well (Kuhlmann and Nivre 2006; Havelka 2007).^{8} Therefore, being able to parse planar dependency graphs only provides a modest improvement in practical coverage with respect to projective parsing. To increase this coverage further, we need to be able to handle dependency graphs with crossing arcs.

To be able to build such graphs, several stack-based transition systems have been proposed in the literature that introduce extra flexibility by allowing actions that fall outside the divisible transition system framework, like the systems by Attardi (2006) and Nivre (2009) shown at the end of Section 3.1. Because these parsers use diverse strategies to support different subsets of non-planar structures—allowing arcs to be built to or from nodes deep in the stack in the case of Attardi (2006), adding transitions able to reorder stack nodes in the case of (Nivre 2009)—it seems unlikely that a simple extension of the framework can encompass all of them in a natural way. We can, however, extend the framework individually for each approach by adding the respective new transitions as elementary transitions, but the details and properties of each of these extensions fall outside the scope of this article.

Instead, in the next sections we will focus on introducing a different extension of the framework that is achieved by adding additional stacks, giving support to a generalization of the planar transition system described in Section 3.3 that can parse a large set of non-planar graphs.

## 4. Multiplanar Dependency Graphs

Because it has been shown that exact parsing becomes computationally intractable when arbitrary non-projective dependency graphs are allowed (McDonald and Satta 2007), a substantial amount of research in recent years has been devoted to finding a superset of projective dependency graphs that is rich enough to cover the non-projective phenomena found in natural language and restricted enough to allow for simple and efficient parsing, that is, a suitable set of **mildly non-projective dependency structures**. To this end, different sets of dependency trees have been proposed, such as trees with bounded arc degree (Nivre 2006a; Nivre 2007), well-nested trees with bounded gap degree (Kuhlmann and Nivre 2006; Kuhlmann and Möhl 2007), mildly ill-nested trees with bounded gap degree (Gómez-Rodríguez, Weir, and Carroll 2009), or the operationally defined set of trees parsed by the transition system of Attardi (2006).

In the same vein, a straightforward way to relax the planarity constraint to obtain richer sets of non-projective dependency graphs is the notion of **multiplanarity**, or *k*-planarity, originally introduced by Yli-Jyrä (2003). Quite simply, a dependency graph is said to be *k*-planar if it can be decomposed into *k* planar dependency graphs.

**Definition 14**

A dependency graph *G* = (*V*, *A*) is ** k-planar** if there exist planar dependency graphs

*G*

_{1}= (

*V*,

*A*

_{1}), …,

*G*

_{k}= (

*V*,

*A*

_{k}) (called

**planes**) such that

*A*=

*A*

_{1}∪ ⋯ ∪

*A*

_{k}.

Intuitively, we can associate planes with colors and say that a dependency graph *G* is *k*-planar if it is possible to assign one of *k* colors to each of its arcs in such a way that arcs with the same color do not cross. Note that there may be multiple ways of dividing a *k*-planar graph into planes, as shown in the example of Figure 4. Therefore, 1-planarity is equivalent to planarity, and increasing values of *k* yield increasingly rich sets of dependency graphs.

The notion of *k*-planarity has so far played a marginal role in the dependency parsing literature, because little was known about the properties of these structures. No algorithm was known to determine whether a given graph was *k*-planar, and no efficient parsing algorithm existed for *k*-planar dependency structures. In this article, we overcome these problems. In the remainder of this section, we present a procedure to determine the minimum value of *k* for which a given structure is *k*-planar, and we use it to show that the overwhelming majority of sentences in a number of dependency treebanks have a tree that is at most 2-planar. In Section 5, we then show how the 1-planar dependency parser described in Section 3.3 can be generalized to handle *k*-planar dependency graphs by introducing additional stacks. In particular, we present a linear-time transition-based parser that is provably correct for 2-planar dependency trees.^{9}

### 4.1 Test for Multiplanarity

In order for a constraint on non-projective dependency structures to be useful for practical parsing, it must provide a good balance between parsing efficiency and coverage of non-projective phenomena present in natural language treebanks. For example, Kuhlmann and Nivre (2006) and Havelka (2007) have shown that the vast majority of structures present in existing treebanks are well-nested and have a small gap degree (Bodirsky, Kuhlmann, and Möhl 2005), leading to an interest in parsers for these kinds of structures (Gómez-Rodríguez, Weir, and Carroll 2009; Kuhlmann and Satta 2009). No similar analysis has been performed for *k*-planar structures, however. Yli-Jyrä (2003) does provide evidence that all except two structures in the Danish Dependency Treebank (Kromann 2003) are at most 3-planar, but his analysis is based on constraints that restrict the possible ways of assigning planes to dependency arcs, and he is not guaranteed to find the minimal number *k* for which a given structure is *k*-planar.

*k*such that a dependency graph is

*k*-planar and use it to show that the vast majority of sentences in a number of dependency treebanks are at most 2-planar, with a coverage comparable to that of well-nestedness. The idea is to reduce the problem of determining whether a dependency graph

*G*= (

*V*,

*A*) is

*k*-planar, for a given value of

*k*, to a standard graph coloring problem. To do this, we first consider the following undirected graph:Note that we can formally say that two arcs (

*i*,

*j*) and (

*k*,

*l*) in a dependency graph

*G*such that

*i*<

*k*are crossing arcs if and only if min (

*i*,

*j*) < min (

*k*,

*l*) < max (

*i*,

*j*) < max (

*k*,

*l*). These are the pairs of arcs that were forbidden in the planarity constraint introduced in Definition 2. The graph

*U*(

*G*), which we call the

**crossings graph**of

*G*, has one node corresponding to each arc in the dependency graph

*G*, with an undirected edge between two nodes if they correspond to crossing arcs in

*G*. Figure 5 shows the crossings graph of the 2-planar structure in Figure 4.

As noted earlier, a dependency graph *G* is *k*-planar if each of its arcs can be assigned one of *k* colors in such a way that two arcs that cross each other are not assigned the same color. In terms of the crossings graph, because each arc in *G* corresponds to a node in *U*(*G*) and each pair of crossing arcs in *G* corresponds to an edge in *U*(*G*), this is equivalent to saying that *G* is *k*-planar if each of the *nodes* of *U*(*G*) can be assigned one of *k* colors such that no two neighbors have the same color. This amounts to solving the well-known *k*-coloring problem for *U*(*G*).

For *k* = 1 the problem is trivial: A graph is 1-colorable only if it has no edges. This corresponds to a dependency graph being planar only if it does not have crossing arcs. For *k* = 2, the problem is equivalent to determining whether the graph is bipartite, and it can be solved in time linear in the size of the graph by simple breadth-first search. Given any undirected graph *U* = (*V*,*E*), we pick an arbitrary node *v* and give it one of two colors. This forces us to give the other color to all its neighbors, the first color to the neighbors' neighbors, and so on. This process continues until we have processed all the nodes in the connected component of *v*. If this has resulted in assigning two different colors to the same node, the graph is not 2-colorable. Otherwise, we have obtained a 2-coloring of the connected component of *U* that contains *v*. If there are still unprocessed nodes, we repeat the process by arbitrarily selecting one of them, continue with the rest of the connected components, and in this way obtain a 2-coloring of the whole graph if it exists. Because this process can be completed by visiting each node and edge of the graph *U* once, its complexity is *O*(*V* + *E*). The crossings graph of a dependency graph with *n* nodes can trivially be built in time *O*(*n*^{2}) by checking each pair of dependency arcs to determine if they cross, and cannot contain more than *n*^{2} edges, meaning that we can check if the dependency graph for a sentence of length *n* is 2-planar in *O*(*n*^{2}) time.

For *k* > 2, the *k*-coloring problem is known to be NP-complete (Karp 1972).^{10} We have found this not to be a problem in practice when using it to measure multiplanarity in natural language treebanks, because the effective problem size can be reduced by noting that each connected component of the crossings graph can be treated separately, and that nodes that are not part of a cycle need not be considered. If we have a valid coloring for all the cycles in the graph, the rest of the nodes can be safely colored by breadth-first search as in the *k* = 2 case. Given that non-projective sentences in natural language tend to have a small proportion of non-projective arcs, the connected components of their crossings graphs tend to be very small and with few cycles, and *k*-colorings for them can quickly be found by brute-force search.

### 4.2 Treebank Coverage

To find out the prevalence of *k*-planar trees in natural language treebanks for various values of *k*, we applied the technique described in the previous section to all the trees in the training set for eight languages in the CoNLL-X shared task on dependency parsing (Buchholz and Marsi 2006): Arabic (Hajič et al. 2004), Czech (Hajič et al. 2006), Danish (Kromann 2003), Dutch (Van der Beek et al. 2002), German (Brants et al. 2002), Portuguese (Afonso et al. 2002), Swedish (Nilsson, Hall, and Nivre 2005), and Turkish (Atalay, Oflazer, and Say 2003; Oflazer et al. 2003). The results are shown in Table 1.

Language. | Trees. | Non-Projective. | Not Planar. | Not 2-Pl.. | Not 3-Pl.. | Not 4-Pl.. | Ill-nested. |
---|---|---|---|---|---|---|---|

Arabic | 2,995 | 205 (6.84%) | 158 (5.28%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 1 (0.03%) |

Czech | 87,889 | 20,353 (23.16%) | 16,660 (18.96%) | 82 (0.09%) | 0 (0.00%) | 0 (0.00%) | 96 (0.11%) |

Danish | 5,512 | 853 (15.48%) | 827 (15.00%) | 1 (0.02%) | 1 (0.02%) | 0 (0.00%) | 6 (0.11%) |

Dutch | 13,349 | 4,865 (36.44%) | 4,115 (30.83%) | 162 (1.21%) | 1 (0.01%) | 0 (0.00%) | 15 (0.11%) |

German | 39,573 | 10,927 (27.61%) | 10,908 (27.56%) | 671 (1.70%) | 0 (0.00%) | 0 (0.00%) | 419 (1.06%) |

Portuguese | 9,071 | 1,718 (18.94%) | 1,713 (18.88%) | 8 (0.09%) | 0 (0.00%) | 0 (0.00%) | 7 (0.08%) |

Swedish | 6,159 | 293 (4.76%) | 280 (4.55%) | 5 (0.08%) | 0 (0.00%) | 0 (0.00%) | 14 (0.23%) |

Turkish | 5,510 | 657 (11.92%) | 657 (11.92%) | 10 (0.18%) | 0 (0.00%) | 0 (0.00%) | 20 (0.36%) |

Language. | Trees. | Non-Projective. | Not Planar. | Not 2-Pl.. | Not 3-Pl.. | Not 4-Pl.. | Ill-nested. |
---|---|---|---|---|---|---|---|

Arabic | 2,995 | 205 (6.84%) | 158 (5.28%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 1 (0.03%) |

Czech | 87,889 | 20,353 (23.16%) | 16,660 (18.96%) | 82 (0.09%) | 0 (0.00%) | 0 (0.00%) | 96 (0.11%) |

Danish | 5,512 | 853 (15.48%) | 827 (15.00%) | 1 (0.02%) | 1 (0.02%) | 0 (0.00%) | 6 (0.11%) |

Dutch | 13,349 | 4,865 (36.44%) | 4,115 (30.83%) | 162 (1.21%) | 1 (0.01%) | 0 (0.00%) | 15 (0.11%) |

German | 39,573 | 10,927 (27.61%) | 10,908 (27.56%) | 671 (1.70%) | 0 (0.00%) | 0 (0.00%) | 419 (1.06%) |

Portuguese | 9,071 | 1,718 (18.94%) | 1,713 (18.88%) | 8 (0.09%) | 0 (0.00%) | 0 (0.00%) | 7 (0.08%) |

Swedish | 6,159 | 293 (4.76%) | 280 (4.55%) | 5 (0.08%) | 0 (0.00%) | 0 (0.00%) | 14 (0.23%) |

Turkish | 5,510 | 657 (11.92%) | 657 (11.92%) | 10 (0.18%) | 0 (0.00%) | 0 (0.00%) | 20 (0.36%) |

As we can see, the coverage provided by the 2-planarity constraint is comparable to that of well-nestedness. In most of the treebanks, well over 99% of the sentences are 2-planar, and 3-planarity has almost total coverage. In comparison to well-nestedness, it is worth noting that no efficient parser has been proposed that is able to handle *all* well-nested dependency trees, only well-nested trees with bounded gap degree, which reduces coverage (Kuhlmann and Möhl 2007; Gómez-Rodríguez, Carroll, and Weir 2011). As will be seen in the next section, the class of 2-planar dependency trees not only has good coverage of linguistic structures in existing treebanks but is also parsable with a linear-time transition-based parser, making it a theoretically as well as practically interesting subclass of non-projective dependency trees.

## 5. Multiplanar Dependency Parsing

The divisible transition system framework introduced in Section 3 can be generalized to support *k*-planar dependency graphs by using *k* stacks instead of only one and applying the Shift and Unshift elementary transitions to all of them at the same time, whereas Reduce, Left-Arc, and Right-Arc only affect one stack at a time. The stack on which these latter transitions are applied is decided by an extra elementary transition, called Switch, which cycles through the *k* stacks selecting one of them as the *active* stack.

This generalization has the property that the set of arcs created in the context of each individual stack will be planar, but pairs of arcs created in different stacks are allowed to cross. In this way, a *k*-stack parser will be able to build a *k*-planar dependency forest by using each of the stacks to construct one of its *k* planes.

Although the general case of *k*-planar dependency parsing is interesting as a theoretical construction, we will limit ourselves in this article to the 2-planar case and show how a system built by generalizing the planar parser defined in Section 3.3 to use two stacks instead of one can yield an efficient parser for 2-planar dependency graphs, in particular 2-planar trees. As we saw in Section 4.2, this class of structures gives almost perfect coverage in existing treebanks, and we will therefore leave the exploration of *k*-planar dependency parsing for *k* higher than 2 as future work.

Note that, because we are only interested in defining a single transition system using the multi-stack generalization of the divisible transition system framework, we will introduce the system directly as a generalization of the planar transition system, rather than showing the step-by-step details of how the general framework is first extended to multiple stacks (as outlined earlier) and then defining the new system on top of the extended framework.

### 5.1 2-Planar Dependency Parsing

*S*

_{2P}has configurations of the form (

*σ*

^{1},

*σ*

^{2},

*B*,

*A*), where we call

*σ*

^{1}the

**active stack**and

*σ*

^{2}the

**inactive stack**. Because the system uses two stacks rather than one, it does not conform to the standard definition of a stack-based transition system given in Section 2.2, but it behaves analogously. In this case, the initialization function is

*c*

_{s}(

*w*

_{1}, …,

*w*

_{n}) = ([ ], [ ], [1, …,

*n*], ∅) and the set of terminal configurations is . The transitions of this system are the following:The Shift

_{2P}transition pops the first (leftmost) word in the buffer, and pushes it to

*both*stacks. The Left-arc

_{2P}transition adds an arc from the first word in the buffer to the top of the

*active*stack. The Right-arc

_{2P}transition adds an arc from the top of the

*active*stack to the first word in the buffer. The Reduce

_{2P}transition pops the top word from the

*active*stack, implying that we have added all arcs to or from it on the plane tied to that stack. The Switch

_{2P}transition, finally, makes the active stack inactive and vice versa, changing the plane the parser is working with. In order to exemplify how this system can parse non-planar dependency graphs, Figure 6 shows a transition sequence for the tree in Figure 1.

#### 5.1.1 Correctness of 2-Planar Dependency Parsing

To show that this transition system is correct for the set of 2-planar dependency graphs, we need to prove that it is sound (every graph produced by the system is 2-planar) and complete (all 2-planar graphs can be derived from the system). We do this by proving two corresponding lemmas, the second of which is a stronger claim than mere completeness.

**Lemma 6**

The system *S*_{2P} is sound for the set of 2-planar dependency graphs.

**Proof**

This lemma is proven by showing that the algorithm cannot create a pair of crossing arcs on the same stack. This is done by applying the proof of Proposition 4 separately to each of the two stacks of the 2-planar system (or, alternatively, by observing that the transition system resulting from ignoring one of the stacks in the 2-planar system is divisible). This implies that, given each of the two stacks, the subgraph formed by the arcs created by a transition sequence in configurations where that stack was active is planar, which trivially implies that the graph generated by the sequence is 2-planar.

**Lemma 7**

Let *G* = (*V*,*A*) be a 2-planar dependency graph for a sentence *w*_{1} …*w*_{n}, with planes *P*_{1} and *P*_{2}. Then there is a transition sequence in *S*_{2P} ending in a terminal configuration of the form (*σ*^{1}, *σ*^{2}, [ ], *A*) such that all the nodes that are not covered by any dependency arc in *P*_{1} are in *σ*^{1}, and all the nodes that are not covered by any dependency arc in *P*_{2} are in *σ*^{2}.

**Proof**

The proof is analogous to that of the planar parser, but we have to handle two stacks and two planes. As in the planar case, we proceed by induction on the length *n* of the sentence. In the case where *n* = 1, the only possible 2-planar dependency graph is the graph *G*_{0} = ({1} , ∅ ) with a single node and no arcs, and the transition sequence that applies a single Shift transition meets the conditions of the lemma, because it ends in a terminal configuration ([1], [1], [ ], ∅ ).

*n*and prove that it also holds for sentences of length

*n*+ 1, for any

*n*≥ 0. Let

*G*

_{n + 1}= (

*V*

_{n + 1},

*A*

_{n + 1}) be a 2-planar dependency graph for a sentence

*w*

_{1}…

*w*

_{n + 1}, with planes and . We denote by

*L*

_{n + 1}the set of arcsthat is, the set of incoming and outgoing arcs from the node

*n*+ 1 in

*G*

_{n + 1}, and we denote by

*G*

_{n}the graphthat is, the graph obtained by removing the node

*n*+ 1 and all its incoming and outgoing arcs from

*G*

_{n + 1}. It is easy to show that the graphs and are planes of

*G*

_{n}. They are planar graphs (being subgraphs of and , which are planar) and the union of their arc set is

*A*

_{n + 1}∖

*L*

_{n + 1}=

*A*

_{n}(because , as and are planes of

*G*

_{n + 1}).

By the induction hypothesis, there exists a transition sequence *C*_{n} whose final configuration is of the form , such that contains all the nodes that are not covered by any dependency arc in , for *b* = 1,2. From this transition sequence *C*_{n}, we will obtain a transition sequence *C*_{n + 1} meeting the conditions asserted by the lemma for the graph *G*_{n + 1}.

*b*= 1,2, the planarity of the graph implies that the left endpoints of the arcs in cannot be covered by any arc in , because this would mean that the arc in and the covering arc would cross. Therefore, by the induction hypothesis, we know that all the left endpoints of the arcs in are in . Thus, if the left endpoints of the arcs in are

*i*

_{1},

*i*

_{2}, …,

*i*

_{e}and those of the arcs in are

*j*

_{1},

*j*

_{2}, …,

*j*

_{f}; then the stack (which is ordered, because the same reasoning as in Lemma 1 can be applied to the 2-planar transition system) is of the formand the stack is of the formWith this in mind, we can obtain the transition sequence

*C*

_{n + 1}from

*C*

_{n}by adding the following extra transitions at the end of its associated transition chain:where we use the notation

*arcs*(

*i*) as shorthand for:

Left-Arc, if (

*n*+ 1,*i*) ∈*L*_{n + 1}∧ (*i*,*n*+ 1) ∉*L*_{n + 1},Right-Arc, if (

*n*+ 1,*i*) ∉*L*_{n + 1}∧ (*i*,*n*+ 1) ∈*L*_{n + 1},Left-Arc; Right-Arc, if (

*n*+ 1,*i*) ∈*L*_{n + 1}∧ (*i*,*n*+ 1) ∈*L*_{n + 1}.

*C*

_{n}is of the form (

*σ*

^{1},

*σ*

^{2},

*β*,

*A*), where:

*β*= [ ], because the nodes 1, …,*n*are removed from the buffer by*C*_{n}, and*n*+ 1 is removed by the extra Shift transition;*A*=*A*_{n + 1}, because the arcs in*A*_{n}are added to the set by*C*_{n}, and all the arcs in*L*_{n + 1}are added by*arcs*(*i*_{e}), …,*arcs*(*i*_{1}),*arcs*(*j*_{f}), …,*arcs*(*j*_{1});all the nodes that are not covered by arcs in are in

*σ*^{b}, for*b*= 1,2, because they were in (a node not covered by arcs in is trivially not covered by arcs in ) and the Reduce transitions applied after*C*_{n}only remove nodes to the right of*i*_{1}from the first stack, which are covered by the arc (*n*+ 1,*i*_{1}) or (*i*_{1},*n*+ 1), and from the right of*j*_{1}from the second stack, which are covered by the arc (*n*+ 1,*j*_{1}) or (*j*_{1},*n*+ 1).^{11}

**Proposition 10**

The system *S*_{2P} is correct for the set of 2-planar dependency graphs.

**Proof**

The proposition follows from Lemma 6 and Lemma 7.

#### 5.1.2 Constraints on 2-Planar Dependency Parsing

The No-Covered-Roots constraint is not so straightforward to implement in the 2-planar parser, because in the 2-planar case a node without a head may need to be reduced from one stack and get a head later from the other stack, so restricting the Reduce transitions in the 2-planar parser to nodes with a head would also forbid some structures without covered roots. In any case, the No-Covered-Roots constraint does not seem practically meaningful when we go beyond planar structures.

#### 5.1.3 Complexity of 2-Planar Dependency Parsing

To reason about the complexity of the 2-planar parser, we first note that a naive implementation of the transition system as given here does not guarantee termination. The reason is that the system allows an infinite sequence of Switch transitions, switching the active and inactive stacks repeatedly and cycling between the same two configurations without making any advance. This can easily be avoided in practice by forbidding Switch transitions from being executed if the last transition in the sequence was also a Switch. Note that we could also have incorporated this restriction into the formal system (for example, by adding a flag to configurations to indicate whether the previous transition was a Switch or not), but this would have unnecessarily complicated the notation. Assuming that our implementation of the 2-planar parser has this restriction on Switch transitions, we can show that the length of a transition sequence for a sentence of length *n* is *O*(*n*) in the same way as for efficient divisible systems (see Section 3.2.2).

**Proposition 11**

Let *S*_{2P} be the 2-planar system restricted so that two consecutive Switch transitions are not permitted. Then the length of every transition sequence for a sentence *x* of length *n* in *S*_{2P} is *O*(*n*).

**Proof**

The proof follows the same lines as for efficient divisible transition systems. For every transition chain *T*_{1,m} = *t*_{1}, …, *t*_{m} for *x* = *w*_{1}, …, *w*_{n}, the following must hold:

The number of Shift transitions in

*T*_{1,m}is at most*n*, because each node in {1,…,*n*} can only be shifted once.The number of Reduce transitions in

*T*_{1,m}is at most 2*n*, because each node in {1,…,*n*} can only be reduced twice (once per stack).The number of Left-Arc and Right-Arc transitions in

*T*_{1,m}is bounded by the maximum number of arcs in a 2-planar dependency graph with*n*nodes, which is 8*n*− 12.^{12}Given the ban on consecutive Switch transitions, the maximum number of Switch transitions in

*T*_{1,m}is 1 plus the number of other transitions.

*m*≤ 2(

*n*+ 2

*n*+ 8

*n*− 12) + 1 and hence that

*m*is

*O*(

*n*).

Applying the same reasoning as for the planar parser regarding constant-time execution of transitions and fixed-size beam search, we conclude that the complexity of the 2-planar parser is still *O*(*n*), both for the unrestricted version and for the variant with the Single-Head and Acyclicity constraints.

Throughout this article, we have presented complexity results for transition-based parsers under the assumption that these parsers use deterministic search or fixed-size beam search because this is the most straightforward method to make parsing practically feasible with the rich history-based feature models that are the key component of accurate transition-based parsers. The relevance of this assumption is further supported by recent results on tabularization and dynamic programming for transition-based parsing, which show that such techniques either lead to a significant increase in parsing complexity or require drastic simplifications in the feature models used. In the former case, practical parsing still has to rely on approximate inference, as in Huang and Sagae (2010). In the latter case, dynamic programming provides an exact inference method only for a very simple approximation of the original transition-based model, as in Kuhlmann, Gómez-Rodríguez, and Satta (2011). In general, this exemplifies the tradeoff between approximate inference with richer models (beam search) and exact inference with simpler models (dynamic programming). Thus, although the feature model used by Zhang and Nivre (2011) to achieve state-of-the-art accuracy for English makes dynamic programming very difficult due to the combinatorial effect on parsing complexity of complex valency and label set features, the feature representation of a single configuration can still be computed in constant time, which is all that is required to achieve linear-time parsing with beam search. The same is true for all the transition systems and feature models explored in this article. Nevertheless, it is an interesting theoretical question whether the novel 2-planar system allows for tabularization and what the resulting complexity would be. At present, we do not know the exact answer to this question, but a reasonable conjecture is that complexity would be exponential for the class of feature models that are relevant for transition-based parsing.

### 5.2 Experimental Evaluation

In this section, we present an experimental evaluation of the novel 1-planar and 2-planar transition systems in comparison to the widely used arc-eager projective system of Nivre (2003) (analyzed earlier in Example (4)). Besides being the default parsing algorithm in MaltParser (Nivre, Hall, and Nilsson 2006), this system is also the basis of the ISBN Dependency Parser (Titov and Henderson 2007) and ZPar (Zhang and Clark 2008; Zhang and Nivre 2011). In addition to a strictly projective arc-eager parser, we also include a version that uses **pseudo-projective parsing** (Nivre and Nilsson 2005) to recover non-projective arcs. This is the most widely used method for non-projective transition-based parsing and as such a competitive baseline for the 2-planar parser.

In order to make the comparison as exact as possible, we have chosen to implement all four systems in the MaltParser framework and use the same type of classifiers and feature models. For the arc-eager baselines, we copy the set-up from the CoNLL-X shared task on dependency parsing, which includes the use of support vector machines with a polynomial kernel, history-based feature models tuned separately for each language, and pseudo-projective parsing with the Head encoding (Nivre et al. 2006). For the 1-planar and 2-planar parsers, we use the same type of classifier but modify the feature model to take into account the following systematic differences between the transition systems:

In both the 1-planar and 2-planar parser, we need to add features over the arc connecting the top node of the stack and the first node of the buffer (if any). No such arc can exist in the arc-eager system used by the projective and pseudo-projective baseline systems.

In the 2-planar parser, we need to add features over the top nodes of the inactive stack. No such nodes exist in the 1-planar and arc-eager systems.

Table 2 shows parsing results for the same eight data sets from the CoNLL-X shared task that were investigated with respect to *k*-planarity in Section 4.2: Arabic, Czech, Danish, Dutch, German, Portuguese, Swedish, and Turkish. The overall accuracy metric is labeled attachment score (LAS), the percentage of tokens that are assigned both the correct head and the correct label. In addition, we report labeled precision (LP-NP) and recall (LR-NP) specifically on non-projective dependency arcs, where an arc (*i*, *j*) is taken to be non-projective if and only if there is some node *k* such that min (*i*,*j*) < *k* < max (*i*,*j*) and not *i* →* *k*. Precision is the percentage of non-projective arcs output by the system that are correct, and recall is the percentage of non-projective arcs in the gold standard that are output by the system. Note that, although precision is undefined for the projective parser because it does not output any non-projective arcs, recall may nevertheless be greater than zero because arcs that are non-projective in the gold standard can be projective in the output of the parser.^{13}

Looking first at the overall LAS results, we see that the 2-planar parser outperforms both the 1-planar and the projective parser for languages with a high proportion of non-projective trees (≥ 19%): Czech, Dutch, German, and Portuguese. This is in line with our expectations, given the substantially higher coverage of the 2-planar parser for non-projective structures, and the difference is statistically significant at the 0.05 level for all languages in this group (McNemar's test). For three of these languages, the 2-planar parser also outperforms the pseudo-projective parser, although the differences are not statistically significant, and only in the case of Dutch is the pseudo-projective parser significantly better. Given the relatively small difference in coverage between the projective and 1-planar parser, one would expect these systems to have very similar performance, and this is also what we find except for Portuguese where the 1-planar parser is significantly better than the projective arc-eager parser.

For languages with a lower proportion of non-projective trees (Arabic, Danish, Swedish, Turkish), there are generally smaller differences between the parsers, and for Danish and Turkish there are in fact no statistically significant differences at all, which indicates that the increased expressivity is not beneficial (nor harmful) when non-projective structures are rare. Interestingly, it seems that the planar parsers have an advantage over the arc-eager parsers for Arabic, where the 2-planar parser is significantly better than both the projective and pseudo-projective parsers. By contrast, the arc-eager parsers seem to have an advantage for Swedish, where the projective and pseudo-projective parsers are both significantly better than the 1-planar parser. At present, we have no explanation for this language-specific variation.

Turning next to labeled precision (LP-NP) and recall (LR-NP) on non-projective dependency arcs, we again find that the 2-planar parser does quite well on the four languages with 19% or more non-projective trees, with precision consistently over 50% and recall in the 35–60% range. Again, the results are very similar to those achieved with the pseudo-projective parser, with the 2-planar parser giving higher precision for Dutch and German and higher recall for German. For the remaining four languages, both precision and recall remains low, which probably points to a sparse data problem when learning how to switch between the two planes during parsing, but the same holds true for the pseudo-projective parser. As expected, the 1-planar parser has only marginally higher recall than the projective parser (which, as pointed out earlier, may recover non-projective dependencies by accident), but it is interesting to note that the 1-planar parser has relatively high precision on the few non-projective arcs that it predicts, in some cases comparable to that of the 2-planar parser.

In conclusion, the experimental evaluation shows that the 2-planar parser has the potential to improve parsing accuracy over a strictly projective (or 1-planar) parser for languages with a sufficient proportion of non-projective trees, and that it generally performs at about the same level as the widely used arc-eager pseudo-projective parser. We believe that it is possible to improve results even further by careful optimization of features and other parameters, but this will have to be left for future research. It would also be interesting to explore the use of global optimization and beam search, which has been shown to improve accuracy over local learning and greedy search (Titov and Henderson 2007; Zhang and Clark 2008; Zhang and Nivre 2011).

## 6. Related Work

The literature on dependency parsing has grown enormously in recent years and we will not attempt a comprehensive review here but focus on previous research related to the three main themes of the article: a formal framework for analyzing and constructing transition systems for dependency parsing (Section 3), a procedure for classifying mildly non-projective dependency structures in terms of multiplanarity (Section 4), and a novel transition-based parser for (a subclass of) non-projective dependency structures (Section 5).

### 6.1 Frameworks for Dependency Parsing

Due to the growing popularity of dependency parsing, several proposals have been made that group and study different dependency parsers under common (more or less formal) frameworks. Thus, Buchholz and Marsi (2006) observed that almost all of the systems participating in the CoNLL-X shared task could be classified as belonging to one of two approaches, which they called the “all pairs” and the “stepwise” approaches. This was taken up by McDonald and Nivre (2007), who called the first approach global exhaustive graph-based parsing and the second approach local greedy transition-based parsing. The terms *graph-based* and *transition-based* have become well established, even though there now exist graph-based models that do not perform exhaustive search (McDonald and Pereira 2006; Koo et al. 2010) as well as transition-based models that are neither local nor greedy (Titov and Henderson 2007; Zhang and Clark 2008).

Nivre (2008), building on earlier work in Nivre (2006b), formalizes transition-based parsing by means of *transition systems* and *oracles*. Two distinct types of transition systems are described, differing in the data structures they use to store partially processed tokens: stack-based and list-based systems. The formalization of stack-based systems provided there has been one point of departure for the present article (see Section 2.2) but, whereas general stack-based systems allow transitions to be arbitrary partial functions from configurations to configurations, we have focused on a class of systems where transitions are obtained by composing a small set of elementary transitions, allowing us to derive specific formal properties.

Gómez-Rodríguez, Carroll, and Weir (2011) propose a common deductive framework that can be used to describe a wide range of dependency parsers, including both graph-based and transition-based algorithms. Although the high abstraction level of this framework makes it able to describe and relate very different parsing strategies, it also means that it is not suitable to describe lower-level properties of transition-based parsers such as their computational complexity when implemented with beam search. Kuhlmann, Gómez-Rodríguez, and Satta (2011) introduce a technique to obtain polynomial-time deductive parsers that simulate all the transition sequences allowed by a transition system.

### 6.2 Mildly Non-Projective Dependency Structures

Most natural language treebanks contain non-projective dependency analyses (Havelka 2007), but the general problem of parsing arbitrary non-projective dependency graphs has been shown to be computationally intractable except under strong independence assumptions (McDonald and Satta 2007). This has motivated researchers to look for sets of dependency structures that have more coverage of linguistic phenomena than projective structures, while being more efficiently parsable than unrestricted non-projective graphs.

Several sets have been defined by applying different restrictions to dependency graphs, such as arc degree (Nivre 2006a; Nivre 2007), gap degree and well-nestedness (Bodirsky, Kuhlmann, and Möhl 2005; Kuhlmann and Nivre 2006; Kuhlmann and Möhl 2007), and *k*-ill-nestedness (Maier and Lichte 2009). Among these sets, only well-nested dependency structures with bounded gap degree have been shown to have exact polynomial-time algorithms (Kuhlmann 2010; Gómez-Rodríguez, Carroll, and Weir 2011). For dependency structures with bounded arc degree, a greedy transition-based parser based on the algorithm of Covington (2001) is described in Nivre (2007).

Other sets have been defined operationally as the set of dependency structures that are parsable by a given algorithm. These include the graphs parsable by the transition system of Attardi (2006) or the more restrictive dynamic programming variant of Cohen, Gómez-Rodríguez, and Satta (2011), the set of structures that yield binarizable productions with the algorithm of Kuhlmann and Satta (2009), or the set of mildly ill-nested structures (Gómez-Rodríguez, Weir, and Carroll 2009; Gómez-Rodríguez, Carroll, and Weir 2011).

As mentioned earlier, the notion of multiplanarity was originally introduced by Yli-Jyrä (2003), who also presents additional constraints on *k*-planar graphs. No algorithms were previously known to determine whether a given graph was *k*-planar or to efficiently parse *k*-planar dependency structures, however.

### 6.3 Non-Projective Transition-Based Parsing

Whereas early transition-based dependency parsers were restricted to projective dependency graphs (Yamada and Matsumoto 2003; Nivre 2003), several techniques have been proposed to accomodate non-projectivity within the transition-based framework. Pseudo-projective parsing, proposed by Nivre and Nilsson (2005), is a general technique applicable to any data-driven parser. Before training the parser, dependency structures are projectivized using lifting operations (Kahane, Nasr, and Rambow 1998), and partial information about the lifting paths is encoded in augmented arc labels. After parsing, dependency structures are deprojectivized using a heuristic search procedure guided by the augmented arc labels.

A more integrated approach is to deal with with non-projectivity by adding extra transitions to projective transition systems. Attardi (2006) parses a restricted set of non-projective trees by adding transitions that create arcs using nodes deeper than the top of the stack. Nivre (2009) instead uses a transition that changes the order of input words, obtaining full coverage of non-projective structures in quadratic worst-case time (but achieving linear practical performance). A similar technique is used by Tratz and Hovy (2011) to develop an *O*(*n*^{2} log*n*) non-projective version of the easy-first parser of Goldberg and Elhadad (2010).

Finally, the parsing algorithm described by Covington (2001) can be implemented as a list-based transition system that in its unrestricted form is complete for all non-projective trees (Nivre 2008). The worst-case complexity for this system is *O*(*n*^{2}), but efficiency can be improved in practice by bounding the arc degree (Nivre 2006a; Nivre 2007).

## 7. Conclusion

Although data-driven dependency parsing has seen tremendous progress during the last decade in terms of empirically observed accuracy for a wide range of languages, it is probably fair to say that our theoretical understanding of the methods used is still less developed than for the more familiar paradigm of context-free grammar parsing. In this article, we have tried to contribute to the theoretical foundations of dependency parsing in essentially two ways.

Our first contribution is the framework of divisible transition systems, where transition systems for dependency parsing can be defined by composition and restriction of the five elementary transitions Shift, Unshift, Reduce, Left-Arc, and Right-Arc. On the one hand, this can be used as an analytical tool to characterize existing systems for dependency parsing and prove formal properties related to expressivity and complexity. Thus, we have shown that all divisible systems, including a number of well-known systems from the literature, are sound for planar dependency graphs and can be restricted to satisfy a number of other formal constraints, and we have characterized the subclass of efficient divisible transition systems that give linear parsing complexity when combined with greedy inference or beam search as is customary in transition-based parsing. Even though most of these results have been established previously for particular systems, the general framework allows us to show how the results follow from more general principles. On the other hand, the framework can be used to develop new systems with required formal properties. To illustrate this, we have presented a system that is both sound and complete for planar dependency graphs (with or without additional formal constraints) and that fills a gap in the dependency parsing literature.

Our second contribution consists in extending the available techniques for dependency parsing to multiplanar dependency graphs, an interesting hierarchy of mildly non-projective dependency structures that have remained unexplored due to the lack of suitable formal tools. First of all, we have shown that the problem of finding the smallest *k* such that a dependency graph is *k*-planar can be reduced to the familiar *k*-coloring problem for undirected graphs and can thereby be solved efficiently for *k* ≤ 2 but in practice also for higher *k* due to the sparseness of non-projective dependencies in natural language. Using this procedure, we have shown that the set of 2-planar dependency trees have a coverage in existing treebanks that is at least as good as alternative characterizations of mildly non-projective dependency structures. In addition, we have shown how the planar dependency parser defined in the first part of the article can be generalized to the *k*-planar dependency graphs and in particular to the 2-planar case. Preliminary experiments using standard methods for transition-based parsing show that this system can give significant improvements over a strictly projective system for languages with a non-negligible proportion of non-projective dependencies.

There are a number of directions for future research that suggest themselves. First of all, there are many instances of divisible transition systems that have not yet been explored, either theoretically or for practical parsing applications. For example, as remarked in Section 3.3.2, there is a way of restricting the 1-planar parser to projective forests, which is different from previously explored systems for projective dependency parsing. Secondly, it may be interesting to study different ways of extending divisible transition systems for greater expressivity, besides introducing additional stacks. This may involve the addition of new transition types, as proposed by Attardi (2006) and Nivre (2009), or new data structures, as in the list-based systems of Nivre (2008). Finally, it would be interesting to see what level of accuracy can be reached for 2-planar dependency parsing with proper feature selection in combination with the latest techniques for global optimization and non-greedy search (Titov and Henderson 2007; Zhang and Clark 2008; Huang and Sagae 2010; Zhang and Nivre 2011).

## Acknowledgments

The authors would like to thank Johan Hall for support with the MaltParser system and three anonymous reviewers for useful comments on previous versions of the manuscript. The first author has been partially funded by the Spanish Ministry of Economy and Competitiveness and FEDER (project TIN2010-18552-C03-02) and Xunta de Galicia (Rede Galega de Recursos Lingüísticos para unha Sociedade do Coñecemento). Part of the reported experiments were conducted with the help of computing resources provided by the Supercomputing Center of Galicia (CESGA).

## Notes

The dependency graph has in this case been derived automatically from the constituency-based annotation in the treebank using standard head-finding rules and heuristics for inferring dependency labels.

In addition to stack-based systems, Nivre (2008) also investigates *list-based* systems, which make use of arbitrary lists instead of stacks that obey the last-in first-out constraint.

Please note that, according to standard terminology both in transition-based dependency parsing and for transition systems more generally in computer science, a *transition sequence* is a sequence of *configurations*, not a sequence of *transitions*. We will later introduce the term *transition chain* for the corresponding sequence of *transitions*. We realize that these terms are potentially confusing but prefer not to deviate from the standard terminology.

It is worth noting that the assumption that Left-Arc and Right-Arc only apply to configurations where the new arc is not already in the arc set *A* could be formally stated by restricting these transitions to the sets *LA* = {(σ|*i*, *j*|β, *A*) | (*j*, *i*) ∉ *A*} (for Left-Arc) and *RA* = {(σ|*i*, *j*|β, *A*) | (*j*, *i*) ∉ *A*} (for Right-Arc). Because these restrictions are part of the definition of the elementary transitions themselves, however, we prefer to leave it implicit notationwise.

Additionally, it is also possible to define the divisible transition system framework itself in such a way that the Left-Arc and Right-Arc elementary transitions themselves act upon the two topmost stack nodes, rather than on the topmost stack node and first buffer node. Although this definition can capture exactly the same set of parsers as the one we are using and makes it more natural to describe the mentioned arc-standard variant, we have not used it because it significantly complicates the definition of other algorithms, such as the arc-eager or 1-planar parsers.

This assumes that at least one arc was created to or from node *n* + 1 (i.e., that *e* > 0). In the case where *e* = 0, it is trivial to show that all nodes not covered by arcs in *G*_{n + 1} are in *σ*, because in that case no Reduce transitions are applied at all after *C*_{n}.

Strictly speaking, for undirected cycles using the techniques of path compression and union by rank for disjoint sets, the amortized time per operation is *O*(*α*(*n*)), where *n* is the number of nodes and *α*(*n*) is the inverse of the Ackermann function, which means that *α*(*n*) is less than 5 for all remotely practical values of *n* and is effectively a small constant (Cormen, Leiserson, and Rivest 1990).

This is true in particular if dependency graphs are restricted to trees that have their roots at the periphery, as in Figure 1, in which case the two notions become equivalent.

The test for multiplanarity and the 2-planar parser have previously been described in Gómez-Rodríguez and Nivre (2010).

Note that this does not necessarily imply that the problem of determining whether a graph is *k*-planar is also NP-complete, because there might be polynomial algorithms that solve it without involving a reduction to the *k*-coloring problem.

This is assuming that arcs are created to or from node *n* + 1 in both planes (i.e., that *e* > 0 and *f* > 0), but the cases where *e* = 0 or *f* = 0 are trivial, because in those cases no new Reduce transitions are applied to the respective stacks after *C*_{n}.

This follows from Lemma 4, because a 2-planar graph can be broken up into two planes, each of which is a planar graph with *n* nodes. Moreover, if the Single-Head and Acyclicity constraints are used, the maximum number of arcs is *n* − 1, because every node can have at most one incoming arc and there must be at least one root.

When this happens, there is by necessity an error elsewhere in the parser output, because the projectivity of the arc implies that at least one gold standard arc must be missing.

## References

## Author notes

Departamento de Computación, Universidade da Coruña, Facultad de Informática, Campus de Elviña s/n, 15071 A Coruña, Spain. E-mail: cgomezr@udc.es.

Department of Linguistics and Philology, Uppsala University, Box 635, 75126 Uppsala, Sweden. E-mail: joakim.nivre@lingfil.uu.se.