## Abstract

Derivations under different grammar formalisms allow extraction of various dependency structures. Particularly, bilexical deep dependency structures beyond surface tree representation can be derived from linguistic analysis grounded by CCG, LFG, and HPSG. Traditionally, these dependency structures are obtained as a by-product of grammar-guided parsers. In this article, we study the alternative data-driven, transition-based approach, which has achieved great success for tree parsing, to build general dependency graphs. We integrate existing tree parsing techniques and present two new transition systems that can generate arbitrary directed graphs in an incremental manner. Statistical parsers that are competitive in both accuracy and efficiency can be built upon these transition systems. Furthermore, the heterogeneous design of transition systems yields diversity of the corresponding parsing models and thus greatly benefits parser ensemble. Concerning the disambiguation problem, we introduce two new techniques, namely, transition combination and tree approximation, to improve parsing quality. Transition combination makes every action performed by a parser significantly change configurations. Therefore, more distinct features can be extracted for statistical disambiguation. With the same goal of extracting informative features, tree approximation induces tree backbones from dependency graphs and re-uses tree parsing techniques to produce tree-related features. We conduct experiments on CCG-grounded functor–argument analysis, LFG-grounded grammatical relation analysis, and HPSG-grounded semantic dependency analysis for English and Chinese. Experiments demonstrate that data-driven models with appropriate transition systems can produce high-quality deep dependency analysis, comparable to more complex grammar-driven models. Experiments also indicate the effectiveness of the heterogeneous design of transition systems for parser ensemble, transition combination, as well as tree approximation for statistical disambiguation.

## 1. Introduction

The derivations licensed by a grammar under deep grammar formalisms, for example, combinatory categorial grammar (CCG; Steedman 2000), lexical-functional grammar (LFG; Bresnan and Kaplan 1982) and head-driven phrase structure grammar (HPSG; Pollard and Sag 1994), are able to produce rich linguistic information encoded as bilexical dependencies. Under CCG, this is done by relating the lexical heads of functor categories and their arguments (Clark, Hockenmaier, and Steedman 2002). Under LFG, bilexical grammatical relations can be easily derived as the backbone of F-structures (Sun et al. 2014). Under HPSG, predicate–argument structures (Miyao, Ninomiya, and ichi Tsujii 2004) or reduction of minimal recursion semantics (Ivanova et al. 2012) can be extracted from typed feature structures corresponding to whole sentences. Dependency analysis grounded in deep grammar formalisms is usually beyond tree representations and well-suited for producing meaning representations. Figure 1 is an example from CCGBank. The deep dependency graph conveniently represents more semantically motivated information than the surface tree. For instance, it directly captures the *Agent–Predicate* relations between word “people” and conjuncts “fight,” “eat,” as well as “drink.”

Automatically building deep dependency structures is desirable for many practical NLP applications, for example, information extraction (Miyao et al. 2008) and question answering (Reddy, Lapata, and Steedman 2014). Traditionally, deep dependency graphs are generated as a by-product of grammar-guided parsers. The challenge is that a deep-grammar-guided parsing model usually cannot produce full coverage and the time complexity of the corresponding parsing algorithms is very high. Previous work on data-driven dependency parsing mainly focused on tree-shaped representations. Nevertheless, recent work has shown that a data-driven approach is also applicable to generate more general linguistic graphs. Sagae and Tsujii (2008) present an initial study on applying transition-based methods to generate HPSG-style predicate–argument structures, and have obtained competitive results. Furthermore, Titov et al. (2009) and Henderson et al. (2013) have shown that more general graphs rather than planars can be produced by augmenting existing transition systems.

This work follows early encouraging research and studies transition-based approaches to construct deep dependency graphs. The computational challenge to incremental graph spanning is the existence of a large number of crossing arcs in deep dependency analysis. To tackle this problem, we integrate insightful ideas, especially the ones illustrated in Nivre (2009) and Gómez-Rodríguez and Nivre (2010), developed in the tree spanning scenario, and design two new transition systems, both of which are able to produce arbitrary directed graphs. In particular, we explore two techniques to localize transition actions to maximize the effect of a greedy search procedure. In this way, the corresponding parsers for generating linguistically motivated bilexical graphs can process sentences in close to linear time with respect to the number of input words. This efficiency advantage allows deep linguistic processing for very-large-scale text data.

For syntactic parsing, ensembled methods have been shown to be very helpful in boosting accuracy (Sagae and Lavie 2006; Zhang et al. 2009; McDonald and Nivre 2011). In particular, Surdeanu and Manning (2010) presented a nice comparative study on various ensemble models for dependency tree parsing. They found that the diversity of base parsers is more important than complex ensemble models for learning. Motivated by this observation, the authors proposed a hybrid transition-based parser that achieved state-of-the-art performance by combining complementary prediction powers of different transition systems. One advantage of their architecture is the linear-time decoding complexity, given that all base models run in linear-time. Another concern of our work is about the model diversity obtained by the heterogeneous design of transition systems for general graph spanning. Empirical evaluation indicates that statistical parsers built upon our new transition systems as well as the existing best transition system—namely, Titov et al. (2009)'s system (thmm, hereafter)—exhibit complementary parsing strengths, which benefit system combination. In order to take advantage of this model diversity, we propose a simple yet effective ensemble model to build a better hybrid system.

We implement statistical parsers using the structured perceptron algorithm (Collins 2002) for transition classification and use a beam decoder for global inference. Concerning the disambiguation problem, we introduce two new techniques, namely, transition combination and tree approximation, to improve parsing quality. To increase system coverage, the Arc transitions designed by the thmm as well as our systems do not change the nodes in the stack nor buffer in a configuration: Only the nodes linked to the top of the stack or buffer are modified. Therefore, features derived from the configurations before and after an Arc transition are not distinct enough to train a good classifier. To deal with this problem, we propose the transition combination technique and three algorithms to derive oracles for modified transition systems. When we apply our models to semantics-oriented deep dependency structures, for example, CCG-grounded functor–argument analysis and HPSG-grounded reduced minimal recursion semantics (MRS; Copestake et al. 2005) analysis, we find that syntactic trees can provide very helpful features. In case the syntactic information is not available, we introduce a tree approximation technique to induce tree backbones from deep dependency graphs. Such tree backbones can be utilized to train a tree parser which provides pseudo tree features.

To evaluate transition-based models for deep dependency parsing, we conduct experiments on CCG-grounded functor–argument analysis (Hockenmaier and Steedman 2007; Tse and Curran 2010), LFG-grounded grammatical relation analysis (Sun et al. 2014), and HPSG-grounded semantic dependency analysis (Miyao, Ninomiya, and ichi Tsujii 2004; Ivanova et al. 2012) for English and Chinese. Empirical evaluation indicates some non-obvious facts:

- 1.
Data-driven models with appropriate transition systems and disambiguation techniques can produce high-quality deep dependency analysis, comparable to more complex grammar-driven models.

- 2.
Parsers built upon heterogeneous transition systems and decoding orders have complementary prediction strengths, and the parsing quality can be significantly improved by system combination; compared to the best individual system, system combination gets an absolute labeled F-score improvement of 1.21 on average.

- 3.
Transition combination significantly improves parsing accuracy on a wide range of conditions, resulting in an absolute labeled F-score improvement of 0.74 on average.

- 4.
Pseudo trees contribute to semantic dependency parsing (SDP) equally well to syntactic trees, and result in an absolute labeled F-score improvement of 1.27 on average.

We compare our parser with representative state-of-the-art parsers (Miyao and Tsujii 2008; Auli and Lopez 2011b; Martins and Almeida 2014; Xu, Clark, and Zhang 2014; Du, Sun, and Wan 2015) with respect to different architectures. To evaluate the impact of grammatical knowledge, we compare our parser with parsers guided by treebank-induced HPSG and CCG grammars. Both of our individual and ensembled parsers achieve equivalent accuracy to HPSG and CCG chart parsers (Miyao and Tsujii 2008; Auli and Lopez 2011b), and outperform a shift-reduce CCG parser (Xu, Clark, and Zhang 2014). It is worth noting that our parsers exclude all syntactic and grammatical information. In other words, strictly less information is used. This result demonstrates the effectiveness of data-driven approaches to the deep linguistic processing problem. Compared to other types of data-driven parsers, our individual parser achieves equivalent performance to and our hybrid parser obtains slightly better results than factorization parsers based on dual decomposition (Martins and Almeida 2014; Du, Sun, and Wan 2015). This result highlights the effectiveness of the lightweight, transition-based approach.

Parsers based on the two new transition systems have been utilized as base components for parser ensemble (Du et al. 2014) for SemEval 2014 Task 8 (Oepen et al. 2014). Our hybrid system obtained the best overall performance of the closed track of this shared task. In this article, we re-implement all models, calibrate features more carefully, and thus obtain improved accuracy. The idea to extract tree-shaped backbone from a deep dependency graph has also been used to design other types of parsing models in our early work (Du et al. 2014, 2015; Du, Sun, and Wan 2015). Nevertheless, the idea to train a pseudo tree parser to serve a transition-based graph parser is new.

The implementation of our parser is available at http://www.icst.pku.edu.cn/lcwm/grass.

## 2. Transition Systems for Graph Spanning

### 2.1 Background Notations

A dependency graph *G* = (*V*, *A*) is a labeled directed graph, such that for sentence *x* = *w*_{1} , … , *w*_{n} the following holds:

- 1.
*V*= {0, 1, 2, … ,*n*}, - 2.
*A*⊆*V*×*R*×*X*.

*V*consists of

*n*+ 1 nodes, each of which is represented by a single integer. In particular, 0 represents a virtual root node

*w*

_{0}, and all others correspond to words in

*x*. The arc set

*A*represents the labeled dependency relations of the particular analysis

*G*. Specifically, an arc (

*i*,

*r*,

*j*) ∈

*A*represents a dependency relation

*r*from head

*w*

_{i}to dependent

*w*

_{j}. A dependency graph

*G*is thus a set of labeled dependency relations between the root and the words of

*x*. To simplify the description in this section, we mainly consider unlabeled parsing and assume the relation set

*R*is a singleton. Or, taking it another way, we assume

*A*⊆

*V*×

*V*. It is straightforward to adapt the discussions in this article for labeled parsing. To do so, we can parameterize transitions with possible dependency relations. For empirical evaluation as discussed in Section 5, we will test both labeled and unlabeled parsing models.

Following Nivre (2008), we define a transition system for dependency parsing as a quadruple , where

- 1.
is a set of configurations, each of which contains a buffer

*β*of (remaining) words and a set*A*of dependency arcs, - 2.
*T*is a set of transitions, each of which is a (partial) function , - 3.
*c*_{s}is an initialization function, mapping a sentence*x*to a configuration with*β*= [1, … ,*n*], - 4.
is a set of terminal configurations.

Given a sentence *x* = *w*_{1}, … , *w*_{n} and a graph *G* = (*V*, *A*) on it, if there is a sequence of transitions *t*_{1}, … , *t*_{m} and a sequence of configurations *c*_{0}, … , *c*_{m} such that *c*_{0} = *c*_{s}(*x*), *t*_{i}(*c*_{i−1}) = *c*_{i}(*i* = 1, … , *m*), , and *A*_{cm} = *A*, we say the sequence of transitions is an **oracle** sequence. And we define *Ā*_{ci} = *A* − *A*_{ci} for the arcs to be built of *c*_{i}. We could denote a transition sequence as either t_{1,m} or c_{0,m}.

In a typical transition-based parsing process, the input words are put into a queue and partially built structures are organized by a stack. A set of Shift/Reduce actions are performed sequentially to consume words from the queue and update the partial parsing results organized by the stack. Our new systems designed for deep parsing differ with respect to their information structures to define a configuration and the behaviors of transition actions.

### 2.2 Naive Spanning and Locality

*i*,

*j*) or (

*j*,

*i*) and 2) adding no arc at all. In this way, the algorithm builds a graph by incrementally trying to link every pair of words.

The complexity of naive spanning is *θ*(*n*^{2}),^{1} because it does nothing to explore the topological properties of a linguistic structure. In other words, the naive graph-spanning idea does not fully take advantages of the **greedy** search of the transition-based parsing architecture. On the contrary, a well-designed transition system for (projective) tree parsing can decode in linear time by exploiting locality among subtrees. Take the arc-eager system presented in Nivre (2008), for example: Only the nodes at the top of the stack and the buffer are allowed to be linked. Such limitation is the key to implement a linear time decoder. In the following, we introduce two ideas to localize a transition action, that is, to allow a transition to manipulate only the frontier items in the data structures of a configuration. By this means, we can decrease the number of possible transitions for each configuration and thus minimize the total decoding time.

### 2.3 System 1: Online Re-ordering

The online re-ordering approach that we explore is to provide the system with ability to re-order the nodes during parsing in an online fashion. The key idea, as introduced in Titov et al. (2009) and Nivre (2009), is to allow a SWAP transition that switches the position of the two topmost nodes on the stack. By changing the linear order of words, the system is able to build crossing arcs for graph spanning. We refer to this approach as **online re-ordering**. We introduce a stack-based transition system with online re-ordering for deep dependency parsing. The obtained oracle parser is complete with respect to the class of all directed graphs without self-loop.

#### 2.3.1 The System

We define a transition system , where a configuration contains a stack *σ* of nodes, besides *β* and *A*. We set the initial configuration for a sentence *x* = *w*_{1}, … , *w*_{n} to be *c*_{s}(*x*) = ([], [1, … , *n*], {}), and take to be the set of all configurations of the form *c*_{t} = (*σ*, [], *A*) (for any *σ* any *A*). These transitions are shown in Figure 2 and explained as follows.

- •
Shift (sh) removes the front from the buffer and pushes it onto the stack.

- •
Left/Right-Arc (la/ra) updates a configuration by adding (

*j*,*i*)/(*i*,*j*) to*A*where*i*is the top of the stack, and*j*is the front of the buffer. - •
Pop (pop) updates a configuration by popping the top of the stack.

- •
Swap (sw) updates a configuration with stack

*σ*|*i*|*j*by moving*i*back to the buffer.

A variation of transition Swap is Swap_{T}, which updates the configuration by swapping *i* and *j*. However, the system of this variation is not complete with respect to directed graphs because the power of transition Swap_{T} is limited, and counterexamples of completeness can be found. For more theoretical discussion about this system (i.e., thmm), see Titov et al. (2009). We also denote Titov et al. (2009)'s system as *S*_{T}.

#### 2.3.2 Theoretical Analysis

The soundness of *S*_{S} is trivial. To demonstrate the completeness of the system, we give a constructive proof that can derive oracle transitions for any arbitrary graph. To simplify the description, the label attached to transitions are not considered. The idea is inspired by Titov et al. (2009). Given a sentence *x* = *w*_{1}, … , *w*_{n} and a graph *G* = (*V*, *A*) on it, we start with the initial configuration *c*_{0} = *c*_{s}(*x*) and compute the oracle transitions step by step. On the *i*-th step, let *p* be the top of *σ*_{ci−1}, *b* be the front of *β*_{ci−1}; let L(*j*) be the ordered list of nodes connected to *j* in *Ā*_{ci−1} for any node *j* ∈ *σ*_{ci−1}; let if *σ*_{ci−1} = [*j*_{l}, … , *j*_{0}].

The oracle transition for each configuration is derived as follows. If there is no arc linked to *p* in *Ā*_{ci−1}, then we set *t*_{i} to pop; if there exists *a* ∈ *Ā*_{ci−1} linking *p* and *b*, then we set *t*_{i} to la or ra correspondingly. When there are only sh and sw left, we see if there is any node *q* under the top of *σ*_{ci−1} such that *L*(*q*) precedes *L*(*p*) by the lexicographical order. If so, we set *t*_{i} to sw; else we set *t*_{i} to sh. An example for when to do sw is shown in Figure 3. Let *c*_{i} = *t*_{i}(*c*_{i−1}); we continue to compute *t*_{i+1}, until *β*_{ci} is empty.

**Lemma 1**

If *t*_{i} is sh, is complete ordered by lexicographical order.

**Proof**

It cannot be the case that for some *u* > 0, *L*(*j*_{u}) strictly precedes *L*(*j*_{0}), otherwise *t*_{i} should be sw. It also cannot be the case that for some *u* > *v* > 0, *L*(*j*_{u}) strictly precedes *L*(*j*_{v}), because when *j*_{v−1} is shifted onto the stack, *L*(*j*_{v}) precedes *L*(*j*_{u}) and all the transitions do not change *L*(*j*_{v}) and *L*(*j*_{u}) afterwards. ∎

**Lemma 2**

For *i* = 0, … , *m*, there is no arc (*j*, *k*) ∈ *Ā*_{ci} such that *j*, *k* ∈ *σ*_{i}.

**Proof**

When *j* ∈ *σ*_{ci} is shifted onto the stack by the *w*-th transition *t*_{w}, there must be no arc (*j*, *k*) or (*k*, *j*) in *Ā*_{cw} such that *k* ∈ *σ*_{cw}. Otherwise, by induction every node in *σ*_{cw−1} can only link to nodes in *β*_{cw−1}, which implies that *L*(*k*) has one of the smallest lexicographical orders, and from Lemma 1 the top of *σ*_{cw−1} must be linked to *j*. And not sh, but la or ra should be applied. ∎

**Theorem 1**

*t*_{1}, … , *t*_{m} is an oracle sequence of transitions for *G*.

**Proof**

From Lemma 2, we can infer that *Ā*_{cm}= ∅ so it suffices to show the sequence of transitions is always finite. We define a **swap sequence** to be a subsequence *t*_{i}, … , *t*_{j} such that *t*_{i} and *t*_{j} are sw, *t*_{i−1} and *t*_{j+1} are not sw, and a **shift sequence** similarly. It can be seen that a swap sequence is always followed by a shift sequence, the length of which is no less than the swap sequence, and if the two sequences are of the same length, the next transition cannot be sw. Let #(*t*) to be the number of transition types *t* in the sequence, then #(la), #(ra), #(pop), and #(sh) − #(sw) are all finite. Therefore the number of swap sequence is finite, indicating that the transition sequence is finite. ∎

### 2.4 System 2: Two-Stack–Based System

A majority of transition systems organize partial parsing results with a stack. Classical parsers, including arc-standard and arc-eager ones, add dependency arcs only between nodes that are adjacent on the stack or the buffer. A natural idea to produce crossing arcs is to temporarily move nodes that block non-adjacent nodes to an extra memory module, like the two-stack–based system for two-planar graphs (Gómez-Rodríguez and Nivre 2010) and the list-based system (Nivre 2008). In this article, we design a new transition system to handle crossing arcs by using two stacks. This system is also complete with respect to the class of directed graphs without self-loop.

#### 2.4.1 The System

We define the two-stack–based transition system , where a configuration contains a primary stack *σ* and a secondary stack *σ*′. We set *c*_{s}(*x*) = ([], [], [1, … , *n*], {}) for the sentence *x* = *w*_{1}, … , *w*_{n}, and we take the set to be the set of all configurations with empty buffers. The transition set *T* contains six types of transitions, as shown in Figure 4. We only explain Mem and Recall:

- •
Mem (mem) pops the top element from the primary stack and pushes it onto the secondary stack.

- •
Recall (rc) moves the top element of the secondary stack back to the primary stack.

#### 2.4.2 Theoretical Analysis

The soundness of this system is trivial, and the completeness is also straightforward after we give the construction of an oracle transition sequence for an arbitrary graph. The oracle is computed as follows on the *i*-th step: We do la, ra, and pop transitions just like in Section 2.3.2. After that, let *b* be the front of *β*_{ci−1}, we see if there is *j* ∈ *σ*_{ci−1} or linked to *b* by an arc in *Ā*_{ci−1}. If *j* ∈ *σ*_{ci−1}, then we do a sequence of mem to make *j* the top of *σ*_{ci−1}; if , then we do a sequence of rc to make *j* the top of *σ*_{ci−1}. When no node in *σ*_{ci−1} or is linked to *b*, we do sh.

**Theorem 2**

*S*_{2S} is complete with respect to directed graphs without self-loop.

**Proof**

The completeness immediately follows the fact that the computed oracle sequence is finite, and every time a node is shifted onto *σ*_{ci}, no arc in *Ā*_{ci} links nodes in *σ*_{ci} to the shifted node. ∎

#### 2.4.3 Related Systems

Gómez-Rodríguez and Nivre (2010, 2013) introduced a two-stack–based transition system for tree parsing. Their study is motivated by the observation that the majority of dependency trees in various treebanks are actually planar or two-planar graphs. Accordingly, their algorithm is specially designed to handle projective trees and two-planar trees, but not all graphs. Because many more crossing arcs exist in deep dependency structures and more sentences are assigned with neither planar nor two-planar graphs, their strategy of utilizing two stacks is not suitable for the deep dependency parsing problem. Different from their system, our new system maximizes the utility of two memory modules and is able to handle any directed graphs.

The list-based systems, such as the basic one introduced by Nivre (2008) and the extended one introduced by Choi and Palmer (2011), also use two memory modules. The function of the secondary memory module of their systems and ours is very different. In our design, only nodes involved in a subgraph that contains crossing arcs may be put into the second stack. In the existing list-based systems, both lists are heavily used, and nodes may be transferred between them many times. The function of the two lists is to simulate one memory module that allows accessing any unit in it.

### 2.5 Extension

#### 2.5.1 Graphs with Loops

It is easy to extend our system to generate arbitrary directed graphs by adding a new transition:

- •
Self-Arc adds an arc from the top element of the primary memory module (

*σ*) to itself, but does not update any stack nor buffer.

**Theorem 3**

*S*_{S} and *S*_{2S} augmented with Self-Arc are complete with respect to directed graphs.

#### 2.5.2 Labeled Parsing and Supertagging

It is also straightforward to adapt the two transition systems for labeled dependency graph generation. To do so, we can parameterize Left-Arc and Right-Arc transitions with dependency relations. For example, a parameterized transition Left-Arc_{r} tells the system not only that there is an arc between the frontier node of the stack and the frontier node of the buffer but also that this arc holds a relation *r*. Some linguistic representations assign labels to nodes as well. When a deep grammar is considered to license to representation, node labels are usually called “supertags.” To assign supertags to words, namely, nodes in a dependency graph, we can parameterize the Shift transition with tag labels.

## 3. Statistical Disambiguation

### 3.1 Transition Classification

*S*that maps a configuration

*c*to a transition

*t*that is defined on

*c*. More formally, a transition-based statistical parser tries to find the transition sequence c

_{0,m}that maximizes the following scoreFollowing the state-of-the-art discriminative disambiguation technique for data-driven parsing, we define the score function as a linear combination of features defined over a configuration and a transition, as follows:where

*ϕ*defines a vector for each configuration–transition pair and

*θ*is the weight vector for linear combination.

Exact calculation of the maximization is extremely hard without any assumption of *ϕ*. Even with a proper *ϕ* for real-word parsing, exact decoding is still impractical for most practical feature designs. In this article, we follow the recent success of using beam search for approximate decoding. During parsing, the parser keeps track of multiple yet a fixed number of partial outputs to avoid making decisions too early. Training a parser in the discriminative setting corresponds to estimating *θ* associated with rich features. Previous research on dependency parsing shows that structured perceptron (Collins 2002; Collins and Roark 2004) is one of the strongest learning algorithms. In all experiments, we use the averaged perceptron algorithm with early update to estimate parameters. The whole parser is very similar to the transition-based system introduced in Zhang and Clark (2008, 2011b).

### 3.2 Transition Combination

In either thmm, *S*_{S}, or *S*_{2S}, the Left/Right-Arc transition does not modify either the stack or the buffer. Only new edges are added to the target graph. When automatic classifiers are utilized to approximate an oracle, a majority of features for predicting an Arc transition will be overlapped with the features for the successive transition. Empirically, this property significantly decreases the parsing accuracy. A key observation of a linguistically motivated bilexical graph is that there is usually at most one edge between any two words, therefore an Arc transition is not followed by another Arc. As a result, any Arc with its successive transition modifies a configuration much. To practically improve the performance of a statistical parser, we combine every pair of two successive transitions starting with Arc and transform the proposed two transition systems into two modified ones. For example, in our two-stack–based system, after combining, we obtain the transitions presented in Figure 5.

The number of edges between any two words could be at most two in real data. If there are two edges between two words *w*_{a} and *w*_{b}, it must be *w*_{a} → *w*_{b} and *w*_{b} → *w*_{a}. We call these two edges a two-cycle, and call this problem the two-cycle problem. In our combined transitions, a Left/Right-Arc transition should appear before a non-ARC transition. In order to generate two edges between two words, we have two strategies:

- A)
Add a new type of transitions to each system, which consist of a Left-Arc transition, a Right-Arc transition, and any other non-ARC transition (e.g., Left-Arc-Right-Arc-Recall for

*S*_{2S}). - B)
Use a non-directional Arc transition instead of Left/Right-Arc. Here, an Arc transition may add one or two edges depends on its label. In detail, we propose two algorithms, namely, EncodeLabel and DecodeLabel (see Algorithms 1 and 2), to deal with labels for Arc transition.

To our best efforts, the strategy B performs better.

First, let us consider accuracy. Generally speaking, it is harder for transition classification if more target transitions are defined. Using strategy A, we should add additional transitions to handle the two-cycle condition. Based on our experiments, the performance decreases when using more transitions.

Considering efficiency, we can save time by only using labels that appear in training data in strategy B. If we have a total of *K* possible labels in training data, they will generate *K*^{2} two-cycle types, but only *k* possible combinations of two-cycle appear in training data (*k* ≪ *K*^{2}). In strategy A, we must add *K*^{2} transitions to deal with all possible two-cycle types, but most of them do not make sense. Using fewer two-cycle types helps us eliminate the invalid calculation and save time effectively.

Using strategy B, we change the original edges' labels and use the Arc(label)–non-Arc transition instead of Left/Right-Arc(label)–non-Arc. An Arc(label)–non-Arc transition should execute the Arc(label) transition first, then execute the non-Arc transition. Arc(label) generates one or two edges depends on its label. Not only do we encode two-cycle labels, but also Left/Right-Arc labels. In practice, we only use those labels that appear in training data. Because labels that do not appear only contribute non-negative weights while training, we can eliminate them without any performance loss.

*S*

_{S}, and

*S*

_{2S}, respectively. Algorithms 3 to 5 illustrate the key steps of the procedure of our system, which find the next transition

*t*given a configuration

*c*and gold graph

*G*

_{gold}= (

*V*

_{x},

*A*

_{gold}) for the three systems. When this key procedure, namely, the ExtractOneOracle method, is well defined, the entire transition system can be derived as follows:We want to emphasize that, although the ExtractOracle methods initialize the parameter label in ExtractOneOracle as nil, if an arc transition is predicted in the ExtractOneOracle method, it will call ExtractOneOracle recursively to return an Arc(label)–non-Arc transition and assign a value for that label.

### 3.3 Feature Design

Developing features has been shown to be crucial to advancing the state-of-the-art in dependency parsing (Koo and Collins 2010; Zhang and Nivre 2011). To build accurate deep dependency parsers, we utilize a large set of features for transition classification.

To conveniently define all features, we use the following notation. In a configuration with stack *σ* and buffer *β*, we denote the top two nodes in *σ* by *σ*_{0} and *σ*_{1}, and the front of *β* by *β*_{0}. In a configuration of the two-stack–based system with the second stack *σ*′, the top element of *σ*′ is denoted by and the front of *β* by *β*_{0}. The left-most dependent of node *n* is denoted by *n.lc*, the right-most one by *n.rc*. The left-most parent of node *n* is denoted by *n.lp*, the right-most one by *n.rp*. Then we denote the word and POS-tag of node *n* by *w*_{n}, *p*_{n}, respectively. Our parser derives the so-called path features from dependency trees. The path features collect POS tags or the first letter of POS tags along the tree between two nodes. Given two nodes *n*_{1} and *n*_{2}, we denote the path feature as *path*(*n*_{1}, *n*_{2}) and the coarse-grained path feature as *cpath*(*n*_{1}, *n*_{2}). The syntactic head of a node *n* is denoted as *n.h*.

We use the same feature templates for the online re-ordering and the two-stack–based systems, and they are slightly different from thmm. Figure 6 defines basic feature template functions. All feature templates are described here.

- •
thmm system: f

_{uni}(*σ*_{0}), f_{uni}(*σ*_{1}), g_{uni}(*β*_{0}), f_{context}(*σ*_{0}), f_{context}(*β*_{0}), f_{pair−l}(*σ*_{0},*β*_{0}), f_{pair−l}(*σ*_{1},*β*_{0}), f_{pair}(*σ*_{0},*σ*_{1}), f_{tri}(*σ*_{0},*β*_{0},*σ*_{1}), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*l*_{p}), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*rp*), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lc*), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lc*), f_{tri−l}(*σ*_{0},*β*_{0},*β*_{0}.*lp*), f_{tri−l}(*σ*_{0},*β*_{0},*β*_{0}.*lc*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lp*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*rp*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lc*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lc*), f_{tri−l}(*σ*_{1},*β*_{0},*β*_{0}.*lp*), f_{tri−l}(*σ*_{1},*β*_{0},*β*_{0}.*lc*), f_{quar−l}(*σ*_{0},*β*_{0},*σ*_{0}.*rp*,*σ*_{0}.*rc*), f_{quar−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lc*,*σ*_{0}.*lc*2), f_{quar−l}(*σ*_{0},*β*_{0},*σ*_{0}.*rc*,*σ*_{0}.*rc*2), f_{quar−l}(*σ*_{0},*β*_{0},*β*_{0}.*lp*,*β*_{0}.*lc*), f_{quar−l}(*σ*_{0},*β*_{0},*β*_{0}.*lc*,*β*_{0}.*lc*2), f_{quar−l}(*σ*_{1},*β*_{0},*σ*_{1}.*rp*,*σ*_{1}.*rc*), f_{quar−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lc*,*σ*_{1}.*lc*2), f_{quar−l}(*σ*_{1},*β*_{0},*σ*_{1}.*rc*,*σ*_{1}.*rc*2), f_{quar−l}(*σ*_{1},*β*_{0},*β*_{0}.*lp*,*β*_{0}.*lc*), f_{quar−l}(*σ*_{1},*β*_{0},*β*_{0}.*lc*,*β*_{0}.*lc*2), f_{path}(*σ*_{0},*β*_{0}), f_{path}(*σ*_{1},*β*_{0}), f_{char}(*σ*_{0}), f_{char}(*β*_{0}), - •
Online re-ordering/two stack system: f

_{uni}(*σ*_{0}), f_{uni}(*σ*_{1}), f_{uni}(*σ*_{0}′), g_{uni}(*β*_{0}), f_{context}(*σ*_{0}), f_{context}(*β*_{0}), f_{pair−l}(*σ*_{0},*β*_{0}), f_{pair−l}(*σ*_{1},*β*_{0}), f_{pair−l}(*σ*_{0}′,*β*_{0}), f_{pair}(*σ*_{0},*σ*_{1}), f_{pair}(*σ*_{0},*σ*_{0}′), f_{tri}(*σ*_{0},*β*_{0},*σ*_{1}), f_{tri}(*σ*_{0},*β*_{0},*σ*_{0}′), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lp*), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*rp*), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lc*), f_{tri−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lc*), f_{tri−l}(*σ*_{0},*β*_{0},*β*_{0}.*lp*), f_{tri−l}(*σ*_{0},*β*_{0},*β*_{0}.*lc*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lp*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*rp*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lc*), f_{tri−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lc*), f_{tri−l}(*σ*_{1},*β*_{0},*β*_{0}.*lp*), f_{tri−l}(*σ*_{1},*β*_{0},*β*_{0}.*lc*), f_{tri−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*lp*), f_{tri−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*rp*), f_{tri−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*lc*), f_{tri−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*lc*), f_{tri−l}(*σ*_{0}′,*β*_{0},*β*_{0}.*lp*), f_{tri−l}(*σ*_{0}′,*β*_{0},*β*_{0}.*lc*), f_{quar−l}(*σ*_{0},*β*_{0},*σ*_{0}.*rp*,*σ*_{0}.*rc*), f_{quar−l}(*σ*_{0},*β*_{0},*σ*_{0}.*lc*,*σ*_{0}.*lc*2), f_{quar−l}(*σ*_{0},*β*_{0},*σ*_{0}.*rc*,*σ*_{0}.*rc*2), f_{quar−l}(*σ*_{0},*β*_{0},*β*_{0}.*lp*,*β*_{0}.*lc*), f_{quar−l}(*σ*_{0},*β*_{0},*β*_{0}.*lc*,*β*_{0}.*lc*2), f_{quar−l}(*σ*_{1},*β*_{0},*σ*_{1}.*rp*,*σ*_{1}.*rc*), f_{quar−l}(*σ*_{1},*β*_{0},*σ*_{1}.*lc*,*σ*_{1}.*lc*2), f_{quar−l}(*σ*_{1},*β*_{0},*σ*_{1}.*rc*,*σ*_{1}.*rc*2), f_{quar−l}(*σ*_{1},*β*_{0},*β*_{0}.*lp*,*β*_{0}.*lc*), f_{quar−l}(*σ*_{1},*β*_{0},*β*_{0}.*lc*,*β*_{0}.*lc*2), f_{quar−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*rp*,*σ*_{0}′.*rc*), f_{quar−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*lc*,*σ*_{0}′.*lc*2), f_{quar−l}(*σ*_{0}′,*β*_{0},*σ*_{0}′.*rc*,*σ*_{0}′.*rc*2), f_{quar−l}(*σ*_{0}′,*β*_{0},*β*_{0}.*lp*,*β*_{0}.*lc*), f_{quar−l}(*σ*_{0}′,*β*_{0},*β*_{0}.*lc*,*β*_{0}.*lc*2), f_{path}(*σ*_{0},*β*_{0}), f_{path}(*σ*_{1},*β*_{0}), f_{path}(*σ*_{0}′,*β*_{0}), f_{char}(*σ*_{0}), f_{char}(*β*_{0})

## 4. Tree Approximation

Tree structures exhibit many computationally good properties, and parsing techniques for tree-structured representations are quite mature to some extent. When we consider semantics-oriented graphs, such as the representations for semantic role labeling (SRL; Surdeanu et al. 2008; Hajič et al. 2009), CCG-grounded functor–argument (Clark, Hockenmaier, and Steedman 2002) analysis, HPSG-grounded predicate–argument analysis (Miyao, Ninomiya, and ichi Tsujii 2004), and reduction of MRS (Ivanova et al. 2012), syntactic trees can provide very useful features for semantic disambiguation (Punyakanok, Roth, and Yih 2008). Our parser also utilizes a *path* feature template (as defined in Section 3.3) to incorporate syntactic information for disambiguation.

In case syntactic tree information is not available, we introduce a tree approximation technique to induce tree backbones from deep dependency graphs. Such tree backbones can be utilized to train a tree parser which provides *pseudo* path features. In particular, we introduce an algorithm to associate every graph with a projective dependency tree, which we call **weighted conversion**. The tree reflects partial information about the corresponding graph. The key idea underlying this algorithm is to assign heuristic weights to all ordered pairs of words, and then find the tree with maximum weights. That means a tree frame of a given graph is automatically derived as an alternative for syntactic analysis.

*V*, each possible edge (

*i*,

*j*), where

*i*,

*j*∈

*V*, is assigned a heuristic weight ω(

*i*,

*j*). Among all trees (denoted as ) over

*V*, the maximum spanning tree

*T*

^{max}contains the maximum sum of values of edges:

We separate the ω(*i*, *j*) into three parts (ω(*i*, *j*) = *A*(*i*, *j*) + *B*(*i*, *j*) + *C*(*i*, *j*)) that are as defined here.

- •
*A*(*i*,*j*) =*a*· max{*y*(*i*,*j*),*y*(*j*,*i*)}:*a*is the weight for the existing edge on graph ignoring direction. - •
*B*(*i*,*j*) =*b*·*y*(*i*,*j*):*b*is the weight for the forward edge on the graph. - •
*C*(*i*,*j*) =*n*− |*i*−*j*|: This term estimates the importance of an edge where*n*is the length of the given sentence. For dependency parsing, we consider edges with short distance to be more important because those edges can be predicted more accurately in future parsing processes. - •
*a*≫*b*≫*n*or*a*>*bn*>*n*^{2}: The converted tree should contain as many arcs as possible in original graph, and the direction of the arcs should not be changed if possible. The relationship of*a*,*b*, and*c*guarantees this.

After all edges are weighted, we can use maximum spanning tree algorithms to obtain the converted tree. To obtain the projective tree, we choose Eisner's algorithm. For any graph, we can call this algorithm and get a corresponding tree. However, the tree is informative only when the given graph is dense enough. Fortunately, this condition holds for semantic dependency parsing.

## 5. Empirical Evaluation

### 5.1 Set-up

We present empirical evaluation of different incremental graph spanning algorithms for CCG-style functor–argument analysis, LFG-style grammatical relation analysis, and HPSG-style semantic dependency analysis for English and Chinese. Linguistically speaking, these types of syntacto-semantic dependencies directly encode information such as coordination, extraction, raising, control, as well as many other long-range dependencies. Experiments for a variety of formalisms and languages profile different aspects of transition-based deep dependency parsing models.

Figure 7 visualizes cross-format annotations assigned to the English sentence: *A similar technique is almost impossible to apply to other crops, such as cotton, soybeans, and rice*. This running example illustrates a range of linguistic phenomena such as coordination, verbal chains, argument and modifier prepositional phrases, complex noun phrases, and the so-called **tough** construction. The first format is from the popular corpus PropBank, which is widely used by various SRL systems. We can clearly see that compared with SRL, SDP uses dense graphs to represent much more syntacto–semantic information. This difference suggests to us that we should explore different algorithms for producing SRL and SDP graphs. Another thing worth noting is that, for the same phenomenon, annotation schemes may not agree with each other. Take the coordination construction, for example. For more details about the difference among different data sets, please refer to Ivanova et al. (2012).

For CCG analysis, we conduct experiments on English and Chinese CCGBank (Hockenmaier and Steedman 2007; Tse and Curran 2010). Following previous experimental set-up for English CCG parsing, we use Section 02–21 as training data, Section 00 as the development data, and Section 23 for testing. To conduct Chinese parsing experiments, we use data setting **C** of Tse and Curran (2012). For grammatical relation analysis, we conduct experiments on Chinese GRBank data (Sun et al. 2014). The selection for training, development, and test data is also according to Sun et al.'s (2014) experiments.

We also evaluate all parsing models using more HPSG-grounded semantics-oriented data, namely, DeepBank^{2} (Flickinger, Zhang, and Kordoni 2012) and EnjuBank (Miyao, Ninomiya, and ichi Tsujii 2004). Different from Penn Treebank–converted corpus, DeepBank's annotations are essentially based on the parsing results given a large-scale linguistically precise HPSG grammar, namely, LingGO English resource grammar (ERG; Flickinger 2000), and manually disambiguated. As part of the full HPSG sign, the ERG also makes available a logical-form representation of propositional semantics, in the framework of minimal recursion semantics (MRS; Copestake et al. 2005). Such semantic information is reduced into variable-free bilexical dependency graphs (Oepen and Lønning 2006; Ivanova et al. 2012). In summary, DeepBank gives the *reduction of logical-form meaning representations* with respect to MRS. EnjuBank (Miyao, Ninomiya, and ichi Tsujii 2004) provides another corpus for semantic dependency parsing. This type of annotation is somehow shallower than DeepBank, given that only basic predicate–argument structures are concerned. Different from DeepBank but similar to CCGBank and GRBank, EnjuBank is semi-automatically converted from Penn Treebank–style annotations with linguistic heuristics. To conduct HPSG experiments, we use Sections 00 to 19 as training data and Section 20 as development data to tune parameters. For final evaluation, we use Sections 00 to 20 as training data and section 21 as test data. The DeepBank and EnjuBank data sets are from SemEval 2014 Task 8 (Oepen et al. 2014), and the data splitting policy follows the shared task. Table 1 gives a summary of the data sets for experiments.

Language . | Formalism . | Data . | Training . | Test . |
---|---|---|---|---|

English | CCG | CCGBank | 39,604 | 2,407 |

HPSG | DeepBank | 34,003 | 1,348 | |

HPSG | EnjuBank | 34,003 | 1,348 | |

Chinese | CCG | CCGBank | 22,339 | 2,813 |

LFG | GRBank | 22,277 | 2,557 |

Language . | Formalism . | Data . | Training . | Test . |
---|---|---|---|---|

English | CCG | CCGBank | 39,604 | 2,407 |

HPSG | DeepBank | 34,003 | 1,348 | |

HPSG | EnjuBank | 34,003 | 1,348 | |

Chinese | CCG | CCGBank | 22,339 | 2,813 |

LFG | GRBank | 22,277 | 2,557 |

Experiments for English CCG-grounded analysis were performed using automatically assigned POS-tags that are generated by a symbol-refined generative HMM tagger^{3} (SR-HMM; Huang, Harper, and Petrov 2010). Experiments for English HPSG-grounded analysis used POS-tags provided by the shared task. For the experiments on Chinese CCGBank and GRBank, we use gold-standard POS tags.

We use the averaged perceptron algorithm with early update to estimate parameters, and beam search for decoding. We set the beam size to 16 and the number of iterations to 20 for all experiments. The measure for comparing two dependency graphs is precision and recall of tokens that are defined as 〈*w*_{h}, *w*_{d}, *l*〉 tuples, where *w*_{h} is the head, *w*_{d} is the dependent, and *l* is the relation. Labeled precision/recall (LP/LR) is the ratio of tuples correctly identified by the automatic generator, and unlabeled precision/recall (UP/UR) is the ratio regardless of *l*. F-score is a harmonic mean of precision and recall. These measures correspond to attachment scores (LAS/UAS) in dependency tree parsing and also used by the SemEval 2014 Task 8. The de facto standard to evaluate CCG parsers also considers supertags. Because no supertagging is performed in our experiments, only the unlabeled precision/recall/F-score is comparable to the results reported in other papers. And the labeled performance reported here only considers the labels assigned to dependency arcs that indicate the argument types. For example, an arc label *arg1* denotes that the dependent is the first argument of the head.

### 5.2 Parsing Efficiency

We evaluate the real running time of our final trained parser using realistic data. The test sentences are collected from English Wikipedia and Chinese Gigaword (LDC2005T14). First, we show the influence of beam size in Figure 8. In this experiment, the DeepBank trained models are used for test. We can see that the parsers run in nearly linear time regardless of the beam width in realistic situations. Second, we report the the averaged real running time of models trained on different data sets in Figure 9. Again, we can see that the parser runs in close to linear time for a variety of linguistically motivated representations. The results also suggest that our proposed transition-based parsers can automatically learn the complexity of linguistically motivated dependency structures from an annotated corpus. Note that although within the deep parsing framework, the study of formal grammars is partially relevant for data-driven dependency parsing, where our parsers rely on inductive inference from treebank data, and only *implicitly* use a grammar.

### 5.3 Importance of Transition Combination

Figure 10 and Table 2 summarize the labeled parsing results on all of the five data sets. In this experiment, we distinguish parsing models with and without transition combination. All models take only the surface word form and POS tag information and do not derive features from any syntactic analysis. The importance of transition combination is highlighted by the comparative evaluation on parsers using this mechanism or not. Significant improvements are observed over a wide range of conditions: Parsers based on different transition systems for different languages and different formalisms almost always benefit. This result suggests a necessary strategy for designing transition systems for producing deep dependency graphs: Configurations should be essentially modified by every transition.

Because of the importance of transition combination, all the following experiments utilize the transition combination strategy.

### 5.4 Model Diversity and Parser Ensemble

#### 5.4.1 Model Diversity

*A*and

*B*, we define the following metric:where denotes the set of dependencies related to held out sentences returned by model

*X*. Tables 3 and 4 show the model diversity evaluated on English and Chinese data, respectively. We can see that parsing models built upon different transition systems do vary. Even for one specific transition system, different processing directions yield quite different parsing results.

#### 5.4.2 Parser Ensemble

Parser ensemble has been shown very effective to boost the performance of data-driven tree parsers (Nivre and McDonald 2008; Surdeanu and Manning 2010; Sun and Wan 2013). Empirically, the two proposed systems together with the existing THMM system exhibit complementary prediction powers, and their combination yields superior accuracy. We present a simple yet effective voting strategy for parser ensemble. For each pair of words in each sentence, we count the number of models that give positive predictions. If the number is greater than a threshold (we set it to half the number of models in this work), we put this arc to the final graph, and label the arc with the most common label of what the models give.

Table 5 presents the parsing accuracy of the combined model where six base models are utilized for voting. We can see that a system ensemble is quite helpful. Given that our graph parsers all run in expected linear time, the combined system also runs very efficiently.

### 5.5 Impact of Syntactic Parsing

#### 5.5.1 Effectiveness of Syntactic Features

Syntactic parsing, especially the full one, has been shown very important for boosting the performance of SRL, a well studied shallow semantic parsing task (Punyakanok, Roth, and Yih 2008). According to the comprehensive evaluation presented in Punyakanok, Roth, and Yih (2008) and Zhuang and Zong (2010) (see Table 6), there is an essential gap between full and shallow parsing-based SRL systems. If we consider a system that takes only word form and POS tags as input, the performance gap will be larger.

. | . | Precison . | Recall . | F-score . |
---|---|---|---|---|

English | Full parsing | 77.09% | 75.51% | 76.29 |

Shallow parsing | 75.48% | 67.13% | 71.06 | |

Chinese | Full parsing | 79.17% | 72.09% | 75.47 |

Shallow parsing | 72.57% | 67.02% | 69.68 |

. | . | Precison . | Recall . | F-score . |
---|---|---|---|---|

English | Full parsing | 77.09% | 75.51% | 76.29 |

Shallow parsing | 75.48% | 67.13% | 71.06 | |

Chinese | Full parsing | 79.17% | 72.09% | 75.47 |

Shallow parsing | 72.57% | 67.02% | 69.68 |

When we consider semantics-oriented deep dependency structures, including the representations for CCG-grounded functor–argument (Clark, Hockenmaier, and Steedman 2002) analysis, HPSG-grounded predicate–argument analysis (Miyao, Ninomiya, and ichi Tsujii 2004), and reduction of MRS (Ivanova et al. 2012), syntactic parses can also provide very useful features for disambiguation. To evaluate the impact of syntactic tree parsing, we include more features, namely, path features, to our parsing models. The detailed description of syntactic features are presented in Section 3.3. In this work, we apply syntactic dependency parsers rather than phrase-structure parsers. Figure 11 summarizes the impact of features derived from syntactic trees. We can clearly see that syntactic features are effective to enhance semantic dependency parsing. These informative features lead to on average 1.14% and 1.03% absolute improvements for English and Chinese CCG parsing. Compared with SRL, the improvement brought by syntactic parsing is smaller. We think one main reason for this difference is the information density of different types of graphs. SRL graphs usually annotate only on verbal predicates and their nominalization, whereas the semantic graphs grounded by CCG and HPSG target all words. In other words, SRL provides partial analysis and semantic dependency parsing provides full analysis. Accordingly, SRL needs structural information generated by a syntactic parser much more than semantic dependency parsing.

#### 5.5.2 Comparison of Different Tree Parsers

There are two dominant data-driven approaches to syntactic dependency tree parsing: transition-based (Yamada and Matsumoto 2003; Nivre 2008) and graph-based (McDonald 2006; Torres Martins, Smith, and Xing 2009). In terms of overall per token prediction, the transition-based and graph-based tree parsers achieve comparable performance (Suzuki et al. 2009; Weiss et al. 2015). To evaluate the impact of the two tree parsing approaches on semantic dependency parsing, we use two tree parsers to serve our graph parser. The first one is our in-house implementation of the algorithm presented in Zhang and Nivre (2011), and the second one is a second-order graph-based parser^{4} (Bohnet 2010). The tree parsers are trained with the unlabeled tree annotations provided by the English and Chinese CCGBank data. For both English and Chinese experiments, 5-fold cross validation is performed to parse the training data to avoid overfitting. The accuracy of tree parsers is shown in Table 7. Results presented in Figure 12 indicate that the two parsers are also equivalently effective for producing semantic analysis. This result is somehow non-obvious given that the combination of a graph-based and transition-based parser usually gives significantly better parsing performance (Nivre and McDonald 2008; Torres Martins et al. 2008).

### 5.6 Effectiveness of Tree Approximation

In case syntactic information is not available, we propose a tree approximation technique to induce tree backbones from deep dependency graphs. In particular, our technique guarantees that the automatically derived trees are projective, which is a necessary condition for a number of effective tree parsing algorithms. We can utilize these *pseudo* trees as an alternative to syntactic analysis. To evaluate the effectiveness of tree approximation, we compare the contribution to semantic dependency parsing of syntactic trees and pseudo trees. In this experiment, we use a transition-based tree parser to generate automatic analysis. Figure 13 presents the results. Generally speaking, *pseudo* trees contribute to semantic dependency parsing equally well as syntactic trees. Sometimes, they perform even better. There is a considerable drop when DeepBank data are applied. We think the main reason is the density of DeepBank graphs. Because there are fewer edges in the original graphs, it is harder to extract informative pseudo trees. As a result, the final graph parsing benefits less.

It is also possible to build a parser ensemble on pseudo tree enhanced models. However, the effectiveness of system combination is not as effective as integrating nontree models. Table 8 summarizes the detailed parsing accuracy. We can see that system ensemble is still helpful, though the improvement is limited.

### 5.7 Comparison with Other Parsers

#### 5.7.1 Comparison with Grammar-Based Parsers

We compare our parser with several representative Treebank-guided, grammar-based parsers that achieve state-of-the-art performance for CCG and HPSG analysis. The grammar-based parsers selected represent two different architectures.

- •
The first type of parser implements a shift-reduce parsing architecture and also uses beam search for practical decoding. In particular, we compare our parser with the state-of-the-art CCG parser introduced in Xu, Clark, and Zhang (2014).

^{5}This parser extends a shift-reduce CFG parser (Zhang and Clark 2011a) with a dependency model. - •
The second type of parser implements the chart parsing architecture with some refinements. For CCG analysis, we focus on the parser proposed by Auli and Lopez (2011b). The basic system architecture follows the well-engineered C&C Parser,

^{6}and additionally applies a number of advanced machine learning and optimization techniques, including belief propagation, dual decomposition Auli and Lopez (2011a), and parameter estimation with softmax-margin loss (Auli and Lopez 2011b), to enhance the results. For HPSG analysis, we compare with the well-studied Enju Parser,^{7}which develops a number of advanced techniques for discriminative deep parsing—for example, maximum entropy estimation with feature forest (Miyao and Tsujii 2008) and efficient decoding with supertagging and CFG-filtering (Matsuzaki, Miyao, and Tsujii 2007).

Table 9 shows the final results on the test data for each data set. The representative shift-reduce parser for comparison utilizes a very similar learning and decoding architectures to our system. Similar to our parser, Xu, Clark, and Zhang's (2014) parser incrementally processes a sentence and uses a beam decoder that performs an inexact search. Xu, Clark, and Zhang's parser sets beam width to 128, while ours is 16. It also uses the structured prediction algorithm for parameter estimation. The major difference is that the shift-reduce CCG parser explicitly utilizes a core grammar to guide decoding, whereas our parser excludes all such information. Actually, our models reported here also exclude all syntactic information because no syntactic parse is used for feature extraction. We can see that our individual system based on the two stack transition system achieves equivalent performance to the CCG-driven parser. Moreover, when this individual system is augmented with tree approximation, the accuracy is significantly improved. Note that the individual system with both settings does not rely on any explicit syntactic information. This result on one hand indicates the effectiveness of adapting syntactic parsing techniques for full semantic parsing, and on the other hand suggests the possibility of using semantically structural (not syntactically structural) information only to achieve high-accuracy semantic parsing.

Statistical parsers based on chart parsing are able to perform a more principled search and therefore usually achieve better parsing accuracy than a *normal* shift-reduce parser. We also compare our parsing models with two state-of-the-art chart parsers, namely, the Enju Parser (Miyao and Tsujii 2008) and Auli and Lopez's (2011b) parser. Different from Xu, Clark, and Zhang's (2014) shift-reduce parser and our models, Auli and Lopez's (2011b) parser does not guarantee to produce analysis for arbitrary sentences. Usually, the numerical performance evaluated on all sentences is lower than the results obtained on sentences that can be parsed. Note that Auli and Lopez (2011b) only reported results on sentences that are covered, whereas Oepen et al. (2014) reported results on all sentences, which is achieved by Enju Parser. From Table 9, we can clearly see that our graph-spanning models are very competitive. The best individual and combined models outperform the Enju Parser and perform equally well to Auli and Lopez's (2011b) parser. It is worth noting that strictly less information is used by our parsers.

#### 5.7.2 Comparison with Other Data-Driven Parsers

We also compare our parser with recently developed data-driven, factorization models (Martins and Almeida 2014; Du, Sun, and Wan 2015). Different from projective but similar to non-projective tree parsing, decoding for factorization models where very basic second-order sibling factors are incorporated is NP-hard. See the proof presented in our early work (Du, Sun, and Wan 2015) for details. To perform principled decoding, dual decomposition is used and achieves good empirical results (Martins and Almeida 2014; Du, Sun, and Wan 2015).

From Table 9, we can see that the transition-based approach augmented with tree approximation is comparable to the factorization approach in general. Compared with the Turbo Parser, our individual and hybrid models perform significantly worse on DeepBank but significantly better on EnjuBank. We think one main reason is because of the annotation styles. Though both corpora are based on HPSG, the annotations in question are quite different. DeepBank graphs are more sparse than EnjuBank, which makes tree approximation less effective. It seems that the transition-based parser suffers more when fewer output edges are targeted. The two approaches achieve equivalent performance for CCG parsing.

## 6. Related Work

Deep linguistic processing is concerned with NLP approaches that aim at modeling the complexity of natural languages in rich linguistic representations. Such approaches are typically related to a particular computational linguistic theory (e.g., CCG, LFG, and HPSG). Parsing in these formalisms provides an elegant way to generate deep syntacto-semantic dependency structures with high quality (Clark and Curran 2007; Miyao, Sagae, and Tsujii 2007; Miyao and Tsujii 2008). The incremental shift-reduce parsing architecture has been implemented for CCG parsing (Zhang and Clark 2011a; Ambati et al. 2015). Besides using phrase-structure rules only, a shift-reduce parser can be enhanced by incorporating a dependency model (Xu, Clark, and Zhang 2014). Our parser and the two above parsers have some essential resemblances, including learning and decoding algorithms. The main difference is the usage of syntactic and grammatical information. The comparison in Section 5.7 gives a rough idea of the impact of explicitly using grammatical constraints. A deep-grammar-guided parsing model usually cannot produce full coverage and the time complexity of the corresponding parsing algorithms is very high. Some NLP applications may favor lightweight solutions to build deep dependency structures.

Different from grammar-guided approaches, data-driven approaches make essential use of machine learning from linguistic annotations in order to parse new sentences. Such approaches, for example, transition-based (Yamada and Matsumoto 2003; Nivre 2008) and graph-based (McDonald 2006; Torres Martins, Smith, and Xing 2009) models, have attracted the most attention of dependency parsing in recent years. Several successful parsers (e.g., MST, Mate, and Malt parsers) have been built and applied to many NLP applications. Recently, two advanced techniques have been studied to enhance a transition-based parser. First, developing features has been shown crucial to advancing parsing accuracy and a very rich feature set is carefully evaluated by Zhang and Nivre (2011). Second, beyond deterministic greedy search, beam search and principled dynamic programming strategies have been used to explore more possible hypotheses (Zhang and Clark 2008; Huang and Sagae 2010). When we implement our graph parser, we also leverage rich features and beam search to obtain good parsing accuracy.

Most research concentrated on surface dependency structures, and the majority of existing approaches are limited to producing only tree-shaped graphs. We notice three distinguished exceptions in early work. Sagae and Tsujii (2008) proposed a DAG parser that is able to handle projective directed dependency graphs, and that uses the pseudo-projective parsing technique (Nivre and Nilsson 2005) to build crossing arcs. Titov et al. (2009) and Henderson et al. (2013) introduced non-planar parsing to parse PropBank (Palmer, Gildea, and Kingsbury 2005) structures. However, neither technique handles crossing arcs fully well. There have been a number of papers trying to build non-projective trees, which inspired the design of our transition systems. Especially, we borrow key ideas from Nivre (2009), Gómez-Rodríguez and Nivre (2010), and Gómez-Rodríguez and Nivre (2013). In addition to the investigation on the transition-based approach, McDonald and Pereira (2006) presented a factorization parser that can generate dependency graphs in which a word may depend on multiple heads, and evaluated it on the Danish Treebank. Very recently, the dual decomposition technique has been adopted to achieve principled decoding for factorization models. High-accuracy models have been introduced in Martins and Almeida (2014) and Du, Sun, and Wan (2015).

## 7. Conclusion

We study transition-based approaches that produce general dependency graphs directly from input sequences of words, in a way nearly as simple as tree parsers. We introduce two new graph-spanning algorithms to generate arbitrary directed graphs, which suit deep dependency parsing well. We also introduce transition combination and tree approximation for statistical disambiguation. Statistical parsers built upon these new techniques have been evaluated with dependency structures that are extracted from linguistically deep CCG, LFG, and HPSG derivations. Our models achieve state-of-the-art performance on five representative data sets for English and Chinese parsing. Experiments demonstrate the effectiveness of grammar-free, transition-based approaches to dealing with complex linguistic phenomena beyond surface syntax.

In addition to deep dependency parsing, many other NLP tasks (e.g., quantifier scope disambiguation [Manshadi, Gildea, and Allen 2013] and event extraction [Li, Ji, and Huang 2013]), can be formulated as graph spanning problems. We think such tasks can benefit from algorithms that span general graphs rather than trees, and our new transition-based parsers can provide practical solutions to these tasks.

## Acknowledgments

This work was supported by the National Natural Science Foundation of China under grants 61300064 and 61331011, and the National High-Tech R&D Program under grant 2015AA015403. We are very grateful to the anonymous reviewers for their insightful and constructive comments and suggestions.

## Notes

We assume that at most one edge exists between two words. This is a reasonable assumption for a linguistic representation.

The unlabeled parsing results are not reported in the original paper. The figures presented in Table 9 are provided by Wenduan Xu.

## References

## Author notes

The authors are with the Institute of Computer Science and Technology, the MOE Key Laboratory of Computational Linguistics, Peking University, Beijing 100871, China. E-mail: zhangxunah@pku.edu.cn, duyantao@pku.edu.cn, ws@pku.edu.cn, wanxiaojun@pku.edu.cn. Weiwei Sun is the corresponding author.