## Abstract

Arc-eager dependency parsers process sentences in a single left-to-right pass over the input and have linear time complexity with greedy decoding or beam search. We show how such parsers can be constrained to respect two different types of conditions on the output dependency graph: span constraints, which require certain spans to correspond to subtrees of the graph, and arc constraints, which require certain arcs to be present in the graph. The constraints are incorporated into the arc-eager transition system as a set of preconditions for each transition and preserve the linear time complexity of the parser.

## 1. Introduction

Data-driven dependency parsers in general achieve high parsing accuracy without relying on hard constraints to rule out (or prescribe) certain syntactic structures (Yamada and Matsumoto 2003; Nivre, Hall, and Nilsson 2004; McDonald, Crammer, and Pereira 2005; Zhang and Clark 2008; Koo and Collins 2010). Nevertheless, there are situations where additional information sources, not available at the time of training the parser, may be used to derive hard constraints at parsing time. For example, Figure 1 shows the parse of a greedy arc-eager dependency parser trained on the *Wall Street Journal* section of the Penn Treebank before (left) and after (right) being constrained to build a single subtree over the span corresponding to the named entity “Cat on a Hot Tin Roof,” which does not occur in the training set but can easily be found in on-line databases. In this case, adding the span constraint fixes both prepositional phrase attachment errors. Similar constraints can also be derived from dates, times, or other measurements that can often be identified with high precision using regular expressions (Karttunen et al. 1996), but are under-represented in treebanks.

In this article, we examine the problem of constraining transition-based dependency parsers based on the arc-eager transition system (Nivre 2003, 2008), which perform a single left-to-right pass over the input, eagerly adding dependency arcs at the earliest possible opportunity, resulting in linear time parsing. We consider two types of constraints: **span constraints**, exemplified earlier, require the output graph to have a single subtree over one or more (non-overlapping) spans of the input; **arc constraints** instead require specific arcs to be present in the output dependency graph. The main contribution of the article is to show that both span and arc constraints can be implemented as efficiently computed preconditions on parser transitions, thus maintaining the linear runtime complexity of the parser.^{1}

Demonstrating accuracy improvements due to hard constraints is challenging, because the phenomena we wish to integrate as hard constraints are by definition not available in the parser's training and test data. Moreover, adding hard constraints may be desirable even if it does not improve parsing accuracy. For example, many organizations have domain-specific gazetteers and want the parser output to be consistent with these even if the output disagrees with gold treebank annotations, sometimes because of expectations of downstream modules in a pipeline. In this article, we concentrate on the theoretical side of constrained parsing, but we nevertheless provide some experimental evidence illustrating how hard constraints can improve parsing accuracy.

## 2. Preliminaries and Notation

*Dependency Graphs.* Given a set *L* of dependency labels, we define a **dependency graph** for a sentence *x* = *w*_{1}, …, *w*_{n} as a labeled directed graph *G* = (*V*_{x}, *A*), consisting of a set of nodes *V*_{x} = {1, …, *n*}, where each node *i* corresponds to the linear position of a word *w*_{i} in the sentence, and a set of labeled arcs *A* ⊆ *V*_{x} × *L* × *V*_{x}, where each arc (*i*, *l*, *j*) represents a dependency with head *w*_{i}, dependent *w*_{j}, and label *l*. We assume that the final word *w*_{n} is always a dummy word Root and that the corresponding node *n* is a designated root node.

Given a dependency graph *G* for sentence *x*, we say that a subgraph *G*_{[i,j]} = (*V*_{[i,j]}, *A*_{[i,j]}) of *G* is a **projective spanning tree** over the interval [*i*, *j*] (1 ≤ *i* ≤ *j* ≤ *n*) iff (i) *G*_{[i,j]} contains all nodes corresponding to words between *w*_{i} and *w*_{j} inclusive, (ii) *G*_{[i,j]} is a directed tree, and (iii) it holds for every arc (*i*, *l*, *j*) ∈ *G*_{[i,j]} that there is a directed path from *i* to every node *k* such that min (*i*,*j*) < *k* < max (*i*,*j*) (projectivity). We now define two constraints on a dependency graph *G* for a sentence *x*:

*G*is a**projective dependency tree**(pdt) if and only if it is a projective spanning tree over the interval [1,*n*] rooted at node*n*.*G*is a**projective dependency graph**(pdg) if and only if it can be extended to a projective dependency tree simply by adding arcs.

*Arc-Eager Transition-Based Parsing.* In the arc-eager transition system of Nivre (2003), a **parser configuration** is a triple *c* = (*Σ* |*i*, *j*| *B*, *A*) such that *Σ* and *B* are disjoint sublists of the nodes *V*_{x} of some sentence *x*, and *A* is a set of dependency arcs over *V*_{x} (and some label set *L*). Following Ballesteros and Nivre (2013), we take the initial configuration for a sentence *x* = *w*_{1}, …, *w*_{n} to be *c*_{s}(*x*) = ([ ], [1, …, *n*], { }), where *n* is the designated root node, and we take a terminal configuration to be any configuration of the form *c* = ([ ], [*n*], *A*) (for any arc set *A*). We will refer to the list *Σ* as the **stack** and the list *B* as the **buffer**, and we will use the variables *σ* and *β* for arbitrary sublists of *Σ* and *B*, respectively. For reasons of perspicuity, we will write *Σ* with its head (top) to the right and *B* with its head to the left. Thus, *c* = (*σ*| *i*, *j* | *β*, *A*) is a configuration with the node *i* on top of the stack *Σ* and the node *j* as the first node in the buffer *B*.

There are four types of **transitions** for going from one configuration to the next, defined formally in Figure 2 (disregarding for now the Added Preconditions column):

Left-Arc

_{l}adds the arc (*j*,*l*,*i*) to*A*, where*i*is the node on top of the stack and*j*is the first node in the buffer, and pops the stack. It has as a precondition that the token*i*does not already have a head.Right-Arc

_{l}adds the arc (*i*,*l*,*j*) to*A*, where*i*is the node on top of the stack and*j*is the first node in the buffer, and pushes*j*onto the stack. It has as a precondition that*j*≠*n*.Reduce pops the stack and requires that the top token has a head.

Shift removes the first node in the buffer and pushes it onto the stack, with the precondition that

*j*≠*n*.

**transition sequence**for a sentence

*x*is a sequence

*C*

_{0,m}= (

*c*

_{0},

*c*

_{1}, …,

*c*

_{m}) of configurations, such that

*c*

_{0}is the initial configuration

*c*

_{s}(

*x*),

*c*

_{m}is a terminal configuration, and there is a legal transition

*t*such that

*c*

_{i}=

*t*(

*c*

_{i − 1}) for every

*i*, 1 ≤

*i*≤

*m*. The dependency graph derived by

*C*

_{0,m}is , where is the set of arcs in

*c*

_{m}.

*Complexity and Correctness.* For a sentence of length *n*, the number of transitions in the arc-eager system is bounded by 2*n* (Nivre 2008). This means that a parser using greedy inference (or constant width beam search) will run in *O*(*n*) time provided that transitions plus required precondition checks can be performed in *O*(1) time. This holds for the arc-eager system and, as we will demonstrate, its constrained variants as well.

The arc-eager transition system as presented here is sound and complete for the set of pdts (Nivre 2008). For a specific sentence *x* = *w*_{1}, …, *w*_{n}, this means that any transition sequence for *x* produces a pdt (soundness), and that any pdt for *x* is generated by some transition sequence (completeness).^{2} In constrained parsing, we want to restrict the system so that, when applied to a sentence *x*, it is sound and complete for the subset of pdts that satisfy all constraints.

## 3. Parsing with Arc Constraints

*Arc Constraints.* Given a sentence *x* = *w*_{1}, …, *w*_{n} and a label set *L*, an **arc constraint set** is a set *A*_{C} of dependency arcs (*i*, *l*, *j*) (1 ≤ *i*, *j* ≤ *n*, *i* ≠ *j* ≠ *n*, *l* ∈ *L*), where each arc is required to be included in the parser output. Because the arc-eager system can only derive pdts, the arc constraint set has to be such that the **constraint graph***G*_{C} = (*V*_{x}, *A*_{C}) can be extended to a pdt, which is equivalent to requiring that *G*_{C} is a pdg. Thus, the task of arc-constrained parsing can be defined as the task of deriving a pdt*G* such that *G*_{C} ⊆ *G*. An arc-constrained transition system is sound if it only derives proper extensions of the constraint graph and complete if it derives all such extensions.

*Added Preconditions.* We know that the unconstrained arc-eager system can derive any pdt for the input sentence *x*, which means that any arc in *V*_{x} × *L* × *V*_{x} is reachable from the initial configuration, including any arc in the arc constraint set *A*_{C}. Hence, in order to make the parser respect the arc constraints, we only need to add preconditions that block transitions that would make an arc in *A*_{C} unreachable.^{3} We achieve this through the following preconditions, defined formally in Figure 2 under the heading Arc Constraints for each transition:

Left-Arc

_{l}in a configuration (*σ*|*i*,*j*|*β*,*A*) adds the arc (*j*,*l*,*i*) and makes unreachable any arc that involves*i*and a node in the buffer (other than (*j*,*l*,*i*)). Hence, we permit Left-Arc_{l}only if no such arc is in*A*_{C}.Right-Arc

_{l}in a configuration (*σ*|*i*,*j*|*β*,*A*) adds the arc (*i*,*l*,*j*) and makes unreachable any arc that involves*j*and a node on the stack (other than (*i*,*l*,*j*)). Hence, we permit Right-Arc_{l}only if no such arc is in*A*_{C}.Reduce in a configuration (

*σ*|*i*,*j*|*β*,*A*) pops*i*from the stack and makes unreachable any arc that involves*i*and a node in the buffer. Hence, we permit Reduce only if no such arc is in*A*_{C}.Shift in a configuration (

*σ*,*i*|*β*,*A*) moves*i*to the stack and makes unreachable any arc that involves*j*and a node on the stack. Hence, we permit Shift_{l}only if no such arc is in*A*_{C}.

*Complexity and Correctness.* Because the transitions remain the same, the arc-constrained parser will terminate after at most 2*n* transitions, just like the unconstrained system. However, in order to guarantee termination, we must also show that at least one transition is applicable in every non-terminal configuration. This is trivial in the unconstrained system, where the Shift transition can apply to any configuration that has a non-empty buffer. In the arc-constrained system, Shift will be blocked if there is an arc *a* ∈ *A*_{C} involving the node *i* to be shifted and some node on the stack, and we need to show that one of the three remaining transitions is then permissible. If *a* involves *i* and the node on top of the stack, then either Left-Arc_{l} and Right-Arc_{l} is permissible (in fact, required). Otherwise, either Left-Arc_{l} or Reduce must be permissible, because their preconditions are implied by the fact that *A*_{C} is a pdg.

In order to obtain linear parsing complexity, we must also be able to check all preconditions in constant time. This can be achieved by preprocessing the sentence *x* and arc constraint set *A*_{C} and recording for each node *i* ∈ *V*_{x} its constrained head (if any), its leftmost constrained dependent (if any), and its rightmost constrained dependent (if any), so that we can evaluate the preconditions in each configuration without having to scan the stack and buffer linearly. Because there are at most *O*(*n*) arcs in the arc constraint set, the preprocessing will not take more than *O*(*n*) time but guarantees that all permissibility checks can be performed in *O*(1) time.

Finally, we note that the arc-constrained system is sound and complete in the sense that it derives all and only pdts compatible with a given arc constraint set *A*_{C} for a sentence *x*. Soundness follows from the fact that, for every arc (*i*, *l*, *j*) ∈ *A*_{C}, the preconditions force the system to reach a configuration of the form (*σ*| min (*i*,*j*), max (*i*,*j*)|*β*, *A*) in which either Left-Arc_{l} (*i* > *j*) or Right-Arc_{l} (*i* < *j*) will be the only permissible transition. Completeness follows from the observation that every pdt*G* compatible with *A*_{C} is also a pdg and can therefore be viewed as a larger constraint set for which every transition sequence (given soundness) derives *G* exactly.

*Empirical Case Study: Imperatives.* Consider the problem of parsing commands to personal assistants such as Siri or Google Now. In this setting, the distribution of utterances is highly skewed towards imperatives making them easy to identify. Unfortunately, parsers trained on treebanks like the Penn Treebank (PTB) typically do a poor job of parsing such utterances (Hara et al. 2011). However, we know that if the first word of a command is a verb, it is likely the root of the sentence. If we take an arc-eager beam search parser (Zhang and Nivre 2011) trained on the PTB, it gets 82.14 labeled attachment score on a set of commands.^{4} However, if we constrain the same parser so that the first word of the sentence must be the root, accuracy jumps dramatically to 85.56. This is independent of simply knowing that the first word of the sentence is a verb, as both parsers in this experiment had access to gold part-of-speech tags.

## 4. Parsing with Span Constraints

*Span Constraints.* Given a sentence *x* = *w*_{1}, …, *w*_{n}, we take a **span constraint set** to be a set *S*_{C} of non-overlapping spans [*i*, *j*] (1 ≤ *i* < *j* ≤ *n*). The task of span-constrained parsing can then be defined as the task of deriving a pdt*G* such that, for every span [*i*,*j*] ∈ *S*_{C}, *G*_{[i,j]} is a (projective) spanning tree over the interval [*i*,*j*]. A span-constrained transition system is sound if it only derives dependency graphs compatible with the span constraint set and complete if it derives all such graphs. In addition, we may add the requirement that no word inside a span may have dependents outside the span (none), or that only the root of the span may have such dependents (root).

*Added Preconditions.* Unlike the case of arc constraints, parsing with span constraints cannot be reduced to simply enforcing (and blocking) specific dependency arcs. In this sense, span constraints are more global than arc constraints as they require entire subgraphs of the dependency graph to have a certain property. Nevertheless, we can use the same basic technique as before and enforce span constraints by adding new preconditions to transitions, but these preconditions need to refer to variables that are updated dynamically during parsing. We need to keep track of two things:

Which word is the designated root of a span? A word becomes the designated root

*r*(*s*) of its span*s*if it acquires a head outside the span or if it acquires a dependent outside the span under the root condition.How many connected components are in the subgraph over the current span up to and including the last word pushed onto the stack? A variable #cc is set to 1 when the first span word enters the stack, incremented by 1 for every Shift and decremented by 1 for every Left-Arc

_{l}.

The designated root must not acquire a head

*inside*the span.No word except the designated root may acquire a head

*outside*the span.The designated root must not be popped from the stack before the last word of the span has been pushed onto the stack.

The last word of a span must not be pushed onto the stack in a Right-Arc

_{l}transition if #cc > 1.The last word of a span must not be pushed onto the stack in a Shift transition if #cc > 0.

*Complexity and Correctness.* To show that the span-constrained parser always terminates after at most 2*n* transitions, it is again sufficient to show that there is at least one permissible transition for every non-terminal configuration. Here, Shift is blocked if the word *i* to be shifted is the last word of a span and #cc > 0. But in this case, one of the other three transitions must be permissible. If #cc = 1, then Right-Arc_{l} is permissible; if #cc > 1 and the word on top of the stack does not have a head, then Left-Arc_{l} is permissible; and if #cc > 1 and the word on top of the stack already has a head, then Reduce is permissible (as #cc > 1 rules out the possibility that the word on top of the stack has its head outside the span). In order to obtain linear parsing complexity, all preconditions should be verifiable in constant time. This can be achieved during initial sentence construction by recording the span *s*(*i*) for every word *i* (with a dummy span for words that are not inside a span) and by updating *r*(*s*) (for every span *s*) and #cc as described herein.

Finally, we note that the span-constrained system is sound and complete in the sense that it derives all and only pdts compatible with a given span constraint set *S*_{C} for a sentence *x*. Soundness follows from the observation that failure to have a connected subgraph *G*_{[i,j]} for some span [*i*,*j*] ∈ *S*_{C} can only arise from pushing *j* onto the stack in a Shift with #cc > 0 or a Right-Arc_{l} with #cc > 1, which is explicitly ruled out by the added preconditions. Completeness can be established by showing that a transition sequence that derives a pdt*G* compatible with *S*_{C} in the unconstrained system cannot violate any of the added preconditions, which is straightforward but tedious.

*Empirical Case Study: Korean Parsing.* In Korean, white-space-separated tokens correspond to phrasal units (similar to Japanese *bunsetsus*) and not to basic syntactic categories like nouns, adjectives, or verbs. For this reason, a further segmentation step is needed in order to transform the space-delimited tokens to units that are a suitable input for a parser and that will appear as the leaves of a syntactic tree. Here, the white-space boundaries are good candidates for posing hard constraints on the allowed sentence structure, as only a single dependency link is allowed between different phrasal units, and all the other links are phrase-internal. An illustration of the process is given in Figure 3. Experiments on the Korean Treebank from McDonald et al. (2013) show that adding span constraints based on white space indeed improves parsing accuracy for an arc-eager beam search parser (Zhang and Nivre 2011). Unlabeled attachment score increases from an already high 94.10 without constraints to 94.92, and labeled attachment score increases from 89.91 to 90.75.

*Combining Constraints.* What happens if we want to add arc constraints on top of the span constraints? In principle, we can simply take the conjunction of the added preconditions from the arc constraint case and the span constraint case, but some care is required to enforce correctness. First of all, we have to check that the arc constraints are consistent with the span constraints and do not require, for example, that there are two words with outside heads inside the the same span. In addition, we need to update the variables *r*(*s*) already in the preprocessing phase in case the arc constraints by themselves fix the designated root because they require a word inside the span to have an outside head or (under the Root condition) to have an outside dependent.

## 5. Conclusion

We have shown how the arc-eager transition system for dependency parsing can be modified to take into account both arc constraints and span constraints, without affecting the linear runtime and while preserving natural notions of soundness and completeness. Besides the practical applications discussed in the introduction and case studies, constraints can also be used as partial oracles for parser training.

## Notes

Although span and arc constraints can easily be added to other dependency parsing frameworks, this often affects parsing complexity. For example, in graph-based parsing (McDonald, Crammer, and Pereira 2005) arc constraints can be enforced within the *O*(*n*^{3}) Eisner algorithm (Eisner 1996) by pruning out inconsistent chart cells, but span constraints require the parser to keep track of full subtree end points, which would necessitate the use of *O*(*n*^{4}) algorithms (Eisner and Satta 1999).

Although the transition system in Nivre (2008) is complete but not sound, it is trivial to show that the system as presented here (with the root node at the end of the buffer) is both sound and complete.

Data and splits from the Web Treebank of Petrov and McDonald (2012). Commands used for evaluation were sentences from the test set that had a sentence initial verb root.

## References

## Author notes

Uppsala University, Department of Linguistics and Philology, Box 635, SE-75126, Uppsala, Sweden. E-mail: joakim.nivre@lingfil.uu.se.

Bar-Ilan University, Department of Computer Science, Ramat-Gan, 5290002, Israel. E-mail: yoav.goldberg@gmail.com.

Google, 76 Buckingham Palace Road, London SW1W9TQ, United Kingdom. E-mail: ryanmcd@google.com.