## Abstract

We present a new cubic-time algorithm to calculate the optimal next step in shift-reduce dependency parsing, relative to ground truth, commonly referred to as dynamic oracle. Unlike existing algorithms, it is applicable if the training corpus contains non-projective structures. We then show that for a projective training corpus, the time complexity can be improved from cubic to linear.

## 1 Introduction

A deterministic parser may rely on a classifier that predicts the next step, given
features extracted from the present configuration (Yamada and Matsumoto, 2003; Nivre et al., 2004). It was found that accuracy improves if the
classifier is trained not just on configurations that correspond to the
ground-truth, or “gold”, tree, but also on configurations that a
parser would typically reach when a classifier strays from the optimal predictions.
This is known as a *dynamic oracle*.^{1}

The effective calculation of the optimal step for some kinds of parsing relies on ‘arc-decomposibility’, as in the case of Goldberg and Nivre (2012, 2013). This generally requires a projective training corpus; an attempt to extend this to non-projective training corpora had to resort to an approximation (Aufrant et al., 2018). It is known how to calculate the optimal step for a number of non- projective parsing algorithms, however (Gómez-Rodríguez et al., 2014; Gómez-Rodríguez and Fernández-González, 2015; Fernández-González and Gómez-Rodríguez, 2018a); see also de Lhoneux et al. (2017).

Ordinary shift-reduce dependency parsing is known at least since Fraser (1989); see also Nasr (1995). Nivre (2008) calls it “arc-standard parsing.” For shift-reduce dependency parsing, calculation of the optimal step is regarded to be difficult. The best known algorithm is cubic and is only applicable if the training corpus is projective (Goldberg et al., 2014). We present a new cubic-time algorithm that is also applicable to non-projective training corpora. Moreover, its architecture is modular, expressible as a generic tabular algorithm for dependency parsing plus a context-free grammar that expresses the allowable transitions of the parsing strategy. This differs from approaches that require specialized tabular algorithms for different kinds of parsing (Gómez-Rodríguez et al., 2008; Huang and Sagae, 2010; Kuhlmann et al., 2011).

The generic tabular algorithm is interesting in its own right, and can be used to determine the optimal projectivization of a non-projective tree. This is not to be confused with pseudo-projectivization (Kahane et al., 1998; Nivre and Nilsson, 2005), which generally has a different architecture and is used for a different purpose, namely, to allow a projective parser to produce non-projective structures, by encoding non-projectivity into projective structures before training, and then reconstructing potential non-projectivity after parsing.

A presentational difference with earlier work is that we do not define optimality in terms of “loss” or “cost” functions but directly in terms of attainable accuracy. This perspective is shared by Straka et al. (2015), who also relate accuracies of competing steps, albeit by means of actual parser output and not in terms of best attainable accuracies.

We further show that if the training corpus is projective, then the time complexity can be reduced to linear. To achieve this, we develop a new approach of excluding computations whose accuracies are guaranteed not to exceed the accuracies of the remaining computations. The main theoretical conclusion is that arc-decomposibility is not a necessary requirement for efficient calculation of the optimal step.

Despite advances in unrestricted non-projective parsing, as, for example, Fernández-González and Gómez-Rodríguez (2018b), many state-of-the-art dependency parsers are projective, as, for example, Qi and Manning (2017). One main practical contribution of the current paper is that it introduces new ways to train projective parsers using non-projective trees, thereby enlarging the portion of trees from a corpus that is available for training. This can be done either after applying optimal projectivization, or by computing optimal steps directly for non-projective trees. This can be expected to lead to more accurate parsers, especially if a training corpus is small and a large proportion of it is non-projective.

## 2 Preliminaries

In this paper, a *configuration* (for sentence length *n*) is a 3-tuple (*α*, *β*, *T*) consisting of a *stack**α*, which is a string of integers each between 0 and *n*, a *remaining input**β*, which is a suffix of the string 1⋯ *n*, and a set *T* of pairs (*a*, *a′*) of integers, with 0 ≤ *a* ≤ *n* and 1 ≤ *a′*≤ *n*. Further, *αβ* is a subsequence
of 0 1 ⋯ *n*, starting with 0. Integer 0 represents an
artificial input position, not corresponding to any actual token of an input
sentence.

An integer *a′* (1 ≤ *a′ * ≤ *n*) occurs as second element of a pair (*a*, *a′*) ∈ *T* if and only if it does
not occur in *αβ*. Furthermore, for each *a′* there is at most one *a* such that
(*a*, *a′*) ∈ *T*. If
(*a*, *a′*) ∈ *T* then *a′* is generally called a *dependent* of *a*, but as we will frequently need concepts from graph theory in
the remainder of this article, we will consistently call *a′* a *child* of *a* and *a* the *parent* of *a′*; if *a′* < *a* then *a′* is a *left child* and if *a* < *a′* then it is a *right
child*. The terminology is extended in the usual way to include *descendants* and *ancestors*. Pairs
(*a*, *a′*) will henceforth be called *edges*.

For sentence length *n*, the *initial configuration* is
(0,12 ⋯ *n*,*∅*), and a *final
configuration* is of the form
(0,*ε*,*T*), where *ε* denotes the empty string. The three transitions of
shift-reduce dependency parsing are given in Table 1. By *step* we mean the application of a transition on a
particular configuration. By *computation* we mean a series of steps,
the formal notation of which uses ⊢^{*}, the reflexive,
transition closure of ⊢. If (0, 1 2 ⋯ *n*,
∅)⊢^{*}(0, *ε*, *T*), then *T* represents a tree, with 0 as root
element, and *T* is *projective*, which means that for
each node, the set of its descendants (including that node itself) is of the form
{*a*, *a* + 1,…, *a′* − 1, *a′*}, for
some *a* and *a′*. In general, a *dependency tree* is any tree of nodes labelled 0, 1, …, *n*, with 0 being the root.

shift: |

(α, bβ, T ) ⊢
(αb, β, T ) |

reduce_left: |

(αa_{1}a_{2}, β, T ) ⊢
(αa_{1}, β, T ∪{(a_{1}, a_{2})}) |

reduce_right: |

(αa_{1}a_{2}, β, T) ⊢
(αa_{2}, β, T ∪{(a_{2}, a_{1})}), |

provided |α| > 0 |

shift: |

(α, bβ, T ) ⊢
(αb, β, T ) |

reduce_left: |

(αa_{1}a_{2}, β, T ) ⊢
(αa_{1}, β, T ∪{(a_{1}, a_{2})}) |

reduce_right: |

(αa_{1}a_{2}, β, T) ⊢
(αa_{2}, β, T ∪{(a_{2}, a_{1})}), |

provided |α| > 0 |

The *score* of a tree *T* for a sentence is the number
of edges that it has in common with a given gold tree *T*_{g} for that sentence, or
formally |*T* ∩ *T*_{g}|. The *accuracy* is the score divided by *n*. Note that neither tree need be
projective for the score to be defined, but in this paper the first tree, *T*, will normally be projective. Where indicated, also *T*_{g} is assumed to be projective.

Assume an arbitrary configuration (*α*, *β*, *T*) for sentence length *n* and assume a gold tree *T*_{g} for a sentence of that same
length, and assume three steps (*α*, *β*, *T*) ⊢
(*α*_{i}, *β*_{i}, *T*_{i}), with *i* =
1, 2, 3, obtainable by a **s****h****i****f****t**, **r****e****d****u****c****e**_**l****e****f****t** or **r****e****d****u****c****e**_**r****i****g****h****t**, respectively. (If *β* = *ε*, or |*α*|≤ 2, then
naturally some of the three transitions need to be left out of consideration.) We
now wish to calculate, for each of *i* = 1, 2, 3, the maximum value
of |*T*_{i}*′*∩ *T*_{g}|, for any *T*_{i}*′* such that (*α*_{i}, *β*_{i}, *T*_{i})
⊢^{*}(0, *ε*, *T*_{i}*′*).
For *i* = 1, 2, 3, let *σ*_{i} be this maximum
value. The absolute scores *σ*_{i} are strictly speaking irrelevant; the relative values determine which is the optimal
step, or which *are* the optimal steps, to reach a tree with the
highest score. Note that |{*i* |*σ*_{i} =
max_{j} *σ*_{j}}| is
either 1, 2, or 3. In the remainder of this article, we find it more convenient to
calculate *σ*_{i} −|*T* ∩ *T*_{g}| for each *i*—or, in other words, gold edges that were previously found
are left out of consideration.

We can put restrictions on the set of allowable computations
(*α*, *β*, *T*)
⊢^{*}(0, *ε*, *T* ∪ *T′*). The *left-before-right* strategy demands that all edges (*a*, *a′*)
∈ *T′* with *a′* < *a* are found before any edges (*a*, *a′*) ∈ *T′* with *a* < *a′*, for each *a* that is rightmost in *α* or that occurs in *β*. The *strict left-before-right* strategy in addition disallows edges (*a*, *a′*) ∈ *T′* with *a′* < *a* for each *a* in *α* other than the rightmost element. The intuition is that
a non-strict strategy allows us to correct mistakes already made: If we have already
pushed other elements on top of a stack element *a*, then *a* will necessarily obtain right children before it occurs on
top of the stack again, when it can take (more) left children. By contrast, the
strict strategy would not allow these left children.

The definition of the *right-before-left* strategy is symmetric to
that of the left-before-right strategy, but there is no independent *strict
right-before-left* strategy. In this paper we consider all three
strategies in order to emphasize the power of our framework. It is our understanding
that Goldberg et al. (2014) does not commit
to any particular strategy.

## 3 Tabular Dependency Parsing

We here consider context-free grammars (CFGs) of a special form, with nonterminals in *N* ∪
(*N*_{ℓ} × *N*_{r}), for appropriate finite sets *N*, *N*_{ℓ}, *N*_{r}, which need not be disjoint. The
finite set of terminals is denoted *Σ*. There is a single
start symbol *S* ∈ *N*. Rules are of one of the
forms:

- •
(

*B*,*C*) →*a*, - •
*A*→ (*B*,*C*), - •
(

*B′*,*C*) →*A*(*B*,*C*), - •
(

*B*,*C′*) → (*B*,*C*)*A*,

where *A* ∈ *N*, *B*, *B′*∈ *N*_{ℓ}, *C*, *C′*∈ *N*_{r}, *a* ∈ *Σ*. A first additional requirement is that if
(*B′*, *C*) → *A* (*B*, *C*) is a rule, then
(*B′*, *C′*) → *A* (*B*, *C′*), for any *C′*∈ *N*_{r}, is also a rule, and if
(*B*, *C′*) → (*B*, *C*) *A* is a rule, then
(*B′*, *C′*) →
(*B′*, *C*) *A*, for any *B′*∈ *N*_{ℓ}, is also a rule. This
justifies our notation of such rules in the remainder of this paper as
(*B′*, _ ) → *A* (*B*, _ ) and ( _, *C′*)
→ ( _, *C*) *A*, respectively. These two
kinds of rules correspond to attachment of left and right children, respectively, in
dependency parsing. Secondly, we require that there is precisely one rule
(*B*, *C*) → *a* for each *a* ∈ *Σ*.

Note that the additional requirements make the grammar explicitly
“split” in the sense of Eisner and Satta (1999), Eisner (2000), and Johnson (2007). That
is, the two processes of attaching left and right children, respectively, are
independent, with rules (*B*, *C*) → *a* creating “initial states” *B* and *C*, respectively, for these two processes. Rules of the form *A* → (*B*, *C*) then
combine the end results of these two processes, possibly placing constraints on
allowable combinations of *B* and *C*.

To bring out the relation between our subclass of CFGs and bilexical grammars, one
could explicitly write (*B*, *C*) (*a*)
→ *a*, *A* (*a*) →
(*B*, *C*) (*a*),
(*B′*, _ ) (*b*) → *A* (*a*) (*B*, _ )
(*b*), and ( _, *C′*)
(*c*) → ( _, *C*)
(*c*) *A* (*a*).

Purely symbolic parsing is extended to weighted parsing much as usual, except that
instead of attaching weights to rules, we attach a score *w*(*i*, *j*) to each pair
(*i*, *j*), which is a potential edge. This can be
done for any semiring. In the semiring we will first use, a value is either a
non-negative integer or −∞. Further, *w*_{1} ⊕ *w*_{2} = max (*w*_{1}, *w*_{2}) and *w*_{1} ⊗ *w*_{2} = *w*_{1} + *w*_{2} if *w*_{1} ≠
−∞ and *w*_{2} ≠ −∞ and *w*_{1} ⊕ *w*_{2} =
−∞ otherwise. Naturally, the identity element of ⊕ is 𝟘
= −∞ and the identity element of ⊗ is 𝟙 = 0.

Tabular weighted parsing can be realized following Eisner and Satta (1999). We assume the input is a string *a*_{0}*a*_{1} ⋯ *a*_{n} ∈ *Σ*^{*}, with *a*_{0} being the prospective root of a tree. Table 2 presents the cubic-time algorithm in
the form of a system of recursive equations. With the semiring we chose above, *W*_{ℓ}(*B*, *i*, *j*) represents the highest score of any
right-most derivation of the form (*B*,_ ) ⇒ *A*_{1}(*B*_{1}, _ )
⇒ *A*_{1}*A*_{2}(*B*_{2},
_ )
⇒^{*}*A*_{1}⋯*A*_{m}(*B*_{m},
_ ) ⇒ *A*_{1} ⋯ *A*_{m}*a*_{j} ⇒^{*}*a*_{i} ⋯ *a*_{j}, for some *m* ≥ 0, and *W*_{r}(*C*, *i*, *j*) has symmetric meaning. Intuitively, *W*_{ℓ}(*B*, *i*, *j*) considers *a*_{j} and its left dependents and *W*_{r}(*C*, *i*, *j*) considers *a*_{i} and its right dependents. A
value *W*_{ℓ}(*B*, *C*, *i*, *j*), or *W*_{r}(*B*, *C*, *i*, *j*), represents the
highest score combining *a*_{i} and its
right dependents and *a*_{j} and its left
dependents, meeting in the middle at some *k*, including also an edge
from *a*_{i} to *a*_{j}, or from *a*_{j} to *a*_{i}, respectively.

One may interpret the grammar in Table 3 as
encoding all possible computations of a shift-reduce parser, and thereby all
projective trees. As there is only one way to instantiate the underscores, we obtain
rule (*S*, *S*) → (*S*, *S*) *S*, which corresponds to **r****e****d****u****c****e**_**l****e****f****t**, and rule (*S*, *S*) → *S* (*S*, *S*), which corresponds
to **r****e****d****u****c****e**_**r****i****g****h****t**.

Figure 1 presents a parse tree for the grammar
and the corresponding dependency tree. Note that if we are not given a particular
strategy, such as left-before-right, then the parse tree underspecifies whether left
children or right children are attached first. This is necessarily the case because
the grammar is split. Therefore, the computation in this example may consist of
three **s****h****i****f****t**s, followed by one **r****e****d****u****c****e**_**l****e****f****t**, one **r****e****d****u****c****e**_**r****i****g****h****t**, and one **r****e****d****u****c****e**_**l****e****f****t**, or it may consist of two **s****h****i****f****t**s, one **r****e****d****u****c****e**_**r****i****g****h****t**, one **s****h****i****f****t**, and two **r****e****d****u****c****e**_**l****e****f****t**s.

For a given gold tree *T*_{g}, which may or
may not be projective, we let *w*(*i*, *j*) = *δ*_{g}(*i*, *j*), where we define *δ*_{g}(*i*, *j*) = 1 if (*i*, *j*) ∈ *T*_{g} and *δ*_{g}(*i*, *j*) = 0 otherwise. With the grammar from Table 3, the value *W* found by weighted parsing
is now the score of the most accurate projective tree. By backtracing from *W* as usual, we can construct the (or more correctly, a) tree
with that highest accuracy. We have thereby found an effective way to projectivize a
treebank in an optimal way. By a different semiring, we can count the number of
trees with the highest accuracy, which reflects the degree of “choice”
when projectivizing a treebank.

## 4 𝓞(*n*^{3}) Time Algorithm

In a computation starting from a configuration
(*a*_{0}⋯*a*_{k},*b*_{1}⋯*b*_{m},*T*),
not every projective parse of the string *a*_{0}⋯*a*_{k}*b*_{1}⋯*b*_{m} is achievable. The structures that are achievable are captured by the grammar in Table 4, with *P* for
prefix and *S* for suffix (also for “start symbol”).
Nonterminals *P* and (*P*, *P*)
correspond to a node *a*_{i} (0 ≤ *i* < *k*) that does not have children.
Nonterminal *S* corresponds to a node that has either *a*_{k} or some *b*_{j} (1 ≤ *j* ≤ *m*) among its descendants. This then
means that the node will appear on top of the stack at some point in the
computation. Nonterminal (*S*,*S*) also corresponds to
a node that has one of the rightmost *m* + 1 nodes among its
descendants, and, in addition, if it itself is not one of the rightmost *m* + 1 nodes, then it must have a left child.

Nonterminal (*P*,*S*) corresponds to a node *a*_{i} (0 ≤ *i* < *k*) that has *a*_{k} among its descendants but that does not have a left child. Nonterminal
(*S*, *P*) corresponds to a node *a*_{i} (0 ≤ *i* < *k*) that has a left child but no right children. For *a*_{i} to be given a left child, it is
required that it eventually appear on top of the stack. This requirement is encoded
in the absence of a rule with right-hand side (*S*, *P*). In other words, (*S*, *P*)
cannot be part of a successful derivation, unless the rule (*S*, *S*) → (*S*, *P*) *S* is subsequently used, which then corresponds to giving *a*_{i} a right child that has *a*_{k} among its descendants.

Figure 2 shows an example. Note that we can
partition a parse tree into “columns”, each consisting of a path
starting with a label in *N*, then a series of labels in *N*_{ℓ} × *N*_{r} and ending with a label in *Σ*.

Suppose we have a configuration
(*a*_{0}⋯*a*_{k},*b*_{1}⋯*b*_{m},*T*)
for sentence length *n*, which implies *k* + *m* ≤ *n*. We need to decide whether a **s****h****i****f****t**, **r****e****d****u****c****e**_**l****e****f****t**, or **r****e****d****u****c****e**_**r****i****g****h****t** should be done in order to achieve the highest accuracy, for given
gold tree *T*_{g}. For this, we calculate
three values *σ*_{1}, *σ*_{2} and *σ*_{3},
and determine which is highest.

The first value *σ*_{1} is obtained by investigating
the configuration
(*a*_{0}⋯*a*_{k}*b*_{1},*b*_{2}⋯*b*_{m},∅)
resulting after a shift. We run our generic tabular algorithm for the grammar in Table 4, for input *p*^{k +1}*s*^{m}, to obtain *σ*_{1} = *W*. The scores are
obtained by translating indices of *a*_{0}⋯*a*_{k}*b*_{1}⋯*b*_{m} = *c*_{0}⋯*c*_{k +m} to indices in the original input, that is, we let *w*(*i*, *j*) = *δ*_{g}(*c*_{i}, *c*_{j}). However, the shift, which
pushes an element on top of *a*_{k}, implies
that *a*_{k} will obtain right children
before it can obtain left children. If we assume the left-before-right strategy,
then we should avoid that *a*_{k} obtains
left children. We could do that by refining the grammar, but find it easier to set *w*(*k*, *i*) = −∞
for all *i* < *k*.

For the second value *σ*_{2}, we investigate the
configuration (*a*_{0} ⋯ *a*_{k−1},*b*_{1} ⋯ *b*_{m},∅) resulting after a **r****e****d****u****c****e**_**l****e****f****t**. The same grammar and algorithm are used, now for input *p*^{k−1}*s*^{m +1}. With *a*_{0} ⋯ *a*_{k−1}*b*_{1} ⋯ *b*_{m} = *c*_{0} ⋯ *c*_{k +m−1}, we let *w*(*i*, *j*) = *δ*_{g}(*c*_{i}, *c*_{j}). We let *σ*_{2} = *W* ⊗ *δ*_{g}(*a*_{k−1}, *a*_{k}). In case of a strict
left-before-right strategy, we set *w*(*k* − 1, *i*) = −∞ for *i* < *k* − 1, to avoid that *a*_{k−1} obtains left
children after having obtained a right child *a*_{k}.

If *k* ≤ 1 then the third value is *σ*_{3} = −∞, as no **r****e****d****u****c****e**_**r****i****g****h****t** is applicable. Otherwise we investigate
(*a*_{0}⋯*a*_{k−2}*a*_{k},*b*_{1}⋯*b*_{m},∅).
The same grammar and algorithm are used as before, and *w*(*i*,*j*) = *δ*_{g}(*c*_{i},*c*_{j})
with *a*_{0}⋯*a*_{k−2}*a*_{k}*b*_{1}⋯*b*_{m} = *c*_{0}⋯*c*_{k +m−1}. Now *σ*_{3} = *W* ⊗ *δ*_{g}(*a*_{k},*a*_{k−1}).
In case of a right-before-left strategy, we set *w*(*k* − 1, *i*) =
−∞ for *k* < *i*.

We conclude that the time complexity of calculating the optimal step is three times
the time complexity of the algorithm of Table 2, hence cubic in *n*.

For a proof of correctness, it is sufficient to show that each parse tree by the
grammar in Table 4 corresponds to a
computation with the same score, and conversely that each computation corresponds to
an equivalent parse tree. Our grammar has spurious ambiguity, just as the
shift-reduce parser from Table 1, and this
can be resolved in the same way, depending on whether the intended strategy is
(non-)strict left-before-right or right-before-left, and whether the configuration
is the result of a **s****h****i****f****t**, **r****e****d****u****c****e**_**l****e****f****t**, or **r****e****d****u****c****e**_**r****i****g****h****t**. Concretely, we can restrict parse trees to attach children lower in
the tree if they would be attached earlier in the computation, and thereby we obtain
a bijection between parse trees and computations. For example, in the middle column
of the parse tree in Figure 2, the
(*P*, *S*) and its right child occur below the
(*S*, *S*) and its left child, to indicate the **r****e****d****u****c****e**_**l****e****f****t** precedes the **r****e****d****u****c****e**_**r****i****g****h****t**.

The proof in one direction assumes a parse tree, which is traversed to gather the
steps of a computation. This traversal is post-order, from left to right, but
skipping the nodes representing stack elements below the top of the stack, starting
from the leftmost node labeled *s*. Each node *ν* with a label in *N*_{ℓ} × *N*_{r} corresponds to a step. If the
child of *ν* is labeled *s*, then we have a **s****h****i****f****t**, and if it has a right or left child with a label in *N*, then it corresponds to a **r****e****d****u****c****e**_**l****e****f****t** or **r****e****d****u****c****e**_**r****i****g****h****t**, respectively. The configuration resulting from that step can be
constructed as sketched in Figure 3. We follow
the shortest path from *ν* to the root. All the leaves to the
right of the path correspond to the remaining input. For the stack, we gather the
leaves in the columns of the nodes on the path, as well as those of the left
children of nodes on the path. Compare this with the concept of *right-sentential forms* in the theory of context-free
parsing.

For a proof in the other direction, we can make use of existing parsing theory, which
tells us how to translate a computation of the shift-reduce parser to a dependency
structure, which in turn is easily translated to an undecorated parse tree. It then
remains to show that the nodes in that tree can be decorated (in fact in a unique
way), according to the rules from Table 4.
This is straightforward given the meanings of *P* and *S* described earlier in this section. Most notably, the absence
of a rule with right-hand side ( _, *P*)*P* does not prevent the decoration of a tree that was constructed out of a computation,
because a reduction involving two nodes within the stack is only possible if the
rightmost of these nodes eventually appears on top of the stack, which is only
possible when the computation has previously made *a*_{k} a descendant of that node,
hence we would have *S* rather than *P*.

## 5 𝓞(*n*^{2}) Time Algorithm

Assume a given configuration (*α*, *β*,*T*) as before, resulting from a shift
or reduction. Let *α* = *a*_{0}⋯*a*_{k}, *A* = {*a*_{0}, …, *a*_{k}}, and let *B* be the set of nodes in *β*. We again
wish to calculate the maximum value of |*T′*∩ *T*_{g}| for any *T′* such that (*α*, *β*, ∅) ⊢^{*}(0, *ε*, *T′*), but now under the
assumption that *T*_{g} is projective. Let
us call this value *σ*_{max}. We
define *w* in terms of *δ*_{g} as in the previous
section, setting *w*(*i*, *j*) =
−∞ for an appropriate subset of pairs (*i*, *j*) to enforce a strategy that is (non-)strict left-before-right
or right-before-left.

The edges in *T*_{g} ∩
(*B* × *B*) partition the remaining input
into maximal connected components. Within these components, a node *b* ∈ *B* is called *critical* if it satisfies one or both of the following two
conditions:

- •
At least one descendant of

*b*(according to*T*_{g}) is not in*B*. - •
The parent of

*b*(according to*T*_{g}) is not in*B*.

Let *B*_{crit} ⊆ *B* be
the set of critical nodes, listed in order as *b*_{1},…,*b*_{m},
and let *B*_{ncrit} = *B* ∖ *B*_{crit}. Figure 4 sketches three components as well as edges in *T*_{g} ∩ (*A* × *B*) and *T*_{g} ∩ (*B* × *A*). Component *C*_{1}, for example, contains the critical elements *b*_{1}, *b*_{2}, and *b*_{3}. The triangles under *b*_{1}, …, *b*_{7} represent subtrees consisting of edges leading to non-critical nodes. For each *b* ∈ *B*_{crit},
|*T*_{g} ∩
({*b*}× *A*)| is zero or
more, or in words, critical nodes have zero or more children in the stack. Further,
if (*a*, *b*) ∈ *T*_{g} ∩ (*A* × *B*_{crit}), then *b* is the rightmost critical node in a component; examples are *b*_{5} and *b*_{7} in the
figure.

Let *T*_{max} be any tree such that
(*α*, *β*, ∅)
⊢^{*}(0, *ε*, *T*_{max}) and
|*T*_{max} ∩ *T*_{g}| = *σ*_{max}. Then we can find
another tree *T*_{max}*′* that
has the same properties and in addition satisfies:

*T*_{g}∩ (*B*×*B*_{ncrit}) ⊆*T*_{max}*′*,*T*_{max}*′*∩ (*B*_{ncrit}×*A*) = ∅,*T*_{max}*′*∩ (*B*×*B*_{crit}) ⊆*T*_{g},

or in words, (1) the subtrees rooted in the critical nodes are entirely included, (2) no child of a non-critical node is in the stack, and (3) within the remaining input, all edges to critical nodes are gold. Very similar observations were made before by Goldberg et al. (2014), and therefore we will not give full proofs here. The structure of the proof is in each case that all violations of a property can be systematically removed, by rearranging the computation, in a way that does not decrease the score.

We need two more properties:

If (

*a*,*b*) ∈*T*_{max}*′*∩ (*A*×*B*_{crit}) ∖*T*_{g}then either:- •
*b*is the rightmost critical node in its component, or - •
there is (

*b*,*a′*) ∈*T*_{max}*′*∩*T*_{g}, for some*a′*∈*A*and there is at least one other critical node*b′*to the right of*b*, but in the same component, such that (*b′*,*a*″) ∈*T*_{max}*′*∩*T*_{g}or (*a*″,*b′*) ∈*T*_{max}*′*∩*T*_{g}, for some*a*″ ∈*A*.

- •
If (

*b*,*a*) ∈*T*_{max}*′*∩ (*B*_{crit}×*A*) ∖*T*_{g}then there is (*b*,*a′*) ∈*T*_{max}*′*, for some*a′*∈*A*, such that*a′*is a sibling of*a*immediately to its right.

Figure 5, to be discussed in more detail later,
illustrates property (4) for the non-gold edge from *a*_{4};
this edge leads to *b*_{4} (which has outgoing gold edge to *a*_{5}) rather than to *b*_{5} or *b*_{6}. It further respects property (4) because of the
gold edges connected to *b*_{7} and *b*_{8}, which occur to the right of *b*_{4} but in the same component. Property (5) is
illustrated for the non-gold edge from *b*_{3} to *a*_{8}, which has sibling *a*_{9} immediately to the right.

The proof that property (4) may be assumed to hold, without loss of generality, again
involves making local changes to the computation, in particular replacing the *b* in an offending non-gold edge (*a*, *b*) ∈ *A* × *B*_{crit} by another critical node *b′* further to the left or at the right end of the
component. Similarly, for property (5), if we have an offending non-gold edge
(*b*, *a*), then we can rearrange the computation,
such that node *a* is reduced not into *b* but into
one of the descendants of *b* in *B* that was given
children in *A*. If none of the descendants of *b* in *B* was given children in *A*, then *a* can instead be reduced into its neighbor in the stack
immediately to the left, without affecting the score.

By properties (1)–(3), we can from here on ignore non-critical nodes, so that
the remaining task is to calculate *σ*_{max} −|*B*_{ncrit}|. In fact, we go
further than that and calculate *σ*_{max} − *M*, where *M* =
|*T*_{g} ∩
(*B* × *B*)|. In other words, we take for
granted that the score can be at least as much as the number of gold edges within
the remaining input, which leaves us with the task of counting the additional gold
edges in the optimal computation. For any given component *C* we can
consider the sequence of edges that the computation creates between *A* and *C*, in the order in which they are
created:

- •
for the first gold edge between

*C*and*A*, we count +1, - •
for each subsequent gold edge between

*C*to*A*, we count +1, - •
we ignore interspersed non-gold edges from

*C*to*A*, - •
but following a non-gold edge from

*A*to*C*, the immediately next gold edge between*C*and*A*is*not*counted, because that non-gold edge implies that another gold edge in*B*_{crit}×*B*_{crit}cannot be created.

This is illustrated by Figure 5. For
(*b*_{3}, *a*_{9}) we count +1, it
being the first gold edge connected to the component. For the subsequent three gold
edges, we count +1 for each, ignoring the non-gold edge
(*b*_{3}, *a*_{8}). The non-gold
edge (*a*_{4}, *b*_{4}) implies that
the parent of *b*_{4} is already determined. One would then
perhaps expect we count −1 for non-creation of
(*b*_{5}, *b*_{4}), considering
(*b*_{5}, *b*_{4}) was already
counted as part of *M*. Instead, we let this −1 cancel out
against the following (*b*_{7}, *a*_{3}), by letting the latter contribute +0 rather than
+1. The subsequent edge (*b*_{7}, *a*_{2}) again contributes +1, but the non-gold edge
(*a*_{1}, *b*_{7}) means that the
subsequent (*a*_{0}, *b*_{8})
contributes +0. Hence the net count in this component is 5.

The main motivation for properties (1)–(5) is that they limit the input
positions that can be relevant for a node that is on top of the stack, thereby
eliminating one factor *m* from the time complexity. More
specifically, the gold edges relate a stack element to a “current critical
node” in a “current component”. We need to distinguish however
between three possible *states*:

- •
𝒩 (none): none of the critical nodes from the current component were shifted on to the stack yet,

- •
𝒞 (consumed): the current critical node was ‘consumed’ by it having been shifted and assigned a parent,

- •
ℱ (fresh): the current critical node was not consumed, but at least one of the preceding critical nodes in the same component was consumed.

For 0 ≤ *i* ≤ *k*, we define *p*(*i*) to be the index *j* such
that (*b*_{j}, *a*_{i}) ∈ *T*_{g}, and if there is no such *j*, then *p*(*i*) = ⊥,
where ⊥ denotes ‘undefined’. For 0 ≤ *i* < *k*, we let *p*_{≥ }(*i*) = *p*(*i*) if *p*(*i*) ≠ ⊥, and *p*_{≥ }(*i*) = *p*_{≥ }(*i* + 1) otherwise, and
further *p*_{≥ }(*k*) = *p*(*k*). Intuitively, we seek a critical node
that is the parent of *a*_{i}, or if there
is none, of *a*_{i +1},… We define *c*(*i*) to be the smallest *j* such that (*a*_{i}, *b*_{j}) ∈ *T*_{g}, or in words, the index of
the leftmost child in the remaining input, and *c*(*i*) = ⊥ if there is none.

As representative element of a component with critical element *b*_{j} we take the critical element
that is rightmost in that component, or formally, we define *R*(*j*) to be the largest *j′* such that *b*_{j′} is an ancestor (by *T*_{g} ∩
(*B*_{crit} × *B*_{crit})) of *b*_{j}. For completeness, we define *R*(⊥) = ⊥. We let *P*(*i*) = *R*( *p*(*i*)) and *P*_{≥ }(*i*) = *R*( *p*_{≥ }(*i*)). Note that *R*(*c*(*i*))
=*c*(*i*) for each *i*. For 0
≤ *i* ≤ *k* and 1 ≤ *j* ≤ *m*, we let *p′*(*i*, *j*) = *p*(*i*) if *P*(*i*)
= *R*(*j*) and *p′*(*i*, *j*) = ⊥
otherwise; or in words, *p′*(*i*, *j*) is the index of the parent of *a*_{i} in the remaining input,
provided it is in the same component as *b*_{j}.

Table 5 presents the algorithm, expressed as
system of recursive equations. Here **s****c****o****r****e**(*i*,*j*,*q*) represents
the maximum number of gold edges (in addition to *M*) in a
computation from
(*a*_{0}⋯*a*_{i}*a*_{j}, *b*_{ℓ}⋯*b*_{k},
∅), where *ℓ* depends on the state *q* ∈ {𝒩, 𝒞, ℱ}. If *q* =
𝒩, then *ℓ* is the smallest number such that *R*(*ℓ*) = *P*_{≥ }(*j*); critical nodes from the current component were not
yet shifted. If *q* = 𝒞, then *ℓ* = *p*_{≥ }(*j*) + 1 or *ℓ* = *P*_{≥ }(*j*) + 1; this can be related to the two cases
distinguished by property (4). If *q* = ℱ, then *ℓ* is greater than the smallest number such that *R*(*ℓ*) = *P*_{≥ }(*j*), but smaller than or equal to *p*_{≥ }(*j*) or equal to *ℓ* = *P*_{≥ }(*j*) + 1. Similarly, **s****c****o****r****e***′*(*i*, *j*)
represents the maximum number of gold edges in a computation from
(*a*_{0}⋯*a*_{i}*b*_{j}, *b*_{j +1}⋯*b*_{k},
∅).

For *i* ≥ 0, the value of **s****c****o****r****e**(*i*, *j*, *q*) is the
maximum (by ⊕) of three values. The first corresponds to a reduction of *a*_{j} into *a*_{i}, which turns the stack into *a*_{0}⋯*a*_{i−1}*a*_{i};
this would also include shifts of any remaining right children of *a*_{i}, if there are any, and their
reduction into *a*_{i}. Because there is a
new top-of-stack, the state is updated using *τ*. The function **n****c****h****i****l****d****r****e****n** counts the critical nodes that are children of *a*_{i}. We define **n****c****h****i****l****d****r****e****n** in terms of *w* rather than *T*_{g}, as in the case of the
right-before-left strategy after a **r****e****d****u****c****e**_**r****i****g****h****t** we would preclude right children of *a*_{k} by setting *w*(*k*, *i*) = −∞
for *k* < *i*. The leftmost of the children, at
index *c*(*i*), is not counted (or in other words, 1
is subtracted from the number of children) if it is in the current component *P*_{≥}(*j*) and that component is
anything other than ‘none’; here Δ is the indicator function,
which returns 1 if its Boolean argument evaluates to true, and 0 otherwise. Figure 6 illustrates one possible case.

The second value corresponds to a reduction of *a*_{i} into *a*_{j}, which turns the stack into *a*_{0}⋯*a*_{i−1}*a*_{j},
leaving the state unchanged as the top of the stack is unchanged. The third value is
applicable if *a*_{j} has parent *b*_{ℓ} that has not yet been
consumed, and it corresponds to a shift of *b*_{ℓ} and a reduction of *a*_{i} into *b*_{ℓ} (and possibly further
shifts and reductions that are implicit), resulting in stack *a*_{0}⋯*a*_{i}*b*_{ℓ}.
If this creates the first gold edge connected to the current component, then we add
+1.

For *i* ≥ 0, the value of **s****c****o****r****e***′*(*i*, *j*) is
the maximum of two values. The first value distinguishes two cases. In the first
case, *a*_{i} does not have a parent in the
same component as *b*_{j}, and *a*_{i} is reduced into *b*_{j} without counting the (non-gold)
edge. In the second case, *a*_{i} is reduced
into its parent, which is *b*_{j} or another
critical node that is an ancestor of *b*_{j}; in this case we count the gold
edge. The second value in the definition of **s****c****o****r****e***′*(*i*, *j*)
corresponds to a reduction of *b*_{j} into *a*_{i} (as well as shifts of any
critical nodes that are children of *a*_{i},
and their reduction into *a*_{i}), resulting
in stack *a*_{0}⋯*a*_{i−1}*a*_{i}.
The state is updated using *τ′*, in the light of the
new top-of-stack.

The top-level call is **score** (*k* − 1, *k*, 𝒩). As this does not account for right children of
the top of stack *a*_{k}, we need to add **n****c****h****i****l****d****r****e****n**(*k*). Putting everything together, we have *σ*_{max} = *M* ⊗ **score** (*k* − 1, *k*,
𝒩) ⊗ **nchildren** (*k*). The time complexity
is quadratic in *k* + *m* ≤ *n*,
given the quadratically many combinations of *i* and *j* in **s****c****o****r****e**(*i*, *j*, *q*) and **s****c****o****r****e***′*(*i*, *j*).

## 6 𝓞(*n*) Time Algorithm

Under the same assumption as in the previous section, namely, that *T*_{g} is projective, we can
further reduce the time complexity of computing *σ*_{max}, by two
observations. First, let us define *λ*(*i*, *j*) to be true if and only if there is an *ℓ* < *i* such that
(*a*_{ℓ}, *a*_{j}) ∈ *T*_{g} or
(*a*_{j}, *a*_{ℓ}) ∈ *T*_{g}. If
(*a*_{j}, *a*_{i})∉*T*_{g} and *λ*(*i*, *j*) is false, then
the highest score attainable from a configuration
(*a*_{0}⋯*a*_{i−1}*a*_{j}, *β*,∅) is no higher than the highest score
attainable from
(*a*_{0}⋯*a*_{i−1}*a*_{i}, *β*,∅), or, if *a*_{j} has a parent *b*_{j′}, from
(*a*_{0}⋯*a*_{i}*b*_{j′}, *β′*,∅), for appropriate suffix *β′* of *β*. This means that
in order to calculate **s****c****o****r****e**(*i*, *j*, *q*) we do not
need to calculate **s****c****o****r****e**(*i* − 1, *j*, *q*) in this case.

Secondly, if (*a*_{j}, *a*_{i})∉*T*_{g} and *λ*(*i*, *j*) is true, and
if there is *ℓ′* < *i* such that
(*a*_{ℓ′}, *a*_{i}) ∈ *T*_{g} or
(*a*_{i}, *a*_{ℓ′}) ∈ *T*_{g}, then there are no edges between *a*_{j} and *a*_{i′} for any *i′* with *ℓ′* < *i′* < *i*, because of projectivity of *T*_{g}. We therefore do not need to
calculate **s****c****o****r****e**(*i′*, *j*, *q*)
for such values of *i′* in order to find the computation with
the highest score. This is illustrated in Figure 7.

*κ*(

*i*) to be the smallest

*ℓ′*such that (

*a*

_{ℓ′},

*a*

_{i}) ∈

*T*

_{g}or (

*a*

_{i},

*a*

_{ℓ′}) ∈

*T*

_{g}, or

*i*− 1 if there is no such

*ℓ′*. In the definition of

**s**

**c**

**o**

**r**

**e**, we may now replace

*w*(

*j*,

*i*) ⊗

**s**

**c**

**o**

**r**

**e**(

*i*− 1,

*j*,

*q*) by:

*λ′*(

*i*,

*j*) to be true if and only if there is an

*ℓ*<

*i*such that (

*a*

_{ℓ},

*b*

_{j′}) ∈

*T*

_{g}or (

*b*

_{j′},

*a*

_{ℓ}) ∈

*T*

_{g}for some

*j′*with

*R*(

*j′*) =

*R*(

*j*). In the definition of

**s**

**c**

**o**

**r**

**e**

*′*, we may now replace

**s**

**c**

**o**

**r**

**e**

*′*(

*i*− 1,

*j*) by:

**s**

**c**

**o**

**r**

**e**(

*i*,

*j*,

*q*) and

**s**

**c**

**o**

**r**

**e**

*′*(

*i*,

*j*) that are calculated for any

*i*is now linear. To see this, consider that for any

*i*,

**s**

**c**

**o**

**r**

**e**(

*i*,

*j*,

*q*) would be calculated only if

*j*=

*i*+ 1, if (

*a*

_{i},

*a*

_{j}) ∈

*T*

_{g}or (

*a*

_{j},

*a*

_{i}) ∈

*T*

_{g}, if (

*a*

_{j},

*a*

_{i +1}) ∈

*T*

_{g}, or if

*j*is smallest such that there is

*ℓ*<

*i*with (

*a*

_{ℓ},

*a*

_{j}) ∈

*T*

_{g}or (

*a*

_{j},

*a*

_{ℓ}) ∈

*T*

_{g}. Similarly,

**s**

**c**

**o**

**r**

**e**(

*i*,

*j*) would be calculated only if

**s**

**c**

**o**

**r**

**e**(

*i*,

*j′*,

*q*) would be calculated and (

*b*

_{j},

*a*

_{j′}) ∈

*T*

_{g}, if (

*b*

_{j},

*a*

_{i +1}) ∈

*T*

_{g}, or if

*j*is smallest such that there is

*ℓ*≤

*i*with (

*a*

_{ℓ},

*b*

_{j′}) ∈

*T*

_{g}or (

*b*

_{j′},

*a*

_{ℓ}) ∈

*T*

_{g}for some

*j′*such that

*b*

_{j′}an ancestor of

*b*

_{j}in the same component.

## 7 Towards Constant Time Per Calculation

A typical application would calculate the optimal step for several or even all
configurations within one computation. Between one configuration and the next, the
stack differs at most in the two rightmost elements and the remaining input differs
at most in that it loses its leftmost element. Therefore, all but a constant number
of values of **s****c****o****r****e**(*i*, *j*, *q*) and **s****c****o****r****e***′*(*i*, *j*) can
be reused, to make the time complexity closer to constant time for each calculation
of the optimal step. The practical relevance of this is limited however if one would
typically reload the data structures containing the relevant values, which are of
linear size. Hence we have not pursued this further.

## 8 Experiments

Our experiments were run on a laptop with an Intel i7-7500U processor (4 cores, 2.70
GHz) with 8 GB of RAM. The implementation language is Java, with DL4J^{2} for the classifier, realized as a
neural network with a single layer of 256 hidden nodes. Training is with batch size
100, and 20 epochs. Features are the (gold) parts of speech and length-100 word2vec
representations of the word forms of the top-most three stack elements, as well as
of the left-most three elements of the remaining input, and the left-most and
right-most dependency relations in the top-most two stack elements.

### 8.1 Optimal Projectivization

We need to projectivize our training corpus for the experiments in Section 8.2, using the algorithm described at the end of Section 3. As we are not aware of literature reporting experiments with optimal projectivization, we briefly describe our findings here.

Projectivizing all the training sets in Universal Dependencies v2.2^{3} took 244 sec in total, or 0.342
ms per tree. As mentioned earlier, there may be multiple projectivized trees
that are optimal in terms of accuracy, for a single gold tree. We are not aware
of meaningful criteria that tell us how to choose any particular one of them,
and for our experiments in Section
8.2 we have chosen an arbitrary one. It is conceivable, however, that
the choices of the projectivized trees would affect the accuracy of a parser
trained on them. Figure 8 illustrates the
degree of “choice” when projectiving trees. We consider two
languages that are known to differ widely in the prevalence of non-projectivity,
namely Ancient Greek (PROIEL) and Japanese (BCCWJ), and we consider one more
language, German (GSD), that falls in between (Straka et al., 2015). As can be expected, the degree of choice grows
roughly exponentially in sentence length.

Table 6 shows that pseudo-projectivization
is non-optimal. We realized pseudo-projectivization using MaltParser 1.9.0.^{4}

### 8.2 Computing the Optimal Step

To investigate the run-time behavior of the algorithms, we trained our
shift-reduce dependency parser on the German training corpus, after it was
projectivized as in Section 8.1. In a
second pass over the same corpus, the parser followed the steps returned by the
trained classifier. For each configuration that was obtained in this way, the
running time was recorded of calculating the optimal step, with the non-strict
left-before-right strategy. For each configuration, it was verified that the
calculated scores, for **s****h****i****f****t**, **r****e****d****u****c****e**_**l****e****f****t**, and **r****e****d****u****c****e**_**r****i****g****h****t**, were the same between the three algorithms from Sections 4, 5, and 6.

The two-pass design was inspired by Choi and Palmer (2011). We chose this design, rather than online learning, as we found it easiest to implement. Goldberg and Nivre (2012) discuss the relation between multi-pass and online learning approaches.

As Figure 9 shows, the running times of the
algorithms from Sections 5 and 6 grow slowly as the summed length of
stack and remaining input grows; note the logarithmic scale. The improvement of
the linear-time algorithm over the quadratic-time algorithm is perhaps less than
one may expect. This is because the calculation of the critical nodes and the
construction of the necessary tables, such as *p*, *p′*, and *R*, is considerable compared
to the costs of the memoized recursive calls of **s****c****o****r****e** and **s****c****o****r****e***′*.

Both these algorithms contrast with the algorithm from Section 4, applied on projectivized trees as above (hence
tagged proj in Figure 9), and with the
remaining input simplified to just its critical nodes. For *k* + *m* = 80, the cubic-time algorithm is slower than the
linear-time algorithm by a factor of about 65. Nonetheless, we find that the
cubic-time algorithm is practically relevant, even for long sentences.

The decreases at roughly *k* + *m* = 88, which are
most visible for Section 4 (proj),
are explained by the fact that the running time is primarily determined by *k* + *m′*, where *m′* is the number of critical nodes. Because *k* + *m* is bounded by the sentence length
and the stack height *k* tends to be much less than the sentence
length, high values of *k* + *m* tend to result
from the length *m* of the remaining input being large, which in
turn implies that there will be more non-critical nodes that are removed before
the most time-consuming part of the analyses is entered. This is confirmed by Figure 10.

The main advantage of the cubic-time algorithm is that it is also applicable if the training corpus has not been projectivized. To explore this we have run this algorithm on the same corpus again, but now without projectivization in the second pass (for training the classifier in the first pass, projectivization was done as before). In this case, we can no longer remove non-critical nodes (without it affecting correctness), and now the curve is monotone increasing, as shown by Section 4 (unproj) in Figure 9. Nevertheless, with mean running times below 0.25 sec even for input longer than 100 tokens, this algorithm is practically relevant.

### 8.3 Accuracy

If a corpus is large enough for the parameters of a classifier to be reliably
estimated, or if the vast majority of trees is projective, then accuracy is not
likely to be much affected by the work in this paper. We therefore also consider
six languages that have some of the smallest corpora in UD v2.2 in combination
with a relatively large proportion of non-projective trees: Danish, Basque,
Greek, Old Church Slanovic, Gothic, and Hungarian. For these languages, Table 7 shows that accuracy is generally
higher if training can benefit from *all* trees. In a few cases,
it appears to be slightly better to train directly on non-projective trees
rather than on optimally projectivized trees.

LAS UAS . | (1) subset . | (2) all . | (3) subset Sec. 6 . | (4) all Sec. 6 . | (5) all Sec. 4 . |
---|---|---|---|---|---|

de, 13% | 71.15 | 71.33 | 71.69 | 72.57 | 72.55 |

263,804 | 78.09 | 78.14 | 78.96 | 79.78 | 79.77 |

da, 13% | 69.11 | 71.42 | 69.95 | 72.18 | 72.25 |

80,378 | 75.13 | 76.98 | 76.30 | 78.00 | 78.21 |

eu, 34% | 54.69 | 58.27 | 54.11 | 57.49 | 57.81 |

72,974 | 67.49 | 70.07 | 67.71 | 70.07 | 70.13 |

el, 12% | 71.62 | 72.78 | 70.49 | 72.66 | 72.34 |

42,326 | 77.45 | 78.34 | 77.14 | 78.91 | 78.36 |

cu, 20% | 56.25 | 59.09 | 56.31 | 58.78 | 59.52 |

37,432 | 68.08 | 69.95 | 69.07 | 70.10 | 70.94 |

got, 22% | 51.96 | 55.00 | 53.44 | 55.94 | 56.20 |

35,024 | 64.48 | 66.58 | 65.85 | 67.85 | 68.09 |

hu, 26% | 52.70 | 56.20 | 54.09 | 57.37 | 57.62 |

20,166 | 65.72 | 68.96 | 67.55 | 70.20 | 70.30 |

LAS UAS . | (1) subset . | (2) all . | (3) subset Sec. 6 . | (4) all Sec. 6 . | (5) all Sec. 4 . |
---|---|---|---|---|---|

de, 13% | 71.15 | 71.33 | 71.69 | 72.57 | 72.55 |

263,804 | 78.09 | 78.14 | 78.96 | 79.78 | 79.77 |

da, 13% | 69.11 | 71.42 | 69.95 | 72.18 | 72.25 |

80,378 | 75.13 | 76.98 | 76.30 | 78.00 | 78.21 |

eu, 34% | 54.69 | 58.27 | 54.11 | 57.49 | 57.81 |

72,974 | 67.49 | 70.07 | 67.71 | 70.07 | 70.13 |

el, 12% | 71.62 | 72.78 | 70.49 | 72.66 | 72.34 |

42,326 | 77.45 | 78.34 | 77.14 | 78.91 | 78.36 |

cu, 20% | 56.25 | 59.09 | 56.31 | 58.78 | 59.52 |

37,432 | 68.08 | 69.95 | 69.07 | 70.10 | 70.94 |

got, 22% | 51.96 | 55.00 | 53.44 | 55.94 | 56.20 |

35,024 | 64.48 | 66.58 | 65.85 | 67.85 | 68.09 |

hu, 26% | 52.70 | 56.20 | 54.09 | 57.37 | 57.62 |

20,166 | 65.72 | 68.96 | 67.55 | 70.20 | 70.30 |

## 9 Conclusions

We have presented the first algorithm to calculate the optimal step for shift-reduce dependency parsing that is applicable on non-projective training corpora. Perhaps even more innovative than its functionality is its modular architecture, which implies that the same is possible for related kinds of parsing, as long as the set of allowable transitions can be described in terms of a split context-free grammar. The application of the framework to, among others, arc-eager dependency parsing is to be reported elsewhere.

We have also shown that calculation of the optimal step is possible in linear time if the training corpus is projective. This is the first time this has been shown for a form of projective, deterministic dependency parsing that does not have the property of arc-decomposibility.

## Acknowledgments

The author wishes to thank the reviewers for comments and suggestions, which led to substantial improvements.

## Notes

A term we avoid here, as dynamic oracles are neither oracles nor dynamic, especially in our formulation, which allows gold trees to be non-projective. Following, for example, Kay (2000), an oracle informs a parser whether a step may lead to the correct parse. If the gold tree is non-projective and the parsing strategy only allows projective trees, then there are no steps that lead to the correct parse. At best, there is an optimal step, by some definition of optimality. An algorithm to compute the optimal step, for a given configuration, would typically not change over time, and therefore is not dynamic in any generally accepted sense of the word.