## Abstract

Syntactic representations based on word-to-word dependencies have a long-standing tradition in descriptive linguistics, and receive considerable interest in many applications. Nevertheless, dependency syntax has remained something of an island from a formal point of view. Moreover, most formalisms available for dependency grammar are restricted to projective analyses, and thus not able to support natural accounts of phenomena such as wh-movement and cross–serial dependencies. In this article we present a formalism for non-projective dependency grammar in the framework of linear context-free rewriting systems. A characteristic property of our formalism is a close correspondence between the non-projectivity of the dependency trees admitted by a grammar on the one hand, and the parsing complexity of the grammar on the other. We show that parsing with unrestricted grammars is intractable. We therefore study two constraints on non-projectivity, block-degree and well-nestedness. Jointly, these two constraints define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial time. An evaluation on five dependency treebanks shows that these grammars have a good coverage of empirical data.

## 1. Introduction

Syntactic representations based on word-to-word dependencies have a long-standing tradition in descriptive linguistics. Since the seminal work of Tesnière (1959), they have become the basis for several linguistic theories, such as Functional Generative Description (Sgall, Hajičová, and Panevová 1986), Meaning–Text Theory (Mel'čuk 1988), and Word Grammar (Hudson 2007). In recent years they have also been used for a wide range of practical applications, such as information extraction, machine translation, and question answering. We ascribe the widespread interest in dependency structures to their intuitive appeal, their conceptual simplicity, and in particular to the availability of accurate and efficient dependency parsers for a wide range of languages (Buchholz and Marsi 2006; Nivre et al. 2007).

Although there exist both a considerable practical interest and an extensive linguistic literature, dependency syntax has remained something of an island from a formal point of view. In particular, there are relatively few results that bridge between dependency syntax and other traditions, such as phrase structure or categorial syntax. This makes it hard to gauge the similarities and differences between the paradigms, and hampers the exchange of linguistic resources and computational methods. An overarching goal of this article is to bring dependency grammar closer to the mainland of formal study.

One of the few bridging results for dependency grammar is thanks to Gaifman (1965), who studied a formalism that we will refer to as Hays–Gaifman grammar, and proved it to be weakly equivalent to context-free phrase structure grammar. Although this result is of fundamental importance from a theoretical point of view, its practical usefulness is limited. In particular, Hays–Gaifman grammar is restricted to projective dependency structures, which is similar to the familiar restriction to contiguous constituents. Yet, *non*-projective dependencies naturally arise in the analysis of natural language. One classic example of this is the phenomenon of cross–serial dependencies in Dutch. In this language, the nominal arguments of verbs that also select an infinitival complement occur in the same order as the verbs themselves:

^{1}The German linearization gives rise to a projective structure, where the verb–argument dependencies are nested within each other, whereas the Dutch linearization induces a non-projective structure with crossing edges. To account for such structures we need to turn to formalisms more expressive than Hays–Gaifman grammars.

In this article we present a formalism for non-projective dependency grammar based on linear context-free rewriting systems (LCFRSs) (Vijay-Shanker, Weir, and Joshi 1987; Weir 1988). This framework was introduced to facilitate the comparison of various grammar formalisms, including standard context-free grammar, tree-adjoining grammar (Joshi and Schabes 1997), and combinatory categorial grammar (Steedman and Baldridge 2011). It also comprises, among others, multiple context-free grammars (Seki et al. 1991), minimalist grammars (Michaelis 1998), and simple range concatenation grammars (Boullier 2004).

The article is structured as follows. In Section 2 we provide the technical background to our work; in particular, we introduce our terminology and notation for linear context-free rewriting systems. An LCFRS generates a set of terms (formal expressions) which are interpreted as derivation trees of objects from some domain. Each term also has a secondary interpretation under which it denotes a tuple of strings, representing the string yield of the derived object. In Section 3 we introduce the central notion of a lexicalized linear context-free rewriting system, which is an LCFRS in which each rule of the grammar is associated with an overt lexical item, representing a syntactic head (cf. Schabes, Abeillé, and Joshi 1988 and Schabes 1990). We show that this property gives rise to an additional interpretation under which each term denotes a dependency tree on its yield. With this interpretation, lexicalized LCFRSs can be used as dependency grammars.

In Section 4 we show how to acquire lexicalized LCFRSs from dependency treebanks. This works in much the same way as the extraction of context-free grammars from phrase structure treebanks (cf. Charniak 1996), except that the derivation trees of dependency trees are not immediately accessible in the treebank. We therefore present an efficient algorithm for computing a canonical derivation tree for an input dependency tree; from this derivation tree, the rules of the grammar can be extracted in a straightforward way. The algorithm was originally published by Kuhlmann and Satta (2009). It produces a restricted type of lexicalized LCFRS that we call “canonical.” In Section 5 we provide a declarative characterization of this class of grammars, and show that every lexicalized LCFRS is (strongly) equivalent to a canonical one, in the sense that it induces the same set of dependency trees.

In Section 6 we present a simple parsing algorithm for LCFRSs. Although the runtime of this algorithm is polynomial in the length of the sentence, the degree of the polynomial depends on two grammar-specific measures called fan-out and rank. We show that even in the restricted case of canonical grammars, parsing is an NP-hard problem. It is important therefore to keep the fan-out and the rank of a grammar as low as possible, and much of the recent work on LCFRSs has been devoted to the development of techniques that optimize parsing complexity in various scenarios Gómez-Rodríguez and Satta 2009; Gómez-Rodríguez et al. 2009; Kuhlmann and Satta 2009; Gildea 2010; Gómez-Rodríguez, Kuhlmann, and Satta 2010; Sagot and Satta 2010; and Crescenzi et al. 2011).

In this article we explore the impact of non-projectivity on parsing complexity. In Section 7 we present the structural correspondent of the fan-out of a lexicalized LCFRS, a measure called **block-degree** (or gap-degree) (Holan et al. 1998). Although there is no theoretical upper bound on the block-degree of the dependency trees needed for linguistic analysis, we provide evidence from several dependency treebanks showing that, from a practical point of view, this upper bound can be put at a value of as low as 2. In Section 8 we study a second constraint on non-projectivity called **well-nestedness** (Bodirsky, Kuhlmann, and Möhl 2005), and show that its presence facilitates tractable parsing. This comes at the cost of a small loss in coverage on treebank data. Bounded block-degree and well-nestedness jointly define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial time.

Section 9 summarizes our main contributions and concludes the article.

## 2. Technical Background

We assume basic familiarity with linear context-free rewriting systems (see, e.g., Vijay-Shanker, Weir, and Joshi 1987 and Weir 1988) and only review the terminology and notation that we use in this article.

**linear context-free rewriting system**(LCFRS) is a structure

*G*= (

*N*, Σ,

*P*,

*S*) where

*N*is a set of nonterminals, Σ is a set of function symbols,

*P*is a finite set of production rules, and

*S*∈

*N*is a distinguished start symbol. Rules take the formwhere

*f*is a function symbol and the

*A*

_{i}are nonterminals. Rules are used for rewriting in the same way as in a context-free grammar, with the function symbols acting as terminals. The outcome of the rewriting process is a set

*T*(

*G*) of terms, tree-formed expressions built from function symbols. Each term is then associated with a string yield, more specifically a

*tuple*of strings. For this, every function symbol

*f*comes with a

**yield function**that specifies how to compute the yield of a term

*f*(

*t*

_{1}, …,

*t*

_{m}) from the yields of its subterms

*t*

_{i}. Yield functions are defined by equationswhere the tuple on the right-hand side consists of strings over the variables on the left-hand side and some given alphabet of yield symbols, and contains exactly one occurrence of each variable. For a yield function

*f*defined by an equation of this form, we say that

*f*is of

**type**

*k*

_{1}⋯

*k*

_{m}→

*k*

_{0}, denoted by

*f*:

*k*

_{1}⋯

*k*

_{m}→

*k*

_{0}. To guarantee that the string yield of a term is well-defined, each nonterminal

*A*is associated with a

**fan-out**

*ϕ*(

*A*) ≥ 1, and it is required that for every rule (1),In Equation (2), the values

*m*and

*k*

_{0}are called the

**rank**and the

**fan-out**of

*f*, respectively. The rank and the fan-out of an LCFRS are the maximal rank and fan-out of its yield functions.

**Example 1**

Figure 2 shows an example of an LCFRS for the language {〈*a*^{n}*b*^{n}*c*^{n}*d*^{n}〉 | *n* ≥ 0}.

Equation (2) is uniquely determined by the tuple on the right-hand side of the equation. We call this tuple the **template** of the yield function *f*, and use it as the canonical function symbol for *f*. This gives rise to a compact notation for LCFRSs, illustrated in the right column of Figure 2. In this notation, to save some subscripts, we use the following shorthands for variables: *x* and *x*_{1} for *x*_{1,1}; *x*_{2} for *x*_{1,2}; *x*_{3} for *x*_{1,3}; *y* and *y*_{1} for *x*_{2,1}; *y*_{2} for *x*_{2,2}; *y*_{3} for *x*_{2,3}.

## 3. Lexicalized LCFRSs as Dependency Grammars

**lexicalized**in the sense that each of their yield functions is associated with a lexical item, such as

*sah*or

*zag*(cf. Schabes, Abeillé, and Joshi 1988 and Schabes 1990). Productions with lexicalized yield functions can be read as dependency rules. For example, the rulescan be read as stating that the verb

*to see*requires two dependents, one noun (N) and one verb (V). Based on this reading, every term generated by a lexicalized LCFRS does not only yield a tuple of strings, but also induces a dependency tree on these strings: Each parent–child relation in the term represents a dependency between the associated lexical items (cf. Rambow and Joshi 1997). Thus every lexicalized LCFRS can be reinterpreted as a dependency grammar. To illustrate the idea, Figure 4 shows (the tree representations of) two terms generated by the grammars

*G*

_{1}and

*G*

_{2}, together with the dependency trees induced by them. Note that these are the same trees that we gave for (iii) and (iv) in Figure 1.

Our goal for the remainder of this section is to make the notion of induction formally precise. To this end we will reinterpret the yield functions of lexicalized LCFRSs as operations on dependency trees.

### 3.1 Dependency Trees

*D*is a tree-shaped graph whose nodes correspond to the occurrences of symbols in , and whose edges represent dependency relations between these occurrences. We identify occurrences in by pairs (

*i*,

*j*) of integers, where

*i*indexes the component of that contains the occurrence, and

*j*specifies the linear position of the occurrence within that component. We can then formally define a

**dependency graph**for a tuple of stringsas a directed graph

*G*= (

*V*,

*E*) where We use

*u*and

*v*as variables for nodes, and denote edges (

*u*,

*v*) as

*u*→

*v*. A

**dependency tree**

*D*for is a dependency graph for in which there exists a root node

*r*such that for any node

*u*, there is exactly one directed path from

*r*to

*u*. A dependency tree is called

**simple**if consists of a single string

*w*. In this case, we write the dependency tree as (

*w*,

*D*), and identify occurrences by their linear positions

*j*in

*w*, with 1 ≤

*j*≤ |

*w*|.

**Example 1**

*D*

_{i}as

*D*

_{i}= (

*V*

_{i},

*E*

_{i}) we have:

*u*, the set of

**descendants**of

*u*, which we denote by ⌊

*u*⌋, is the set of nodes that can be reached from

*u*by following a directed path consisting of zero or more edges. We write

*u*<

*v*to express that the node

*u*precedes the node

*v*when reading the yield from left to right. Formally, precedence is the lexicographical order on occurrences:

### 3.2 Operations on Dependency Trees

*f*is called

**lexicalized**if its template contains exactly one yield symbol, representing a lexical item; this symbol is then called the

**anchor**of

*f*. With every lexicalized yield function

*f*we associate an operation

*f*′ on dependency trees as follows. Let be tuples of strings such thatand let

*D*

_{i}be a dependency tree for . By the definition of yield functions, every occurrence

*u*in an input tuple corresponds to exactly one occurrence in the output tuple ; we denote this occurrence by . Let

*G*be the dependency graph for that has an edge whenever there is an edge

*u*→

*v*in some

*D*

_{i}, and no other edges. Because

*f*is lexicalized, there is exactly one occurrence

*r*in the output tuple that does not correspond to any occurrence in some ; this is the occurrence of the anchor of

*f*. Let

*D*be the dependency tree for that is obtained by adding to the graph

*G*all edges of the form , where

*r*

_{i}is the root node of

*D*

_{i}. By this construction, the occurrence

*r*of the anchor becomes the root node of

*D*, and the root nodes of the input dependency trees

*D*

_{i}become its dependents. We then define

**Example 3**

*D*

_{1},

*D*

_{2}are defined asWe show that , where

*D*= (

*V*,

*E*) withThe correspondences between the occurrences

*u*in the input tuples and the occurrences in the output tuple are as follows:By copying the edges from the input dependency trees, we obtain the intermediate dependency graph

*G*= (

*V*,

*E*′) for , whereThe occurrence

*r*of the anchor

*b*of

*f*in is (1, 2); the nodes of

*G*that correspond to the root nodes of

*D*

_{1}and

*D*

_{2}are and . The dependency tree

*D*is obtained by adding the edges and to

*G*.

## 4. Extraction of Dependency Grammars

We now show how to extract lexicalized linear context-free rewriting systems from dependency treebanks. To this end, we adapt the standard technique for extracting context-free grammars from phrase structure treebanks (Charniak 1996).

Our technique was originally published by Kuhlmann and Satta (2009). In recent work, Maier and Lichte (2011) have shown how to unify it with a similar technique for the extraction of range concatenation grammars from discontinuous constituent structures, due to Maier and Søgaard (2008). To simplify our presentation we restrict our attention to treebanks containing simple dependency trees.

*w*,

*D*) in the treebank, we compute a

**construction tree**, a term

*t*over yield functions that induces (

*w*,

*D*). Then we collect a set of production rules, one rule for each node of the construction trees. As an example, consider Figure 7, which shows a dependency tree with one of its construction trees. (The analysis is taken from (Kübler, McDonald, and Nivre 2009).) From this construction tree we extract the following rules. The nonterminals (in bold) represent linear positions of nodes.Rules like these can serve as the starting point for practical systems for data-driven, non-projective dependency parsing (Maier and Kallmeyer 2010).

Because the extraction of rules from construction trees is straightforward, the problem that we focus on in this section is how to obtain these trees in the first place. Our procedure for computing construction trees is based on the concept of “blocks.”

### 4.1 Blocks

Let *D* be a dependency tree. A **segment** of *D* is a contiguous, non-empty sequence of nodes of *D*, all of which belong to the same component of the string yield. Thus a segment contains its endpoints, as well as all nodes between the endpoints in the precedence order. For a node *u* of *D*, a **block** of *u* is a longest segment consisting of descendants of *u*. This means that the left endpoint of a block of *u* either is the first node in its component, or is preceded by a node that is not a descendant of *u*. A symmetric property holds for the right endpoint.

**Example 1**

Consider the node 2 of the dependency tree in Figure 7. The descendants of 2 fall into two blocks, marked by the dashed boxes: 1 2 and 5 6 7.

We use and as variables for blocks. Extending the precedence order on nodes, we say that a block precedes a block , denoted by , if the right endpoint of precedes the left endpoint of .

### 4.2 Computing Canonical Construction Trees

*t*for a dependency tree (

*w*,

*D*) we label each node

*u*of

*D*with a yield function

*f*as follows. Let be the tuple consisting of the blocks of

*u*, in the order of their precedence, and let be the corresponding tuples for the children of

*u*. We may view blocks as strings of nodes. Taking this view, we compute the (unique) yield function

*g*with the property thatThe anchor of

*g*is the node

*u*, the rank of

*g*corresponds to the number of children of

*u*, the variables in the template of

*g*represent the blocks of these children, and the components of the template represent the blocks of

*u*. To obtain

*f*, we take the template of

*g*and replace the occurrence of

*u*with the corresponding lexical item.

**Example 2**

Note that in order to properly define *f* we need to assume some order on the children of *u*. The function *g* (and hence the construction tree *t*) is unique up to the specific choice of this order. In the following we assume that children are ordered from left to right based on the position of their leftmost descendants.

### 4.3 Computing the Blocks of a Dependency Tree

The algorithmically most interesting part of our extraction procedure is the computation of the yield function *g*. The template of *g* is uniquely determined by the left-to-right sequence of the endpoints of the blocks of *u* and its children. An efficient algorithm that can be used to compute these sequences is given in Table 1.

#### 4.3.1 Description

We start at a virtual root node ⊥ (line 1) which serves as the parent of the real root node. For each node *next* in the precedence order of *D*, we follow the shortest path from the current node *current* to *next*. To determine this path, we compute the lowest common ancestor *lca* of the two nodes (lines 4–5), using a set of markings on the nodes. At the beginning of each iteration of the *for* loop in line 2, all ancestors of *current* (including the virtual root node ⊥) are marked; therefore, we find *lca* by going upwards from *next* to the first node that is marked. To restore the loop invariant, we then unmark all nodes on the path from *current* to *lca* (lines 6–9). Each time we move down from a node to one of its children (line 12), we record the information that *next* is the left endpoint of a block of *current*. Symmetrically, each time we move up from a node to its parent (lines 8 and 17), we record the information that *next* – 1 is the right endpoint of a block of *current*. The *while* loop in lines 15–18 takes us from the last node of the dependency tree back to the node ⊥.

#### 4.3.2 Runtime Analysis

We analyze the runtime of our algorithm. Let *m* be the total number of blocks of *D*. Let us write *n*_{i} for the total number of iterations of the *i*th *while* loop, and let *n* = *n*_{1} + *n*_{2} + *n*_{3} + *n*_{4}. Under the reasonable assumption that every line in Table 1 can be executed in constant time, the runtime of the algorithm clearly is in *O*(*n*). Because each iteration of loop 2 and loop 4 determines the right endpoint of a block, we have *n*_{2} + *n*_{4} = *m*. Similarly, as each iteration of loop 3 fixes the left endpoint of a block, we have *n*_{3} = *m*. To determine *n*_{1}, we note that every node that is pushed to the auxiliary stack in loop 1 is popped again in loop 3; therefore, *n*_{1} = *n*_{3} = *m*. Putting everything together, we have *n* = 3*m*, and we conclude that the runtime of the algorithm is in *O*(*m*). Note that this runtime is asymptotically optimal for the task we are considering.

## 5. Canonical Grammars

Our extraction technique produces a restricted type of lexicalized linear context-free rewriting system that we will refer to as “canonical.” In this section we provide a declarative characterization of these grammars, and show that every lexicalized LCFRS is equivalent to a canonical one.

### 5.1 Definition of Canonical Grammars

*x*,

*y*we write

*x*<

_{f}

*y*to state that

*x*precedes

*y*in the template of

*f*, that is, in the string α

_{1}⋯ α

_{k}. Recall that, in the context of our extraction procedure, the components in the template of

*f*represent the blocks of a node

*u*, and the variables in the template represent the blocks of the children of

*u*. For a variable

*x*

_{i,j}we call

*i*the

**argument index**and

*j*the

**component index**of the variable.

**Property 1**

For all 1 ≤ *i*_{1}, *i*_{2} ≤ *m*, if *i*_{1} < *i*_{2} then .

This property is an artifact of our decision to order the children of a node from left to right based on the position of their leftmost descendants. A variable with argument index *i* represents a block of the *i*th child of *u* in that order. An example of a yield function that does not have Property 1 is 〈*x*_{2,1}*x*_{1,1}〉, which defines a kind of “reverse concatenation operation.”

**Property 2**

For all 1 ≤ *i* ≤ *m* and 1 ≤ *j*_{1}, *j*_{2} ≤ *k*_{i}, if *j*_{1} < *j*_{2} then .

This property reflects that, in our extraction procedure, the variable *x*_{i,j} represents the *j*th block of the *i*th child of *u*, where the blocks of a node are ordered from left to right based on their precedence. An example of a yield function that violates the property is 〈*x*_{2,1}*x*_{1,1}〉, which defines a kind of **swapping operation**. In the literature on LCFRSs and related formalisms, yield functions with Property 2 have been called **monotone** (Michaelis 2001; Kracht 2003), **ordered** (Villemonte de la Clergerie 2002; Kallmeyer 2010), and **non-permuting** (Kanazawa 2009).

**Property 3**

No component α_{h} is the empty string.

This property, which is similar to ε-freeness as known from context-free grammars, has been discussed for multiple context-free grammars (Seki et al. 1991, Property N3 in Lemma 2.2) and range concatenation grammars (Boullier 1998, Section 5.1). For our extracted grammars it holds because each component *α*_{h} represents a block, and blocks are always non-empty.

**Property 4**

No component α_{h} contains a substring of the form .

This property, which does not seem to have been discussed in the literature before, is a reflection of the facts that variables with the same argument index represent blocks of the same child node, and that these blocks are *longest* segments of descendants.

A yield function with Properties 1–4 is called **canonical**. An LCFRS is canonical if all of its yield functions are canonical.

**Lemma 1**

A lexicalized LCFRS is canonical if and only if it can be extracted from a dependency treebank using the technique presented in Section 4.

**Proof**

We have already argued for the “only if” part of the claim. To prove the “if” part, it suffices to show that for every canonical, lexicalized yield function *f*, one can construct a dependency tree such that the construction tree extracted for this dependency tree contains *f*. This is an easy exercise.

We conclude by noting that Properties 2–4 are also shared by the treebank grammars extracted from constituency treebanks using the technique by Maier and Søgaard (2008).

### 5.2 Equivalence Between General and Canonical Grammars

Two lexicalized LCFRSs are called **strongly equivalent** if they induce the same set of dependency trees. We show the following equivalence result:

**Lemma 2**

For every lexicalized LCFRS *G* one can construct a strongly equivalent lexicalized LCFRS *G*′ such that *G*′ is canonical.

**Proof**

Our proof of this lemma uses two normal-form results about multiple context-free grammars: Michaelis (2001, Section 2.4) provides a construction that transforms a multiple context-free grammar into a weakly equivalent multiple context-free grammar in which all rules satisfy Property 2, and Seki et al. (1991, Lemma 2.2) present a corresponding construction for Property 3. Whereas both constructions are only quoted to preserve weak equivalence, we can verify that, in the special case where the input grammar is a lexicalized LCFRS, they also preserve the set of induced dependency trees. To complete the proof of Lemma 2, we show that every lexicalized LCFRS can be cast into normal forms that satisfy Property 1 and Property 4. It is not hard then to combine the four constructions into a single one that simultaneously establishes all properties of canonical yield functions.

**Lemma 3**

For every lexicalized LCFRS *G* one can construct a strongly equivalent lexicalized LCFRS *G*′ such that *G*′ only contains yield functions which satisfy Property 1.

**Proof**

The proof is very simple. Intuitively, Property 1 enforces a canonical naming of the arguments of yield functions. To establish it, we determine, for every yield function *f*, a permutation that renames the argument indices of the variables occurring in the template of *f* in such a way that the template meets Property 1. This renaming gives rise to a modified yield function . We then replace every rule *A* → *f*(*A*_{1}, …, *A*_{m}) with the modified rule .

**Lemma 4**

For every lexicalized LCFRS *G* one can construct a strongly equivalent lexicalized LCFRS *G*′ such that *G*′ only contains yield functions which satisfy Property 4.

**Proof**

The idea behind our construction of the grammar *G*′ is perhaps best illustrated by an example. Imagine that the grammar *G* generates the term *t* shown in Figure 8a. The yield function *f*_{1} = 〈*x*_{1}*c x*_{2}*x*_{3}〉 at the root node of that term violates Property 4, as its template contains the offending substring *x*_{2}*x*_{3}. We set up *G*′ in such a way that instead of *t* it generates the term *t*′ shown in Figure 8b in which *f*_{1} is replaced with the yield function *f*′_{1} = 〈*x*_{1}*c x*_{2}〉. To obtain *f*′_{1} from *f*_{1} we *reduce* the offending substring *x*_{2}*x*_{3} to the single variable *x*_{2}. In order to ensure that *t* and *t*′ induce the same dependency tree (shown in Figure 8c), we then *adapt* the function *f*_{2} = 〈*x*_{1}*b*, *y*, *x*, *x*_{2}〉 at the first child of the root node: Dual to the reduction, we replace the two-component sequence *y*, *x*_{2} in the template of *f*_{2} with the single component *y x*_{2}; in this way we get *f*′_{2} = 〈*x*_{1}*b*, *y x*_{2}〉.

*G*′. Such an algorithm is given in Table 2. For every rule

*A*→

*f*(

*A*

_{1}, …,

*A*

_{m}) of

*G*we construct new ruleswhere

*g*and the

*g*

_{i}are yield functions encoding adaptation operations. As an example, the adaptation of the function

*f*

_{2}in the term

*t*may be encoded into the adaptor function 〈

*x*

_{1},

*x*

_{2}

*x*

_{3}〉. The function

*f*′

_{2}can then be written as the composition of this function and

*f*

_{2}:The yield function

*f*′ and the adaptor functions

*g*

_{i}are computed based on the template of the

*g*-adapted yield function

*f*, that is, the composed function

*g*∘

*f*. In Table 2 we write this as

*f*′ =

*reduce*(

*f*,

*g*) and

*g*

_{i}=

*adapt*(

*f*,

*g*,

*i*), respectively. Let us denote the template of the adapted function

*g*∘

*f*by τ. An

*i-block*of τ is a maximal, non-empty substring of some component of τ that consists of variables with argument index

*i*. To compute the template of

*g*

_{i}we read the

*i*-blocks of τ from left to right and rename the variables by changing their argument indices from

*i*to 1. To compute the template of

*f*′ we take the template τ and replace the

*j*th

*i*-block with the variable

*x*

_{i,j}, for all argument indices

*i*and component indices

*j*.

Input: a linear context-free rewriting system G = (N, Σ, P, S) |

1: P′ ← ∅; agenda ← {(S, 〈x〉)}; chart ← ∅ |

2: whileagenda is not empty |

3: remove some (A, g) from agenda |

4: if (A, g) ∉ chartthen |

5: add (A, g) to chart |

6: for each rule A → f(A_{1}, …, A_{m}) ∈ Pdo |

7: f ← reduce(f, g); g_{i} ← adapt(f, g, i) (1 ≤ i ≤ m) |

8: for eachi from 1 to mdo |

9: add (A_{i}, g_{i}) to agenda |

10: add (A, g) → f′((A_{1}, g_{1}), …, (A_{m}, g_{m})) to P′ |

Input: a linear context-free rewriting system G = (N, Σ, P, S) |

1: P′ ← ∅; agenda ← {(S, 〈x〉)}; chart ← ∅ |

2: whileagenda is not empty |

3: remove some (A, g) from agenda |

4: if (A, g) ∉ chartthen |

5: add (A, g) to chart |

6: for each rule A → f(A_{1}, …, A_{m}) ∈ Pdo |

7: f ← reduce(f, g); g_{i} ← adapt(f, g, i) (1 ≤ i ≤ m) |

8: for eachi from 1 to mdo |

9: add (A_{i}, g_{i}) to agenda |

10: add (A, g) → f′((A_{1}, g_{1}), …, (A_{m}, g_{m})) to P′ |

Our algorithm is controlled by an agenda and a chart, both containing pairs of the form (*A*, *g*), where *A* is a nonterminal of *G* and *g* is an adaptor function. These pairs also constitute the nonterminals of the new grammar *G*′. The fan-out of a nonterminal is the fan-out of *g*. The agenda is initialized with the pair (*S*, 〈*x*〉) where 〈*x*〉 is the identity function; this pair also represents the start symbol of *G*′. To see that the algorithm terminates, one may observe that the fan-out of every nonterminal (*A*, *g*) added to the agenda is upper-bounded by the fan-out of *A*. Hence, there are only finitely many pairs (*A*, *g*) that may occur in the chart, and a finite number of iterations of the *while*-loop.

## 6. Parsing and Recognition

Lexicalized linear context-free rewriting systems are able to account for arbitrarily non-projective dependency trees. This expressiveness comes with a price: In this section we show that parsing with lexicalized LCFRSs is intractable, unless we are willing to restrict the class of grammars.

### 6.1 Parsing Algorithm

To ground our discussion of parsing complexity, we present a simple bottom–up parsing algorithm for LCFRSs, specified as a grammatical deduction system (Shieber, Schabes, and Pereira 1995). Several similar algorithms have been described in the literature (Seki et al. 1991; Bertsch and Nederhof 2001; Kallmeyer 2010). We assume that we are given a grammar *G* = (*N*, Σ, *P*, *S*) and a string *w* = *a*_{1} ⋯ *a*_{n} ∈ *V** to be parsed.

**Item form.**The items of the deduction system take the formwhere

*A*∈

*N*with

*ϕ*(

*A*) =

*k*, and the remaining components are indices identifying the left and right endpoints of pairwise non-overlapping substrings of

*w*. More formally, 0 ≤

*l*

_{h}≤

*r*

_{h}≤

*n*, and for all

*h*,

*h*′ with

*h*≠

*h*′, either

*r*

_{h}≤

*l*

_{h′}or

*r*

_{h′}≤

*l*

_{h}. The intended interpretation of an item of this form is that

*A*derives a term

*t*∈

*T*(

*G*) that yields the specified substrings of

*w*, that is,

**Goal item.** The goal item is [*S*, 0, *n*]. By this item, there exists a term that can be derived from the start symbol *S* and yields the full string 〈*w*〉.

**Inference rules.**The inference rules of the deduction system are defined based on the rules in

*P*. Each production ruleis converted into a set of inference rules of the formEach such rule is subject to the following constraints. Let 1 ≤

*h*≤

*k*,

*v*∈

*V**, 1 ≤

*i*≤

*m*, and 1 ≤

*j*≤

*k*

_{i}. We write

*δ*(

*l*,

*v*) =

*r*to assert that

*r*=

*l*+ |

*v*| and that

*v*is the substring of

*w*between indices

*l*and

*r*.These constraints ensure that the substrings corresponding to the premises of the inference rule can be combined into the substrings corresponding to the conclusion by means of the yield function

*f*.

Based on the deduction system, a tabular parser for LCFRSs can be implemented using standard dynamic programming techniques. This parser will compute a packed representation of the set of all derivation trees that the grammar *G* assigns to the string *w*. Such a packed representation is often called a **shared forest** (Lang 1994). In combination with appropriate semirings, the shared forest is useful for many tasks in syntactic analysis and machine learning (Goodman 1999; Li and Eisner 2009).

### 6.2 Parsing Complexity

*O*(|

*G*||

*w*|

^{c}), where |

*G*| denotes the size of some suitable representation of the grammar

*G*, and

*c*denotes the maximal number of instantiations of an inference rule (cf. McAllester 2002). Let us write

*c*(

*f*) for the specialization of

*c*to inference rules for productions with yield function

*f*. We refer to this value as the

**parsing complexity**of

*f*(cf. Gildea 2010). Then to show an upper bound on

*c*it suffices to show an upper bound on the parsing complexities of the yield functions that the parser has to handle. An obvious such upper bound isHere we imagine that we could choose each endpoint in Equation (3) independently of all the others. By virtue of the constraints, however, some of the endpoints cannot be chosen freely; in particular, some of the substrings may be adjacent. In general, to show an upper bound

*c*(

*f*) ≤

*b*we specify a strategy for choosing

*b*endpoints, and then argue that, given the constraints, these choices determine the remaining endpoints.

**Lemma 5**

**Proof**

We adopt the following strategy for choosing endpoints: For 1 ≤ *i* ≤ *k*, choose the value of *l*_{h}. Then, for 1 ≤ *i* ≤ *m* and 1 ≤ *j* ≤ *k*_{i}, choose the value of *r*_{i,j}. It is not hard to see that these choices suffice to determine all other endpoints. In particular, each left endpoint *l*_{i′,j′} will be shared either with the left endpoint *l*_{h} of some component (by constraint C2), or with some right endpoint *r*_{i,j} (by constraint c4).

### 6.3 Universal Recognition

The runtime of our parsing algorithm for LCFRSs is exponential in both the rank and the fan-out of the input grammar. One may wonder whether there are parsing algorithms that can be substantially faster. We now show that the answer to this question is likely to be negative even if we restrict ourselves to canonical lexicalized LCFRSs. To this end we study the universal recognition problem for this class of grammars.

The **universal recognition problem** for a class of linear context-free rewriting systems is to decide, given a grammar *G* from the class in question and a string *w*, whether *G* yields 〈*w*〉. A straightforward algorithm for solving this problem is to first compute the shared forest for *G* and *w*, and to return “yes” if and only if the shared forest is non-empty. Choosing appropriate data structures, the emptiness of shared forests can be decided in linear time and space with respect to the size of the forest. Therefore, the computational complexity of universal recognition is upper-bounded by the complexity of constructing the shared forest. Conversely, parsing cannot be faster than universal recognition.

In the next three lemmas we prove that the universal recognition problem for canonical lexicalized LCFRSs is NP-complete unless we restrict ourselves to a class of grammars where both the fan-out and the rank of the yield functions are bounded by constants. Lemma 6, which shows that the universal recognition problem of lexicalized LCFRSs is in NP, distinguishes lexicalized LCFRSs from general LCFRSs, for which the universal recognition problem is known to be PSPACE-complete (Kaji et al. 1992). The crucial difference between general and lexicalized LCFRSs is the fact that in the latter, the size of the generated terms is bounded by the length of the input string. Lemma 7 and Lemma 8, which establish two NP-hardness results for lexicalized LCFRSs, are stronger versions of the corresponding results for general LCFRSs presented by Satta (1992), and are proved using similar reductions. They show that the hardness results hold under significant restrictions of the formalism: to lexicalized form and to canonical yield functions. Note that, whereas in Section 5.2 we have shown that every lexicalized LCFRS is equivalent to a canonical one, the normal form transformation increases the size of the original grammar by a factor that is at least exponential in the fan-out.

**Lemma 6**

The universal recognition problem of lexicalized LCFRSs is in NP.

**Proof**

*G*be a lexicalized LCFRS, and let

*w*be a string. To test whether

*G*yields 〈

*w*〉, we guess a term

*t*∈

*T*(

*G*) and check whether

*t*yields 〈

*w*〉. Let |

*t*| denote the length of some string representation of

*t*. Since the yield functions of

*G*are lexicalized, |

*t*| ≤ |

*w*||

*G*|. Note that we haveUsing a simple tabular algorithm, we can verify in time

*O*(|

*w*||

*G*|) whether a candidate term

*t*belongs to

*T*(

*G*). It is then straightforward to compute the string yield of

*t*in time

*O*(|

*w*||

*G*|). Thus we have a nondeterministic polynomial-time decider for the universal recognition problem.

For the following two lemmas, recall the decision problem 3SAT, which is known to be NP-complete. An instance of 3SAT is a Boolean formula φ in conjunctive normal form where each clause contains exactly three literals, which may be either variables or negated variables. We write *m* for the number of distinct variables that occur in φ, and *n* for the number of clauses. In the proofs the index *i* will always range over values from 1 to *m*, and the index *j* will range over values from 1 to *n*.

In order to make the grammars in the following reductions more readable, we use yield functions with more than one lexical anchor. Our use of these yield functions is severely restricted, however, and each of our grammars can be transformed into a proper lexicalized LCFRS without affecting the correctness or polynomial size of the reductions.

**Lemma 7**

The universal recognition problem for canonical lexicalized LCFRSs with unbounded fan-out and rank 1 is NP-hard.

**Proof**

To prove this claim, we provide a polynomial-time reduction of 3SAT. The basic idea is to use the derivations of the grammar to guess truth assignments for the variables, and to use the feature of unbounded fan-out to ensure that the truth assignment satisfies all clauses.

*G*and a string

*w*as follows. Let

*M*denote the

*m*×

*n*matrix with entries

*M*

_{i,j}= (

*v*

_{i},

*c*

_{j}), that is, entries in the same row share the same variable, and entries in the same column share the same clause. We set up

*G*in such a way that each of its derivations simulates a row-wise iteration over

*M*. Before visiting a new row, the derivation chooses a truth value for the corresponding variable, and sticks to that choice until the end of the row. The string

*w*takes the formThis string is built up during the iteration over

*M*in a column-wise fashion, where each column corresponds to one component of a tuple with fan-out

*n*. More specifically, for each entry (

*v*

_{i},

*c*

_{j}), the derivation generates one of two strings, denoted by and :The string is generated only if

*v*

_{i}can be used to satisfy

*c*

_{j}under the hypothesized truth assignment. By this construction, every successful derivation of

*G*represents a truth assignment that satisfies φ. Conversely, using a satisfying truth assignment for φ, we will be able to construct a derivation of

*G*that yields

*w*.

To see how the traversal of the matrix *M* can be implemented by the grammar *G*, consider the grammar fragment in Figure 9. Each of the rules specifies one possible step of the iteration for the pair (*v*_{i}, *c*_{j}) under the truth assignment *v*_{i} = *true*; rules with left-hand side *F*_{i,j} (not shown here) specify possible steps under the assignment *v*_{i} = *false*.

**Lemma 8**

The universal recognition problem for canonical lexicalized LCFRSs with unbounded rank and fan-out 2 is NP-hard.

**Proof**

*G*and a string

*w*, again based on the matrix

*M*mentioned in the previous proof. Also as in the previous reduction, we set up the grammar

*G*to simulate a row-wise iteration over

*M*. The major difference this time is that the entries of

*M*are not visited during one long rank 1 derivation, but during

*mn*rather short fan-out 2 subderivations. The string

*w*isDuring the traversal of

*M*, for each entry (

*v*

_{i},

*c*

_{j}), we generate a tuple consisting of two substrings of

*w*. The right component of the tuple consists of one the two strings and mentioned previously. As before, the string is generated only if

*v*

_{i}can be used to satisfy

*c*

_{j}under the hypothesized truth assignment. The left component consists of one of two strings, denoted by σ

_{i,j}and :These strings are generated to represent the truth assignments

*v*

_{i}=

*true*and

*v*

_{i}=

*false*, respectively. By this construction, each substring

*w*

_{⊲, i}can be derived in exactly one of two ways, ensuring a consistent truth assignment for all subderivations that are linked to the same variable

*v*

_{i}.

*G*is defined as follows. There is one rather complex rule to rewrite the start symbol

*S*; this rule sets up the general topology of

*w*. Let

*I*be the

*m*×

*n*matrix with entries

*I*

_{i,j}= (

*j*− 1)

*m*+

*i*. Define to be the sequence of variables of the form

*x*

_{h,1}, where the argument index

*i*is taken from a row-wise reading of the matrix

*I*; in this case, the argument indices in will simply go up from 1 to

*mn*. Now define to be the sequence of variables of the form

*x*

_{h,2}, where

*h*is taken from a column-wise reading of the matrix

*I*. Then

*S*can be expanded with the ruleNote that there is one nonterminal

*V*

_{i,j}for each variable–clause pair (

*v*

_{i},

*c*

_{j}). These nonterminals can be rewritten using the following rules:The remaining rules rewrite the nonterminals

*T*

_{i,j}and

*F*

_{i,j}:It is not hard to see that both

*G*and

*w*can be constructed in polynomial time.

## 7. Block-Degree

To obtain efficient parsing, we would like to have grammars with as low a fan-out as possible. Therefore it is interesting to know how low we can go without losing too much coverage. In lexicalized LCFRSs extracted from dependency treebanks, the fan-out of a grammar has a structural correspondence in the maximal number of blocks per subtree, a measure known as “block-degree.” In this section we formally define block-degree, and evaluate grammar coverage under different bounds on this measure.

### 7.1 Definition of Block-Degree

Recall the concept of “blocks” that was defined in Section 4.2. The **block-degree** of a node *u* of a dependency tree *D* is the number of distinct blocks of *u*. The block-degree of *D* is the maximal block-degree of its nodes.^{2}

**Example 6**

Figure 10 shows two non-projective dependency trees. For *D*_{1}, consider the node 2. The descendants of 2 fall into two blocks, marked by the dashed boxes. Because this is the maximal number of blocks per node in *D*_{1}, the block-degree of *D*_{1} is 2. Similarly, we can verify that the block-degree of the dependency tree *D*_{2} is 3.

A dependency tree is **projective** if its block-degree is 1. In a projective dependency tree, each subtree corresponds to a substring of the underlying tuple of strings. In a non-projective dependency tree, a subtree may span over several, discontinuous substrings.

### 7.2 Computing the Block-Degrees

Using a straightforward extension of the algorithm in Table 1, the block-degrees of all nodes of a dependency tree *D* can be computed in time *O*(*m*), where *m* is the total number of blocks. To compute the block-degree of *D*, we simply take the maximum over the degrees of each node. We can also adapt this procedure to test whether *D* is projective, by aborting the computation as soon as we discover that some node has more than one block. The runtime of this test is linear in the number of nodes of *D*.

### 7.3 Block-Degree in Extracted Grammars

In a lexicalized LCFRS extracted from a dependency treebank, there is a one-to-one correspondence between the blocks of a node *u* and the components of the template of the yield function *f* extracted for *u*. In particular, the fan-out of *f* is exactly the block-degree of *u*. As a consequence, any bound on the block-degree of the trees in the treebank translates into a bound on the fan-out of the extracted grammar. This has consequences for the generative capacity of the grammars: As Seki et al. (1991) show, the class of LCFRSs with fan-out *k* > 1 can generate string languages that cannot be generated by the class of LCFRSs with fan-out *k* − 1.

It may be worth emphasizing that the one-to-one correspondence between blocks and tuple components is a consequence of two characteristic properties of extracted grammars (Properties 3 and 4), and does not hold for non-canonical lexicalized LCFRSs.

**Example 7**

The following term induces a two-node dependency tree with block-degree 1, but contains yield functions with fan-out 2: 〈*a x*_{1}*x*_{2}〉(〈*b*, ɛ〉). Note that the yield functions in this term violate both Property 3 and Property 4.

### 7.4 Coverage on Dependency Treebanks

In order to assess the consequences of different bounds on the fan-out, we now evaluate the block-degree of dependency trees in real-world data. Specifically, we look into five dependency treebanks used in the 2006 CoNLL shared task on dependency parsing (Buchholz and Marsi 2006): the Prague Arabic Dependency Treebank (Hajič et al. 2004), the Prague Dependency Treebank of Czech (Böhmová et al. 2003), the Danish Dependency Treebank (Kromann 2003), the Slovene Dependency Treebank (Džeroski et al. 2006), and the Metu-Sabancı treebank of Turkish (Oflazer et al. 2003). The full data used in the CoNLL shared task also included treebanks that were produced by conversion of corpora originally annotated with structures other than dependencies, which is a potential source of “noise” that one has to take into account when interpreting any findings. Here, we consider only genuine dependency treebanks. More specifically, our statistics concern the training sections of the treebanks that were set off for the task. For similar results on other data sets, see Kuhlmann and Nivre (2006), Havelka (2007), and Maier and Lichte (2011).

Our results are given in Table 3. For each treebank, we list the number of rules extracted from that treebank, as well as the number of corresponding dependency trees. We then list the number of rules that we lose if we restrict ourselves to rules with fan-out = 1, or rules with fan-out ≤ 2, as well as the number of dependency trees that we lose because their construction trees contain at least one such rule. We count rule *tokens*, meaning that two otherwise identical rules are counted twice if they were extracted from different trees, or from different nodes in the same tree.

. | . | . | fan-out = 1 . | fan-out ≤ 2 . | ||
---|---|---|---|---|---|---|

. | rules . | trees . | rules . | trees . | rules . | trees . |

Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 |

Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 |

Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 |

Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 |

Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 |

. | . | . | fan-out = 1 . | fan-out ≤ 2 . | ||
---|---|---|---|---|---|---|

. | rules . | trees . | rules . | trees . | rules . | trees . |

Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 |

Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 |

Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 |

Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 |

Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 |

By putting the bound at fan-out 1, we lose between 0.74% (Arabic) and 1.75% (Slovene) of the rules, and between 11.16% (Arabic) and 23.15% (Czech) of the trees in the treebanks. This loss is quite substantial. If we instead put the bound at fan-out ≤ 2, then rule loss is reduced by between 94.16% (Turkish) and 99.76% (Arabic), and tree loss is reduced by between 94.31% (Turkish) and 99.39% (Arabic). This outcome is surprising. For example, Holan et al. (1998) argue that it is impossible to give a theoretical upper bound for the block-degree of reasonable dependency analyses of Czech. Here we find that, if we are ready to accept a loss of as little as 0.02% of the rules extracted from the Prague Dependency Treebank, and up to 0.5% of the trees, then such an upper bound can be set at a block-degree as low as 2.

## 8. Well-Nestedness

The parsing of LCFRSs is exponential both in the fan-out and in the rank of the grammars. In this section we study “well-nestedness,” another restriction on the non-projectivity of dependency trees, and show how enforcing this constraint allows us to restrict our attention to the class of LCFRSs with rank 2.

### 8.1 Definition of Well-Nestedness

*D*be a dependency tree, and let

*u*and

*v*be nodes of

*D*. The descendants of

*u*and

*v*

**overlap**, denoted by , if there exist nodes

*u*

_{l},

*u*

_{r}∈ ⌊

*u*⌋ and

*v*

_{l},

*v*

_{r}∈ ⌊

*v*⌋ such thatA dependency tree

*D*is called

**well-nested**if for all pairs of nodes

*u*,

*v*of

*D*In other words, ⌊

*u*⌋ and ⌊

*v*⌋ may overlap only if

*u*is an ancestor of

*v*, or

*v*is an ancestor of

*u*. If this implication does not hold, then

*D*is called

**ill-nested**.

**Example 8**

*D*

_{1}and

*D*

_{2}are well-nested:

*D*

_{1}does not contain any overlapping sets of descendants at all. In

*D*

_{2}, although ⌊1⌋ and ⌊2⌋ overlap, it is also the case that ⌊1⌋ ⊇ ⌊2⌋. In contrast,

*D*

_{3}is ill-nested, asThe following lemma characterizes well-nestedness in terms of blocks.

**Lemma 9**

**Proof**

*D*be a dependency tree. Suppose that

*D*contains a configuration of the form (4). This configuration witnesses that the sets ⌊

*u*⌋ and ⌊

*u*⌋ overlap. Because

*u*,

*v*are siblings, ⌊

*u*⌋ ∩ ⌊

*v*⌋ = ∅. Therefore we conclude that

*D*is ill-nested. Conversely now, suppose that

*D*is ill-nested. In this case, there exist two nodes

*u*and

*v*such thatHere, we may assume

*u*and

*v*to be siblings: otherwise, we may replace either

*u*or

*v*with its parent node, and property (*) will continue to hold. Because , there exist descendants

*u*

_{l},

*u*

_{r}∈ ⌊

*u*⌋ and

*v*

_{l},

*v*

_{r}∈ ⌊

*v*⌋ such thatWithout loss of generality, assume that we have the first case. The nodes

*u*

_{l}and

*u*

_{r}belong to different blocks of

*u*, say and ; and the nodes

*v*

_{l}and

*v*

_{r}belong to different blocks of

*v*, say and . Then it is not hard to verify Equation (4).

Note that projective dependency trees are always well-nested; in these structures, every node has exactly one block, so configuration (4) is impossible. For every *k* > 1, there are both well-nested and ill-nested dependency trees with block-degree *k*.

### 8.2 Testing for Well-Nestedness

Based on Lemma 9, testing whether a dependency tree *D* is well-nested can be done in time linear in the number of blocks in *D* using a simple subsequence test as follows. We run the algorithm given in Table 1, maintaining a stack *s*[*u*] for every node *u*. The first time we make a down step to *u*, we push *u* to the stack for the parent of *u*; every other time, we pop the stack for the parent until we either find *u* as the topmost element, or the stack becomes empty. In the latter case, we terminate the computation and report that *D* is ill-nested; if the computation can be completed without any stack ever becoming empty, we report that *D* is well-nested.

To show that the algorithm is sound, suppose that some stack *s*[*p*] becomes empty when making a down step to some child *v* of *p*. In this case, the node *v* must have been popped from *s*[*p*] when making a down step to some other child *u* of *p*, and that child must have already been on the stack before the first down step to *v*. This witnesses the existence of a configuration of the form in Equation (4).

### 8.3 Well-Nestedness in Extracted Grammars

*x*<

_{f}

*y*from Section 5.1. A yield functionis

**ill-nested**if there are argument indices 1 ≤

*i*

_{1},

*i*

_{2}≤

*m*with

*i*

_{1}≠

*i*

_{2}and component indices , such that Otherwise, we say that

*f*is

**well-nested**. As an immediate consequence of Lemma 9, a restriction to well-nested dependency trees translates into a restriction to well-nested yield functions in the extracted grammars. This puts them into the class of what Kanazawa (2009) calls “well-nested multiple context-free grammars.”

^{3}These grammars have a number of interesting properties that set them apart from general LCFRSs; in particular, they have a standard pumping lemma (Kanazawa 2009). The yield languages generated by well-nested multiple context-free grammars form a proper subhierarchy within the languages generated by general LCFRSs (Kanazawa and Salvati 2010). Perhaps the most prominent subclass of well-nested LCFRSs is the class of tree-adjoining grammars (Joshi and Schabes 1997).

Similar to the situation with block-degree, the correspondence between structural well-nestedness and syntactic well-nestedness is tight only for canonical grammars. For non-canonical grammars, syntactic well-nestedness alone does not imply structural well-nestedness, nor the other way around.

### 8.4 Coverage on Dependency Treebanks

To estimate the coverage of well-nested grammars, we extend the evaluation presented in Section 7.4. Table 4 shows how many rules and trees in the five dependency treebanks we lose if we restrict ourselves to well-nested yield functions with fan-out ≤ 2. The losses reported in Table 3 are repeated here for comparison. Although the coverage of well-nested rules is significantly smaller than the coverage of rules without this requirement, rule loss is still reduced by between 92.65% (Turkish) and 99.51% (Arabic) when compared to the fan-out = 1 baseline.

. | . | . | fan-out = 1 . | fan-out ≤ 2 . | + well-nested. | |||
---|---|---|---|---|---|---|---|---|

. | rules . | trees . | rules . | trees . | rules . | trees . | rules. | trees. |

Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 | 2 | 2 |

Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 | 407 | 382 |

Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 | 17 | 15 |

Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 | 17 | 13 |

Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 | 68 | 43 |

. | . | . | fan-out = 1 . | fan-out ≤ 2 . | + well-nested. | |||
---|---|---|---|---|---|---|---|---|

. | rules . | trees . | rules . | trees . | rules . | trees . | rules. | trees. |

Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 | 2 | 2 |

Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 | 407 | 382 |

Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 | 17 | 15 |

Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 | 17 | 13 |

Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 | 68 | 43 |

### 8.5 Binarization of Well-Nested Grammars

Our main interest in well-nestedness comes from the following:

**Lemma 10**

To prove this lemma, we will provide an algorithm for the **binarization** of well-nested lexicalized LCFRSs. In the context of LCFRSs, a binarization is a procedure for transforming a grammar into an equivalent one with rank at most 2. Binarization, either explicit at the level or the grammar or implicit at the level of some parsing algorithm, is essential for achieving efficient recognition algorithms, in particular the usual cubic-time algorithms for context-free grammars. Note that our binarization only preserves *weak* equivalence; in effect, it reduces the universal recognition problem for well-nested lexicalized LCFRSs to the corresponding problem for well-nested LCFRSs with rank 2. Many interesting semiring computations on the original grammar can be simulated on the binarized grammar, however. A direct parsing algorithm for well-nested dependency trees has been presented by Gómez-Rodríguez, Carroll, and Weir (2011).

**concatenation function**takes a

*k*

_{1}-tuple and a

*k*

_{2}-tuple and returns the (

*k*

_{1}+

*k*

_{2}− 1)-tuple that is obtained by concatenating the two arguments. The simplest concatenation function is the standard concatenation operation 〈

*x y*〉. We will write

*conc*:

*k*

_{1}

*k*

_{2}to refer to a concatenation function of the type given in Equation (6). By counting endpoints, we see that the parsing complexity of concatenation functions isA

**wrapping function**takes a

*k*

_{1}-tuple (for some

*k*

_{1}≥ 2) and a

*k*

_{2}-tuple and returns the (

*k*

_{1}+

*k*

_{2}− 2)-tuple that is obtained by “wrapping” the first argument around the second argument, filling some gap in the former. The simplest function of this type is 〈

*x*

_{1}

*y x*

_{2}〉, which wraps a 2-tuple around a 1-tuple. We write

*wrap*:

*k*

_{1}

*k*

_{2}

*j*to refer to a wrapping function of the type given in Equation (7). The parsing complexity isThe constants of the binarized grammar have the form 〈

*ɛ*〉, 〈

*ɛ*,

*ɛ*〉, and 〈

*a*〉, where

*a*is the anchor of some yield function of the original grammar.

#### 8.5.1 Parsing Complexity

*k*, for concatenation functions

*conc*:

*k*

_{1}

*k*

_{2}we have

*k*

_{1}+

*k*

_{2}− 1 ≤

*k*and for wrapping functions

*wrap*:

*k*

_{1}

*k*

_{2}

*j*we have

*k*

_{1}+

*k*

_{2}− 2 ≤

*k*, we can rewrite the general parsing complexities asThus the maximal parsing complexity in the binarized grammar is 2

*k*+ 2; this is achieved by wrapping operations. This gives the bound stated in Lemma 10.

#### 8.5.2 Binarization

*f*is not already a concatenation function, wrapping function, or constant. We decompose this rule into up to three rulesas follows. We match the template of

*f*against one of three cases, shown schematically in Figure 12. In each case we select a concatenation or wrapping function

*f*′ (shown in the right half of the figure), and split up the template of

*f*into two parts defining yield functions

*f*

_{1}and

*f*

_{2}, respectively. In Figure 12,

*f*

_{1}is drawn shaded, and

*f*

_{2}is drawn non-shaded.

^{4}The split of

*f*partitions the variables that occur in the template, in the sense that if for some argument index 1 ≤

*i*≤

*m*, either

*f*

_{1}or

*f*

_{2}contains

*any*variable with argument index

*i*, then it contains

*all*such variables. The two sequencesby collecting the nonterminal

*A*

_{i}if the variables with argument index

*i*belong to the template of

*f*

_{1}and

*f*

_{2}, respectively. The nonterminals

*B*and

*C*are fresh nonterminals. We do not create rules for

*f*

_{1}and

*f*

_{2}if they are identity functions.

**Example 9**

*x*

_{1}

*a x*

_{2}

*y*

_{1},

*y*

_{2},

*y*

_{3}

*x*

_{3}〉 is complex and matches Case 3 in Figure 12, because its first component starts with the variable

*x*

_{1}and its last component ends with the variable

*x*

_{3}. We therefore split the template into two smaller parts 〈

*x*

_{1}

*a x*

_{2},

*x*

_{3}〉 and 〈

*y*

_{1},

*y*

_{2},

*y*

_{3}〉. The function 〈

*y*

_{1},

*y*

_{2},

*y*

_{3}〉 is an identity. We therefore create two rules:Note that the index

*j*for the wrapping function was chosen to be

*j*= 2 because there were more component boundaries between

*x*

_{2}and

*x*

_{3}than between

*x*

_{1}and

*x*

_{2}. The template 〈

*x*

_{1}

*a x*

_{2},

*x*

_{3}〉 requires further decomposition according to Case 3. This time, the two smaller parts are the identity function 〈

*x*

_{1},

*x*

_{2},

*x*

_{3}〉 and the constant 〈

*a*〉. We therefore create the following rules:At this point, the transformation ends.

#### 8.5.3 Correctness

*f*

_{0}:

*k*

_{1}⋯

*k*

_{m}→

*k*, each step of the binarization decomposes some yield function

*f*into two new yield functions

*f*

_{1},

*f*

_{2}. Let us denote the fan-outs of the three functions by

*h*,

*h*

_{1},

*h*

_{2}, respectively. We haveFrom Equation (8) it is clear that in Case 1 and Case 2, both

*h*

_{1}and

*h*

_{2}are upper-bounded by

*h*. In Case 3 we have

*h*

_{1}≥ 2, which together with Equation (9) implies that

*h*

_{2}≤

*h*. However,

*h*

_{1}is upper-bounded by

*h*only if

*h*

_{2}≥ 2; if

*h*

_{2}= 1, then

*h*

_{1}may be greater than

*h*. As an example, consider the decomposition of 〈

*x*

_{1}

*a x*

_{2}〉 (fan-out 1) into the wrapping function 〈

*x*

_{1},

*x*

_{2}〉 (fan-out 2) and the constant 〈

*a*〉 (fan-out 1). But because in Case 3 the index

*j*is chosen to maximize the number of component boundaries between the variables

*x*

_{i,j}and

*x*

_{i,j+1}, the assumption

*h*

_{2}= 1 implies that each of the

*h*

_{1}components of

*f*

_{1}contains at least one variable with argument index

*i*—if there were a component without such a variable, then the two variables that surrounded that component would have given rise to a different choice of

*j*. Hence we deduce that

*h*

_{1}≤

*k*

_{i}.

## 9. Conclusion

In this article, we have presented a formalism for non-projective dependency grammar based on linear context-free rewriting systems, along with a technique for extracting grammars from dependency treebanks. We have shown that parsing with the full class of these grammars is intractable. Therefore, we have investigated two constraints on the non-projectivity of dependency trees, block-degree and well-nestedness. Jointly, these two constraints define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial time.

Our results in Sections 7 and 8 allow us to relate the formal power of an LCFRS to the structural properties of the dependency structures that it induces. Although we have used this relation to identify a class of dependency grammars that can be parsed in polynomial time, it also provides us with a new perspective on the question about the descriptive adequacy of a grammar formalism. This question has traditionally been discussed on the basis of strong and weak generative capacity (Bresnan et al. 1982; Huybregts 1984; Shieber 1985). A notion of generative capacity based on dependency trees makes a useful addition to this discussion, in particular when comparing formalisms for which no common concept of strong generative capacity exists. As an example for a result in this direction, see Koller and Kuhlmann (2009).

We have defined the dependency trees that an LCFRS induces by means of a compositional mapping on the derivations. While we would claim that compositionality is a generally desirable property, the particular notion of induction is up for discussion. In particular, our interpretation of derivations may not always be in line with how the grammar producing these derivations is actually *used*. One formalism for which such a mismatch between derivation trees and dependency trees has been pointed out is tree-adjoining grammar (Rambow, Vijay-Shanker, and Weir 1995; Candito and Kahane 1998). Resolving this mismatch provides an interesting line of future work.

One aspect that we have not discussed here is the linguistic adequacy of block-degree and well-nestedness. Each of our dependency grammars is restricted to a finite block-degree. As a consequence of this restriction, our dependency grammars are not expressive enough to capture linguistic phenomena that require unlimited degrees of non-projectivity, such as the “scrambling” in German subordinate clauses (Becker, Rambow, and Niv 1992). The question whether it is reasonable to assume a bound on the block-degree of dependency trees, perhaps for some performance-based reason, is open. Likewise, it is not clear whether well-nestedness is a “natural” constraint on dependency analyses (Chen-Main and Joshi 2010; Maier and Lichte 2011).

## Acknowledgments

The author gratefully acknowledges financial support from The German Research Foundation (Sonderforschungsbereich 378, project MI 2) and The Swedish Research Council (diary no. 2008-296).

## Notes

We draw the nodes of a dependency tree as circles, and the edges as arrows pointing towards the dependent (away from the root node). Following Hays (1964), we use dotted lines to help us keep track of the positions of the nodes in the linear order, and to associate nodes with lexical items.

We note that, instead of counting the blocks of each node, one may also count the gaps between these blocks and define the “gap-degree” of a dependency tree (Holan et al. 1998).

Kanazawa (2009) calls a multiple context-free grammar well-nested if each of its rules is non-deleting, non-permuting (our Property 2), and well-nested according to (5).

In order for these parts to make well-defined templates, we will in general need to rename the variables. We leave this renaming implicit here.

## References

## Author notes

Department of Linguistics and Philology, Box 635, 751 26 Uppsala, Sweden. E-mail: [email protected].