Abstract
Syntactic representations based on word-to-word dependencies have a long-standing tradition in descriptive linguistics, and receive considerable interest in many applications. Nevertheless, dependency syntax has remained something of an island from a formal point of view. Moreover, most formalisms available for dependency grammar are restricted to projective analyses, and thus not able to support natural accounts of phenomena such as wh-movement and cross–serial dependencies. In this article we present a formalism for non-projective dependency grammar in the framework of linear context-free rewriting systems. A characteristic property of our formalism is a close correspondence between the non-projectivity of the dependency trees admitted by a grammar on the one hand, and the parsing complexity of the grammar on the other. We show that parsing with unrestricted grammars is intractable. We therefore study two constraints on non-projectivity, block-degree and well-nestedness. Jointly, these two constraints define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial time. An evaluation on five dependency treebanks shows that these grammars have a good coverage of empirical data.
1. Introduction
Syntactic representations based on word-to-word dependencies have a long-standing tradition in descriptive linguistics. Since the seminal work of Tesnière (1959), they have become the basis for several linguistic theories, such as Functional Generative Description (Sgall, Hajičová, and Panevová 1986), Meaning–Text Theory (Mel'čuk 1988), and Word Grammar (Hudson 2007). In recent years they have also been used for a wide range of practical applications, such as information extraction, machine translation, and question answering. We ascribe the widespread interest in dependency structures to their intuitive appeal, their conceptual simplicity, and in particular to the availability of accurate and efficient dependency parsers for a wide range of languages (Buchholz and Marsi 2006; Nivre et al. 2007).
Although there exist both a considerable practical interest and an extensive linguistic literature, dependency syntax has remained something of an island from a formal point of view. In particular, there are relatively few results that bridge between dependency syntax and other traditions, such as phrase structure or categorial syntax. This makes it hard to gauge the similarities and differences between the paradigms, and hampers the exchange of linguistic resources and computational methods. An overarching goal of this article is to bring dependency grammar closer to the mainland of formal study.
One of the few bridging results for dependency grammar is thanks to Gaifman (1965), who studied a formalism that we will refer to as Hays–Gaifman grammar, and proved it to be weakly equivalent to context-free phrase structure grammar. Although this result is of fundamental importance from a theoretical point of view, its practical usefulness is limited. In particular, Hays–Gaifman grammar is restricted to projective dependency structures, which is similar to the familiar restriction to contiguous constituents. Yet, non-projective dependencies naturally arise in the analysis of natural language. One classic example of this is the phenomenon of cross–serial dependencies in Dutch. In this language, the nominal arguments of verbs that also select an infinitival complement occur in the same order as the verbs themselves:
In German, the order of the nominal arguments instead inverts the verb order:Figure 1 shows dependency trees for the two examples.1 The German linearization gives rise to a projective structure, where the verb–argument dependencies are nested within each other, whereas the Dutch linearization induces a non-projective structure with crossing edges. To account for such structures we need to turn to formalisms more expressive than Hays–Gaifman grammars.In this article we present a formalism for non-projective dependency grammar based on linear context-free rewriting systems (LCFRSs) (Vijay-Shanker, Weir, and Joshi 1987; Weir 1988). This framework was introduced to facilitate the comparison of various grammar formalisms, including standard context-free grammar, tree-adjoining grammar (Joshi and Schabes 1997), and combinatory categorial grammar (Steedman and Baldridge 2011). It also comprises, among others, multiple context-free grammars (Seki et al. 1991), minimalist grammars (Michaelis 1998), and simple range concatenation grammars (Boullier 2004).
The article is structured as follows. In Section 2 we provide the technical background to our work; in particular, we introduce our terminology and notation for linear context-free rewriting systems. An LCFRS generates a set of terms (formal expressions) which are interpreted as derivation trees of objects from some domain. Each term also has a secondary interpretation under which it denotes a tuple of strings, representing the string yield of the derived object. In Section 3 we introduce the central notion of a lexicalized linear context-free rewriting system, which is an LCFRS in which each rule of the grammar is associated with an overt lexical item, representing a syntactic head (cf. Schabes, Abeillé, and Joshi 1988 and Schabes 1990). We show that this property gives rise to an additional interpretation under which each term denotes a dependency tree on its yield. With this interpretation, lexicalized LCFRSs can be used as dependency grammars.
In Section 4 we show how to acquire lexicalized LCFRSs from dependency treebanks. This works in much the same way as the extraction of context-free grammars from phrase structure treebanks (cf. Charniak 1996), except that the derivation trees of dependency trees are not immediately accessible in the treebank. We therefore present an efficient algorithm for computing a canonical derivation tree for an input dependency tree; from this derivation tree, the rules of the grammar can be extracted in a straightforward way. The algorithm was originally published by Kuhlmann and Satta (2009). It produces a restricted type of lexicalized LCFRS that we call “canonical.” In Section 5 we provide a declarative characterization of this class of grammars, and show that every lexicalized LCFRS is (strongly) equivalent to a canonical one, in the sense that it induces the same set of dependency trees.
In Section 6 we present a simple parsing algorithm for LCFRSs. Although the runtime of this algorithm is polynomial in the length of the sentence, the degree of the polynomial depends on two grammar-specific measures called fan-out and rank. We show that even in the restricted case of canonical grammars, parsing is an NP-hard problem. It is important therefore to keep the fan-out and the rank of a grammar as low as possible, and much of the recent work on LCFRSs has been devoted to the development of techniques that optimize parsing complexity in various scenarios Gómez-Rodríguez and Satta 2009; Gómez-Rodríguez et al. 2009; Kuhlmann and Satta 2009; Gildea 2010; Gómez-Rodríguez, Kuhlmann, and Satta 2010; Sagot and Satta 2010; and Crescenzi et al. 2011).
In this article we explore the impact of non-projectivity on parsing complexity. In Section 7 we present the structural correspondent of the fan-out of a lexicalized LCFRS, a measure called block-degree (or gap-degree) (Holan et al. 1998). Although there is no theoretical upper bound on the block-degree of the dependency trees needed for linguistic analysis, we provide evidence from several dependency treebanks showing that, from a practical point of view, this upper bound can be put at a value of as low as 2. In Section 8 we study a second constraint on non-projectivity called well-nestedness (Bodirsky, Kuhlmann, and Möhl 2005), and show that its presence facilitates tractable parsing. This comes at the cost of a small loss in coverage on treebank data. Bounded block-degree and well-nestedness jointly define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial time.
Section 9 summarizes our main contributions and concludes the article.
2. Technical Background
We assume basic familiarity with linear context-free rewriting systems (see, e.g., Vijay-Shanker, Weir, and Joshi 1987 and Weir 1988) and only review the terminology and notation that we use in this article.
Example 1
Figure 2 shows an example of an LCFRS for the language {〈anbncndn〉 | n ≥ 0}.
Equation (2) is uniquely determined by the tuple on the right-hand side of the equation. We call this tuple the template of the yield function f, and use it as the canonical function symbol for f. This gives rise to a compact notation for LCFRSs, illustrated in the right column of Figure 2. In this notation, to save some subscripts, we use the following shorthands for variables: x and x1 for x1,1; x2 for x1,2; x3 for x1,3; y and y1 for x2,1; y2 for x2,2; y3 for x2,3.
3. Lexicalized LCFRSs as Dependency Grammars
Our goal for the remainder of this section is to make the notion of induction formally precise. To this end we will reinterpret the yield functions of lexicalized LCFRSs as operations on dependency trees.
3.1 Dependency Trees
Example 1
3.2 Operations on Dependency Trees
Example 3
4. Extraction of Dependency Grammars
We now show how to extract lexicalized linear context-free rewriting systems from dependency treebanks. To this end, we adapt the standard technique for extracting context-free grammars from phrase structure treebanks (Charniak 1996).
Our technique was originally published by Kuhlmann and Satta (2009). In recent work, Maier and Lichte (2011) have shown how to unify it with a similar technique for the extraction of range concatenation grammars from discontinuous constituent structures, due to Maier and Søgaard (2008). To simplify our presentation we restrict our attention to treebanks containing simple dependency trees.
Because the extraction of rules from construction trees is straightforward, the problem that we focus on in this section is how to obtain these trees in the first place. Our procedure for computing construction trees is based on the concept of “blocks.”
4.1 Blocks
Let D be a dependency tree. A segment of D is a contiguous, non-empty sequence of nodes of D, all of which belong to the same component of the string yield. Thus a segment contains its endpoints, as well as all nodes between the endpoints in the precedence order. For a node u of D, a block of u is a longest segment consisting of descendants of u. This means that the left endpoint of a block of u either is the first node in its component, or is preceded by a node that is not a descendant of u. A symmetric property holds for the right endpoint.
Example 1
Consider the node 2 of the dependency tree in Figure 7. The descendants of 2 fall into two blocks, marked by the dashed boxes: 1 2 and 5 6 7.
We use and as variables for blocks. Extending the precedence order on nodes, we say that a block precedes a block , denoted by , if the right endpoint of precedes the left endpoint of .
4.2 Computing Canonical Construction Trees
Example 2
Note that in order to properly define f we need to assume some order on the children of u. The function g (and hence the construction tree t) is unique up to the specific choice of this order. In the following we assume that children are ordered from left to right based on the position of their leftmost descendants.
4.3 Computing the Blocks of a Dependency Tree
The algorithmically most interesting part of our extraction procedure is the computation of the yield function g. The template of g is uniquely determined by the left-to-right sequence of the endpoints of the blocks of u and its children. An efficient algorithm that can be used to compute these sequences is given in Table 1.
4.3.1 Description
We start at a virtual root node ⊥ (line 1) which serves as the parent of the real root node. For each node next in the precedence order of D, we follow the shortest path from the current node current to next. To determine this path, we compute the lowest common ancestor lca of the two nodes (lines 4–5), using a set of markings on the nodes. At the beginning of each iteration of the for loop in line 2, all ancestors of current (including the virtual root node ⊥) are marked; therefore, we find lca by going upwards from next to the first node that is marked. To restore the loop invariant, we then unmark all nodes on the path from current to lca (lines 6–9). Each time we move down from a node to one of its children (line 12), we record the information that next is the left endpoint of a block of current. Symmetrically, each time we move up from a node to its parent (lines 8 and 17), we record the information that next – 1 is the right endpoint of a block of current. The while loop in lines 15–18 takes us from the last node of the dependency tree back to the node ⊥.
4.3.2 Runtime Analysis
We analyze the runtime of our algorithm. Let m be the total number of blocks of D. Let us write ni for the total number of iterations of the ith while loop, and let n = n1 + n2 + n3 + n4. Under the reasonable assumption that every line in Table 1 can be executed in constant time, the runtime of the algorithm clearly is in O(n). Because each iteration of loop 2 and loop 4 determines the right endpoint of a block, we have n2 + n4 = m. Similarly, as each iteration of loop 3 fixes the left endpoint of a block, we have n3 = m. To determine n1, we note that every node that is pushed to the auxiliary stack in loop 1 is popped again in loop 3; therefore, n1 = n3 = m. Putting everything together, we have n = 3m, and we conclude that the runtime of the algorithm is in O(m). Note that this runtime is asymptotically optimal for the task we are considering.
5. Canonical Grammars
Our extraction technique produces a restricted type of lexicalized linear context-free rewriting system that we will refer to as “canonical.” In this section we provide a declarative characterization of these grammars, and show that every lexicalized LCFRS is equivalent to a canonical one.
5.1 Definition of Canonical Grammars
Property 1
For all 1 ≤ i1, i2 ≤ m, if i1 < i2 then .
This property is an artifact of our decision to order the children of a node from left to right based on the position of their leftmost descendants. A variable with argument index i represents a block of the ith child of u in that order. An example of a yield function that does not have Property 1 is 〈x2,1x1,1〉, which defines a kind of “reverse concatenation operation.”
Property 2
For all 1 ≤ i ≤ m and 1 ≤ j1, j2 ≤ ki, if j1 < j2 then .
This property reflects that, in our extraction procedure, the variable xi,j represents the jth block of the ith child of u, where the blocks of a node are ordered from left to right based on their precedence. An example of a yield function that violates the property is 〈x2,1x1,1〉, which defines a kind of swapping operation. In the literature on LCFRSs and related formalisms, yield functions with Property 2 have been called monotone (Michaelis 2001; Kracht 2003), ordered (Villemonte de la Clergerie 2002; Kallmeyer 2010), and non-permuting (Kanazawa 2009).
Property 3
No component αh is the empty string.
This property, which is similar to ε-freeness as known from context-free grammars, has been discussed for multiple context-free grammars (Seki et al. 1991, Property N3 in Lemma 2.2) and range concatenation grammars (Boullier 1998, Section 5.1). For our extracted grammars it holds because each component αh represents a block, and blocks are always non-empty.
Property 4
No component αh contains a substring of the form .
This property, which does not seem to have been discussed in the literature before, is a reflection of the facts that variables with the same argument index represent blocks of the same child node, and that these blocks are longest segments of descendants.
A yield function with Properties 1–4 is called canonical. An LCFRS is canonical if all of its yield functions are canonical.
Lemma 1
A lexicalized LCFRS is canonical if and only if it can be extracted from a dependency treebank using the technique presented in Section 4.
Proof
We have already argued for the “only if” part of the claim. To prove the “if” part, it suffices to show that for every canonical, lexicalized yield function f, one can construct a dependency tree such that the construction tree extracted for this dependency tree contains f. This is an easy exercise.
We conclude by noting that Properties 2–4 are also shared by the treebank grammars extracted from constituency treebanks using the technique by Maier and Søgaard (2008).
5.2 Equivalence Between General and Canonical Grammars
Two lexicalized LCFRSs are called strongly equivalent if they induce the same set of dependency trees. We show the following equivalence result:
Lemma 2
For every lexicalized LCFRS G one can construct a strongly equivalent lexicalized LCFRS G′ such that G′ is canonical.
Proof
Our proof of this lemma uses two normal-form results about multiple context-free grammars: Michaelis (2001, Section 2.4) provides a construction that transforms a multiple context-free grammar into a weakly equivalent multiple context-free grammar in which all rules satisfy Property 2, and Seki et al. (1991, Lemma 2.2) present a corresponding construction for Property 3. Whereas both constructions are only quoted to preserve weak equivalence, we can verify that, in the special case where the input grammar is a lexicalized LCFRS, they also preserve the set of induced dependency trees. To complete the proof of Lemma 2, we show that every lexicalized LCFRS can be cast into normal forms that satisfy Property 1 and Property 4. It is not hard then to combine the four constructions into a single one that simultaneously establishes all properties of canonical yield functions.
Lemma 3
For every lexicalized LCFRS G one can construct a strongly equivalent lexicalized LCFRS G′ such that G′ only contains yield functions which satisfy Property 1.
Proof
The proof is very simple. Intuitively, Property 1 enforces a canonical naming of the arguments of yield functions. To establish it, we determine, for every yield function f, a permutation that renames the argument indices of the variables occurring in the template of f in such a way that the template meets Property 1. This renaming gives rise to a modified yield function . We then replace every rule A → f(A1, …, Am) with the modified rule .
Lemma 4
For every lexicalized LCFRS G one can construct a strongly equivalent lexicalized LCFRS G′ such that G′ only contains yield functions which satisfy Property 4.
Proof
The idea behind our construction of the grammar G′ is perhaps best illustrated by an example. Imagine that the grammar G generates the term t shown in Figure 8a. The yield function f1 = 〈x1c x2x3〉 at the root node of that term violates Property 4, as its template contains the offending substring x2x3. We set up G′ in such a way that instead of t it generates the term t′ shown in Figure 8b in which f1 is replaced with the yield function f′1 = 〈x1c x2〉. To obtain f′1 from f1 we reduce the offending substring x2x3 to the single variable x2. In order to ensure that t and t′ induce the same dependency tree (shown in Figure 8c), we then adapt the function f2 = 〈x1b, y, x, x2〉 at the first child of the root node: Dual to the reduction, we replace the two-component sequence y, x2 in the template of f2 with the single component y x2; in this way we get f′2 = 〈x1b, y x2〉.
Input: a linear context-free rewriting system G = (N, Σ, P, S) |
1: P′ ← ∅; agenda ← {(S, 〈x〉)}; chart ← ∅ |
2: whileagenda is not empty |
3: remove some (A, g) from agenda |
4: if (A, g) ∉ chartthen |
5: add (A, g) to chart |
6: for each rule A → f(A1, …, Am) ∈ Pdo |
7: f ← reduce(f, g); gi ← adapt(f, g, i) (1 ≤ i ≤ m) |
8: for eachi from 1 to mdo |
9: add (Ai, gi) to agenda |
10: add (A, g) → f′((A1, g1), …, (Am, gm)) to P′ |
Input: a linear context-free rewriting system G = (N, Σ, P, S) |
1: P′ ← ∅; agenda ← {(S, 〈x〉)}; chart ← ∅ |
2: whileagenda is not empty |
3: remove some (A, g) from agenda |
4: if (A, g) ∉ chartthen |
5: add (A, g) to chart |
6: for each rule A → f(A1, …, Am) ∈ Pdo |
7: f ← reduce(f, g); gi ← adapt(f, g, i) (1 ≤ i ≤ m) |
8: for eachi from 1 to mdo |
9: add (Ai, gi) to agenda |
10: add (A, g) → f′((A1, g1), …, (Am, gm)) to P′ |
Our algorithm is controlled by an agenda and a chart, both containing pairs of the form (A, g), where A is a nonterminal of G and g is an adaptor function. These pairs also constitute the nonterminals of the new grammar G′. The fan-out of a nonterminal is the fan-out of g. The agenda is initialized with the pair (S, 〈x〉) where 〈x〉 is the identity function; this pair also represents the start symbol of G′. To see that the algorithm terminates, one may observe that the fan-out of every nonterminal (A, g) added to the agenda is upper-bounded by the fan-out of A. Hence, there are only finitely many pairs (A, g) that may occur in the chart, and a finite number of iterations of the while-loop.
6. Parsing and Recognition
Lexicalized linear context-free rewriting systems are able to account for arbitrarily non-projective dependency trees. This expressiveness comes with a price: In this section we show that parsing with lexicalized LCFRSs is intractable, unless we are willing to restrict the class of grammars.
6.1 Parsing Algorithm
To ground our discussion of parsing complexity, we present a simple bottom–up parsing algorithm for LCFRSs, specified as a grammatical deduction system (Shieber, Schabes, and Pereira 1995). Several similar algorithms have been described in the literature (Seki et al. 1991; Bertsch and Nederhof 2001; Kallmeyer 2010). We assume that we are given a grammar G = (N, Σ, P, S) and a string w = a1 ⋯ an ∈ V* to be parsed.
Goal item. The goal item is [S, 0, n]. By this item, there exists a term that can be derived from the start symbol S and yields the full string 〈w〉.
Based on the deduction system, a tabular parser for LCFRSs can be implemented using standard dynamic programming techniques. This parser will compute a packed representation of the set of all derivation trees that the grammar G assigns to the string w. Such a packed representation is often called a shared forest (Lang 1994). In combination with appropriate semirings, the shared forest is useful for many tasks in syntactic analysis and machine learning (Goodman 1999; Li and Eisner 2009).
6.2 Parsing Complexity
Lemma 5
Proof
We adopt the following strategy for choosing endpoints: For 1 ≤ i ≤ k, choose the value of lh. Then, for 1 ≤ i ≤ m and 1 ≤ j ≤ ki, choose the value of ri,j. It is not hard to see that these choices suffice to determine all other endpoints. In particular, each left endpoint li′,j′ will be shared either with the left endpoint lh of some component (by constraint C2), or with some right endpoint ri,j (by constraint c4).
6.3 Universal Recognition
The runtime of our parsing algorithm for LCFRSs is exponential in both the rank and the fan-out of the input grammar. One may wonder whether there are parsing algorithms that can be substantially faster. We now show that the answer to this question is likely to be negative even if we restrict ourselves to canonical lexicalized LCFRSs. To this end we study the universal recognition problem for this class of grammars.
The universal recognition problem for a class of linear context-free rewriting systems is to decide, given a grammar G from the class in question and a string w, whether G yields 〈w〉. A straightforward algorithm for solving this problem is to first compute the shared forest for G and w, and to return “yes” if and only if the shared forest is non-empty. Choosing appropriate data structures, the emptiness of shared forests can be decided in linear time and space with respect to the size of the forest. Therefore, the computational complexity of universal recognition is upper-bounded by the complexity of constructing the shared forest. Conversely, parsing cannot be faster than universal recognition.
In the next three lemmas we prove that the universal recognition problem for canonical lexicalized LCFRSs is NP-complete unless we restrict ourselves to a class of grammars where both the fan-out and the rank of the yield functions are bounded by constants. Lemma 6, which shows that the universal recognition problem of lexicalized LCFRSs is in NP, distinguishes lexicalized LCFRSs from general LCFRSs, for which the universal recognition problem is known to be PSPACE-complete (Kaji et al. 1992). The crucial difference between general and lexicalized LCFRSs is the fact that in the latter, the size of the generated terms is bounded by the length of the input string. Lemma 7 and Lemma 8, which establish two NP-hardness results for lexicalized LCFRSs, are stronger versions of the corresponding results for general LCFRSs presented by Satta (1992), and are proved using similar reductions. They show that the hardness results hold under significant restrictions of the formalism: to lexicalized form and to canonical yield functions. Note that, whereas in Section 5.2 we have shown that every lexicalized LCFRS is equivalent to a canonical one, the normal form transformation increases the size of the original grammar by a factor that is at least exponential in the fan-out.
Lemma 6
The universal recognition problem of lexicalized LCFRSs is in NP.
Proof
For the following two lemmas, recall the decision problem 3SAT, which is known to be NP-complete. An instance of 3SAT is a Boolean formula φ in conjunctive normal form where each clause contains exactly three literals, which may be either variables or negated variables. We write m for the number of distinct variables that occur in φ, and n for the number of clauses. In the proofs the index i will always range over values from 1 to m, and the index j will range over values from 1 to n.
In order to make the grammars in the following reductions more readable, we use yield functions with more than one lexical anchor. Our use of these yield functions is severely restricted, however, and each of our grammars can be transformed into a proper lexicalized LCFRS without affecting the correctness or polynomial size of the reductions.
Lemma 7
The universal recognition problem for canonical lexicalized LCFRSs with unbounded fan-out and rank 1 is NP-hard.
Proof
To prove this claim, we provide a polynomial-time reduction of 3SAT. The basic idea is to use the derivations of the grammar to guess truth assignments for the variables, and to use the feature of unbounded fan-out to ensure that the truth assignment satisfies all clauses.
To see how the traversal of the matrix M can be implemented by the grammar G, consider the grammar fragment in Figure 9. Each of the rules specifies one possible step of the iteration for the pair (vi, cj) under the truth assignment vi = true; rules with left-hand side Fi,j (not shown here) specify possible steps under the assignment vi = false.
Lemma 8
The universal recognition problem for canonical lexicalized LCFRSs with unbounded rank and fan-out 2 is NP-hard.
Proof
7. Block-Degree
To obtain efficient parsing, we would like to have grammars with as low a fan-out as possible. Therefore it is interesting to know how low we can go without losing too much coverage. In lexicalized LCFRSs extracted from dependency treebanks, the fan-out of a grammar has a structural correspondence in the maximal number of blocks per subtree, a measure known as “block-degree.” In this section we formally define block-degree, and evaluate grammar coverage under different bounds on this measure.
7.1 Definition of Block-Degree
Recall the concept of “blocks” that was defined in Section 4.2. The block-degree of a node u of a dependency tree D is the number of distinct blocks of u. The block-degree of D is the maximal block-degree of its nodes.2
Example 6
Figure 10 shows two non-projective dependency trees. For D1, consider the node 2. The descendants of 2 fall into two blocks, marked by the dashed boxes. Because this is the maximal number of blocks per node in D1, the block-degree of D1 is 2. Similarly, we can verify that the block-degree of the dependency tree D2 is 3.
A dependency tree is projective if its block-degree is 1. In a projective dependency tree, each subtree corresponds to a substring of the underlying tuple of strings. In a non-projective dependency tree, a subtree may span over several, discontinuous substrings.
7.2 Computing the Block-Degrees
Using a straightforward extension of the algorithm in Table 1, the block-degrees of all nodes of a dependency tree D can be computed in time O(m), where m is the total number of blocks. To compute the block-degree of D, we simply take the maximum over the degrees of each node. We can also adapt this procedure to test whether D is projective, by aborting the computation as soon as we discover that some node has more than one block. The runtime of this test is linear in the number of nodes of D.
7.3 Block-Degree in Extracted Grammars
In a lexicalized LCFRS extracted from a dependency treebank, there is a one-to-one correspondence between the blocks of a node u and the components of the template of the yield function f extracted for u. In particular, the fan-out of f is exactly the block-degree of u. As a consequence, any bound on the block-degree of the trees in the treebank translates into a bound on the fan-out of the extracted grammar. This has consequences for the generative capacity of the grammars: As Seki et al. (1991) show, the class of LCFRSs with fan-out k > 1 can generate string languages that cannot be generated by the class of LCFRSs with fan-out k − 1.
It may be worth emphasizing that the one-to-one correspondence between blocks and tuple components is a consequence of two characteristic properties of extracted grammars (Properties 3 and 4), and does not hold for non-canonical lexicalized LCFRSs.
Example 7
The following term induces a two-node dependency tree with block-degree 1, but contains yield functions with fan-out 2: 〈a x1x2〉(〈b, ɛ〉). Note that the yield functions in this term violate both Property 3 and Property 4.
7.4 Coverage on Dependency Treebanks
In order to assess the consequences of different bounds on the fan-out, we now evaluate the block-degree of dependency trees in real-world data. Specifically, we look into five dependency treebanks used in the 2006 CoNLL shared task on dependency parsing (Buchholz and Marsi 2006): the Prague Arabic Dependency Treebank (Hajič et al. 2004), the Prague Dependency Treebank of Czech (Böhmová et al. 2003), the Danish Dependency Treebank (Kromann 2003), the Slovene Dependency Treebank (Džeroski et al. 2006), and the Metu-Sabancı treebank of Turkish (Oflazer et al. 2003). The full data used in the CoNLL shared task also included treebanks that were produced by conversion of corpora originally annotated with structures other than dependencies, which is a potential source of “noise” that one has to take into account when interpreting any findings. Here, we consider only genuine dependency treebanks. More specifically, our statistics concern the training sections of the treebanks that were set off for the task. For similar results on other data sets, see Kuhlmann and Nivre (2006), Havelka (2007), and Maier and Lichte (2011).
Our results are given in Table 3. For each treebank, we list the number of rules extracted from that treebank, as well as the number of corresponding dependency trees. We then list the number of rules that we lose if we restrict ourselves to rules with fan-out = 1, or rules with fan-out ≤ 2, as well as the number of dependency trees that we lose because their construction trees contain at least one such rule. We count rule tokens, meaning that two otherwise identical rules are counted twice if they were extracted from different trees, or from different nodes in the same tree.
. | . | . | fan-out = 1 . | fan-out ≤ 2 . | ||
---|---|---|---|---|---|---|
. | rules . | trees . | rules . | trees . | rules . | trees . |
Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 |
Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 |
Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 |
Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 |
Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 |
. | . | . | fan-out = 1 . | fan-out ≤ 2 . | ||
---|---|---|---|---|---|---|
. | rules . | trees . | rules . | trees . | rules . | trees . |
Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 |
Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 |
Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 |
Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 |
Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 |
By putting the bound at fan-out 1, we lose between 0.74% (Arabic) and 1.75% (Slovene) of the rules, and between 11.16% (Arabic) and 23.15% (Czech) of the trees in the treebanks. This loss is quite substantial. If we instead put the bound at fan-out ≤ 2, then rule loss is reduced by between 94.16% (Turkish) and 99.76% (Arabic), and tree loss is reduced by between 94.31% (Turkish) and 99.39% (Arabic). This outcome is surprising. For example, Holan et al. (1998) argue that it is impossible to give a theoretical upper bound for the block-degree of reasonable dependency analyses of Czech. Here we find that, if we are ready to accept a loss of as little as 0.02% of the rules extracted from the Prague Dependency Treebank, and up to 0.5% of the trees, then such an upper bound can be set at a block-degree as low as 2.
8. Well-Nestedness
The parsing of LCFRSs is exponential both in the fan-out and in the rank of the grammars. In this section we study “well-nestedness,” another restriction on the non-projectivity of dependency trees, and show how enforcing this constraint allows us to restrict our attention to the class of LCFRSs with rank 2.
8.1 Definition of Well-Nestedness
Example 8
Lemma 9
Proof
Note that projective dependency trees are always well-nested; in these structures, every node has exactly one block, so configuration (4) is impossible. For every k > 1, there are both well-nested and ill-nested dependency trees with block-degree k.
8.2 Testing for Well-Nestedness
Based on Lemma 9, testing whether a dependency tree D is well-nested can be done in time linear in the number of blocks in D using a simple subsequence test as follows. We run the algorithm given in Table 1, maintaining a stack s[u] for every node u. The first time we make a down step to u, we push u to the stack for the parent of u; every other time, we pop the stack for the parent until we either find u as the topmost element, or the stack becomes empty. In the latter case, we terminate the computation and report that D is ill-nested; if the computation can be completed without any stack ever becoming empty, we report that D is well-nested.
To show that the algorithm is sound, suppose that some stack s[p] becomes empty when making a down step to some child v of p. In this case, the node v must have been popped from s[p] when making a down step to some other child u of p, and that child must have already been on the stack before the first down step to v. This witnesses the existence of a configuration of the form in Equation (4).
8.3 Well-Nestedness in Extracted Grammars
Similar to the situation with block-degree, the correspondence between structural well-nestedness and syntactic well-nestedness is tight only for canonical grammars. For non-canonical grammars, syntactic well-nestedness alone does not imply structural well-nestedness, nor the other way around.
8.4 Coverage on Dependency Treebanks
To estimate the coverage of well-nested grammars, we extend the evaluation presented in Section 7.4. Table 4 shows how many rules and trees in the five dependency treebanks we lose if we restrict ourselves to well-nested yield functions with fan-out ≤ 2. The losses reported in Table 3 are repeated here for comparison. Although the coverage of well-nested rules is significantly smaller than the coverage of rules without this requirement, rule loss is still reduced by between 92.65% (Turkish) and 99.51% (Arabic) when compared to the fan-out = 1 baseline.
. | . | . | fan-out = 1 . | fan-out ≤ 2 . | + well-nested . | |||
---|---|---|---|---|---|---|---|---|
. | rules . | trees . | rules . | trees . | rules . | trees . | rules . | trees . |
Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 | 2 | 2 |
Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 | 407 | 382 |
Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 | 17 | 15 |
Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 | 17 | 13 |
Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 | 68 | 43 |
. | . | . | fan-out = 1 . | fan-out ≤ 2 . | + well-nested . | |||
---|---|---|---|---|---|---|---|---|
. | rules . | trees . | rules . | trees . | rules . | trees . | rules . | trees . |
Arabic | 5,839 | 1,460 | 411 | 163 | 1 | 1 | 2 | 2 |
Czech | 1,322,111 | 72,703 | 22,283 | 16,831 | 328 | 312 | 407 | 382 |
Danish | 99,576 | 5,190 | 1,229 | 811 | 11 | 9 | 17 | 15 |
Slovene | 30,284 | 1,534 | 530 | 340 | 14 | 11 | 17 | 13 |
Turkish | 62,507 | 4,997 | 924 | 580 | 54 | 33 | 68 | 43 |
8.5 Binarization of Well-Nested Grammars
Our main interest in well-nestedness comes from the following:
Lemma 10
To prove this lemma, we will provide an algorithm for the binarization of well-nested lexicalized LCFRSs. In the context of LCFRSs, a binarization is a procedure for transforming a grammar into an equivalent one with rank at most 2. Binarization, either explicit at the level or the grammar or implicit at the level of some parsing algorithm, is essential for achieving efficient recognition algorithms, in particular the usual cubic-time algorithms for context-free grammars. Note that our binarization only preserves weak equivalence; in effect, it reduces the universal recognition problem for well-nested lexicalized LCFRSs to the corresponding problem for well-nested LCFRSs with rank 2. Many interesting semiring computations on the original grammar can be simulated on the binarized grammar, however. A direct parsing algorithm for well-nested dependency trees has been presented by Gómez-Rodríguez, Carroll, and Weir (2011).
8.5.1 Parsing Complexity
8.5.2 Binarization
Example 9
8.5.3 Correctness
9. Conclusion
In this article, we have presented a formalism for non-projective dependency grammar based on linear context-free rewriting systems, along with a technique for extracting grammars from dependency treebanks. We have shown that parsing with the full class of these grammars is intractable. Therefore, we have investigated two constraints on the non-projectivity of dependency trees, block-degree and well-nestedness. Jointly, these two constraints define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial time.
Our results in Sections 7 and 8 allow us to relate the formal power of an LCFRS to the structural properties of the dependency structures that it induces. Although we have used this relation to identify a class of dependency grammars that can be parsed in polynomial time, it also provides us with a new perspective on the question about the descriptive adequacy of a grammar formalism. This question has traditionally been discussed on the basis of strong and weak generative capacity (Bresnan et al. 1982; Huybregts 1984; Shieber 1985). A notion of generative capacity based on dependency trees makes a useful addition to this discussion, in particular when comparing formalisms for which no common concept of strong generative capacity exists. As an example for a result in this direction, see Koller and Kuhlmann (2009).
We have defined the dependency trees that an LCFRS induces by means of a compositional mapping on the derivations. While we would claim that compositionality is a generally desirable property, the particular notion of induction is up for discussion. In particular, our interpretation of derivations may not always be in line with how the grammar producing these derivations is actually used. One formalism for which such a mismatch between derivation trees and dependency trees has been pointed out is tree-adjoining grammar (Rambow, Vijay-Shanker, and Weir 1995; Candito and Kahane 1998). Resolving this mismatch provides an interesting line of future work.
One aspect that we have not discussed here is the linguistic adequacy of block-degree and well-nestedness. Each of our dependency grammars is restricted to a finite block-degree. As a consequence of this restriction, our dependency grammars are not expressive enough to capture linguistic phenomena that require unlimited degrees of non-projectivity, such as the “scrambling” in German subordinate clauses (Becker, Rambow, and Niv 1992). The question whether it is reasonable to assume a bound on the block-degree of dependency trees, perhaps for some performance-based reason, is open. Likewise, it is not clear whether well-nestedness is a “natural” constraint on dependency analyses (Chen-Main and Joshi 2010; Maier and Lichte 2011).
Acknowledgments
The author gratefully acknowledges financial support from The German Research Foundation (Sonderforschungsbereich 378, project MI 2) and The Swedish Research Council (diary no. 2008-296).
Notes
We draw the nodes of a dependency tree as circles, and the edges as arrows pointing towards the dependent (away from the root node). Following Hays (1964), we use dotted lines to help us keep track of the positions of the nodes in the linear order, and to associate nodes with lexical items.
We note that, instead of counting the blocks of each node, one may also count the gaps between these blocks and define the “gap-degree” of a dependency tree (Holan et al. 1998).
Kanazawa (2009) calls a multiple context-free grammar well-nested if each of its rules is non-deleting, non-permuting (our Property 2), and well-nested according to (5).
In order for these parts to make well-defined templates, we will in general need to rename the variables. We leave this renaming implicit here.
References
Author notes
Department of Linguistics and Philology, Box 635, 751 26 Uppsala, Sweden. E-mail: [email protected].