Abstract
We explore the concept of hybrid grammars, which formalize and generalize a range of existing frameworks for dealing with discontinuous syntactic structures. Covered are both discontinuous phrase structures and non-projective dependency structures. Technically, hybrid grammars are related to synchronous grammars, where one grammar component generates linear structures and another generates hierarchical structures. By coupling lexical elements of both components together, discontinuous structures result. Several types of hybrid grammars are characterized. We also discuss grammar induction from treebanks. The main advantage over existing frameworks is the ability of hybrid grammars to separate discontinuity of the desired structures from time complexity of parsing. This permits exploration of a large variety of parsing algorithms for discontinuous structures, with different properties. This is confirmed by the reported experimental results, which show a wide variety of running time, accuracy, and frequency of parse failures.
1. Introduction
Much of the theory of parsing assumes syntactic structures that are trees, formalized such that the children of each node are ordered, and the yield of a tree, that is, the leaves read from left to right, is the sentence. In different terms, each node in the hierarchical syntactic structure of a sentence corresponds to a phrase that is a list of adjacent words, without any gaps. Such a structure is easy to represent in terms of bracketed notation, which is used, for instance, in the Penn Treebank (Marcus, Santorini, and Marcinkiewicz, 1993).
Describing syntax in terms of such narrowly defined trees seems most appropriate for relatively rigid word-order languages such as English. Nonetheless, the aforementioned Penn Treebank of English contains traces and other elements that encode additional structure next to the pure tree structure as indicated by the brackets. This is in keeping with observations that even English cannot be described adequately without a more general form of trees, allowing for so-called discontinuity (McCawley, 1982; Stucky, 1987). In a discontinuous structure, the set of leaves dominated by a node of the tree need not form a contiguous sequence of words, but may comprise one or more gaps. The need for discontinuous structures tends to be even greater for languages with relatively free word order (Kathol and Pollard 1995; Müller 2004).
In the context of dependency parsing (Kübler, McDonald, and Nivre 2009), the more specific term non-projectivity is used instead of, or next to, discontinuity. See Rambow (2010) for a discussion of the relation between constituent and dependency structures and see Maier and Lichte (2009) for a comparison of discontinuity and non-projectivity. As shown by, for example, Hockenmaier and Steedman (2007) and Evang and Kallmeyer (2011), discontinuity encoded using traces in the Penn Treebank can be rendered in alternative, and arguably more explicit, forms. In many modern treebanks, discontinuous structures have been given a prominent status (e.g., Böhmová et al. 2000). Figure 1 shows an example of a non-projective dependency structure.
The most established parsing algorithms are compiled out of context-free grammars (CFGs), or closely related formalisms such as tree substitution grammars (Sima’an et al. 1994) or regular tree grammars (Brainerd 1969; Gécseg and Steinby 1997). These parsers, which have a time complexity of for n being the length of the input string, operate by composing adjacent substrings of the input sentence into longer substrings. As a result, the structures they can build directly do not involve any discontinuity. The need for discontinuous syntactic structures thus poses a challenge to traditional parsing algorithms.
One possible solution is commonly referred to as pseudo-projectivity in the literature on dependency parsing (Kahane, Nasr, and Rambow, 1998; Nivre and Nilsson, 2005; McDonald and Pereira, 2006). A standard parsing system is trained on a corpus of projective dependency structures that was obtained by applying a lifting operation to non-projective structures. In a first pass, this system is applied to unlabeled sentences and produces projective dependencies. In a second pass, the lifting operation is reversed to introduce non-projectivity. A related idea for discontinuous phrase structures is the reversible splitting conversion of Boyd (2007). See also Johnson (2002), Campbell (2004), and Gabbard, Kulick, and Marcus (2006).
The two passes of pseudo-projective dependency parsing need not be strictly separated in time. For example, one way to characterize the algorithm by Nivre (2009) is that it combines the first pass with the second. Here the usual one-way input tape is replaced by a buffer. A non-topmost element from the parsing stack, which holds a word previously read from the input sentence, can be transferred back to the buffer, and thereby input positions can be effectively swapped. This then results in a non-projective dependency structure.
A second potential solution to obtain syntactic structures that go beyond context-free power is to use more expressive grammatical formalisms. One approach proposed by Reape (1989, 1994) is to separate linear order from the parent–child relation in syntactic structure, and to allow shuffling of the order of descendants of a node, which need not be its direct children. The set of possible orders is restricted by linear precedence constraints. A further restriction may be imposed by compaction (Kathol and Pollard, 1995). As discussed by Fouvry and Meurers (2000) and Daniels and Meurers (2002), this may lead to exponential parsing complexity; see also Daniels and Meurers (2004). Separating linear order from the parent–child relation is in the tradition of head-driven phrase structure grammar (HPSG), where grammars are commonly hand-written. This differs from our objectives to induce grammars automatically from training data, as will become clear in the following sections.
To stay within a polynomial time complexity, one may also consider tree adjoining grammars (TAGs), which can describe strictly larger classes of word order phenomena than CFGs (Rambow and Joshi, 1997). The resulting parsers have a time complexity of (Vijay-Shankar and Joshi, 1985). However, the derived trees they generate are still continuous. Although their derivation trees may be argued to be discontinuous, these by themselves are not normally the desired syntactic structures. Moreover, it was argued by Becker, Joshi, and Rambow (1991) that further additions to TAGs are needed to obtain adequate descriptions of certain non-context-free phenomena. These additions further increase the time complexity.
In order to obtain desired syntactic structures, one may combine TAG parsing with an idea that is related to that of pseudo-projectivity. For example, Kallmeyer and Kuhlmann (2012) propose a transformation that turns a derivation tree of a (lexicalized) TAG into a non-projective dependency structure. The same idea has been applied to derivation trees of other formalisms, in particular (lexicalized) linear context-free rewriting systems (LCFRSs) (Kuhlmann, 2013), whose weak generative power subsumes that of TAGs.
Parsers more powerful than those for CFGs often incur high time costs. In particular, LCFRS parsers have a time complexity that is polynomial in the sentence length, but with a degree that is determined by properties of the grammar. This degree typically increases with the amount of discontinuity in the desired structures. Difficulties in running LCFRS parsers for natural languages are described, for example, by Kallmeyer and Maier (2013).
In the architectures we have discussed, the common elements are:
- •
a grammar, in some fixed formalism, that determines the set of sentences that are accepted, and
- •
a procedure to build (discontinuous) structures, guided by the derivation of input sentences.
The general concept of hybrid grammars leaves open the choice of the string grammar formalism and that of the tree grammar formalism. In this article we consider simple macro grammars (Fischer, 1968) and LCFRSs as string grammar formalisms. The tree grammar formalisms we consider are simple context-free tree grammars (Rounds, 1970) and simple definite clause programs (sDCP), inspired by Deransart and Małuszynski (1985). This gives four combinations, each leading to one class of hybrid grammars. In addition, more fine-grained subclasses can be defined by placing further syntactic restrictions on the string and tree formalisms.
To place hybrid grammars in the context of existing parsing architectures, let us consider classical grammar induction from a treebank, for example, for context-free grammars (Charniak, 1996) or for LCFRSs (Maier and Søgaard, 2008; Kuhlmann and Satta, 2009). Rules are extracted directly from the trees in the training set, and unseen input strings are consequently parsed according to these structures. Grammars induced in this way can be seen as restricted hybrid grammars, in which no freedom exists in the relation between the string component and the tree component. In particular, the presence of discontinuous structures generally leads to high time complexity of string parsing. In contrast, the framework in this article detaches the string component of the grammar from the tree component. Thereby the parsing process of input strings is no longer bound to follow the tree structures, while the same tree structures as before can still be produced, provided the tree component is suitably chosen. This allows string parsing with low time complexity in combination with production of discontinuous trees.
Nederhof and Vogler (2014) presented experiments with various subclasses of hybrid grammars for the purpose of constituent parsing. Trade-offs between speed and accuracy were identified. In the present article, we extend our investigation to dependency parsing. This includes induction of a hybrid grammar from a dependency treebank. Before turning to the experiments, we present several completeness results about existence of hybrid grammars generating non-projective dependency structures.
This article is organized as follows. After preliminaries in Section 2, Section 3 defines hybrid trees. These are able to capture both discontinuous phrase structures and non-projective dependency structures. Thanks to the concept of hybrid trees, there will be no need, later in the article, to distinguish between hybrid grammars for constituent parsing and hybrid grammars for dependency parsing. To make this article self-contained, we define two existing string grammar formalisms and two tree grammar formalisms in Section 4. The four classes of hybrid grammars that result by combining these formalisms are presented in Section 5, which also discusses how to use them for parsing.
How to induce hybrid grammars from treebanks is discussed in Section 6. Section 7 reports on experiments that provide proof of concept. In particular, LCFRS/sDCP-hybrid grammars are induced from corpora of dependency structures and phrase structures and employed to predict the syntactic structure of unlabeled sentences. It is demonstrated that hybrid grammars allow a wide variety of results, in terms of time complexity, accuracy, and frequency of parse failures. How hybrid grammars relate to existing ideas is discussed in Section 8.
The text refers to a number of theoretical results that are not central to the main content of this article. In order to preserve the continuity of the discussion, we have deferred their proofs to appendices.
2. Preliminaries
Let ℕ = {0,1,2, …} and ℕ + = ℕ ∖{0}. For each n ∈ ℕ +, we let [n] stand for the set {1, … , n}, and we let [0] stand for ∅. We write [n]0 to denote [n] ∪{0}. We fix an infinite list x1,x2, … of pairwise distinct variables. We let X = {x1,x2,x3, …} and Xk = {x1, … , xk} for each k ∈ ℕ. For any set A, the power set of A is denoted by .
A ranked set Δ is a set of symbols associated with a rank function assigning a number rkΔ(δ) ∈ ℕ to each symbol δ ∈ Δ. A ranked alphabet is a ranked set with a finite number of symbols. We let Δ(k) denote {δ ∈ Δ∣rkΔ(δ) = k}.
The following definitions were inspired by Seki and Kato (2008). The sets of terms and sequence-terms (s-terms) over ranked set Δ, with variables in some set Y ⊆ X, are denoted by TΔ(Y ) and , respectively, and defined inductively as follows:
Y ⊆ TΔ(Y ),
if k ∈ ℕ, δ ∈ Δ(k) and for each i ∈ [k], then δ(s1, … , sk) ∈ TΔ(Y), and
if n ∈ ℕ and ti ∈ TΔ(Y ) for each i ∈ [n], then .
We let and TΔ stand for and TΔ(∅), respectively. Throughout this article, we use variables such as s and si for s-terms and variables such as t and ti for terms. The length |s| of a s-term s = 〈t1, … , tn〉 is n.
The justification for using s-terms as defined here is that they provide the required flexibility for dealing with both strings and unranked trees, in combination with derivational nonterminals in various kinds of grammar. By using an alphabet Δ =Δ(0) one can represent strings. For instance, if Δ contains the symbols a and b, then the s-term 〈a(),b(),a()〉 denotes the string ab a. We will therefore refer to such s-terms simply as strings.
By using an alphabet Δ =Δ(1) one may represent trees without fixing the number of child nodes that a node with a certain label should have. Conceptually, one may think of such node labels as unranked, as is common in parsing theory of natural language. For instance, the s-term 〈a(〈b(〈〉),a(〈〉)〉)〉 denotes the unranked tree a(b,a). We will therefore refer to such s-terms simply as trees, and we will sometimes use familiar terminology, such as “node,” “parent,” and “sibling,” as well as common graphical representations of trees.
If theoretical frameworks require trees over ranked alphabets in the conventional sense (without s-terms), one may introduce a distinguished symbol cons of rank 2, replacing each s-term of length greater than 1 by an arrangement of subterms combined using occurrences of that symbol. Another symbol nil of rank 0 may be introduced to replace each s-term of length 0. Hence δ(〈α(〈〉),β(〈〉),γ(〈〉)〉) could be more conventionally written as δ(cons(α(nil),cons(β(nil),γ(nil)))).
Concatenation of s-terms is given by 〈t1, … , tn〉⋅〈tn+1, … , tn+m〉 = 〈t1, … , tn +m〉. Sequences such as s1, … , sk or x1, … , xk will typically be abbreviated to s1,k or x1,k, respectively. For δ ∈ Δ(0) we sometimes abbreviate δ() to δ.
In examples we also abbreviate 〈t1, … , tn〉 to t1⋯tn—that is, omitting the angle brackets and commas. In particular, for n = 0, the s-term 〈〉 is abbreviated by ε. Moreover, we sometimes abbreviate δ(〈〉) to δ. Whether δ then stands for δ(〈〉) or for δ() depends on whether δ ∈ Δ(1) or δ ∈ Δ(0), which will be clear from the context.
The subterm at position p in a term t (or a s-term s) is defined as follows. For any term t we have t|ε = t. For a term t = δ(s1,k), i ∈ [k],p ∈pos(si) we have t|ip =si|p. For a s-term s = 〈t1,n〉, i ∈ [n],p ∈pos(ti) we have s|ip =ti|p.
The label at position p in a term t is denoted by t(p). In other words, if t|p equals δ(s1, … , sk) or x ∈ X, then t(p) equals δ or x, respectively. Let Γ ⊆ Δ. The subset of pos(t) consisting of all positions where the label is in Γ is denoted by posΓ(t), or formally posΓ(t) = {p ∈ pos(t)∣t(p) ∈ Γ}. Analogously to t(p) and posΓ(t) for terms t, one may define s(p) and posΓ(s) for s-terms s.
The expression t[s′]p (or s[s′]p) denotes the s-term obtained from t (or from s) by replacing the subterm at position p by s-term s′. For any term t we have t[s′]ε = s′. For a term t = δ(s1,k), i ∈ [k],p ∈pos(si) we have t[s′]ip =〈δ(s1,i−1,si[s′]p,si +1,k)〉. For a s-term s = 〈t1,n〉, i ∈ [n],p ∈pos(ti) we have s[s′]ip = 〈t1,i−1〉⋅ ti[s′]p ⋅〈ti +1,n〉.
For term t (or s-term s) with variables in Xk and s-terms si (i ∈ [k]), the first-order substitutiont[s1,k] (or s[s1,k], respectively) denotes the s-term obtained from t (or from s) by replacing each occurrence of any variable xi by si, or formally:
- •
xi[s1,k] = si,
- •
δ(s′1,n)[s1,k] = 〈δ(s′1[s1,k], … , s′n[s1,k])〉,
- •
〈t1,n〉[s1,k] = t1[s1,k] ⋅… ⋅ tn[s1,k].
If , s|p = δ(s1,k) and , then the second-order substitution denotes the s-term obtained from s by replacing the subterm at position p by s′, with the variables in s′ replaced by the corresponding s-terms found immediately below p, or formally .
3. Hybrid Trees
The purpose of this section is to unify existing notions of non-projective dependency structures and discontinuous phrase structures, formalized using s-terms.
We fix a ranked alphabet Σ = Σ(1) and a subset Γ ⊆ Σ. A hybrid tree over (Γ,Σ) is a pair h = (s,≤s), where and ≤s is a total order on posΓ(s). In words, a hybrid tree combines hierarchical structure, in the form of a s-term over the full alphabet Σ, with a linear structure, which can be seen as a string over Γ ⊆ Σ. This string will be denoted by str(h). Formally, let posΓ(s) = {p1, … , pn} with pi ≤spi +1 (i ∈ [n − 1]). Then str(h) = s(p1)⋯s(pn). In order to avoid the treatment of pathological cases we assume that s≠〈〉 and posΓ(s)≠∅.
A hybrid tree (s,≤s) is a phrase structure if ≤s is a total order on the leaves of s. The elements of Γ would typically represent lexical items, and the elements of Σ ∖ Γ would typically represent syntactic categories. A hybrid tree (s,≤s) is a dependency structure if Γ = Σ, whereby the linear structure of a hybrid tree involves all of its nodes, which represent lexical items. Our dependency structures generalize totally ordered trees (Kuhlmann and Niehren, 2008) by considering s-terms instead of usual terms over a ranked alphabet.
We say that a phrase structure (s,≤s) over (Γ,Σ) is continuous if for each p ∈pos(s) the set posΓ(s|p) is a complete span, that is, if the following condition holds: if p1, p2, p′ satisfy pp1, p′, pp2 ∈posΓ(s) and pp1 ≤sp′≤spp2, then p′ = pp3 for some p3. If the same condition holds for a dependency structure (s,≤s), then we say that (s,≤s) is projective. If the condition is not satisfied, then we call a phrase structure discontinuous and a dependency structure non-projective.
The phenomenon of cross-serial dependencies in Dutch (Bresnan et al., 1982) is illustrated in Figure 4b, using a non-projective dependency structure.
Figure 5 gives an abstract rendering of cross-serial dependencies in Dutch, this time in terms of discontinuous phrase structure.
4. Basic Grammatical Formalisms
The concept of hybrid grammars is illustrated in Section 5, first on the basis of a coupling of linear context-free rewriting systems and simple definite clause programs, and then three more such couplings are introduced that further involve simple macro grammars and simple context-free tree grammars. In the current section we discuss these basic classes of grammars, starting with the simplest ones.
4.1 Macro Grammars
The definitions in this section are very close to those in Fischer (1968) with the difference that the notational framework of s-terms is used for strings, as in Seki and Kato (2008).
We write for the “derives” relation, using rule at position p in a s-term. Formally, we write if , s1(p) = A and . We write s1 ⇒Gs2 if for some p and ρ, and is the reflexive, transitive closure of ⇒G. Derivation in i steps is denoted by . The (string) language induced by macro grammar G is .
In the sequel we will focus our attention on macro grammars with the property that for each rule and each i ∈ [k], variable xi has exactly one occurrence in r. In this article, such grammars will be called simple macro grammars (sMGs).
4.2 Context-free Tree Grammars
The definitions in this section are a slight generalization of those in Rounds (1970) and Engelfriet and Schmidt (1977,1978), as here they involve s-terms. In Section 3 we already argued that the extra power due to s-terms can be modeled using fixed symbols cons and nil and is therefore not very significant in itself. The benefit of the generalization lies in the combination with macro grammars, as discussed in Section 5.
A (generalized) context-free tree grammar (CFTG) is a tuple G =(N,S,Σ,P), where Σ is a ranked alphabet with Σ = Σ(1) and N, S, and P are as for macro grammars except that Γ is replaced by Σ in the specification of the rules.
The “derives” relation and other relevant notation are defined as for macro grammars. Note that the language induced by a CFTG G is not a string language but a tree language, or more precisely, its elements are sequences of trees.
As for macro grammars, we will focus our attention on CFTGs with the property that for each rule and each i ∈ [k], variable xi has exactly one occurrence in r. In this article, such grammars will be called simple context-free tree grammars (sCFTGs). Note that if N = N(0), then a sCFTGs is a regular tree grammar (Brainerd 1969; Gécseg and Steinby 1997). sCFTGs are a natural generalization of the widely used TAGs; see Kepser and Rogers (2011), Maletti and Engelfriet (2012), and Gebhardt and Osterholzer (2015).
4.3 Linear Context-free Rewriting Systems
In Vijay-Shanker, Weir, and Joshi (1987), the semantics of LCFRS is introduced by distinguishing two phases. In the first phase, a tree over function symbols is generated by a regular tree grammar. In the second phrase, the function symbols are interpreted, each composing a sequence of tuples of strings into another tuple of strings. This formalism is equivalent to the multiple CFGs of Seki et al. (1991). We choose a notation similar to that of the formalisms discussed before, which will also enable us to couple these string-generating grammars to tree-generating grammars, as will be discussed later.
A derivation can be represented by a derivation treed (cf. Figure 6), which is obtained by glueing together the rules as they are used in the derivation. The backbone of d is a usual derivation tree of the CFG underlying the LCFRS (nonterminals and solid lines). Each argument of a nonterminal is represented as a box to the right of the nonterminal, and dashed arrows indicate dependencies of values. An xi above a box specifies an argument of a nonterminal on the right-hand side of a rule, whereas the content of a box is an argument of the nonterminal on the left-hand side of a rule. The boxes taken as vertices and the dashed arrows taken as edges constitute a dependency graph. It can be evaluated by first sorting its vertices topologically and, according to this sorting, substituting in an obvious manner the relevant s-terms into each other. The final s-term that is evaluated in this way is an element of the language [G]. This s-term, denoted by ϕ(d), is called the evaluation of d. The notion of dependency graph originates from attribute grammars (Knuth, 1968; Paakki, 1995); it should not be confused with the linguistic concept of dependency.
Note that if the rules in a derivation are given, then the choice of ri for each variable xi in each rule instance is uniquely determined. For a given string s, the set of all LCFRS derivations (in compact tabular form) can be obtained in polynomial time in the length of s (Seki et al., 1991). See also Kallmeyer and Maier (2010, 2013) for the extension with probabilities.
All strings derived by G have the interlaced structure amcnbmdn with m,n ∈ ℕ, where the i-th occurrence of a corresponds to the i-th occurrence of b and the i-th occurrence of c corresponds to the i-th occurrence of d. This resembles cross-serial dependencies in Swiss German (Shieber, 1985) in an abstract way; a and c represent noun phrases with different case markers (dative or accusative) and b and d are verbs that take different arguments (dative or accusative noun phrases).
There is a subclass of LCFRS called well-nested LCFRS. Its time complexity of parsing is lower than that of general LCFRS (Gómez-Rodríguez, Kuhlmann, and Satta 2010), and the class of languages it induces is strictly included in the class of languages induced by general LCFRSs (Kanazawa and Salvati, 2010). One can see sMGs as syntactic variants of well-nested LCFRSs (cf. footnote 3 of Kanazawa 2009), the former being more convenient for our purposes of constructing hybrid grammars, when the string component is to be explicitly restricted to have the power of sMG / well-nested LCFRS. The class of languages they induce also equals the class of string languages induced by sCFTGs.
4.4 Definite Clause Programs
In this section we describe a particular kind of definite clause program. Our definition is inspired by Deransart and Małuszynski (1985), who investigated the relation between logic programs and attribute grammars, together with the “syntactic single use requirement” from Giegerich (1988). The values produced are s-terms. The induced class of s-term languages is strictly larger than that of sCFTGs (cf. Appendix B).
As discussed subsequently, the class of string languages that results if we take the yields of those s-terms equals the class of string languages induced by LCFRSs. Thereby, our class of definite clause programs relates to LCFRSs much as the class of sCFTGs relates to sMGs.
A simple definite clause program (sDCP) is a tuple G =(N,S,Σ,P), where N is a ranked alphabet of nonterminals and Σ = Σ(1) is a ranked alphabet of terminals (as for CFTGs).2 Moreover, each nonterminal A ∈ N has a number of arguments, each of which is either an inherited argument or a synthesized argument. The number of inherited arguments is the i-rank and the number of synthesized arguments is the s-rank of A; we let rkN(A) =i-rk(A) +s-rk(A) denote the rank of A. The start symbolS has only one argument, which is synthesized—that is, rkN(S) =s-rk(S) = 1 and i-rk(S) = 0.
The “derives” relation ⇒G and other relevant notation are defined as for LCFRSs. Thus, in particular, the language induced by sDCP G is , whose elements are now sequences of trees.
In the same way as for LCFRS we can represent a derivation of a sDCP G as derivation tree d. A box in its dependency graph is placed to the left of a nonterminal occurrence if it represents an inherited argument and to the right otherwise. As before, ϕ(d) denotes the evaluation of d, which is now a s-term over Σ.
If a sDCP is such that the dependency graph for any derivation contains no cycles, then we say the sDCP contains no cycles. In this case, if the rules in a derivation are given, then the choice of ri for each variable xi in each rule instance is uniquely determined, and can be computed in linear time in the size of the derivation. The existence of cycles is decidable, as we know from the literature on attribute grammars (Knuth, 1968). There are sufficient conditions for absence of cycles, such as the grammar being L-attributed (Bochmann, 1976; Deransart, Jourdan, and Lorho, 1988). In this article, we will assume that sDCPs contain no cycles. A one-pass computation model can be formalized as a bottom–up tree-generating tree-to-hypergraph transducer (Engelfriet and Vogler, 1998). One may alternatively evaluate sDCP arguments using the more general mechanism of unification, as it exists for example in HPSG (Pollard and Sag, 1994).
If the sDCP is single-synthesized—that is, each nonterminal has exactly one synthesized argument (and any number of inherited arguments)—then there is an equivalent sCFTG. The converse also holds. For proofs, see Appendix A.
Single-synthesized sDCPs have the same s-term generating power as sCFTGs.
On the basis of the flattening function, Appendix C proves the following result.
sDCP have the same string generating power as LCFRS.
5. Hybrid Grammars
A hybrid grammar consists of a string grammar and a tree grammar. Intuitively, the string grammar is used for parsing a given string w, and the tree grammar simultaneously generates a hybrid tree h with str(h) = w. To synchronize these two processes, we couple derivations in the grammars in a way similar to how this is commonly done for synchronous grammars—namely, by indexed symbols. However, we apply the mechanism not only to derivational nonterminals but also to terminals.
Let Ω be a ranked alphabet. We define the ranked set , with . Let Δ be another ranked alphabet (Ω ∩ Δ = ∅) and Y ⊆ X. We let be the set of all s-terms in which each index u occurs at most once.
For a s-term s, let ind(s) be the set of all indices occurring in s. The deindexing function removes all indices from a s-term to obtain . The set ⊆ of terms with indexed symbols is defined much as above. We let = and =.
5.1 LCFRS/sDCP Hybrid Grammars
We first couple a LCFRS and a sDCP in order to describe a set of hybrid trees.
In order to define the “derives” relation , for some index u and some rule ρ of the form of Equation (4), we need the additional notions of nonterminal reindexing and of terminal reindexing. The nonterminal reindexing is an injective function fU that replaces each index at a nonterminal occurrence of the rule by one that does not clash with the indices in an existing set U ⊆ℕ +. We may, for example, define fU such that it maps each v ∈ ℕ + to the smallest v′ ∈ ℕ + such that v′∉U ∪{fU(1), … , fU(v − 1)}. We extend fU to apply to terms, s-terms, and rules in a natural way, to replace indices by other indices, but leaving all other symbols unaffected. A terminal reindexingg maps indices at occurrences of terminals in the rule to the indices of corresponding occurrences of terminals in the sentential form; we extend g to terms, s-terms, and rules in the same way as for fU. The definition of fU is fixed while an appropriate g needs to be chosen for each derivation step.
- •
ρ ∈ P is ,
- •
,
- •
there is a terminal reindexing g such that (and hence ),
- •
is obtained from by consistently substituting occurrences of variables by s-terms in , and is obtained from by consistently substituting occurrences of variables by s-terms in .
From a pair [s1,s2] ∈ [G], we can construct the hybrid tree (s,≤s) over (Γ,Σ) by letting and, for each combination of positions p1, p′1, p2, p′2 such that and , we set p2 ≤sp′2 if and only if p1 ≤ℓp′1. (The lexicographical ordering ≤ℓ on positions here simplifies to the linear ordering of integers, as positions in strings always have length 1.) In words, occurrences of terminals in s obtain a total order in accordance with the order in which corresponding terminals occur in s1. The set of all such (s,≤s) will be denoted by L(G).
application of… . | nonterminal reindexing . | terminal reindexing . |
---|---|---|
ρ1 | f{2,3,4}(2) = 5 | g identity |
ρ2 | f{2,3,4,5} identity | g(1) = 2 and g(2) = 4 |
ρ3 | f{3} identity | g(1) = 3 |
application of… . | nonterminal reindexing . | terminal reindexing . |
---|---|---|
ρ1 | f{2,3,4}(2) = 5 | g identity |
ρ2 | f{2,3,4,5} identity | g(1) = 2 and g(2) = 4 |
ρ3 | f{3} identity | g(1) = 3 |
and fU(i) = i and g( j) = j if not specified otherwise.
Note that in the LCFRS that is the first component, nonterminal V has fanout 2, and the LCFRS thereby has fanout 2. The tree produced by the second component is a parse tree in the traditional sense—that is, it specifies exactly how the first component analyzes the input string. Each LCFRS can in fact be extended to become a canonical LCFRS/sDCP hybrid of this kind, in which there is no freedom in the coupling of the string grammar and the tree grammar. Traditional frameworks for LCFRS parsing can be reinterpreted as using such hybrid grammars.
A derivation of a LCFRS/sDCP hybrid grammar G can be represented by a derivation tree (cf. Figure 8). It combines a derivation tree of the first component of G and a derivation tree of its second component, letting them share the common parts. We make a graphical distinction between arguments of the first component (rectangles) and those of the second component (ovals). Implicit in a derivation tree is the reindexing of terminal indices. The evaluation ϕ1 of the derivation tree d in Figure 8 yields the indexed string and the evaluation ϕ2 of d yields the tree .
5.2 Other Classes of Hybrid Grammars
In order to illustrate the generality of our framework, we will sketch three more classes of hybrid grammars. In these three classes, the first component or the second component, or both, are less powerful than in the case of the LCFRS/sDCP hybrid grammars defined previously, and thereby the resulting hybrid grammars are less powerful, in the light of the observation that sMGs are syntactic variants of well-nested LCFRSs and sCFTGs are syntactic variants of sDCPs with s-rank restricted to 1. Noteworthy are the differences between the four classes of hybrid grammars in the formal definition of their derivations.
In the definition of the “derives” relation we have to use a reindexing function. Because terminals are produced by a rule application (instead of being consumed as in the LCFRS/sDCP case), there is no need for a terminal reindexing that matches indices of terminal occurrences of the applied rule with those of the sentential form. Instead, terminal indices occurring in the rule have to be reindexed away from the sentential form in the same way as for nonterminal indices. Thus we use one reindexing function fU that applies to nonterminal indices and terminal indices.
We define for every and if and only if:
- •
there are positions p1 and p2 such that and ,
- •
ρ ∈ P is ,
- •
U = ind(s1) ∖{u} = ind(s2) ∖{u},
- •
for i = 1,2.
5.3 Probabilistic LCFRS/sDCP Hybrid Grammars and Parsing
In the usual way, we can extend LCFRS/sDCP hybrid grammars to become probabilistic LCFRS/sDCP hybrid grammars. For this we can assign a probability to each hybrid rule, under the constraint of a properness condition. More precisely, for each nonterminal A the probabilities of all hybrid rules with A in their left-hand sides sum up to 1. In this way a probabilistic LCFRS/sDCP hybrid grammar induces a distribution over hybrid trees. Thus it can be considered as a generative model. See also Nivre (2010) for a general survey of probabilistic parsing.
Algorithm 1 shows a parsing pipeline which takes as input a probabilistic LCFRS/sDCP hybrid grammar G and a sentence w ∈ Γ*. As output it computes the hybrid tree h ∈ L(G) that is derived by the most likely derivation whose first component derives w. In line 1 the first component of G is extracted. Because we later will restore the second component, we assume that a tag is attached to each LCFRS rule that uniquely identifies the original LCFRS/sDCP rule. This also means that two otherwise identical LCFRS rules are treated as distinct if they were taken from two different LCFRS/sDCP rules. In line 2 the string w is parsed. For this any standard LCFRS parser can be used. Such parsers, for instance those by Seki et al. (1991) and Kallmeyer (2010), typically run in polynomial time. The technical details of the chosen LCFRS parser (like the used form of items or the form of iteration over rules) are irrelevant, as for our framework only the functionality of the parsing component matters. The parsing algorithm builds a succinct representation of all derivations of w. In a second, linear-time phase the most likely derivation tree d for w is extracted. In line 3 the dependency graph of d is enriched by the inherited and synthesized arguments that correspond to the second components of the rules occurring in d; here, the identity of the original LCFRS/sDCP rule is needed. This results in an intermediate structure . This is not yet a derivation tree of G, as the terminals still need to be assigned unique indices. This is done first by an indexing of the terminals from w in an arbitrary manner, and then by traversing the derivation to associate these indices with the appropriate terminal occurrences in the derivation, leading to a derivation tree d′ of the LCFRS/sDCP hybrid grammar G. Note that ϕ1(d′) = s1. Finally, in line 6 the derivation tree d′ is evaluated by ϕ2 yielding a tree in .
6. Grammar Induction
For most practical applications, hybrid grammars would not be written by hand, but would be automatically extracted from finite corpora of hybrid trees, over which they should generalize. To be precise, the task is to construct a hybrid grammar G out of a corpus c such that c ⊆ L(G).
During grammar induction each hybrid tree h = (s,≤s) of the corpus is decomposed. A decomposition will determine the structure of a derivation of h by the grammar. Classically, this decomposition and thereby the resulting derivations resemble the structure of s. This approach has been pursued, for instance, by Charniak (1996) for CFGs, and by Maier and Søgaard (2008) and Kuhlmann and Satta (2009) for LCFRSs. There is no guarantee, however, that this approach is optimal from the perspective of, for example, parsing efficiency, the grammar’s potential to generalize from training data, or the size of the grammar.
For a more general approach that offers additional degrees of freedom we extend the framework of Nederhof and Vogler (2014) and let grammar induction depend on a decomposition strategy of the string str(h), called recursive partitioning. One may choose one out of several such strategies. We consider four instances of induction of hybrid grammars, which vary in the type of hybrid trees that occur in the corpus, and the type of the resulting hybrid grammar:
. | corpus . | hybrid grammar . |
---|---|---|
1. | phrase structures | LCFRS/sDCP |
2. | phrase structures | LCFRS/sCFTG |
3. | dependency structures | LCFRS/sDCP |
4. | dependency structures | LCFRS/sCFTG |
. | corpus . | hybrid grammar . |
---|---|---|
1. | phrase structures | LCFRS/sDCP |
2. | phrase structures | LCFRS/sCFTG |
3. | dependency structures | LCFRS/sDCP |
4. | dependency structures | LCFRS/sCFTG |
6.1 Recursive Partitioning and Induction of LCFRS
A recursive partitioning of a string w of length n is a tree π whose nodes are labeled with subsets of [n]. The root of π is labeled with [n]. Each leaf of π is labeled with a singleton subset of [n]. Each non-leaf node has at least two children and is labeled with the union of the labels of its children, which furthermore must be disjoint. To fit recursive partitionings into our framework of s-terms, we regard as ranked alphabet with . We let with |π| = 1.
We say a set J ⊆ [n] has fanoutk if k is the smallest number such that J can be written as J = J1 ∪… ∪ Jk, where:
- •
each Jℓ (ℓ ∈ [k]) is of the form {iℓ,iℓ + 1, … , iℓ + mℓ}, for some iℓ and mℓ ≥ 0, and
- •
i ∈ Jℓ, i′ ∈ Jℓ′ (ℓ,ℓ′ ∈ [k]) and ℓ < ℓ′ imply i < i′.
Note that J1, … , Jk are uniquely defined by this, and we write spans( J) = 〈J1, … , Jk〉. The fanout of a recursive partitioning is defined as the maximal fanout of its nodes.
Figure 10 presents two recursive partitionings of a string with seven positions. The left one has fanout 3 (because of the node label {1,3,6,7} ={1}∪{3}∪{6,7}), whereas the right one has fanout 2.
Algorithm 2 constructs a LCFRS G out of a string w = α1⋯αn and a recursive partitioning π of w. The nonterminals are symbols of the form ⦇J⦈, where J is a node label from π. (In practice, the nonterminals ⦇J⦈ are replaced by other symbols, as will be explained in Section 6.8.) For each position p of π a rule is constructed. If p is a leaf labeled by {i}, then the constructed rule simply generates αi where ⦇{i}⦈ has fanout 1 (lines 6 and 7). For each internal position labeled J0 and its children labeled J1, … , Jj we compute the spans of these labels (line 9). For each argument q of ⦇J0⦈ a s-term sq is constructed by analyzing the corresponding component J0,q of spans( J0) (lines 12–13). Here we exploit the fact that J0,q can be decomposed in a unique way into a selection of sets that are among the spans of J1, … , Jj. Each of these sets translates to one variable. The resulting grammar G allows for exactly one derivation tree that derives w and has the same structure as π. We say that G parses w according to π.
We observe that G as constructed in this way is in a particular normal form.3 To be precise, by the notation in Equation (2) each rule satisfies one of the following:
- •
n ≥ 2 and , (structural rule)
- •
n = 0, k0 = 1, and s1 = 〈α〉 for some α ∈ Γ. (terminal generating rule)
Conversely, each derivation tree d of a LCFRS in normal form can be translated to a recursive partitioning πd, by processing d bottom–up as follows. Each leaf, that is, a rule A(〈α〉) →〈〉, is replaced by the singleton set {i} where i is the position of the corresponding occurrence of α in the accepted string. Each internal node is replaced by the union of the sets that were computed for its children. For the constructed LCFRS G that parses w according to π and for its only derivation tree d, we have π = πd.
Observe that terminals are generated individually by the rules that were constructed from the leaves of the recursive partitioning but not by the structural rules obtained from internal nodes. Consequently, the LCFRS that we induce is in general not lexicalized. In order to obtain a lexicalized LCFRS, the notion of recursive partitioning would need to be generalized. We suspect that such a generalization is feasible, in conjunction with a corresponding generalization of the induction techniques presented in this article. This would be technically involved, however, and is therefore left for further research.
6.2 Construction of Recursive Partitionings
Figure 11 sketches three pipelines to induce a LCFRS from a hybrid tree, which differ in the way the recursive partitioning is constructed. The first way (cf. Figure 11a) is to extract a recursive partitioning directly from a hybrid tree (s,≤s). This extraction is specified in Algorithm 3, which recursively traverses s. For each node in s, the gathered input positions consist of those obtained from the children (lines 6 and 7), plus possibly one input position from the node itself if its label is in Γ (line 5). The case distinction in line 8 and following is needed because every non-leaf in a recursive partitioning must have at least two children.
For a first example, consider the phrase structure in Figure 4a. The extracted recursive partitioning is given at the beginning of Example 16. For a second example, consider the dependency structure and the extracted recursive partitioning in Figure 12.
Note that if Algorithm 3 is applied to an arbitrary hybrid tree, then the accumulated set of input positions is empty for those positions of s that do not dominate elements with labels in Γ. The algorithm requires further (straightforward) refinements before it can be applied on hybrid trees with several roots, that is, if |s| > 1. In much of what follows, we ignore these special cases.
By this procedure, a recursive partitioning extracted from a discontinuous phrase structure or a non-projective dependency structure will have a fanout greater than 1. By then applying Algorithm 2, the resulting LCFRS will have fanout greater than 1. The more discontinuity exists in the input structures, the greater the fanout will be of the recursive partitioning and the resulting LCFRS, which in turn leads to greater parsing complexity. We therefore consider how to reduce the fanout of a recursive partitioning, before it is given as input to Algorithm 2. We aim to keep the structure of a given recursive partitioning largely unchanged, except where it exceeds a certain threshold on the fanout.
Algorithm 4 presents one possible such procedure. It starts at the root, which by definition has fanout 1. Assuming the fanout of the current node does not exceed k, then there are two cases to be distinguished. If the label J of the present node is a singleton, then the node is a leaf, and we can stop (lines 2–3). Otherwise, we search breadth-first through the subtree rooted in the present node to identify a descendant p such that both its label J′ and J ∖ J′ have fanout not exceeding k (line 4). It is easy to see that such a node always exists: Ultimately, breadth-first search will reach the leaves, which are each labeled with a single number. One of these numbers must match either the lowest or the highest element of some maximal subset of consecutive numbers from J, so that the fanout of J cannot increase if that number is removed.
The current node is now given two children π|p and t. The first is the subtree rooted in the node labeled J′ that we identified earlier, and the second is a copy of the subtree rooted in the present node, but with J′ subtracted from the label of every node (lines 7–14). Nodes labeled with the empty set are removed (line 10), and if a node has the same label as its parent then the two are collapsed (line 13). As the two children each have fanout not exceeding k, we can apply the procedure recursively (line 6).
The recursive partitioning π in the left half of Figure 10 has a node labeled {1,3,6,7}, with fanout 3. With J = {1,2,3,5,6,7} and k = 2, one possible choice for J′ is {3,7}, as then both J′ and J ∖ J′ ={1,2,5,6} have fanout not exceeding 2. This leads to the partitioning π′ in the right half of the figure. Because now all node labels have fanout not exceeding 2, recursive traversal will make no further changes. The partitioning π′ is similar to π in the sense that subtrees that are not on the path to {3,7} remain unchanged. Other valid choices for J′ would be {2} and {5}. Not a valid choice for J′ would be {1,6}, as J ∖{1,6} ={2,3,5,7}, which has fanout 3.
Algorithm 4 ensures that subsequent induction of an LCFRS (cf. Figure 11b) leads to a binary LCFRS. Note the difference between binarization algorithms such as those from Gómez-Rodríguez and Satta (2009) and Gómez-Rodríguez et al. (2009), which are applied on grammar rules, and our procedure, which is applied before any grammar is obtained. Unlike van Cranenburgh (2012), moreover, our objective is not to obtain a “coarse” grammar for the purpose of coarse-to-fine parsing.
Note that if k is chosen to be 1, then the resulting partitioning is consistent with derivations of a CFG. Even simpler partitionings exist. In particular, the left-branching partitioning has internal node labels that are {1,2, … , m}, each with children labeled {1, … , m − 1} and {m}. These are consistent with the computations of finite automata (FA) in reverse direction; see Figure 11c. Similarly, there is a right-branching recursive partitioning, reflecting finite-state processing in a forward direction. The use of branching partitionings completely detaches string parsing from the structure of the given hybrid tree.
The relation between recursive partitioning and worst-case parsing complexity of the induced LCFRS is summarized in Table 1. We remark that the parsing complexity for unrestricted LCFRSs would improve from to if we could ensure that the LCFRSs are well-nested (Gómez-Rodríguez, Kuhlmann, and Satta 2010), or in other words, if we replace LCFRSs by sMGs. How to refine pipeline (a) to achieve this is left for future investigation.
Pipeline . | Type of the induced LCFRS . | Parsing complexity for string of length n . |
---|---|---|
(a) | LCFRS of arbitrary k and m | |
(b) k ≥ 1 | binarized k-LCFRS | |
(b) k = 1 | binarized CFG | |
(c) right | FA | |
(c) left | (reverse) FA |
Pipeline . | Type of the induced LCFRS . | Parsing complexity for string of length n . |
---|---|---|
(a) | LCFRS of arbitrary k and m | |
(b) k ≥ 1 | binarized k-LCFRS | |
(b) k = 1 | binarized CFG | |
(c) right | FA | |
(c) left | (reverse) FA |
6.3 Induction of Hybrid Grammars
In the remainder of Section 6 we extend the induction pipelines for LCFRS to induction pipelines for LCFRS/sDCP hybrid grammars, as illustrated in Figure 14. Given a hybrid tree h = (s,≤s), we choose a recursive partitioning π obtained in one of the ways discussed in the previous section. We apply Algorithm 2 to induce a LCFRS G1 that generates str(h). Using the same recursive partitioning π, we induce a sDCP G2 that generates s. For this we use either Algorithm 5 or 6, depending on whether h is a phrase structure or a dependency structure.
In preparation, we define several notions to relate labels of the recursive partitioning with subsets of the positions of s. Let posΓ(s) = {p1, … , pn} with pi ≤spi +1 (i ∈ [n − 1]) and let J be a label of π. We define the set Π(J) ={pi∣ i ∈ J}, which identifies the nodes of s corresponding to the elements in J. For any subset U ⊆pos(s), we construct the sets ⊤(U) and ⊥(U), which, intuitively, delimit U from the top and the bottom, respectively. Formally, p ∈⊤(U) if and only if p ∈ U and parent(p)∉U. We have p ∈⊥(U) if and only if parent(p) ∈ U and p∉U.
6.4 Induction of LCFRS/sDCP Hybrid Grammars from Phrase Structures
Grammar induction is relatively straightforward for a given hybrid tree h = (s,≤s) that is a phrase structure, and a given recursive partitioning π. For each node of π, its label is a set of positions of str(h), and these positions must each correspond to a leaf node of s, by virtue of h being a phrase structure. We can apply a closure operation on this set of leaf nodes that includes a node if all of its children are included. Formally, let J be a label of π. Then C(J) is the smallest set U ⊆pos(s) satisfying (i) Π(J) ⊆ U and (ii) if p ∈pos(s), children(p)≠∅, and children(p) ⊆ U, then p ∈ U.
The set C(J) corresponds to a set of (maximal, disjoint) sub-s-terms of s, that is, C(J) can be partitioned such that each part contains the positions that correspond to one sub-s-term. These parts can be arranged according to the lexicographical ordering on positions in s. Formally, for each set U ⊆pos(s), we define s-rk(U) to be the maximal number k such that U can be partitioned into sets U1, … , Uk where:
- •
for every i ∈ [k], p ∈ Ui, and p′ ∈ U if p′ = parent(p) or p′ = right-sibling(p), then p′ ∈ Ui, and
- •
for every i,j ∈ [k], p ∈ Ui, and p′ ∈ Uj we have p ≤ℓp′ implies i ≤ j.
Note that U1, … , Uk are uniquely defined by this, and we write gspans(U) = 〈U1, … , Uk〉. The function “gspans” generalizes “spans,” which we defined for recursive partitioning earlier. Similarly, s-rk as used here generalizes the notion of fanout as used for node labels of a recursive partitioning.
These concepts allow us to define a mapping from a node J in π to the set gspans(C(J)) of positions of s. This mapping is such that if J1, … , Jm are the child nodes of node J in π, then each set in gspans(C(Ji)) (i ∈ [m], j ∈ [s-rk(Ji)]) is contained in some set in gspans(C(J)) (q ∈ [s-rk(J)]). A different way of looking at this is that the image of J can be constructed out of the images of Ji (i ∈ [m]), possibly by adding further nodes linking existing sub-s-terms together.
Such composition of s-terms into larger s-terms is realized by a sDCP without inherited arguments, as constructed by Algorithm 5. It builds one production for each node J0 with children J1, … , Jm. J0 has a synthesized attribute for each set in gspans(C(J0)). The corresponding s-term sq is constructed by traversing the positions in (lines 18–31). A sequence of consecutive positions that are also in some in gspans(C(Ji)) is realized by a single variable (lines 21–24). Remaining positions become nodes of sq (lines 26–30).
Figure 16 compares the two derivation trees of the LCFRS/sDCP hybrid grammars of Examples 19 (left) and 20 (right).
We observe the following general property of Algorithm 5: If the recursive partitioning extracted by Algorithm 3 is given as input, then each nonterminal of the induced sDCP G) has s-rank 1 (as in Example 19). This coincides with pipeline (a) of Figure 11, that is, the induction of a LCFRS of arbitrary fanout. However, if the recursive partitioning is transformed by Algorithm 4 or if the left-branching or right-branching recursive partitioning is used (as in Example 20 and as in pipelines (b) and (c) of Figure 11), then the fanout of the induced LCFRS decreases and its derivations are binarized. At the same time, the numbers of synthesized arguments in the induced sDCP may increase. In other words, we witness a trade-off between the degree of mild context-sensitivity of the LCFRS and the numbers of arguments of the sDCP.
We conclude:
For each phrase structure h and recursive partitioning π of str(h), we can construct a LCFRS/sDCP hybrid grammar G such that G generates h and parses str(h) according to π. Moreover, the sDCP that is the second component only has synthesized arguments.
6.5 Induction of LCFRS/sCFTG Hybrid Grammars from Phrase Structures
Given a recursive partitioning π and a phrase structure h = (s,≤s), the construction in Section 6.4 relied on a mapping from a node of π labeled J to sets of positions of maximal sub-s-terms in s whose yields together cover exactly the positions in J. We now say π is chunky with respect to h if, for each node J of π, we have s-rk(C(J)) = 1, that is, the nodes in its image under the mapping form a single sub-s-term. If π is chunky with respect to h, then each sDCP nonterminal in the construction from Section 6.4 will have a single synthesized argument. A sDCP with this property is equivalent to a sCFTG in which all nonterminals have rank 0, that is, a regular tree grammar. Therefore:
For each phrase structure h and recursive partitioning π of str(h) that is chunky with respect to h, we can construct a LCFRS/sCFTG hybrid grammar G such that G generates h and parses str(h) according to π. Moreover, the second component of G is a regular tree grammar.
6.6 Induction of LCFRS/sDCP Hybrid Grammars from Dependency Structures
Let h = (s,≤s) be a hybrid tree over (Σ,Σ), or in other words, a dependency structure, and let π be a recursive partitioning of str(h). The task is to construct a LCFRS/sDCP hybrid grammar that generates h and parses str(h) according to π.
In the case of phrase structures, we could identify entire subtrees of s whose yields corresponded to subsets of node labels of π. A sequence of consecutive such subtrees (i.e., whose roots were siblings) was then translated to a synthesized argument of a sDCP rule. With dependency structures, however, we need to allow for the scenario where we have a label J of a node in π and two positions p and p′ in s with parent(p′) = p, and p ∈ Π(J) whereas p′∉Π(J). Thus we will need to consider a subtree of s rooted in p, with a “gap” for child p′. This gap will be implemented by introducing an inherited argument in the sDCP, so that the gap can be filled by a structure built elsewhere.
Like the induction algorithms considered before, Algorithm 6 constructs a rule for each node p of π. If p is a leaf of π (line 6), labeled with a singleton J0 = {i}, we construct the sDCP rule , where α is the i-th element of str(h), assuming the node labeled α is not a leaf in s (line 10). The i-rank of ⦇J0⦈ is 1. Coupled to the corresponding LCFRS rule, this creates the hybrid rule . If the node labeled α is a leaf of s, we can dispense with the inherited argument of ⦇J0⦈ in the second component and replace α(〈x1〉) by α (line 8).
If p is an internal node of π, then we proceed as follows. We determine the set Π(J0) of positions of s that correspond to the numbers in the label of p. We compute the sets of positions ⊤(Π(J0)) and ⊥(Π(J0)) that delimit Π(J0) from the top and bottom, respectively. We bundle consecutive positions in ⊤(Π(J0)) and ⊥(Π(J0)) by applying gspans. For brevity, we write instead of gspans(⊤(Π(J0))) and instead of gspans(⊥(Π(J0))). The nonterminal ⦇J0⦈ is given a synthesized attribute for each set in (line 12), and an inherited attribute for each set in (line 13).
Analogously, we determine sets and of positions of s for each child of p labeled Ji (lines 14–15). Each set corresponds to a distinct variable in the sDCP rule. Each set corresponds to a s-term which combines variables. In particular, each set is the (disjoint) union of some of the sets . Conversely, each set is disjoint with all but one of the sets . This fact is used in lines 20–27 to construct the s-term . Having specified the variables and s-terms, the construction of the sDCP rule is completed in line 18. Much as in Section 6.4, this sDCP rule can be coupled to the corresponding LCFRS rule to form a hybrid rule.
Figure 17 presents part of the hybrid tree from Figure 4b, together with part of a recursive partitioning. Let us use the symbols pP, pM, ph, pl for the four positions in the hybrid tree corresponding to the string positions 2,3,5,6. Naturally, Π({2,3,5,6}) ={pP,pM,ph,pl}, Π({2,6}) ={pP,pl}, Π({3,5}) ={pM,ph}.
We conclude:
For each dependency structure h and recursive partitioning π of str(h), we can construct a LCFRS/sDCP hybrid grammar G such that G generates h and parses str(h) according to π.
6.7 Induction of LCFRS/sCFTG Hybrid Grammars from Dependency Structures
For dependency structures, we can define a notion of chunkiness similar to that in Section 6.5, relying on the definitions in Section 6.4. We say a recursive partitioning π with respect to dependency structure h = (s,≤s) is chunky if for every node label J of π the length of ⊤max(J) is 1. We have seen in Section 6.4 that the length of ⊤max(J) determines the fanout of the corresponding nonterminal in the second component of the constructed hybrid rule. We observed before that a sDCP grammar in which all nonterminals have s-rank 1 is equivalent to a sCFTG. We may conclude:
For each dependency structure h and recursive partitioning π of str(h) that is chunky with respect to h, we can construct a LCFRS/sCFTG hybrid grammar G such that G generates h and parses str(h) according to π.
6.8 Induction on a Corpus
In the previous sections, we have induced a LCFRS/sDCP hybrid grammar G from a single phrase structure or dependency structure h. Given a training corpus c of phrase structures or dependency structures, we now want to induce a single hybrid grammar G that generalizes over the corpus. To this end, we apply one of the induction techniques to each hybrid tree h in c. The resulting grammars are condensed into a single hybrid grammar G by relabeling the existing nonterminals of the form ⦇J⦈. This relabeling should be according to a naming scheme that is consistent to ensure that hybrid rules constructed from one hybrid tree for adjacent nodes of π can still link together to form a derivation. Beyond that, the naming scheme shall also allow for interaction of rules that were constructed from different hybrid trees, such that L(G) contains meaningful hybrid trees that were not in c.
Two such naming schemes were considered in Nederhof and Vogler (2014) for a corpus of phrase structures. By strict labeling, a nonterminal name is chosen to consist of the terminal labels at the roots of the relevant subtrees of s. In Example 20, therefore, we would replace ⦇{1,2}⦈ by 〈hat,ADV〉. (In a more realistic grammar, we would likely have a part of speech instead of hat.) This tends to lead to many nonterminal labels for different combinations of terminals. Therefore, an alternative was considered, called child labeling. This means that for two or more consecutive siblings in s, we collapse their sequence of terminals into a single tag of the form children-of (X), where X is the terminal label of the parent. This creates much fewer nonterminal names, but confuses different sequences of terminals that may occur below the same terminal. For more details, we refer the reader to Nederhof and Vogler (2014).
For grammar induction with a corpus of dependency structures, there is the additional complication of inherited arguments. If a LCFRS/sDCP hybrid grammar is induced from a single dependency structure, using nonterminals of the form ⦇J⦈ as in Section 6.6, then cycles cannot occur in derivations of the sDCP. This is because, by definition, no cycles occur in the given dependency structure. However, if we replace nonterminals of the form ⦇J⦈ by any other choice of symbols, and combine rules induced by different hybrid trees from a corpus, then cycles may arise if the relationship between inherited and synthesized arguments is confused.
One solution is to encode the dependencies between inherited and synthesized attributes into the nonterminal names—that is, for each synthesized attribute the list of inherited attributes from which it receives subtrees is specified (Angelov et al., 2014). Conceptually this is close to the g functions in the proof of Theorem 2 in Appendix C. One way to formalize this encoding is as follows.
As explained in Section 6.6, if we have a node label J in π, and ⊥max(J) =〈O1, … , Ok〉 and ⊤max(J) =〈I1, … , Ik′〉, then this leads to creation of a nonterminal with k inherited arguments and k′ synthesized arguments, so in total k + k′ arguments. We can construct a s-term σ in , in which every number in [k + k′] occurs exactly once. In σ, argument numbers are located relative to one another as the corresponding positions in ⊥max(J) ⋅⊤max(J) are located in the hybrid tree. More precisely, if the i-th element and the j-th element of ⊥max(J) ⋅⊤max(J) represent positions that share a parent, and those positions of the i-th precede those of the j-th, then number i precedes j in the root of the same sub-s-term of σ. Similarly, if the positions of the j-th element are descendants of at least one position in the i-th element, then j occurs as a descendant of i in σ.
In combination with strict labeling, the nonterminal ⦇{3,5}⦈ in Example 21 could be replaced by 〈Pietlezen,Marie,helpen,σ〉, where σ is the s-term 3(1(2)). The “3” (third argument of the nonterminal) stands for the node in the hybrid tree labeled with helpen. Descendants of this node are the two nodes labeled with Piet and lezen, which belong to the first argument. The “2” (second argument) stands for the node labeled Marie, which is a descendant of lezen and therefore occurs below the “1.” __
7. Experiments
In this section, we present the first experimental results on induction of LCFRS/sDCP hybrid grammars from dependency structures and their application to parsing. We further present experiments on constituent parsing similar to those of Nederhof and Vogler (2014), but now with a larger portion of the TIGER corpus (Brants et al., 2004).
The purpose of the experiments is threefold: (i) A proof-of-concept for the induction techniques developed in Section 6 is provided. (ii) The influence of the strategy of recursive partitioning is evaluated empirically, as is the influence of the nonterminal naming scheme. We are particularly interested in how the strategy of recursive partitioning affects the size, parse time, accuracy, and robustness of the induced hybrid grammars. (iii) The performance of our architecture for syntactic parsing based on LCFRS/sDCP hybrid grammar is compared with two existing parsing architectures.
For all experiments, a corpus is split into a training set and a test set. A LCFRS/sDCP hybrid grammar is induced from the training set. Probabilities of rules are determined by relative frequency estimation. The induced grammar is then applied on each sentence of the test set (see Section 5.3), and the parse obtained from the most probable derivation is compared with the gold standard, resulting in a score for each sentence. The average of these scores for all test sentences is computed, weighted by sentence length.
All algorithms are implemented in Python and experiments are run on a server with two 2.6-GHz Intel Xeon E5-2630 v2 CPUs and 64 GB of RAM. Each experiment uses a single thread; the measured running time might be slightly distorted because of the usual load jitter. For probabilistic LCFRS parsing we use two off-the-shelf systems: If the induced grammar’s first component is equivalent to a FA, then we use the OpenFST (Allauzen et al., 2007) framework with the Python bindings of Gorman (2016). Otherwise, we utilize the LCFRS parser of Angelov and Ljunglöf (2014), which is part of the runtime system of the Grammatical Framework (Ranta, 2011).
7.1 Dependency Parsing
In our experiments on dependency parsing we use a corpus based on TIGER as provided in the 2006 CoNLL shared task (Buchholz and Marsi, 2006). The task specifies splits of TIGER into a training set (39,216 sentences) and a test set (357 sentences). Each sentence in the corpus consists of a sequence of tokens. A token has up to 10 fields, including the sentence position, form, lemma, part-of-speech (POS) tag, sentence position of the head, and dependency relation (DEPREL) to the head. In TIGER, 52 POS tags and 46 DEPRELs are used. We adopt the three evaluation metrics from the shared task, namely, the percentages of tokens for which a parser correctly predicts the head (the unlabeled attachment score or UAS), the DEPREL (the label accuracy or LA), or both head and DEPREL (the labeled attachment score or LAS). Punctuation is removed from both training and test sets, and is thereby ignored for the purposes of these metrics. For testing we restrict ourselves to the 281 sentences of up to 20 (non-punctuation) tokens.
Our experiments with grammar induction are controlled by three parameters, each ranging over a number of values:
- •
The naming scheme for nonterminals can be (i) strict labeling or (ii) child labeling, as outlined in Section 6.8.
- •
In both naming schemes, sequences of terminal labels from the hybrid tree are composed into nonterminal labels. For each terminal, which is a CoNLL token, we include only particular fields, namely, (i) POS and DEPREL, (ii) POS, or (iii) DEPREL. We call this parameter argument label.
- •
The considered methods to obtain recursive partitionings include (i) direct extraction (see Algorithm 3), (ii) transformation to fanout k (see Algorithm 4), (iii) right-branching, and (iv) left-branching.
Each choice of values for the parameters determines an experimental scenario. In particular, direct extraction in combination with strict labeling achieves traditional LCFRS parsing.
We compare our induction framework with rparse (Maier and Kallmeyer 2010), which induces and trains unlexicalized, binarized, and Markovized LCFRS. As parser for these LCFRS we choose the runtime system of the Grammatical Framework, because it is faster than rparse’s built-in parser (for a comparison cf. Angelov and Ljunglöf 2014).
Additionally, we use MaltParser (Nivre, Hall, and Nilsson 2006) to compare our approach with a well-established transition-based parsing architecture, which is not state-of-the-art but allows us to disable features of lemmas and word forms, to match our own implementation, which does not take lemmas or word forms as part of the input. The stacklazy strategy (Nivre, Kuhlmann, and Hall 2009) with the LibLinear classifier is used.
Experimental Results. Statistics on the induced hybrid grammars and parsing results are listed in Table 2. For the purpose of measuring UAS, LAS, and LA, we take a default dependency structure in the case of a parse failure; in this default structure, the head of the i-th word is the i − 1-th word, and the DEPREL fields are left empty. Figure 19 shows the distributions of the fanout of nonterminals in the LCFRS and the numbers of arguments in the sDCP, depending on the recursive partitioning strategy if we fix child labeling with POS+DEPREL as argument labels. In the following we discuss trends that can be observed in these data if we change the value of one parameter while keeping others fixed.
extraction . | arg. lab. . | nont. . | rules . | fmax . | favg . | fail . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|
child labeling | ||||||||||
direct | POS+DEPREL | 4,043 | 61,923 | 4 | 1.06 | 68 | 68.0 | 59.5 | 63.2 | 97 |
k = 1 | POS+DEPREL | 17,333 | 60,399 | 1 | 1.00 | 9 | 85.8 | 79.7 | 85.5 | 94 |
k = 2 | POS+DEPREL | 10,777 | 49,448 | 2 | 1.18 | 11 | 85.7 | 79.7 | 85.4 | 103 |
k = 3 | POS+DEPREL | 10,381 | 48,844 | 3 | 1.19 | 11 | 85.6 | 79.7 | 85.3 | 108 |
r-branch | POS+DEPREL | 92,624 | 191,341 | 1 | 1.00 | 60 | 68.9 | 60.3 | 65.0 | 27 |
l-branch | POS+DEPREL | 96,980 | 196,125 | 1 | 1.00 | 56 | 70.3 | 61.9 | 66.7 | 28 |
direct | POS | 799 | 43,439 | 4 | 1.10 | 23 | 78.2 | 59.7 | 67.2 | 113 |
k = 1 | POS | 6,060 | 28,880 | 1 | 1.00 | 4 | 83.2 | 65.3 | 73.2 | 70 |
k = 2 | POS | 2,396 | 20,592 | 2 | 1.38 | 4 | 83.8 | 65.4 | 73.1 | 86 |
k = 3 | POS | 2,121 | 20,100 | 3 | 1.44 | 4 | 83.4 | 65.2 | 73.1 | 88 |
r-branch | POS | 47,661 | 123,367 | 1 | 1.00 | 23 | 77.9 | 59.8 | 68.1 | 53 |
l-branch | POS | 49,203 | 125,406 | 1 | 1.00 | 26 | 78.5 | 60.2 | 67.5 | 51 |
direct | DEPREL | 527 | 33,844 | 4 | 1.10 | 1 | 79.5 | 72.4 | 82.7 | 251 |
k = 1 | DEPREL | 4,344 | 21,613 | 1 | 1.00 | 1 | 78.0 | 70.6 | 81.4 | 172 |
k = 2 | DEPREL | 1,739 | 15,184 | 2 | 1.35 | 1 | 78.7 | 70.9 | 81.6 | 263 |
r-branch | DEPREL | 40,239 | 99,113 | 1 | 1.00 | 1 | 77.2 | 69.1 | 80.7 | 94 |
l-branch | DEPREL | 37,535 | 92,390 | 1 | 1.00 | 1 | 78.1 | 69.8 | 80.9 | 85 |
strict labeling | ||||||||||
direct | POS+DEPREL | 48,404 | 106,284 | 4 | 1.01 | 68 | 68.0 | 59.5 | 63.2 | 122 |
k = 1 | POS+DEPREL | 103,425 | 162,124 | 1 | 1.00 | 57 | 72.2 | 64.2 | 68.3 | 185 |
k = 2 | POS+DEPREL | 92,395 | 150,013 | 2 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 266 |
k = 3 | POS+DEPREL | 91,412 | 149,106 | 3 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 240 |
r-branch | POS+DEPREL | 251,536 | 338,294 | 1 | 1.00 | 141 | 44.6 | 31.4 | 32.7 | 54 |
l-branch | POS+DEPREL | 264,190 | 349,299 | 1 | 1.00 | 137 | 45.3 | 32.6 | 34.2 | 53 |
direct | POS | 29,165 | 80,695 | 4 | 1.00 | 23 | 77.6 | 59.6 | 67.5 | 120 |
k = 1 | POS | 62,769 | 115,363 | 1 | 1.00 | 18 | 78.7 | 60.9 | 69.1 | 201 |
k = 2 | POS | 54,082 | 104,390 | 2 | 1.09 | 18 | 79.5 | 61.3 | 69.4 | 237 |
k = 3 | POS | 53,186 | 103,503 | 3 | 1.11 | 18 | 79.6 | 61.4 | 69.5 | 231 |
r-branch | POS | 181,432 | 277,201 | 1 | 1.00 | 98 | 55.1 | 36.4 | 40.8 | 88 |
l-branch | POS | 190,890 | 286,273 | 1 | 1.00 | 108 | 52.2 | 33.9 | 37.7 | 87 |
direct | DEPREL | 17,047 | 53,342 | 4 | 1.00 | 3 | 82.4 | 76.5 | 84.6 | 178 |
k = 1 | DEPREL | 37,333 | 71,423 | 1 | 1.00 | 1 | 83.2 | 78.0 | 85.8 | 188 |
k = 2 | DEPREL | 31,956 | 63,487 | 2 | 1.08 | 2 | 83.0 | 77.5 | 85.5 | 231 |
r-branch | DEPREL | 126,841 | 197,261 | 1 | 1.00 | 2 | 80.8 | 74.4 | 83.1 | 101 |
l-branch | DEPREL | 124,722 | 192,922 | 1 | 1.00 | 2 | 81.2 | 74.8 | 83.5 | 96 |
rparse (v = 1, h = 5) | 46,799 | 72,962 | 5 | 1.07 | 2 | 85.3 | 79.0 | 86.4 | 228 | |
MaltParser, unlexicalized, stacklazy | 0 | 88.2 | 83.7 | 88.7 | 2 |
extraction . | arg. lab. . | nont. . | rules . | fmax . | favg . | fail . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|
child labeling | ||||||||||
direct | POS+DEPREL | 4,043 | 61,923 | 4 | 1.06 | 68 | 68.0 | 59.5 | 63.2 | 97 |
k = 1 | POS+DEPREL | 17,333 | 60,399 | 1 | 1.00 | 9 | 85.8 | 79.7 | 85.5 | 94 |
k = 2 | POS+DEPREL | 10,777 | 49,448 | 2 | 1.18 | 11 | 85.7 | 79.7 | 85.4 | 103 |
k = 3 | POS+DEPREL | 10,381 | 48,844 | 3 | 1.19 | 11 | 85.6 | 79.7 | 85.3 | 108 |
r-branch | POS+DEPREL | 92,624 | 191,341 | 1 | 1.00 | 60 | 68.9 | 60.3 | 65.0 | 27 |
l-branch | POS+DEPREL | 96,980 | 196,125 | 1 | 1.00 | 56 | 70.3 | 61.9 | 66.7 | 28 |
direct | POS | 799 | 43,439 | 4 | 1.10 | 23 | 78.2 | 59.7 | 67.2 | 113 |
k = 1 | POS | 6,060 | 28,880 | 1 | 1.00 | 4 | 83.2 | 65.3 | 73.2 | 70 |
k = 2 | POS | 2,396 | 20,592 | 2 | 1.38 | 4 | 83.8 | 65.4 | 73.1 | 86 |
k = 3 | POS | 2,121 | 20,100 | 3 | 1.44 | 4 | 83.4 | 65.2 | 73.1 | 88 |
r-branch | POS | 47,661 | 123,367 | 1 | 1.00 | 23 | 77.9 | 59.8 | 68.1 | 53 |
l-branch | POS | 49,203 | 125,406 | 1 | 1.00 | 26 | 78.5 | 60.2 | 67.5 | 51 |
direct | DEPREL | 527 | 33,844 | 4 | 1.10 | 1 | 79.5 | 72.4 | 82.7 | 251 |
k = 1 | DEPREL | 4,344 | 21,613 | 1 | 1.00 | 1 | 78.0 | 70.6 | 81.4 | 172 |
k = 2 | DEPREL | 1,739 | 15,184 | 2 | 1.35 | 1 | 78.7 | 70.9 | 81.6 | 263 |
r-branch | DEPREL | 40,239 | 99,113 | 1 | 1.00 | 1 | 77.2 | 69.1 | 80.7 | 94 |
l-branch | DEPREL | 37,535 | 92,390 | 1 | 1.00 | 1 | 78.1 | 69.8 | 80.9 | 85 |
strict labeling | ||||||||||
direct | POS+DEPREL | 48,404 | 106,284 | 4 | 1.01 | 68 | 68.0 | 59.5 | 63.2 | 122 |
k = 1 | POS+DEPREL | 103,425 | 162,124 | 1 | 1.00 | 57 | 72.2 | 64.2 | 68.3 | 185 |
k = 2 | POS+DEPREL | 92,395 | 150,013 | 2 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 266 |
k = 3 | POS+DEPREL | 91,412 | 149,106 | 3 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 240 |
r-branch | POS+DEPREL | 251,536 | 338,294 | 1 | 1.00 | 141 | 44.6 | 31.4 | 32.7 | 54 |
l-branch | POS+DEPREL | 264,190 | 349,299 | 1 | 1.00 | 137 | 45.3 | 32.6 | 34.2 | 53 |
direct | POS | 29,165 | 80,695 | 4 | 1.00 | 23 | 77.6 | 59.6 | 67.5 | 120 |
k = 1 | POS | 62,769 | 115,363 | 1 | 1.00 | 18 | 78.7 | 60.9 | 69.1 | 201 |
k = 2 | POS | 54,082 | 104,390 | 2 | 1.09 | 18 | 79.5 | 61.3 | 69.4 | 237 |
k = 3 | POS | 53,186 | 103,503 | 3 | 1.11 | 18 | 79.6 | 61.4 | 69.5 | 231 |
r-branch | POS | 181,432 | 277,201 | 1 | 1.00 | 98 | 55.1 | 36.4 | 40.8 | 88 |
l-branch | POS | 190,890 | 286,273 | 1 | 1.00 | 108 | 52.2 | 33.9 | 37.7 | 87 |
direct | DEPREL | 17,047 | 53,342 | 4 | 1.00 | 3 | 82.4 | 76.5 | 84.6 | 178 |
k = 1 | DEPREL | 37,333 | 71,423 | 1 | 1.00 | 1 | 83.2 | 78.0 | 85.8 | 188 |
k = 2 | DEPREL | 31,956 | 63,487 | 2 | 1.08 | 2 | 83.0 | 77.5 | 85.5 | 231 |
r-branch | DEPREL | 126,841 | 197,261 | 1 | 1.00 | 2 | 80.8 | 74.4 | 83.1 | 101 |
l-branch | DEPREL | 124,722 | 192,922 | 1 | 1.00 | 2 | 81.2 | 74.8 | 83.5 | 96 |
rparse (v = 1, h = 5) | 46,799 | 72,962 | 5 | 1.07 | 2 | 85.3 | 79.0 | 86.4 | 228 | |
MaltParser, unlexicalized, stacklazy | 0 | 88.2 | 83.7 | 88.7 | 2 |
Naming Scheme. As expected, child labeling leads to significantly fewer distinct nonterminals than strict labeling, by a factor of between 2.5 and 37, depending on the other parameter values. This makes the tree language less refined, and one may expect the scores therefore to be generally lower. However, in many cases where strict labeling suffers from a higher proportion of parse failures (so that the parser must fall back on the default structure), it is child labeling that has the higher scores.
Argument Labels. Including only POS tags in nonterminal labels generally leads to high UAS but low LA. By including DEPRELs but not POS tags, the LA and LAS are higher in all cases. For child labeling, this is at the expense of a lower UAS, in all cases except one. For strict labeling, UAS is higher in all cases. There are also fewer parse failures, which may be because the number of DEPRELs is smaller than the number of POS tags.
With the combination of POS tags and DEPRELs, the tree language can be most accurately described, and we see that this achieves some of the highest UAS and LAS in the case of child labeling. However, the scores can be low in the presence of many parse failures, due to the fall-back on the default structure. This holds in particular in the case of strict labeling.
Recursive Partitionings. Concerning the choice of the method to obtain recursive partitionings, the baseline is direct extraction, which produces an unrestricted LCFRS in the first component. For instance, for child labeling and POS and DEPREL as argument labels, the induced LCFRS has fanout 4 and about 88% of the rules have three or more nonterminals on the right-hand side. In total there are 4,043 nonterminals, the majority of which have fanout 1. Reduction of the fanout to 1 ≤ k ≤ 3 leads to a binarized grammar with smaller fanout as desired but the number of nonterminals is quadrupled. By transforming a recursive partitioning with parameter k ≥ 2, the average fanout favg of nonterminals may in fact increase, which is somewhat counter-intuitive. Whereas the distribution of nonterminals with fanout 1, 2, 3 is 94.8%, 4.2%, 0.9% in the case of direct extraction, this changes to distribution 83.4%, 14.2%, 2.5% for the transformation with k = 2. We suspect the increase is due to the high fanout of newly introduced nodes (see line 8 of Algorithm 4).
We can further see in Figure 19 that the number of arguments in the sDCP increases signficantly once the fanout is restricted to k = 1. The binarization involved in the process of reducing the fanout to some value of k seems to improve the ability of the grammar to generalize over the training data, as here both the number of parse failures drops and the scores increase. The exact choice of k has little impact on the scores, however.
The measured parse times in most cases increase if higher values of k are chosen. In some cases, however, the parse times are lower for direct extraction. This may be explained by the higher average fanout and differences in the sizes of the grammars.
The left-branching and right-branching recursive partitionings lead to many specialized nonterminal symbols (and rules). In comparison with the partitioning strategies discussed earlier, we observe that a nonterminal can have up to 10 sDCP arguments, the average lying between 3 and 4. The scores are often worse and there tend to be more parse failures. However, in the case of child labeling with DEPREL as argument label, the scores are very similar. Left-branching seems to lead to slightly higher scores than right-branching recursive partitioning, except for POS as argument label. With left-branching and right-branching recursive partitionings, parsing tends to be faster than with the other recursive partitionings. However, in many cases we do not observe the predicted asymptotic differences, which may be due to larger grammar sizes.
Overall, child labeling with POS+DEPREL and k = 1 and strict labeling with DEPREL and k = 1 turned out to be the best choices for maximizing UAS/LAS and LA, respectively. The former scenario slightly outperforms rparse with respect to UAS and LAS and parse time whereas rparse obtains better LA. Still, MaltParser outperforms these experimental scenarios with respect to both the obtained scores and the measured parse times. Our restriction to sentence length 20, which we mentioned earlier, was motivated by the excessive computational costs for instances with high (average) fanout.
The most suitable choices for naming scheme, argument labeling, and recursive partitioning may differ between languages and between annotation schemes. For instance, for the NEGRA (Skut et al., 1997) benchmark used by Maier and Kallmeyer (2010), we obtain the results in Table 3, which suggest the combination of child labeling, DEPREL, and k = 2 is the most suitable choice if LAS is the target score. It appears that scenarios with POS+DEPREL give lower scores than DEPREL, mainly because of the many parse failures, despite the fact that such argument labeling can be expected to lead to more refined tree languages. For this reason, we also consider a cascade of three scenarios with child labeling, k = 1, and argument labels set to POS+DEPREL, POS, or DEPREL, respectively, where in the case of parse failure we fall back to the next scenario. This cascade achieves better scores than any individual experimental scenario and the additional parse time with respect to POS+DEPREL alone is small. Both the single best scenario and the cascade perform better than the baseline LCFRS (rparse simple) induced with the algorithm by Kuhlmann and Satta (2009) and the LCFRS that is obtained from the baseline by binarization, vertical Markovization v = 1, and horizontal Markovization h = 3. Again, hybrid grammars fall short of MaltParser. We do not include results for NEGRA with the strict labeling strategy, because the numbers of parse failures are too high and, consequently, accuracies are too low to be of interest.
extraction . | arg. lab. . | nont. . | rules . | fmax . | favg . | fail . | . | . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
child labeling | ||||||||||||
direct | P+D | 4,739 | 27,042 | 7 | 1.10 | 693 | 51.7 | 40.7 | 52.2 | 40.8 | 42.6 | 253 |
k = 1 | P+D | 13,178 | 35,071 | 1 | 1.00 | 202 | 75.9 | 68.7 | 77.0 | 69.1 | 73.3 | 288 |
k = 2 | P+D | 11,156 | 32,231 | 2 | 1.17 | 195 | 76.5 | 69.6 | 77.7 | 70.1 | 74.2 | 355 |
r-branch | P+D | 42,577 | 79,648 | 1 | 1.00 | 775 | 45.5 | 33.1 | 45.7 | 32.8 | 34.7 | 49 |
l-branch | P+D | 40,100 | 75,321 | 1 | 1.00 | 768 | 45.8 | 33.4 | 46.0 | 33.2 | 35.0 | 45 |
direct | POS | 675 | 19,276 | 7 | 1.24 | 303 | 68.7 | 51.5 | 69.3 | 50.0 | 55.4 | 300 |
k = 1 | POS | 3,464 | 15,826 | 1 | 1.00 | 30 | 81.7 | 65.5 | 82.5 | 63.5 | 70.6 | 244 |
k = 2 | POS | 2,099 | 13,347 | 2 | 1.40 | 35 | 81.6 | 65.2 | 82.4 | 63.3 | 70.5 | 410 |
r-branch | POS | 19,804 | 51,733 | 1 | 1.00 | 372 | 62.7 | 46.4 | 62.7 | 44.7 | 50.6 | 222 |
l-branch | POS | 17,240 | 45,883 | 1 | 1.00 | 342 | 63.7 | 47.4 | 63.9 | 45.6 | 51.4 | 197 |
direct | DEP | 2,505 | 19,511 | 7 | 1.13 | 3 | 78.5 | 72.2 | 78.9 | 71.6 | 78.6 | 484 |
k = 1 | DEP | 8,059 | 22,613 | 1 | 1.00 | 1 | 78.5 | 71.7 | 79.5 | 71.7 | 79.0 | 608 |
k = 2 | DEP | 6,651 | 20,314 | 2 | 1.20 | 1 | 78.7 | 72.1 | 79.8 | 72.0 | 79.2 | 971 |
k = 3 | DEP | 6,438 | 19,962 | 3 | 1.25 | 1 | 78.6 | 72.0 | 79.5 | 71.9 | 79.1 | 1,013 |
r-branch | DEP | 27,653 | 54,360 | 1 | 1.00 | 2 | 76.0 | 68.4 | 76.3 | 67.5 | 76.1 | 216 |
l-branch | DEP | 25,699 | 50,418 | 1 | 1.00 | 1 | 75.8 | 68.4 | 76.2 | 67.6 | 76.1 | 198 |
cascade: child labeling, k = 1, P+D/POS/DEP | 1 | 83.2 | 76.2 | 84.3 | 76.1 | 81.6 | 325 | |||||
LCFRS (Maier and Kallmeyer, 2010) | - | 79.0 | 71.8 | - | - | - | - | |||||
rparse simple | 920 | 18,587 | 7 | 1.37 | 56 | 77.1 | 70.6 | 77.3 | 70.0 | 76.2 | 350 | |
rparse (v = 1, h = 3) | 40,141 | 61,450 | 7 | 1.10 | 13 | 78.4 | 72.2 | 78.5 | 71.4 | 79.0 | 778 | |
MaltParser, unlexicalized, stacklazy | 0 | 85.0 | 80.2 | 85.6 | 80.0 | 85.0 | 24 |
extraction . | arg. lab. . | nont. . | rules . | fmax . | favg . | fail . | . | . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
child labeling | ||||||||||||
direct | P+D | 4,739 | 27,042 | 7 | 1.10 | 693 | 51.7 | 40.7 | 52.2 | 40.8 | 42.6 | 253 |
k = 1 | P+D | 13,178 | 35,071 | 1 | 1.00 | 202 | 75.9 | 68.7 | 77.0 | 69.1 | 73.3 | 288 |
k = 2 | P+D | 11,156 | 32,231 | 2 | 1.17 | 195 | 76.5 | 69.6 | 77.7 | 70.1 | 74.2 | 355 |
r-branch | P+D | 42,577 | 79,648 | 1 | 1.00 | 775 | 45.5 | 33.1 | 45.7 | 32.8 | 34.7 | 49 |
l-branch | P+D | 40,100 | 75,321 | 1 | 1.00 | 768 | 45.8 | 33.4 | 46.0 | 33.2 | 35.0 | 45 |
direct | POS | 675 | 19,276 | 7 | 1.24 | 303 | 68.7 | 51.5 | 69.3 | 50.0 | 55.4 | 300 |
k = 1 | POS | 3,464 | 15,826 | 1 | 1.00 | 30 | 81.7 | 65.5 | 82.5 | 63.5 | 70.6 | 244 |
k = 2 | POS | 2,099 | 13,347 | 2 | 1.40 | 35 | 81.6 | 65.2 | 82.4 | 63.3 | 70.5 | 410 |
r-branch | POS | 19,804 | 51,733 | 1 | 1.00 | 372 | 62.7 | 46.4 | 62.7 | 44.7 | 50.6 | 222 |
l-branch | POS | 17,240 | 45,883 | 1 | 1.00 | 342 | 63.7 | 47.4 | 63.9 | 45.6 | 51.4 | 197 |
direct | DEP | 2,505 | 19,511 | 7 | 1.13 | 3 | 78.5 | 72.2 | 78.9 | 71.6 | 78.6 | 484 |
k = 1 | DEP | 8,059 | 22,613 | 1 | 1.00 | 1 | 78.5 | 71.7 | 79.5 | 71.7 | 79.0 | 608 |
k = 2 | DEP | 6,651 | 20,314 | 2 | 1.20 | 1 | 78.7 | 72.1 | 79.8 | 72.0 | 79.2 | 971 |
k = 3 | DEP | 6,438 | 19,962 | 3 | 1.25 | 1 | 78.6 | 72.0 | 79.5 | 71.9 | 79.1 | 1,013 |
r-branch | DEP | 27,653 | 54,360 | 1 | 1.00 | 2 | 76.0 | 68.4 | 76.3 | 67.5 | 76.1 | 216 |
l-branch | DEP | 25,699 | 50,418 | 1 | 1.00 | 1 | 75.8 | 68.4 | 76.2 | 67.6 | 76.1 | 198 |
cascade: child labeling, k = 1, P+D/POS/DEP | 1 | 83.2 | 76.2 | 84.3 | 76.1 | 81.6 | 325 | |||||
LCFRS (Maier and Kallmeyer, 2010) | - | 79.0 | 71.8 | - | - | - | - | |||||
rparse simple | 920 | 18,587 | 7 | 1.37 | 56 | 77.1 | 70.6 | 77.3 | 70.0 | 76.2 | 350 | |
rparse (v = 1, h = 3) | 40,141 | 61,450 | 7 | 1.10 | 13 | 78.4 | 72.2 | 78.5 | 71.4 | 79.0 | 778 | |
MaltParser, unlexicalized, stacklazy | 0 | 85.0 | 80.2 | 85.6 | 80.0 | 85.0 | 24 |
Note that the hybrid grammar obtained with child labeling, DEPREL, and direct extraction should, in principle, be similar to the baseline LCFRS. Indeed, both grammars have the same fanout and similar numbers of rules. The differences in size can be explained by the separation of terminal generating and structural rules during hybrid grammar induction, which is not present in the induction algorithm by Kuhlmann and Satta (2009). This separation also leads to different generalization over the training data, as the higher accuracy and the lower numbers of parse failures for the hybrid grammar indicate.
One possible refinement of this result is to choose different argument labels for inherited and synthesized arguments. One may also use form or lemma next to, or instead of, POS tags and DEPRELs, possibly in combination with smoothing techniques to handle unknown words. Another subject for future investigation is the use of splitting and merging (Petrov et al., 2006) to determine parts of nonterminal names. New recursive partitioning strategies may be developed, and one could consider blending grammars that were induced using different recursive partitioning strategies.
7.2 Constituent Parsing
The experiments for constituent parsing are carried out as in Nederhof and Vogler (2014) but on a larger portion of the TIGER corpus: We now use the first 40,000 sentences for training (omitting 10 where a single tree does not span the entire sentence), and from the remaining 10,474 sentences we remove the ones with length greater than 20, leaving 7,597 sentences for testing. The results are displayed in Table 4 and allow for similar conclusions as in Nederhof and Vogler (2014). Note that for direct extraction both labeling strategies yield the same grammar; thus, only one entry is shown. Unsurprisingly, the larger training set leads to smaller proportions of parse failures and to improvements of F-measure. Another consequence is that the more fine-grained strict labeling now outperforms child labeling, except in the case of right-branching and left-branching, where the numbers of parse failures push the F-measure down.
. | nont. . | rules . | fail . | R . | P . | F1 . | # gaps . | time . |
---|---|---|---|---|---|---|---|---|
strict labeling | ||||||||
direct | 104 | 31,656 | 10 | 77.5 | 77.7 | 76.9 | 0.0139 | 2,098 |
k = 1 | 26,936 | 62,221 | 11 | 77.1 | 77.1 | 76.4 | 0.0136 | 2,514 |
k = 2 | 19,387 | 51,414 | 9 | 77.5 | 77.8 | 76.9 | 0.0136 | 2,892 |
k = 3 | 18,685 | 50,678 | 9 | 77.5 | 77.8 | 76.9 | 0.0135 | 2,886 |
r-branch | 164,842 | 248,063 | 658 | 61.2 | 58.2 | 59.1 | 0.0135 | 1,341 |
l-branch | 284,816 | 348,536 | 3,662 | 37.7 | 34.8 | 35.7 | 0.0131 | 2,143 |
child labeling | ||||||||
k = 1 | 2,117 | 13,877 | 1 | 75.3 | 74.9 | 74.5 | 0.0140 | 1,196 |
k = 2 | 473 | 8,078 | 1 | 75.6 | 75.4 | 74.9 | 0.0144 | 1,576 |
k = 3 | 176 | 7,352 | 1 | 75.7 | 75.4 | 74.9 | 0.0144 | 1,667 |
r-branch | 27,222 | 83,035 | 36 | 75.0 | 74.4 | 74.1 | 0.0148 | 502 |
l-branch | 87,171 | 162,645 | 137 | 74.5 | 73.9 | 73.6 | 0.0146 | 702 |
. | nont. . | rules . | fail . | R . | P . | F1 . | # gaps . | time . |
---|---|---|---|---|---|---|---|---|
strict labeling | ||||||||
direct | 104 | 31,656 | 10 | 77.5 | 77.7 | 76.9 | 0.0139 | 2,098 |
k = 1 | 26,936 | 62,221 | 11 | 77.1 | 77.1 | 76.4 | 0.0136 | 2,514 |
k = 2 | 19,387 | 51,414 | 9 | 77.5 | 77.8 | 76.9 | 0.0136 | 2,892 |
k = 3 | 18,685 | 50,678 | 9 | 77.5 | 77.8 | 76.9 | 0.0135 | 2,886 |
r-branch | 164,842 | 248,063 | 658 | 61.2 | 58.2 | 59.1 | 0.0135 | 1,341 |
l-branch | 284,816 | 348,536 | 3,662 | 37.7 | 34.8 | 35.7 | 0.0131 | 2,143 |
child labeling | ||||||||
k = 1 | 2,117 | 13,877 | 1 | 75.3 | 74.9 | 74.5 | 0.0140 | 1,196 |
k = 2 | 473 | 8,078 | 1 | 75.6 | 75.4 | 74.9 | 0.0144 | 1,576 |
k = 3 | 176 | 7,352 | 1 | 75.7 | 75.4 | 74.9 | 0.0144 | 1,667 |
r-branch | 27,222 | 83,035 | 36 | 75.0 | 74.4 | 74.1 | 0.0148 | 502 |
l-branch | 87,171 | 162,645 | 137 | 74.5 | 73.9 | 73.6 | 0.0146 | 702 |
Note that the average numbers of gaps per constituent are very similar for the different recursive partitionings. One may observe once more that the parse times of right-branching and left-branching recursive partitionings are higher than one might expect from the asymptotic time complexities. This is again due to the considerable sizes of the grammars.
8. Related Work
Two kinds of discriminative models of non-projective dependency parsing have been intensively studied. One is based on an algorithm for finding the maximum spanning tree (MST) of a weighted directed graph, where the vertices are the words of a sentence, and each edge represents a potential dependency relation (McDonald et al., 2005). In principle, any non-projective structure can be obtained, depending on the weights of the edges. A disadvantage of MST dependency parsing is that it is difficult to put any constraints on the desirable structures beyond the local constraints encoded in the edge weights (McDonald and Pereira, 2006).
The second kind of model involves stack-based transition systems, with an added transition that swaps stack elements. In particular, Nivre (2009) introduced a deterministic system that uses a classifier to determine the next transition to be applied. The worst-case time complexity is quadratic, and the expected complexity is linear. The classifier relies on features that look at neighboring words in the sentence, as well as at vertical and horizontal context in the syntactic tree. Advances in learning the relevant features are due to Chen and Manning (2014).
In this article we have assumed a generative model of hybrid grammars, which differs from deterministic, stack-based models in at least two ways, one of which is superficial whereas the other is more fundamental. The superficial difference is the presence of nonterminals in hybrid grammars. These, however, fulfill a role that is comparable to that of sets of features used by classifiers. We conjecture that machine learning techniques could even be introduced to create more refined nonterminals for hybrid grammars. The more fundamental difference lies in the determinism of the discussed stack-based models, which is difficult to realize for hybrid grammars, except perhaps in the case of left-branching and right-branching recursive partitionings. Without determinism, the time complexity grows with the fanout that we allow for the left components of hybrid grammars. What we get in return for the higher running time are more powerful models for parsing the input.
As in the case of MST dependency parsing, the discussed stack-based transition systems can in principle produce any non-projective structure. The non-projectivity allowed by hybrid grammars is determined by the hybrid rules, which in turn are determined by non-projectivity that occurs in the training data. This means that there is no restriction per se on non-projectivity in structures produced by a parser for test data, provided the training data contains an adequate amount of non-projectivity. This even holds if we restrict the fanout of the first component, although we may then need more training data to obtain the same coverage and accuracy.
Many established algorithms for constituent parsing rely on generative models and grammar induction (Collins, 1997; Charniak, 2000; Klein and Manning, 2003; Petrov et al., 2006). Most of these are unable to produce discontinuous structures. A notable exception is Maier and Søgaard (2008), where the induced grammar is a LCFRS. Such a grammar can be seen as a special case of a LCFRS/sDCP hybrid grammar, with the restriction that each nonterminal has a single synthesized argument. This restriction limits the power of hybrid grammars. In particular, discontinuous structures can now only be produced if the fanout is strictly greater than 1, which also implies the time complexity is more than cubic.
Generative models proposed for dependency parsing have often been limited to projective structures, based on either context-free grammars (Eisner, 1996; Klein and Manning, 2004) or tree substitution grammars (Blunsom and Cohn, 2010). Exceptions are recent models based on LCFRS (Maier and Kallmeyer, 2010; Kuhlmann, 2013). As in the case of constituent parsing, these can be seen as restricted LCFRS/sDCP hybrid grammars.
A related approach is from Satta and Kuhlmann (2013). Although it does not use an explicit grammar, there is a clear link to mildly context-sensitive grammar formalisms, in particular lexicalized TAG, following from the work of Bodirsky, Kuhlmann, and Möhl (2005).
The flexibility of generative models has been demonstrated by a large body of literature. Applications and extensions of generative models include syntax-based machine translation (Charniak, Knight, and Yamada, 2003), discriminative reranking (Collins, 2000), and involvement of discriminative training criteria (Henderson, 2004). It is very likely that these apply to hybrid grammars as well. Further discussion is outside the scope of this article.
Appendix A. Single-Synthesized sDCPs Have the Same s-term Generating Power as sCFTGs
The proof of Theorem 1 is the following.
Proof. Let G be a single-synthesized sDCP. We construct a sCFTG G′ such that [G] = [G′]. For each nonterminal A from G, there will be one nonterminal A′ in G′, with rk(A′) = i-rk(A).
There is an inverse transformation from a sCFTG to a sDCP with s-rk(A) = 1 for each A. This is as straightforward as the transformation above. In fact, one can think of sCFTGs and sDCPs with s-rank restricted to 1 as syntactic variants of one another.
A more extensive treatment of a very closely related result is from Mönnich (2010), who refers to a special form of attributed tree transducers in place of sDCPs. This result for tree languages was inspired by an earlier result by Duske et al. (1977) for string languages, involving (non-simple) macro grammars (with IO derivation) and simple-L-attributed grammars. For arbitrary s-rank, a related result is that attributed tree transductions are equivalent to attribute-like macro tree transducers, as shown by Fülöp and Vogler (1999).
Appendix B. The Class of s-term Languages Induced by sDCP Is Strictly Larger than that Induced by sCFTG
In Section 5 of Engelfriet and Filè (1981) it was proved that [G], viewed as language of binary trees, cannot be induced by 1S-AG, that is, attribute grammars with one synthesized attribute (and any number of inherited attributes). Although single-synthesized sDCP can generate s-terms (and not only trees as 1S-AG can), it is rather obvious to see that this extra power does not help to induce [G]. Thus we conclude that there is no single-synthesized sDCP that induces [G].
Appendix C. sDCP Have the Same String Generating Power as LCFRS
The proof of Theorem 2 is the following.
Proof. We show that for each sDCP G1 there is a LCFRS G2 such that [G2] = pre([G1]), where pre was extended from s-terms to sets of s-terms in the obvious way. Our construction first produces sDCP G′1 from G1 by replacing every s-term s in every rule of G1 by pre(s). It is not difficult to see that [G′1] =pre([G1]).
Next, for every nonterminal A in G′1, with i-rk(A) = k′ and s-rk(A) = k, we introduce nonterminals of the form A(g), where g is a mapping from [k] to sequences of numbers in [k′], such that each j ∈ [k′] occurs precisely once in g(1) ⋅… ⋅ g(k). The intuition is that if , then a value appearing as the jq-th inherited argument of A reappears as part of the i-th synthesized argument, and it is the q-th inherited argument to do so. (This concept is similar to the argument selector in Courcelle and Franchi-Zannettacci [1982, page 175].) For each A, only those functions g are considered that are consistent with at least one subderivation of G′1.
We will show that whereas replacing a nonterminal Am by , as m-th member in the right-hand side of a rule of the form of Equation (3), a variable appearing as the i-th synthesized argument is split up into several new variables , … , . These variables are drawn from X so as not to clash with variables used before. The terms of the inherited arguments whose indices are listed by gm(i) will be shifted to the left-hand side of a new rule to be constructed, interspersed with the new variables. This shifting of terms may happen along dependencies between inherited and synthesized arguments of other members in the right-hand side. (This shifting is again very similar to the one for obtaining rules for an IO-macro grammar from a simple L-attributed grammar (Duske et al., 1977, Def.6.1)).
Then the following two statements are equivalent:
- 1.
and g is the argument selector function of this derivation.
- 2.
.
The grammar G1″ constructed here has no inherited arguments and is almost the required LCFRS G2. To precisely follow our definitions, what remains is to consistently rename the variables in each rule to obtain the set Xm, some m. Furthermore, the symbols from Σ now need to be explicitly assigned rank 0.
For the converse direction of the theorem consider LCFRS G. We construct sDCP G′ by taking the same nonterminals as those of G, with the same ranks. Each argument is synthesized. To obtain the rules of G′, we replace each term a() by the term a(〈〉).
Acknowledgments
We are grateful to the reviewers for their constructive criticism and encouragement. We also thank Markus Teichmann for helpful discussions. The third author was financially supported by the Deutsche Forschungsgemeinschaft by project DFG VO 1011/8-1.
Notes
The term “hybrid tree” was used before by Lu et al. (2008), also for a mixture of a tree structure and a linear structure, generated by a probabilistic model. However, the linear “surface” structure was obtained by a simple left-to-right tree traversal, whereas a meaning representation was obtained by a slightly more flexible traversal of the same tree. The emphasis in the current article is rather on separating the linear structure from the tree structure. Note that similar distinctions of multiple strata were made before in both constituent linguistics (see, e.g., Chomsky 1981) and dependency linguistics (see, e.g., Mel’čuk 1988).
The term “simple” will here be used for definite clause programs to have analogous meaning to the term for MGs and CFTGs. This is more restrictive than the term with the same name in Deransart and Małuszynski (1985).
See Seki et al. (1991) for an even stronger normal form.
References
Author notes
School of Computer Science, University of St. Andrews, North Haugh, St. Andrews, KY16 9SX, UK.