## Abstract

We explore the concept of hybrid grammars, which formalize and generalize a range of existing frameworks for dealing with discontinuous syntactic structures. Covered are both discontinuous phrase structures and non-projective dependency structures. Technically, hybrid grammars are related to synchronous grammars, where one grammar component generates linear structures and another generates hierarchical structures. By coupling lexical elements of both components together, discontinuous structures result. Several types of hybrid grammars are characterized. We also discuss grammar induction from treebanks. The main advantage over existing frameworks is the ability of hybrid grammars to separate discontinuity of the desired structures from time complexity of parsing. This permits exploration of a large variety of parsing algorithms for discontinuous structures, with different properties. This is confirmed by the reported experimental results, which show a wide variety of running time, accuracy, and frequency of parse failures.

## 1. Introduction

Much of the theory of parsing assumes syntactic structures that are trees, formalized such that the children of each node are ordered, and the yield of a tree, that is, the leaves read from left to right, is the sentence. In different terms, each node in the hierarchical syntactic structure of a sentence corresponds to a phrase that is a list of adjacent words, without any gaps. Such a structure is easy to represent in terms of bracketed notation, which is used, for instance, in the Penn Treebank (Marcus, Santorini, and Marcinkiewicz, 1993).

Describing syntax in terms of such narrowly defined trees seems most appropriate for relatively rigid word-order languages such as English. Nonetheless, the aforementioned Penn Treebank of English contains traces and other elements that encode additional structure next to the pure tree structure as indicated by the brackets. This is in keeping with observations that even English cannot be described adequately without a more general form of trees, allowing for so-called **discontinuity** (McCawley, 1982; Stucky, 1987). In a discontinuous structure, the set of leaves dominated by a node of the tree need not form a contiguous sequence of words, but may comprise one or more gaps. The need for discontinuous structures tends to be even greater for languages with relatively free word order (Kathol and Pollard 1995; Müller 2004).

In the context of dependency parsing (Kübler, McDonald, and Nivre 2009), the more specific term **non-projectivity** is used instead of, or next to, *discontinuity*. See Rambow (2010) for a discussion of the relation between constituent and dependency structures and see Maier and Lichte (2009) for a comparison of discontinuity and non-projectivity. As shown by, for example, Hockenmaier and Steedman (2007) and Evang and Kallmeyer (2011), discontinuity encoded using traces in the Penn Treebank can be rendered in alternative, and arguably more explicit, forms. In many modern treebanks, discontinuous structures have been given a prominent status (e.g., Böhmová et al. 2000). Figure 1 shows an example of a non-projective dependency structure.

The most established parsing algorithms are compiled out of **context-free grammars** (CFGs), or closely related formalisms such as tree substitution grammars (Sima’an et al. 1994) or regular tree grammars (Brainerd 1969; Gécseg and Steinby 1997). These parsers, which have a time complexity of $O(n3)$ for *n* being the length of the input string, operate by composing adjacent substrings of the input sentence into longer substrings. As a result, the structures they can build directly do not involve any discontinuity. The need for discontinuous syntactic structures thus poses a challenge to traditional parsing algorithms.

One possible solution is commonly referred to as **pseudo-projectivity** in the literature on dependency parsing (Kahane, Nasr, and Rambow, 1998; Nivre and Nilsson, 2005; McDonald and Pereira, 2006). A standard parsing system is trained on a corpus of projective dependency structures that was obtained by applying a **lifting** operation to non-projective structures. In a first pass, this system is applied to unlabeled sentences and produces projective dependencies. In a second pass, the lifting operation is reversed to introduce non-projectivity. A related idea for discontinuous phrase structures is the reversible splitting conversion of Boyd (2007). See also Johnson (2002), Campbell (2004), and Gabbard, Kulick, and Marcus (2006).

The two passes of pseudo-projective dependency parsing need not be strictly separated in time. For example, one way to characterize the algorithm by Nivre (2009) is that it combines the first pass with the second. Here the usual one-way input tape is replaced by a buffer. A non-topmost element from the parsing stack, which holds a word previously read from the input sentence, can be transferred back to the buffer, and thereby input positions can be effectively swapped. This then results in a non-projective dependency structure.

A second potential solution to obtain syntactic structures that go beyond context-free power is to use more expressive grammatical formalisms. One approach proposed by Reape (1989, 1994) is to separate linear order from the parent–child relation in syntactic structure, and to allow shuffling of the order of descendants of a node, which need not be its direct children. The set of possible orders is restricted by linear precedence constraints. A further restriction may be imposed by compaction (Kathol and Pollard, 1995). As discussed by Fouvry and Meurers (2000) and Daniels and Meurers (2002), this may lead to exponential parsing complexity; see also Daniels and Meurers (2004). Separating linear order from the parent–child relation is in the tradition of head-driven phrase structure grammar (HPSG), where grammars are commonly hand-written. This differs from our objectives to induce grammars automatically from training data, as will become clear in the following sections.

To stay within a polynomial time complexity, one may also consider **tree adjoining grammars** (TAGs), which can describe strictly larger classes of word order phenomena than CFGs (Rambow and Joshi, 1997). The resulting parsers have a time complexity of $O(n6)$ (Vijay-Shankar and Joshi, 1985). However, the derived trees they generate are still continuous. Although their derivation trees may be argued to be discontinuous, these by themselves are not normally the desired syntactic structures. Moreover, it was argued by Becker, Joshi, and Rambow (1991) that further additions to TAGs are needed to obtain adequate descriptions of certain non-context-free phenomena. These additions further increase the time complexity.

In order to obtain desired syntactic structures, one may combine TAG parsing with an idea that is related to that of pseudo-projectivity. For example, Kallmeyer and Kuhlmann (2012) propose a transformation that turns a derivation tree of a (lexicalized) TAG into a non-projective dependency structure. The same idea has been applied to derivation trees of other formalisms, in particular (lexicalized) **linear context-free rewriting systems** (LCFRSs) (Kuhlmann, 2013), whose weak generative power subsumes that of TAGs.

Parsers more powerful than those for CFGs often incur high time costs. In particular, LCFRS parsers have a time complexity that is polynomial in the sentence length, but with a degree that is determined by properties of the grammar. This degree typically increases with the amount of discontinuity in the desired structures. Difficulties in running LCFRS parsers for natural languages are described, for example, by Kallmeyer and Maier (2013).

In the architectures we have discussed, the common elements are:

- •
a grammar, in some fixed formalism, that determines the set of sentences that are accepted, and

- •
a procedure to build (discontinuous) structures, guided by the derivation of input sentences.

**hybrid grammar**, introduced in Nederhof and Vogler (2014). Such a grammar consists of a string grammar and a tree grammar. Derivations are coupled, as in synchronous grammars (Shieber and Schabes, 1990; Satta and Peserico, 2005). In addition, each occurrence of a terminal symbol in the string grammar is coupled to an occurrence of a terminal symbol in the tree grammar. The string grammar defines the set of accepted sentences. The tree grammar, whose rules are tied to the rules of the string grammar for synchronous rewriting, determines the resulting syntactic structures, which may be discontinuous. The way the syntactic structures are obtained critically relies on the coupling of terminal symbols in the two component grammars, which is where our theory departs from that of synchronous grammars. A hybrid grammar generates a set of

**hybrid trees**;

^{1}Figure 2 shows an example of a hybrid tree, which corresponds to the non-projective dependency structure of Figure 1.

The general concept of hybrid grammars leaves open the choice of the string grammar formalism and that of the tree grammar formalism. In this article we consider simple macro grammars (Fischer, 1968) and LCFRSs as string grammar formalisms. The tree grammar formalisms we consider are simple context-free tree grammars (Rounds, 1970) and **simple definite clause programs** (sDCP), inspired by Deransart and Małuszynski (1985). This gives four combinations, each leading to one class of hybrid grammars. In addition, more fine-grained subclasses can be defined by placing further syntactic restrictions on the string and tree formalisms.

To place hybrid grammars in the context of existing parsing architectures, let us consider classical grammar induction from a treebank, for example, for context-free grammars (Charniak, 1996) or for LCFRSs (Maier and Søgaard, 2008; Kuhlmann and Satta, 2009). Rules are extracted directly from the trees in the training set, and unseen input strings are consequently parsed according to these structures. Grammars induced in this way can be seen as restricted hybrid grammars, in which no freedom exists in the relation between the string component and the tree component. In particular, the presence of discontinuous structures generally leads to high time complexity of string parsing. In contrast, the framework in this article detaches the string component of the grammar from the tree component. Thereby the parsing process of input strings is no longer bound to follow the tree structures, while the same tree structures as before can still be produced, provided the tree component is suitably chosen. This allows string parsing with low time complexity in combination with production of discontinuous trees.

Nederhof and Vogler (2014) presented experiments with various subclasses of hybrid grammars for the purpose of constituent parsing. Trade-offs between speed and accuracy were identified. In the present article, we extend our investigation to dependency parsing. This includes induction of a hybrid grammar from a dependency treebank. Before turning to the experiments, we present several completeness results about existence of hybrid grammars generating non-projective dependency structures.

This article is organized as follows. After preliminaries in Section 2, Section 3 defines hybrid trees. These are able to capture both discontinuous phrase structures and non-projective dependency structures. Thanks to the concept of hybrid trees, there will be no need, later in the article, to distinguish between hybrid grammars for constituent parsing and hybrid grammars for dependency parsing. To make this article self-contained, we define two existing string grammar formalisms and two tree grammar formalisms in Section 4. The four classes of hybrid grammars that result by combining these formalisms are presented in Section 5, which also discusses how to use them for parsing.

How to induce hybrid grammars from treebanks is discussed in Section 6. Section 7 reports on experiments that provide proof of concept. In particular, LCFRS/sDCP-hybrid grammars are induced from corpora of dependency structures and phrase structures and employed to predict the syntactic structure of unlabeled sentences. It is demonstrated that hybrid grammars allow a wide variety of results, in terms of time complexity, accuracy, and frequency of parse failures. How hybrid grammars relate to existing ideas is discussed in Section 8.

The text refers to a number of theoretical results that are not central to the main content of this article. In order to preserve the continuity of the discussion, we have deferred their proofs to appendices.

## 2. Preliminaries

Let ℕ = {0,1,2, …} and ℕ_{ +} = ℕ ∖{0}. For each *n* ∈ ℕ_{ +}, we let [*n*] stand for the set {1, … , *n*}, and we let [0] stand for ∅. We write [*n*]_{0} to denote [*n*] ∪{0}. We fix an infinite list *x*_{1},*x*_{2}, … of pairwise distinct **variables**. We let *X* = {*x*_{1},*x*_{2},*x*_{3}, …} and *X*_{k} = {*x*_{1}, … , *x*_{k}} for each *k* ∈ ℕ. For any set *A*, the power set of *A* is denoted by $P(A)$.

A **ranked set** Δ is a set of symbols associated with a rank function assigning a number rk_{Δ}(δ) ∈ ℕ to each symbol δ ∈ Δ. A **ranked alphabet** is a ranked set with a finite number of symbols. We let Δ^{(k)} denote {δ ∈ Δ∣rk_{Δ}(δ) = *k*}.

The following definitions were inspired by Seki and Kato (2008). The sets of **terms** and **sequence-terms** (**s-terms**) over ranked set Δ, with variables in some set *Y* ⊆ *X*, are denoted by *T*_{Δ}(*Y* ) and $T\Delta *(Y)$, respectively, and defined inductively as follows:

*Y*⊆*T*_{Δ}(*Y*),if

*k*∈ ℕ, δ ∈ Δ^{(k)}and $si\u2208T\Delta *(Y)$ for each*i*∈ [*k*], then δ(*s*_{1}, … ,*s*_{k}) ∈*T*_{Δ}(*Y*), andif

*n*∈ ℕ and*t*_{i}∈*T*_{Δ}(*Y*) for each*i*∈ [*n*], then $\u2329t1,\u2026,tn\u232a\u2208T\Delta *(Y)$.

We let $T\Delta *$ and *T*_{Δ} stand for $T\Delta *(\u2205)$ and *T*_{Δ}(∅), respectively. Throughout this article, we use variables such as *s* and *s*_{i} for s-terms and variables such as *t* and *t*_{i} for terms. The **length** |*s*| of a s-term *s* = 〈*t*_{1}, … , *t*_{n}〉 is *n*.

The justification for using s-terms as defined here is that they provide the required flexibility for dealing with both strings and unranked trees, in combination with derivational nonterminals in various kinds of grammar. By using an alphabet Δ =Δ^{(0)} one can represent strings. For instance, if Δ contains the symbols *a* and *b*, then the s-term 〈*a*(),*b*(),*a*()〉 denotes the string *ab a*. We will therefore refer to such s-terms simply as strings.

By using an alphabet Δ =Δ^{(1)} one may represent trees without fixing the number of child nodes that a node with a certain label should have. Conceptually, one may think of such node labels as unranked, as is common in parsing theory of natural language. For instance, the s-term 〈*a*(〈*b*(〈〉),*a*(〈〉)〉)〉 denotes the unranked tree *a*(*b*,*a*). We will therefore refer to such s-terms simply as trees, and we will sometimes use familiar terminology, such as “node,” “parent,” and “sibling,” as well as common graphical representations of trees.

If theoretical frameworks require trees over ranked alphabets in the conventional sense (without s-terms), one may introduce a distinguished symbol *cons* of rank 2, replacing each s-term of length greater than 1 by an arrangement of subterms combined using occurrences of that symbol. Another symbol *nil* of rank 0 may be introduced to replace each s-term of length 0. Hence δ(〈α(〈〉),β(〈〉),γ(〈〉)〉) could be more conventionally written as δ(*cons*(α(*nil*),*cons*(β(*nil*),γ(*nil*)))).

Concatenation of s-terms is given by 〈*t*_{1}, … , *t*_{n}〉⋅〈*t*_{n+1}, … , *t*_{n+m}〉 = 〈*t*_{1}, … , *t*_{n +m}〉. Sequences such as *s*_{1}, … , *s*_{k} or *x*_{1}, … , *x*_{k} will typically be abbreviated to *s*_{1,k} or *x*_{1,k}, respectively. For δ ∈ Δ^{(0)} we sometimes abbreviate δ() to δ.

In examples we also abbreviate 〈*t*_{1}, … , *t*_{n}〉 to *t*_{1}⋯*t*_{n}—that is, omitting the angle brackets and commas. In particular, for *n* = 0, the s-term 〈〉 is abbreviated by ε. Moreover, we sometimes abbreviate δ(〈〉) to δ. Whether δ then stands for δ(〈〉) or for δ() depends on whether δ ∈ Δ^{(1)} or δ ∈ Δ^{(0)}, which will be clear from the context.

**positions**(cf. Gorn addresses). The set of all positions in term

*t*or in s-term

*s*is denoted by pos(

*t*) or pos(

*s*), respectively, and defined inductively by:

_{ℓ}denote the lexicographical order of positions. We say that position

*p*is a

**parent**of position

*p*′ if

*p*′ is of the form

*pij*, for some numbers

*i*and

*j*, and we denote parent(

*pij*) =

*p*. For a position

*p*′ of length 0 or 1, we set parent(

*p*′) = nil. Conversely, the set of

**children**of a position

*p*, denoted by children(

*p*), contains each position

*p*′ with parent(

*p*′) =

*p*. The

**right sibling**of the position

*pij*, denoted by right-sibling(

*pij*), is

*pi*(

*j*+ 1).

The subterm at position *p* in a term *t* (or a s-term *s*) is defined as follows. For any term *t* we have *t*|_{ε} = *t*. For a term *t* = δ(*s*_{1,k}), *i* ∈ [*k*],*p* ∈pos(*s*_{i}) we have *t*|_{ip} =*s*_{i}|_{p}. For a s-term *s* = 〈*t*_{1,n}〉, *i* ∈ [*n*],*p* ∈pos(*t*_{i}) we have *s*|_{ip} =*t*_{i}|_{p}.

The **label** at position *p* in a term *t* is denoted by *t*(*p*). In other words, if *t*|_{p} equals δ(*s*_{1}, … , *s*_{k}) or *x* ∈ *X*, then *t*(*p*) equals δ or *x*, respectively. Let Γ ⊆ Δ. The subset of pos(*t*) consisting of all positions where the label is in Γ is denoted by pos_{Γ}(*t*), or formally pos_{Γ}(*t*) = {*p* ∈ pos(*t*)∣*t*(*p*) ∈ Γ}. Analogously to *t*(*p*) and pos_{Γ}(*t*) for terms *t*, one may define *s*(*p*) and pos_{Γ}(*s*) for s-terms *s*.

The expression *t*[*s*′]_{p} (or *s*[*s*′]_{p}) denotes the s-term obtained from *t* (or from *s*) by replacing the subterm at position *p* by s-term *s*′. For any term *t* we have *t*[*s*′]_{ε} = *s*′. For a term *t* = δ(*s*_{1,k}), *i* ∈ [*k*],*p* ∈pos(*s*_{i}) we have *t*[*s*′]_{ip} =〈δ(*s*_{1,i−1},*s*_{i}[*s*′]_{p},*s*_{i +1,k})〉. For a s-term *s* = 〈*t*_{1,n}〉, *i* ∈ [*n*],*p* ∈pos(*t*_{i}) we have *s*[*s*′]_{ip} = 〈*t*_{1,i−1}〉⋅ *t*_{i}[*s*′]_{p} ⋅〈*t*_{i +1,n}〉.

For term *t* (or s-term *s*) with variables in *X*_{k} and s-terms *s*_{i} (*i* ∈ [*k*]), the **first-order substitution***t*[*s*_{1,k}] (or *s*[*s*_{1,k}], respectively) denotes the s-term obtained from *t* (or from *s*) by replacing each occurrence of any variable *x*_{i} by *s*_{i}, or formally:

- •
*x*_{i}[*s*_{1,k}] =*s*_{i}, - •
δ(

*s*′_{1,n})[*s*_{1,k}] = 〈δ(*s*′_{1}[*s*_{1,k}], … ,*s*′_{n}[*s*_{1,k}])〉, - •
〈

*t*_{1,n}〉[*s*_{1,k}] =*t*_{1}[*s*_{1,k}] ⋅… ⋅*t*_{n}[*s*_{1,k}].

If $s\u2208T\Delta *$, *s*|_{p} = δ(*s*_{1,k}) and $s\u2032\u2208T\Delta *(Xk)$, then the **second-order substitution**$s\u301as\u2032\u301bp$ denotes the s-term obtained from *s* by replacing the subterm at position *p* by *s*′, with the variables in *s*′ replaced by the corresponding s-terms found immediately below *p*, or formally $s\u301as\u2032\u301bp=s[s\u2032[s1,k]]p$.

*A*,

*B*,

*C*,α,β,γ

_{1},γ

_{2},δ

_{1},δ

_{2}} where rk

_{Δ}(

*A*) = 0, rk

_{Δ}(

*B*) =rk

_{Δ}(

*C*) = 2, and all other symbols have rank 1. An example of a s-term in $T\Delta *(X2)$ is

*s*= α(

*A*())

*x*

_{2}

*B*(

*x*

_{1},β), which is short for:

*s*| = 3, and pos(

*s*) = {1,111,2,3,311,321}. Note that 31 is not a position, as positions must point to terms, not s-terms. Further

*s*|

_{1}= α(

*A*()),

*s*|

_{111}=

*A*(),

*s*|

_{311}=

*x*

_{1},

*s*(2) =

*x*

_{2}, and

*s*(3) =

*B*.

*s*[δ

_{1},δ

_{2}] =α(

*A*())δ

_{2}

*B*(δ

_{1},β). With

*s*′ = γ

_{1}

*C*(δ

_{1},δ

_{2}) γ

_{2}, an example of second-order substitution is:

^{(1)}and its positions are illustrated in Figure 3.

## 3. Hybrid Trees

The purpose of this section is to unify existing notions of non-projective dependency structures and discontinuous phrase structures, formalized using s-terms.

We fix a ranked alphabet Σ = Σ^{(1)} and a subset Γ ⊆ Σ. A **hybrid tree** over (Γ,Σ) is a pair *h* = (*s*,≤_{s}), where $s\u2208T\Sigma *$ and ≤_{s} is a total order on pos_{Γ}(*s*). In words, a hybrid tree combines hierarchical structure, in the form of a s-term over the full alphabet Σ, with a linear structure, which can be seen as a string over Γ ⊆ Σ. This string will be denoted by str(*h*). Formally, let pos_{Γ}(*s*) = {*p*_{1}, … , *p*_{n}} with *p*_{i} ≤_{s}*p*_{i +1} (*i* ∈ [*n* − 1]). Then str(*h*) = *s*(*p*_{1})⋯*s*(*p*_{n}). In order to avoid the treatment of pathological cases we assume that *s*≠〈〉 and pos_{Γ}(*s*)≠∅.

A hybrid tree (*s*,≤_{s}) is a **phrase structure** if ≤_{s} is a total order on the leaves of *s*. The elements of Γ would typically represent lexical items, and the elements of Σ ∖ Γ would typically represent syntactic categories. A hybrid tree (*s*,≤_{s}) is a **dependency structure** if Γ = Σ, whereby the linear structure of a hybrid tree involves all of its nodes, which represent lexical items. Our dependency structures generalize **totally ordered trees** (Kuhlmann and Niehren, 2008) by considering s-terms instead of usual terms over a ranked alphabet.

We say that a phrase structure (*s*,≤_{s}) over (Γ,Σ) is **continuous** if for each *p* ∈pos(*s*) the set pos_{Γ}(*s*|_{p}) is a complete span, that is, if the following condition holds: if *p*_{1}, *p*_{2}, *p*′ satisfy *pp*_{1}, *p*′, *pp*_{2} ∈pos_{Γ}(*s*) and *pp*_{1} ≤_{s}*p*′≤_{s}*pp*_{2}, then *p*′ = *pp*_{3} for some *p*_{3}. If the same condition holds for a dependency structure (*s*,≤_{s}), then we say that (*s*,≤_{s}) is **projective**. If the condition is not satisfied, then we call a phrase structure **discontinuous** and a dependency structure **non-projective**.

*h*= (

*s*,≤

_{s}), where:

_{s}is given by 11111 ≤

_{s}11112 ≤

_{s}11211. It is graphically represented in Figure 4a. The bottom line indicates the word order in German, with adverb

**schnell**[quickly] separating the two verbs of the verb phrase. The dashed lines connect the leaves of the tree structure to the total order. (Alternative analyses exist that do not require discontinuity; we make no claim that the shown structure is the most linguistically adequate.)

The phenomenon of cross-serial dependencies in Dutch (Bresnan et al., 1982) is illustrated in Figure 4b, using a non-projective dependency structure.

Figure 5 gives an abstract rendering of cross-serial dependencies in Dutch, this time in terms of discontinuous phrase structure.

## 4. Basic Grammatical Formalisms

The concept of hybrid grammars is illustrated in Section 5, first on the basis of a coupling of linear context-free rewriting systems and simple definite clause programs, and then three more such couplings are introduced that further involve simple macro grammars and simple context-free tree grammars. In the current section we discuss these basic classes of grammars, starting with the simplest ones.

### 4.1 Macro Grammars

The definitions in this section are very close to those in Fischer (1968) with the difference that the notational framework of s-terms is used for strings, as in Seki and Kato (2008).

**macro grammar**(MG) is a tuple

*G*=(

*N*,

*S*,Γ,

*P*), where

*N*is a ranked alphabet of

**nonterminals**,

*S*∈

*N*

^{(0)}is the

**start symbol**, Γ = Γ

^{(0)}is a ranked alphabet of

**terminals**and Γ ∩

*N*= ∅, and

*P*is a finite set of

**rules**, each of the form:

*A*∈

*N*

^{(k)}and $r\u2208TN\u222a\Gamma *(Xk)$. A macro grammar in which each nonterminal has rank 0 is a CFG.

We write $\u21d2Gp,\rho $ for the “derives” relation, using rule $\rho =A(x1,k)\u2192r$ at position *p* in a s-term. Formally, we write $s1\u21d2Gp,\rho s2$ if $s1\u2208TN\u222a\Gamma *$, *s*_{1}(*p*) = *A* and $s2=s1\u301ar\u301bp$. We write *s*_{1} ⇒_{G}*s*_{2} if $s1\u21d2Gp,\rho s2$ for some *p* and ρ, and $\u21d2G*$ is the reflexive, transitive closure of ⇒_{G}. Derivation in *i* steps is denoted by $\u21d2Gi$. The (string) language induced by macro grammar *G* is $[G]={s\u2208T\Gamma *\u2223\u2329S\u232a\u21d2G*s}$.

In the sequel we will focus our attention on macro grammars with the property that for each rule $A(x1,k)\u2192r$ and each *i* ∈ [*k*], variable *x*_{i} has exactly one occurrence in *r*. In this article, such grammars will be called **simple** macro grammars (sMGs).

*ww*, where

*w*is a string over {

**a**,

**b**}. In this and in the following examples, bold letters (which may be lower-case or upper-case) represent terminals and upper-case italic letters represent nonterminals.

### 4.2 Context-free Tree Grammars

The definitions in this section are a slight generalization of those in Rounds (1970) and Engelfriet and Schmidt (1977,1978), as here they involve s-terms. In Section 3 we already argued that the extra power due to s-terms can be modeled using fixed symbols *cons* and *nil* and is therefore not very significant in itself. The benefit of the generalization lies in the combination with macro grammars, as discussed in Section 5.

A (generalized) **context-free tree grammar** (CFTG) is a tuple *G* =(*N*,*S*,Σ,*P*), where Σ is a ranked alphabet with Σ = Σ^{(1)} and *N*, *S*, and *P* are as for macro grammars except that Γ is replaced by Σ in the specification of the rules.

The “derives” relation $\u21d2Gp,\rho $ and other relevant notation are defined as for macro grammars. Note that the language $[G]={s\u2208T\Sigma *\u2223\u2329S\u232a\u21d2G*s}$ induced by a CFTG *G* is not a string language but a tree language, or more precisely, its elements are **sequences** of trees.

As for macro grammars, we will focus our attention on CFTGs with the property that for each rule $A(x1,k)\u2192r$ and each *i* ∈ [*k*], variable *x*_{i} has exactly one occurrence in *r*. In this article, such grammars will be called **simple** context-free tree grammars (sCFTGs). Note that if *N* = *N*^{(0)}, then a sCFTGs is a **regular tree grammar** (Brainerd 1969; Gécseg and Steinby 1997). sCFTGs are a natural generalization of the widely used TAGs; see Kepser and Rogers (2011), Maletti and Engelfriet (2012), and Gebhardt and Osterholzer (2015).

**a**could stand for a noun,

**b**for a verb, and

**c**for an adverb that modifies exactly one of the verbs.

### 4.3 Linear Context-free Rewriting Systems

In Vijay-Shanker, Weir, and Joshi (1987), the semantics of LCFRS is introduced by distinguishing two phases. In the first phase, a tree over function symbols is generated by a regular tree grammar. In the second phrase, the function symbols are interpreted, each composing a sequence of tuples of strings into another tuple of strings. This formalism is equivalent to the multiple CFGs of Seki et al. (1991). We choose a notation similar to that of the formalisms discussed before, which will also enable us to couple these string-generating grammars to tree-generating grammars, as will be discussed later.

*G*=(

*N*,

*S*,Γ,

*P*), where

*N*is a ranked alphabet of

**nonterminals**,

*S*∈

*N*

^{(1)}is the

**start symbol**, Γ = Γ

^{(0)}is a ranked alphabet of

**terminals**and Γ ∩

*N*= ∅, and

*P*is a finite set of

**rules**, each of the form:

*n*∈ ℕ, $Ai\u2208N(ki)$ for each

*i*∈ [

*n*]

_{0}, and $mi=\u2211j:1\u2264j\u2264ikj$ for each

*i*∈ [

*n*], and $sj\u2208T\Gamma *(Xmn)$ for each

*j*∈ [

*k*

_{0}]. In words, the right-hand side is a s-term consisting of nonterminals

*A*

_{i}(

*i*∈ [

*n*]), with

*k*

_{i}distinct variables as arguments; there are

*m*

_{n}variables altogether, which is the sum of the ranks of all

*A*

_{i}(

*i*∈ [

*n*]). The left-hand side is an occurrence of

*A*

_{0}with each argument being a string of variables and terminals. Furthermore, we demand that each

*x*

_{j}(

*j*∈ [

*m*

_{n}]) occurs exactly once on the left-hand side. The rank of a nonterminal is called its

**fanout**and the largest rank of any nonterminal is called the

**fanout**of the grammar.

**rule instance of ρ**is obtained by choosing some $ri\u2208T\Gamma *$ for each variable

*x*

_{i}(

*i*∈ [

*m*

_{n}]), and replacing the two occurrences of

*x*

_{i}in the rule by this

*r*

_{i}. Much as in the preceding sections, $\u21d2G\rho $ is a binary relation on s-terms, with:

_{G}for ⋃

_{ρ∈P}$\u21d2G\rho $. The (string) language induced by LCFRS

*G*is $[G]={s\u2208T\Gamma *\u2223\u2329S(s)\u232a\u21d2G*\u2329\u232a}$. If 〈

*S*(

*s*)〉 $\u21d2G*$ 〈〉, then there are

*i*∈ ℕ and ρ

_{1}, … , ρ

_{i}such that $\u2329S(s)\u232a\u21d2G\rho 1\cdots \u21d2G\rho i\u2329\u232a$. We call ρ

_{1}⋯ρ

_{i}a

**derivation of**.

*G*A derivation can be represented by a **derivation tree***d* (cf. Figure 6), which is obtained by glueing together the rules as they are used in the derivation. The backbone of *d* is a usual derivation tree of the CFG underlying the LCFRS (nonterminals and solid lines). Each argument of a nonterminal is represented as a box to the right of the nonterminal, and dashed arrows indicate dependencies of values. An *x*_{i} above a box specifies an argument of a nonterminal on the right-hand side of a rule, whereas the content of a box is an argument of the nonterminal on the left-hand side of a rule. The boxes taken as vertices and the dashed arrows taken as edges constitute a **dependency graph**. It can be evaluated by first sorting its vertices topologically and, according to this sorting, substituting in an obvious manner the relevant s-terms into each other. The final s-term that is evaluated in this way is an element of the language [*G*]. This s-term, denoted by ϕ(*d*), is called the **evaluation of d**. The notion of dependency graph originates from attribute grammars (Knuth, 1968; Paakki, 1995); it should not be confused with the linguistic concept of dependency.

Note that if the rules in a derivation are given, then the choice of *r*_{i} for each variable *x*_{i} in each rule instance is uniquely determined. For a given string *s*, the set of all LCFRS derivations (in compact tabular form) can be obtained in polynomial time in the length of *s* (Seki et al., 1991). See also Kallmeyer and Maier (2010, 2013) for the extension with probabilities.

*G*is the following:

*G*and the corresponding derivation tree

*d*is depicted in Figure 6; its evaluation is the s-term ϕ(

*d*) =

**a**

**a**

**c**

**b**

**b**

**d**.

All strings derived by *G* have the interlaced structure **a**^{m}**c**^{n}**b**^{m}**d**^{n} with *m*,*n* ∈ ℕ, where the *i*-th occurrence of **a** corresponds to the *i*-th occurrence of **b** and the *i*-th occurrence of **c** corresponds to the *i*-th occurrence of **d**. This resembles cross-serial dependencies in Swiss German (Shieber, 1985) in an abstract way; **a** and **c** represent noun phrases with different case markers (dative or accusative) and **b** and **d** are verbs that take different arguments (dative or accusative noun phrases).

There is a subclass of LCFRS called **well-nested** LCFRS. Its time complexity of parsing is lower than that of general LCFRS (Gómez-Rodríguez, Kuhlmann, and Satta 2010), and the class of languages it induces is strictly included in the class of languages induced by general LCFRSs (Kanazawa and Salvati, 2010). One can see sMGs as syntactic variants of well-nested LCFRSs (cf. footnote 3 of Kanazawa 2009), the former being more convenient for our purposes of constructing hybrid grammars, when the string component is to be explicitly restricted to have the power of sMG / well-nested LCFRS. The class of languages they induce also equals the class of string languages induced by sCFTGs.

### 4.4 Definite Clause Programs

In this section we describe a particular kind of definite clause program. Our definition is inspired by Deransart and Małuszynski (1985), who investigated the relation between logic programs and attribute grammars, together with the “syntactic single use requirement” from Giegerich (1988). The values produced are s-terms. The induced class of s-term languages is strictly larger than that of sCFTGs (cf. Appendix B).

As discussed subsequently, the class of string languages that results if we take the yields of those s-terms equals the class of string languages induced by LCFRSs. Thereby, our class of definite clause programs relates to LCFRSs much as the class of sCFTGs relates to sMGs.

A simple definite clause program (sDCP) is a tuple *G* =(*N*,*S*,Σ,*P*), where *N* is a ranked alphabet of **nonterminals** and Σ = Σ^{(1)} is a ranked alphabet of **terminals** (as for CFTGs).^{2} Moreover, each nonterminal *A* ∈ *N* has a number of arguments, each of which is either an **inherited argument** or a **synthesized argument**. The number of inherited arguments is the **i-rank** and the number of synthesized arguments is the **s-rank** of *A*; we let rk_{N}(*A*) =i-rk(*A*) +s-rk(*A*) denote the **rank** of *A*. The **start symbol***S* has only one argument, which is synthesized—that is, rk_{N}(*S*) =s-rk(*S*) = 1 and i-rk(*S*) = 0.

*n*∈ ℕ,

*k*

_{0}= i-rk(

*A*

_{0}) and

*k*′

_{0}= s-rk(

*A*

_{0}),

*k*′

_{i}= i-rk(

*A*

_{i}) and

*k*

_{i}= s-rk(

*A*

_{i}), for

*i*∈ [

*n*]. The set of variables occurring in the lists $x1,ki(i)$ (

*i*∈ [

*n*]

_{0}) equals

*X*

_{m}, where $m=\u2211i\u2208[n]0ki$. In other words, every variable from

*X*

_{m}occurs exactly once in all these lists together. This is where values “enter” the rule. Further, the s-terms in $s1,ki\u2032(i)$ (

*i*∈ [

*n*]

_{0}) are in $T\Sigma *(Xm)$ and together contain each variable in

*X*

_{m}exactly once (syntactic single use requirement). This is where values are combined and “exit” the rule.

The “derives” relation ⇒_{G} and other relevant notation are defined as for LCFRSs. Thus, in particular, the language induced by sDCP *G* is $[G]={s\u2208T\Sigma *\u2223\u2329S(s)\u232a\u21d2G*\u2329\u232a}$, whose elements are now sequences of trees.

In the same way as for LCFRS we can represent a derivation of a sDCP *G* as derivation tree *d*. A box in its dependency graph is placed to the left of a nonterminal occurrence if it represents an inherited argument and to the right otherwise. As before, ϕ(*d*) denotes the evaluation of *d*, which is now a s-term over Σ.

*B*is inherited and all other arguments are synthesized:

*d*, whose evaluation is the s-term ϕ(

*d*) =

**c**

**B**(

**c**

**B**(

**a**

**A**()

**b**)

**d**)

**d**that is derived by the sDCP of this Example.

If a sDCP is such that the dependency graph for any derivation contains no cycles, then we say the sDCP contains no cycles. In this case, if the rules in a derivation are given, then the choice of *r*_{i} for each variable *x*_{i} in each rule instance is uniquely determined, and can be computed in linear time in the size of the derivation. The existence of cycles is decidable, as we know from the literature on attribute grammars (Knuth, 1968). There are sufficient conditions for absence of cycles, such as the grammar being L-attributed (Bochmann, 1976; Deransart, Jourdan, and Lorho, 1988). In this article, we will assume that sDCPs contain no cycles. A one-pass computation model can be formalized as a bottom–up tree-generating tree-to-hypergraph transducer (Engelfriet and Vogler, 1998). One may alternatively evaluate sDCP arguments using the more general mechanism of unification, as it exists for example in HPSG (Pollard and Sag, 1994).

If the sDCP is **single-synthesized**—that is, each nonterminal has exactly one synthesized argument (and any number of inherited arguments)—then there is an equivalent sCFTG. The converse also holds. For proofs, see Appendix A.

Single-synthesized sDCPs have the same s-term generating power as sCFTGs.

*T*

_{Σ}(

*X*)∪$T\Sigma *(X)$ to $T\Sigma *(X)$, which returns the sequence of node labels in a pre-order tree traversal. (Alternatively, one could define a function “yield” from

*T*

_{Σ}(

*X*)∪$T\Sigma *(X)$ to $T\Gamma *(X)$, where Γ is a subset of Σ, which erases symbols in Σ ∖ Γ. The erased symbols would typically be linguistic categories as opposed to lexical elements. This does not affect the validity of what follows, however.) We define recursively:

On the basis of the flattening function, Appendix C proves the following result.

sDCP have the same string generating power as LCFRS.

## 5. Hybrid Grammars

A hybrid grammar consists of a string grammar and a tree grammar. Intuitively, the string grammar is used for parsing a given string *w*, and the tree grammar simultaneously generates a hybrid tree *h* with str(*h*) = *w*. To synchronize these two processes, we couple derivations in the grammars in a way similar to how this is commonly done for synchronous grammars—namely, by **indexed** symbols. However, we apply the mechanism not only to derivational nonterminals but also to terminals.

Let Ω be a ranked alphabet. We define the ranked set , with . Let Δ be another ranked alphabet (Ω ∩ Δ = ∅) and *Y* ⊆ *X*. We let $I\Omega ,\Delta *(Y)$ be the set of all s-terms $s\u2208TI(\Omega )\u222a\Delta *(Y)$ in which each index *u* occurs at most once.

For a s-term *s*, let *ind*(*s*) be the set of all indices occurring in *s*. The deindexing function $D$ removes all indices from a s-term $s\u2208I\Omega ,\Delta *(Y)$ to obtain $D(s)\u2208T\Omega \u222a\Delta *(Y)$. The set $I\Omega ,\Delta (Y)$⊆$TI(\Omega )\u222a\Delta (Y)$ of terms with indexed symbols is defined much as above. We let $I\Omega ,\Delta *$ =$I\Omega ,\Delta *(\u2205)$ and $I\Omega ,\Delta $ =$I\Omega ,\Delta (\u2205)$.

### 5.1 LCFRS/sDCP Hybrid Grammars

We first couple a LCFRS and a sDCP in order to describe a set of hybrid trees.

**LCFRS/sDCP hybrid grammar**(over Γ and Σ) is a tuple

*G*=(

*N*,

*S*,(Γ,Σ),

*P*), subject to the following restrictions. The objects Γ and Σ are ranked alphabets with Γ = Γ

^{(0)}and Σ = Σ

^{(1)}. As mere sets of symbols, we demand Γ ⊆ Σ. Let Δ be the ranked alphabet Σ ∖ Γ, with rk

_{Δ}(δ) =rk

_{Σ}(δ) = 1 for δ ∈ Δ. The set

*P*is a finite set of

**hybrid rules**, each of the form:

*r*

_{1}and

*r*

_{2}and at terminals from Γ in $s1,k(1)$ and in $s1,m(2)$ and

*r*

_{2}. We require that each index in a hybrid rule either couples a pair of identical terminals or couples a pair of identical nonterminals. Let

*P*

_{1}be the set of all $D(A(s1,k(1)))\u2192D(r1)$, where $A(s1,k(1))\u2192r1$ occurs as the first component of a hybrid rule as in Equation (4). The set

*P*

_{2}is similarly defined, taking the second components. We now further require that (

*N*,

*S*,Γ,

*P*

_{1}) is a LCFRS and (

*N*,

*S*,Σ,

*P*

_{2}) is a sDCP. We refer to these two grammars as the first and second

**components**, respectively, of

*G*.

In order to define the “derives” relation $\u21d2Gu,\rho $, for some index *u* and some rule ρ of the form of Equation (4), we need the additional notions of nonterminal reindexing and of terminal reindexing. The **nonterminal reindexing** is an injective function *f*_{U} that replaces each index at a nonterminal occurrence of the rule by one that does not clash with the indices in an existing set *U* ⊆ℕ_{ +}. We may, for example, define *f*_{U} such that it maps each *v* ∈ ℕ_{ +} to the smallest *v*′ ∈ ℕ_{ +} such that *v*′∉*U* ∪{*f*_{U}(1), … , *f*_{U}(*v* − 1)}. We extend *f*_{U} to apply to terms, s-terms, and rules in a natural way, to replace indices by other indices, but leaving all other symbols unaffected. A **terminal reindexing***g* maps indices at occurrences of terminals in the rule to the indices of corresponding occurrences of terminals in the sentential form; we extend *g* to terms, s-terms, and rules in the same way as for *f*_{U}. The definition of *f*_{U} is fixed while an appropriate *g* needs to be chosen for each derivation step.

- •
ρ ∈

*P*is $[A(s1,k(1))\u2192r1,$$A(s1,k(2))\u2192r2]$, - •
,

- •
there is a terminal reindexing

*g*such that $g(s1,k(1))=s\u20321,k(1)$ (and hence $g(s1,k(2))=s\u20321,k(2)$), - •
$A(s\u20321,k(1))\u2192r1\u2032$ is obtained from $g(fU(A(s1,k(1))\u2192r1))$ by consistently substituting occurrences of variables by s-terms in $I\Gamma ,\u2205*$, and $A(s\u20321,k(2))\u2192r2\u2032$ is obtained from $g(fU(A(s1,k(2))\u2192r2))$ by consistently substituting occurrences of variables by s-terms in $I\Gamma ,\Delta *$.

*u*= 1. We have:We define the nonterminal reindexing such that

*f*

_{U}(1) = 1 and

*f*

_{U}(2) = 7. As terminal reindexing we use

*g*such that

*g*(3) = 5. Thus we obtain the reindexed rule:By using the substitutions and for both components we obtain the derivation step:

_{G}for $\u22c3u\u2208N+,\rho \u2208P\u21d2Gu,\rho $. The hybrid language induced by

*G*is:Note that apart from the indices, the first component of a pair [

*s*

_{1},

*s*

_{2}] ∈ [

*G*] consists of a string of terminals from Γ and the second is a s-term built up of terminals in Γ ∪ Δ. Moreover, every occurrence of γ ∈ Γ in

*s*

_{1}corresponds to exactly one in

*s*

_{2}and vice versa, because of the common indices.

From a pair [*s*_{1},*s*_{2}] ∈ [*G*], we can construct the hybrid tree (*s*,≤_{s}) over (Γ,Σ) by letting $s=D(s2)$ and, for each combination of positions *p*_{1}, *p*′_{1}, *p*_{2}, *p*′_{2} such that $s1(p1)=s2(p2)\u2208I(\Gamma )$ and $s1(p1\u2032)=s2(p2\u2032)\u2208I(\Gamma )$, we set *p*_{2} ≤_{s}*p*′_{2} if and only if *p*_{1} ≤_{ℓ}*p*′_{1}. (The lexicographical ordering ≤_{ℓ} on positions here simplifies to the linear ordering of integers, as positions in strings always have length 1.) In words, occurrences of terminals in *s* obtain a total order in accordance with the order in which corresponding terminals occur in *s*_{1}. The set of all such (*s*,≤_{s}) will be denoted by *L*(*G*).

*G*. (All arguments in the second component are synthesized.)We derive:where we have used the following reindexing functions:

application of… . | nonterminal reindexing . | terminal reindexing . |
---|---|---|

ρ_{1} | f_{{2,3,4}}(2) = 5 | g identity |

ρ_{2} | f_{{2,3,4,5}} identity | g(1) = 2 and g(2) = 4 |

ρ_{3} | f_{{3}} identity | g(1) = 3 |

application of… . | nonterminal reindexing . | terminal reindexing . |
---|---|---|

ρ_{1} | f_{{2,3,4}}(2) = 5 | g identity |

ρ_{2} | f_{{2,3,4,5}} identity | g(1) = 2 and g(2) = 4 |

ρ_{3} | f_{{3}} identity | g(1) = 3 |

and *f*_{U}(*i*) = *i* and *g*( *j*) = *j* if not specified otherwise.

Note that in the LCFRS that is the first component, nonterminal *V* has fanout 2, and the LCFRS thereby has fanout 2. The tree produced by the second component is a parse tree in the traditional sense—that is, it specifies exactly how the first component analyzes the input string. Each LCFRS can in fact be extended to become a canonical LCFRS/sDCP hybrid of this kind, in which there is no freedom in the coupling of the string grammar and the tree grammar. Traditional frameworks for LCFRS parsing can be reinterpreted as using such hybrid grammars.

A derivation of a LCFRS/sDCP hybrid grammar *G* can be represented by a **derivation tree** (cf. Figure 8). It combines a derivation tree of the first component of *G* and a derivation tree of its second component, letting them share the common parts. We make a graphical distinction between arguments of the first component (rectangles) and those of the second component (ovals). Implicit in a derivation tree is the reindexing of terminal indices. The evaluation ϕ_{1} of the derivation tree *d* in Figure 8 yields the indexed string and the evaluation ϕ_{2} of *d* yields the tree .

*T*(for transitive verb) has two inherited arguments, for the subject and the object, whereas

*I*(intransitive verb) has one inherited argument, for the subject. (To keep the example simple, we conflate different verb forms, so that the grammar overgenerates.)We derive (abbreviating each Dutch word by its first letter):

### 5.2 Other Classes of Hybrid Grammars

In order to illustrate the generality of our framework, we will sketch three more classes of hybrid grammars. In these three classes, the first component or the second component, or both, are less powerful than in the case of the LCFRS/sDCP hybrid grammars defined previously, and thereby the resulting hybrid grammars are less powerful, in the light of the observation that sMGs are syntactic variants of well-nested LCFRSs and sCFTGs are syntactic variants of sDCPs with s-rank restricted to 1. Noteworthy are the differences between the four classes of hybrid grammars in the formal definition of their derivations.

**sMG/sCFTG hybrid grammar**(over Γ and Σ) is a tuple

*G*=(

*N*,

*S*,(Γ,Σ),

*P*), where Γ and Σ are as in the case of LCFRS/sDCP hybrid grammars. The hybrid rules in

*P*are now of the form:

In the definition of the “derives” relation $\u21d2Gu,\rho $ we have to use a reindexing function. Because terminals are produced by a rule application (instead of being consumed as in the LCFRS/sDCP case), there is no need for a terminal reindexing that matches indices of terminal occurrences of the applied rule with those of the sentential form. Instead, terminal indices occurring in the rule have to be reindexed away from the sentential form in the same way as for nonterminal indices. Thus we use one reindexing function *f*_{U} that applies to nonterminal indices and terminal indices.

We define $[s1,s2]\u21d2Gu,\rho [s1\u2032,s2\u2032]$ for every $s1,s1\u2032\u2208IN\u222a\Gamma ,\u2205*$ and $s2,s2\u2032\u2208IN\u222a\Gamma ,\Delta *$ if and only if:

- •
there are positions

*p*_{1}and*p*_{2}such that and , - •
ρ ∈

*P*is $[A(x1,k)\u2192r1,$$A(x1,k)\u2192r2]$, - •
*U*=*ind*(*s*_{1}) ∖{*u*} =*ind*(*s*_{2}) ∖{*u*}, - •
$si\u2032=si\u301afU(ri)\u301bpi$ for

*i*= 1,2.

_{G}for $\u22c3u\u2208N+,\rho \u2208P\u21d2Gu,\rho $ and define the hybrid language induced by

*G*as:As before, this defines a set

*L*(

*G*) of hybrid trees. Note the structural difference between Equations (5) and (7).

*A*:One can derive, for instance:

**LCFRS/sCFTG hybrid grammars**and

**sMG/sDCP hybrid grammars**, with suitable definitions of ⇒

_{G}obtained straightforwardly by combining elements from the earlier definitions. The hybrid language induced by LCFRS/sCFTG hybrid grammar

*G*is:and the hybrid language induced by sMG/sDCP hybrid grammar

*G*is:

### 5.3 Probabilistic LCFRS/sDCP Hybrid Grammars and Parsing

In the usual way, we can extend LCFRS/sDCP hybrid grammars to become probabilistic LCFRS/sDCP hybrid grammars. For this we can assign a probability to each hybrid rule, under the constraint of a **properness** condition. More precisely, for each nonterminal *A* the probabilities of all hybrid rules with *A* in their left-hand sides sum up to 1. In this way a probabilistic LCFRS/sDCP hybrid grammar induces a distribution over hybrid trees. Thus it can be considered as a generative model. See also Nivre (2010) for a general survey of probabilistic parsing.

Algorithm 1 shows a parsing pipeline which takes as input a probabilistic LCFRS/sDCP hybrid grammar *G* and a sentence *w* ∈ Γ*. As output it computes the hybrid tree *h* ∈ *L*(*G*) that is derived by the most likely derivation whose first component derives *w*. In line 1 the first component of *G* is extracted. Because we later will restore the second component, we assume that a tag is attached to each LCFRS rule that uniquely identifies the original LCFRS/sDCP rule. This also means that two otherwise identical LCFRS rules are treated as distinct if they were taken from two different LCFRS/sDCP rules. In line 2 the string *w* is parsed. For this any standard LCFRS parser can be used. Such parsers, for instance those by Seki et al. (1991) and Kallmeyer (2010), typically run in polynomial time. The technical details of the chosen LCFRS parser (like the used form of items or the form of iteration over rules) are irrelevant, as for our framework only the functionality of the parsing component matters. The parsing algorithm builds a succinct representation of all derivations of *w*. In a second, linear-time phase the most likely derivation tree *d* for *w* is extracted. In line 3 the dependency graph of *d* is enriched by the inherited and synthesized arguments that correspond to the second components of the rules occurring in *d*; here, the identity of the original LCFRS/sDCP rule is needed. This results in an intermediate structure $d^$. This is not yet a derivation tree of *G*, as the terminals still need to be assigned unique indices. This is done first by an indexing of the terminals from *w* in an arbitrary manner, and then by traversing the derivation to associate these indices with the appropriate terminal occurrences in the derivation, leading to a derivation tree *d*′ of the LCFRS/sDCP hybrid grammar *G*. Note that ϕ_{1}(*d*′) = *s*_{1}. Finally, in line 6 the derivation tree *d*′ is evaluated by ϕ_{2} yielding a tree in $I\Gamma ,\Delta *$.

*G*from Example 11 and the sentence

*w*=

**h**

**s**

**g**. Figure 9 shows

*d*, $d^$,

*d*′, and

*s*

_{1}; as part of

*d*′ we obtain . The LCFRS grammar

*G*′ resulting from the extraction in line 1 is:

## 6. Grammar Induction

For most practical applications, hybrid grammars would not be written by hand, but would be automatically extracted from finite corpora of hybrid trees, over which they should generalize. To be precise, the task is to construct a hybrid grammar *G* out of a corpus *c* such that *c* ⊆ *L*(*G*).

During grammar induction each hybrid tree *h* = (*s*,≤_{s}) of the corpus is decomposed. A decomposition will determine the structure of a derivation of *h* by the grammar. Classically, this decomposition and thereby the resulting derivations resemble the structure of *s*. This approach has been pursued, for instance, by Charniak (1996) for CFGs, and by Maier and Søgaard (2008) and Kuhlmann and Satta (2009) for LCFRSs. There is no guarantee, however, that this approach is optimal from the perspective of, for example, parsing efficiency, the grammar’s potential to generalize from training data, or the size of the grammar.

For a more general approach that offers additional degrees of freedom we extend the framework of Nederhof and Vogler (2014) and let grammar induction depend on a decomposition strategy of the string str(*h*), called recursive partitioning. One may choose one out of several such strategies. We consider four instances of induction of hybrid grammars, which vary in the type of hybrid trees that occur in the corpus, and the type of the resulting hybrid grammar:

. | corpus . | hybrid grammar . |
---|---|---|

1. | phrase structures | LCFRS/sDCP |

2. | phrase structures | LCFRS/sCFTG |

3. | dependency structures | LCFRS/sDCP |

4. | dependency structures | LCFRS/sCFTG |

. | corpus . | hybrid grammar . |
---|---|---|

1. | phrase structures | LCFRS/sDCP |

2. | phrase structures | LCFRS/sCFTG |

3. | dependency structures | LCFRS/sDCP |

4. | dependency structures | LCFRS/sCFTG |

### 6.1 Recursive Partitioning and Induction of LCFRS

A **recursive partitioning** of a string *w* of length *n* is a tree π whose nodes are labeled with subsets of [*n*]. The root of π is labeled with [*n*]. Each leaf of π is labeled with a singleton subset of [*n*]. Each non-leaf node has at least two children and is labeled with the union of the labels of its children, which furthermore must be disjoint. To fit recursive partitionings into our framework of s-terms, we regard $P([n])$ as ranked alphabet with $P([n])=P([n])(1)$. We let $\pi \u2208TP([n])*$ with |π| = 1.

We say a set *J* ⊆ [*n*] has **fanout***k* if *k* is the smallest number such that *J* can be written as *J* = *J*_{1} ∪… ∪ *J*_{k}, where:

- •
each

*J*_{ℓ}(ℓ ∈ [*k*]) is of the form {*i*_{ℓ},*i*_{ℓ}+ 1, … ,*i*_{ℓ}+*m*_{ℓ}}, for some*i*_{ℓ}and*m*_{ℓ}≥ 0, and - •
*i*∈*J*_{ℓ},*i*′ ∈*J*_{ℓ′}(ℓ,ℓ′ ∈ [*k*]) and ℓ < ℓ′ imply*i*<*i*′.

Note that *J*_{1}, … , *J*_{k} are uniquely defined by this, and we write spans( *J*) = 〈*J*_{1}, … , *J*_{k}〉. The fanout of a recursive partitioning is defined as the maximal fanout of its nodes.

Figure 10 presents two recursive partitionings of a string with seven positions. The left one has fanout 3 (because of the node label {1,3,6,7} ={1}∪{3}∪{6,7}), whereas the right one has fanout 2.

Algorithm 2 constructs a LCFRS *G* out of a string *w* = α_{1}⋯α_{n} and a recursive partitioning π of *w*. The nonterminals are symbols of the form ⦇*J*⦈, where *J* is a node label from π. (In practice, the nonterminals ⦇*J*⦈ are replaced by other symbols, as will be explained in Section 6.8.) For each position *p* of π a rule is constructed. If *p* is a leaf labeled by {*i*}, then the constructed rule simply generates α_{i} where ⦇{*i*}⦈ has fanout 1 (lines 6 and 7). For each internal position labeled *J*_{0} and its children labeled *J*_{1}, … , *J*_{j} we compute the spans of these labels (line 9). For each argument *q* of ⦇*J*_{0}⦈ a s-term *s*_{q} is constructed by analyzing the corresponding component *J*_{0,q} of spans( *J*_{0}) (lines 12–13). Here we exploit the fact that *J*_{0,q} can be decomposed in a unique way into a selection of sets that are among the spans of *J*_{1}, … , *J*_{j}. Each of these sets translates to one variable. The resulting grammar *G* allows for exactly one derivation tree that derives *w* and has the same structure as π. We say that ** G parses w according to π**.

We observe that *G* as constructed in this way is in a particular **normal form**.^{3} To be precise, by the notation in Equation (2) each rule satisfies one of the following:

- •
*n*≥ 2 and $s1,k0\u2208T\u2205*(Xmn)$, (**structural rule**) - •
*n*= 0,*k*_{0}= 1, and*s*_{1}= 〈α〉 for some α ∈ Γ. (**terminal generating rule**)

Conversely, each derivation tree *d* of a LCFRS in normal form can be translated to a recursive partitioning π_{d}, by processing *d* bottom–up as follows. Each leaf, that is, a rule *A*(〈α〉) →〈〉, is replaced by the singleton set {*i*} where *i* is the position of the corresponding occurrence of α in the accepted string. Each internal node is replaced by the union of the sets that were computed for its children. For the constructed LCFRS *G* that parses *w* according to π and for its only derivation tree *d*, we have π = π_{d}.

**h**

**s**

**g**in which the root has children labeled {1,3} and {2}, Algorithm 2 constructs the LCFRS:

**h**

**s**

**g**in which the root has children labeled {1,2} and {3}, then it produces the following LCFRS, which is, in fact, a CFG:

Observe that terminals are generated individually by the rules that were constructed from the leaves of the recursive partitioning but *not* by the structural rules obtained from internal nodes. Consequently, the LCFRS that we induce is in general not lexicalized. In order to obtain a lexicalized LCFRS, the notion of recursive partitioning would need to be generalized. We suspect that such a generalization is feasible, in conjunction with a corresponding generalization of the induction techniques presented in this article. This would be technically involved, however, and is therefore left for further research.

### 6.2 Construction of Recursive Partitionings

Figure 11 sketches three pipelines to induce a LCFRS from a hybrid tree, which differ in the way the recursive partitioning is constructed. The first way (cf. Figure 11a) is to extract a recursive partitioning directly from a hybrid tree (*s*,≤_{s}). This extraction is specified in Algorithm 3, which recursively traverses *s*. For each node in *s*, the gathered input positions consist of those obtained from the children (lines 6 and 7), plus possibly one input position from the node itself if its label is in Γ (line 5). The case distinction in line 8 and following is needed because every non-leaf in a recursive partitioning must have at least two children.

For a first example, consider the phrase structure in Figure 4a. The extracted recursive partitioning is given at the beginning of Example 16. For a second example, consider the dependency structure and the extracted recursive partitioning in Figure 12.

Note that if Algorithm 3 is applied to an arbitrary hybrid tree, then the accumulated set of input positions is empty for those positions of *s* that do not dominate elements with labels in Γ. The algorithm requires further (straightforward) refinements before it can be applied on hybrid trees with several roots, that is, if |*s*| > 1. In much of what follows, we ignore these special cases.

By this procedure, a recursive partitioning extracted from a discontinuous phrase structure or a non-projective dependency structure will have a fanout greater than 1. By then applying Algorithm 2, the resulting LCFRS will have fanout greater than 1. The more discontinuity exists in the input structures, the greater the fanout will be of the recursive partitioning and the resulting LCFRS, which in turn leads to greater parsing complexity. We therefore consider how to reduce the fanout of a recursive partitioning, before it is given as input to Algorithm 2. We aim to keep the structure of a given recursive partitioning largely unchanged, except where it exceeds a certain threshold on the fanout.

Algorithm 4 presents one possible such procedure. It starts at the root, which by definition has fanout 1. Assuming the fanout of the current node does not exceed *k*, then there are two cases to be distinguished. If the label *J* of the present node is a singleton, then the node is a leaf, and we can stop (lines 2–3). Otherwise, we search breadth-first through the subtree rooted in the present node to identify a descendant *p* such that both its label *J*′ and *J* ∖ *J*′ have fanout not exceeding *k* (line 4). It is easy to see that such a node always exists: Ultimately, breadth-first search will reach the leaves, which are each labeled with a single number. One of these numbers must match either the lowest or the highest element of some maximal subset of consecutive numbers from *J*, so that the fanout of *J* cannot increase if that number is removed.

The current node is now given two children π|_{p} and *t*. The first is the subtree rooted in the node labeled *J*′ that we identified earlier, and the second is a copy of the subtree rooted in the present node, but with *J*′ subtracted from the label of every node (lines 7–14). Nodes labeled with the empty set are removed (line 10), and if a node has the same label as its parent then the two are collapsed (line 13). As the two children each have fanout not exceeding *k*, we can apply the procedure recursively (line 6).

The recursive partitioning π in the left half of Figure 10 has a node labeled {1,3,6,7}, with fanout 3. With *J* = {1,2,3,5,6,7} and *k* = 2, one possible choice for *J*′ is {3,7}, as then both *J*′ and *J* ∖ *J*′ ={1,2,5,6} have fanout not exceeding 2. This leads to the partitioning π′ in the right half of the figure. Because now all node labels have fanout not exceeding 2, recursive traversal will make no further changes. The partitioning π′ is similar to π in the sense that subtrees that are not on the path to {3,7} remain unchanged. Other valid choices for *J*′ would be {2} and {5}. Not a valid choice for *J*′ would be {1,6}, as *J* ∖{1,6} ={2,3,5,7}, which has fanout 3.

Algorithm 4 ensures that subsequent induction of an LCFRS (cf. Figure 11b) leads to a binary LCFRS. Note the difference between binarization algorithms such as those from Gómez-Rodríguez and Satta (2009) and Gómez-Rodríguez et al. (2009), which are applied on grammar rules, and our procedure, which is applied *before* any grammar is obtained. Unlike van Cranenburgh (2012), moreover, our objective is not to obtain a “coarse” grammar for the purpose of coarse-to-fine parsing.

Note that if *k* is chosen to be 1, then the resulting partitioning is consistent with derivations of a CFG. Even simpler partitionings exist. In particular, the **left-branching** partitioning has internal node labels that are {1,2, … , *m*}, each with children labeled {1, … , *m* − 1} and {*m*}. These are consistent with the computations of finite automata (FA) in reverse direction; see Figure 11c. Similarly, there is a **right-branching** recursive partitioning, reflecting finite-state processing in a forward direction. The use of branching partitionings completely detaches string parsing from the structure of the given hybrid tree.

*i*∈ [5], and ⦇{

*i*}⦈(α

_{i}) →〈〉 for each

*i*∈ [6].

The relation between recursive partitioning and worst-case parsing complexity of the induced LCFRS is summarized in Table 1. We remark that the parsing complexity for unrestricted LCFRSs would improve from $O(n(m+1)\u22c5k)$ to $O(n2\u22c5k+2)$ if we could ensure that the LCFRSs are well-nested (Gómez-Rodríguez, Kuhlmann, and Satta 2010), or in other words, if we replace LCFRSs by sMGs. How to refine pipeline (a) to achieve this is left for future investigation.

Pipeline . | Type of the induced LCFRS . | Parsing complexity for string of length n
. |
---|---|---|

(a) | LCFRS of arbitrary k and m | $O(n(m+1)\u22c5k)$ |

(b) k ≥ 1 | binarized k-LCFRS | $O(n3k)$ |

(b) k = 1 | binarized CFG | $O(n3)$ |

(c) right | FA | $O(n)$ |

(c) left | (reverse) FA | $O(n)$ |

Pipeline . | Type of the induced LCFRS . | Parsing complexity for string of length n
. |
---|---|---|

(a) | LCFRS of arbitrary k and m | $O(n(m+1)\u22c5k)$ |

(b) k ≥ 1 | binarized k-LCFRS | $O(n3k)$ |

(b) k = 1 | binarized CFG | $O(n3)$ |

(c) right | FA | $O(n)$ |

(c) left | (reverse) FA | $O(n)$ |

### 6.3 Induction of Hybrid Grammars

In the remainder of Section 6 we extend the induction pipelines for LCFRS to induction pipelines for LCFRS/sDCP hybrid grammars, as illustrated in Figure 14. Given a hybrid tree *h* = (*s*,≤_{s}), we choose a recursive partitioning π obtained in one of the ways discussed in the previous section. We apply Algorithm 2 to induce a LCFRS *G*_{1} that generates str(*h*). Using the same recursive partitioning π, we induce a sDCP *G*_{2} that generates *s*. For this we use either Algorithm 5 or 6, depending on whether *h* is a phrase structure or a dependency structure.

*G*, which synchronously generates the hybrid tree

*h*. For this, we synchronize terminals and nonterminals of

*G*

_{1}and

*G*

_{2}(via indexed symbols) by slightly changing Algorithm 2. Because the terminals that require synchronization are only generated by the rules constructed for the leaves of π, we alter line 7 toLikewise, we need to index the nonterminals on the right-hand side of each structural rule, that is, we alter line 14 toThe corresponding indexing needed in

*G*

_{2}will be described in Sections 6.4 and 6.6.

In preparation, we define several notions to relate labels of the recursive partitioning with subsets of the positions of *s*. Let pos_{Γ}(*s*) = {*p*_{1}, … , *p*_{n}} with *p*_{i} ≤_{s}*p*_{i +1} (*i* ∈ [*n* − 1]) and let *J* be a label of π. We define the set Π(*J*) ={*p*_{i}∣ *i* ∈ *J*}, which identifies the nodes of *s* corresponding to the elements in *J*. For any subset *U* ⊆pos(*s*), we construct the sets ⊤(*U*) and ⊥(*U*), which, intuitively, delimit *U* from the top and the bottom, respectively. Formally, *p* ∈⊤(*U*) if and only if *p* ∈ *U* and parent(*p*)∉*U*. We have *p* ∈⊥(*U*) if and only if parent(*p*) ∈ *U* and *p*∉*U*.

### 6.4 Induction of LCFRS/sDCP Hybrid Grammars from Phrase Structures

Grammar induction is relatively straightforward for a given hybrid tree *h* = (*s*,≤_{s}) that is a phrase structure, and a given recursive partitioning π. For each node of π, its label is a set of positions of str(*h*), and these positions must each correspond to a leaf node of *s*, by virtue of *h* being a phrase structure. We can apply a closure operation on this set of leaf nodes that includes a node if all of its children are included. Formally, let *J* be a label of π. Then *C*(*J*) is the smallest set *U* ⊆pos(*s*) satisfying (i) Π(*J*) ⊆ *U* and (ii) if *p* ∈pos(*s*), children(*p*)≠∅, and children(*p*) ⊆ *U*, then *p* ∈ *U*.

The set *C*(*J*) corresponds to a set of (maximal, disjoint) sub-s-terms of *s*, that is, *C*(*J*) can be partitioned such that each part contains the positions that correspond to one sub-s-term. These parts can be arranged according to the lexicographical ordering on positions in *s*. Formally, for each set *U* ⊆pos(*s*), we define s-rk(*U*) to be the maximal number *k* such that *U* can be partitioned into sets *U*_{1}, … , *U*_{k} where:

- •
for every

*i*∈ [*k*],*p*∈*U*_{i}, and*p*′ ∈*U*if*p*′ = parent(*p*) or*p*′ = right-sibling(*p*), then*p*′ ∈*U*_{i}, and - •
for every

*i*,*j*∈ [*k*],*p*∈*U*_{i}, and*p*′ ∈*U*_{j}we have*p*≤_{ℓ}*p*′ implies*i*≤*j*.

Note that *U*_{1}, … , *U*_{k} are uniquely defined by this, and we write gspans(*U*) = 〈*U*_{1}, … , *U*_{k}〉. The function “gspans” generalizes “spans,” which we defined for recursive partitioning earlier. Similarly, s-rk as used here generalizes the notion of fanout as used for node labels of a recursive partitioning.

These concepts allow us to define a mapping from a node *J* in π to the set gspans(*C*(*J*)) of positions of *s*. This mapping is such that if *J*_{1}, … , *J*_{m} are the child nodes of node *J* in π, then each set $Uj(i)$ in gspans(*C*(*J*_{i})) (*i* ∈ [*m*], *j* ∈ [s-rk(*J*_{i})]) is contained in some set $Uq(0)$ in gspans(*C*(*J*)) (*q* ∈ [s-rk(*J*)]). A different way of looking at this is that the image of *J* can be constructed out of the images of *J*_{i} (*i* ∈ [*m*]), possibly by adding further nodes linking existing sub-s-terms together.

Such composition of s-terms into larger s-terms is realized by a sDCP without inherited arguments, as constructed by Algorithm 5. It builds one production for each node *J*_{0} with children *J*_{1}, … , *J*_{m}. *J*_{0} has a synthesized attribute for each set $Uq(0)$ in gspans(*C*(*J*_{0})). The corresponding s-term *s*_{q} is constructed by traversing the positions in $Uq(0)$ (lines 18–31). A sequence of consecutive positions that are also in some $Uj(i)$ in gspans(*C*(*J*_{i})) is realized by a single variable $xj(i)$ (lines 21–24). Remaining positions become nodes of *s*_{q} (lines 26–30).

*p*

_{i}before selecting the relevant subtree

*s*|

_{p}of

*s*. For the indexing of nonterminals, we change line 16 to:

*h*in Figure 4a, in combination with the recursive partitioning extracted from the hybrid tree by Algorithm 3. The children of the root are {1,3} and {2}. The relevant s-term for {1,3} is

**V**(

**hat**,

**gearbeitet**) and the s-term for {2} is

**ADV**(

**schnell**). Application of Algorithm 5 yields the sDCP grammar:

**hat**and

**ADV**(

**schnell**) and the one for {3} is

**gearbeitet**. (In a real-world grammar we would have parts of speech occurring above all the words.) Applying Algorithm 5 and synchronizing the resulting sDCP grammar with the second LCFRS of Example 16 yields the following LCFRS/sDCP hybrid grammar:The fanout of the LCFRS is 1 and the s-rank of the sDCP is 2.

Figure 16 compares the two derivation trees of the LCFRS/sDCP hybrid grammars of Examples 19 (left) and 20 (right).

We observe the following general property of Algorithm 5: If the recursive partitioning extracted by Algorithm 3 is given as input, then each nonterminal of the induced sDCP *G*) has s-rank 1 (as in Example 19). This coincides with pipeline (a) of Figure 11, that is, the induction of a LCFRS of arbitrary fanout. However, if the recursive partitioning is transformed by Algorithm 4 or if the left-branching or right-branching recursive partitioning is used (as in Example 20 and as in pipelines (b) and (c) of Figure 11), then the fanout of the induced LCFRS decreases and its derivations are binarized. At the same time, the numbers of synthesized arguments in the induced sDCP may increase. In other words, we witness a trade-off between the degree of mild context-sensitivity of the LCFRS and the numbers of arguments of the sDCP.

We conclude:

For each phrase structure *h* and recursive partitioning π of str(*h*), we can construct a LCFRS/sDCP hybrid grammar *G* such that *G* generates *h* and parses str(*h*) according to π. Moreover, the sDCP that is the second component only has synthesized arguments.

### 6.5 Induction of LCFRS/sCFTG Hybrid Grammars from Phrase Structures

Given a recursive partitioning π and a phrase structure *h* = (*s*,≤_{s}), the construction in Section 6.4 relied on a mapping from a node of π labeled *J* to sets of positions of maximal sub-s-terms in *s* whose yields together cover exactly the positions in *J*. We now say π is **chunky** with respect to *h* if, for each node *J* of π, we have s-rk(*C*(*J*)) = 1, that is, the nodes in its image under the mapping form a single sub-s-term. If π is chunky with respect to *h*, then each sDCP nonterminal in the construction from Section 6.4 will have a single synthesized argument. A sDCP with this property is equivalent to a sCFTG in which all nonterminals have rank 0, that is, a regular tree grammar. Therefore:

For each phrase structure *h* and recursive partitioning π of str(*h*) that is chunky with respect to *h*, we can construct a LCFRS/sCFTG hybrid grammar *G* such that *G* generates *h* and parses str(*h*) according to π. Moreover, the second component of *G* is a regular tree grammar.

### 6.6 Induction of LCFRS/sDCP Hybrid Grammars from Dependency Structures

Let *h* = (*s*,≤_{s}) be a hybrid tree over (Σ,Σ), or in other words, a dependency structure, and let π be a recursive partitioning of str(*h*). The task is to construct a LCFRS/sDCP hybrid grammar that generates *h* and parses str(*h*) according to π.

In the case of phrase structures, we could identify entire subtrees of *s* whose yields corresponded to subsets of node labels of π. A sequence of consecutive such subtrees (i.e., whose roots were siblings) was then translated to a synthesized argument of a sDCP rule. With dependency structures, however, we need to allow for the scenario where we have a label *J* of a node in π and two positions *p* and *p*′ in *s* with parent(*p*′) = *p*, and *p* ∈ Π(*J*) whereas *p*′∉Π(*J*). Thus we will need to consider a subtree of *s* rooted in *p*, with a “gap” for child *p*′. This gap will be implemented by introducing an inherited argument in the sDCP, so that the gap can be filled by a structure built elsewhere.

Like the induction algorithms considered before, Algorithm 6 constructs a rule for each node *p* of π. If *p* is a leaf of π (line 6), labeled with a singleton *J*_{0} = {*i*}, we construct the sDCP rule $\u2987J0\u2988(x1,\u2329\alpha (\u2329x1\u232a)\u232a)\u2192\u2329\u232a$, where α is the *i*-th element of str(*h*), assuming the node labeled α is not a leaf in *s* (line 10). The i-rank of ⦇*J*_{0}⦈ is 1. Coupled to the corresponding LCFRS rule, this creates the hybrid rule . If the node labeled α is a leaf of *s*, we can dispense with the inherited argument of ⦇*J*_{0}⦈ in the second component and replace α(〈*x*_{1}〉) by α (line 8).

If *p* is an internal node of π, then we proceed as follows. We determine the set Π(*J*_{0}) of positions of *s* that correspond to the numbers in the label of *p*. We compute the sets of positions ⊤(Π(*J*_{0})) and ⊥(Π(*J*_{0})) that delimit Π(*J*_{0}) from the top and bottom, respectively. We bundle consecutive positions in ⊤(Π(*J*_{0})) and ⊥(Π(*J*_{0})) by applying gspans. For brevity, we write $\u22a4max(J0)$ instead of gspans(⊤(Π(*J*_{0}))) and $\u22a5max(J0)$ instead of gspans(⊥(Π(*J*_{0}))). The nonterminal ⦇*J*_{0}⦈ is given a synthesized attribute for each set $Iq(0)$ in $\u22a4max(J0)$ (line 12), and an inherited attribute for each set $Oj(0)$ in $\u22a5max(J0)$ (line 13).

Analogously, we determine sets $Oj(i)$ and $Iq(\u2113)$ of positions of *s* for each child of *p* labeled *J*_{i} (lines 14–15). Each set $Oj(i)$ corresponds to a distinct variable $xj(i)$ in the sDCP rule. Each set $Iq(\u2113)$ corresponds to a s-term $sq(\u2113)$ which combines variables. In particular, each set $Iq(\u2113)$ is the (disjoint) union of some of the sets $Oj(i)$. Conversely, each set $Oj(i)$ is disjoint with all but one of the sets $Iq(\u2113)$. This fact is used in lines 20–27 to construct the s-term $sq(\u2113)$. Having specified the variables and s-terms, the construction of the sDCP rule is completed in line 18. Much as in Section 6.4, this sDCP rule can be coupled to the corresponding LCFRS rule to form a hybrid rule.

Figure 17 presents part of the hybrid tree from Figure 4b, together with part of a recursive partitioning. Let us use the symbols *p*_{P}, *p*_{M}, *p*_{h}, *p*_{l} for the four positions in the hybrid tree corresponding to the string positions 2,3,5,6. Naturally, Π({2,3,5,6}) ={*p*_{P},*p*_{M},*p*_{h},*p*_{l}}, Π({2,6}) ={*p*_{P},*p*_{l}}, Π({3,5}) ={*p*_{M},*p*_{h}}.

_{max}({2,3,5,6}) has length 0. The i-rank of ⦇{3,5}⦈ is 1 as ⊥

_{max}({3,5}) contains a single set {

*p*

_{P},

*p*

_{l}}, whereas its s-rank is 2 as ⊤

_{max}({3,5}) contains two sets. Because the first (and only) set of ⊥

_{max}({2,6}), namely {

*p*

_{M}}, equals the first set of ⊤

_{max}({3,5}), the variable

*x*

_{2}is shared between the inherited argument of ⦇{2,6}⦈ and the first synthesized argument of ⦇{3,5}⦈.

_{max}({6}) =〈{

*p*

_{M}}〉 and therefore ⦇{6}⦈ has i-rank 1. Further, ⊤

_{max}({2}) =〈{

*p*

_{P}}〉, ⊤

_{max}({6}) =〈{

*p*

_{l}}〉, and we already saw that ⊤

_{max}({2,6}) =〈{

*p*

_{P},

*p*

_{l}}〉. The fact that {

*p*

_{P},

*p*

_{l}} ={

*p*

_{P}}∪{

*p*

_{l}} and

*p*

_{P}<

_{ℓ}

*p*

_{l}explain the concatenation

*x*

_{2}

*x*

_{3}in the only synthesized argument in the left-hand side. Figure 18 shows the derivation tree that is induced by the sDCP rules on the given recursive partitioning (again abbreviating each Dutch word by its first letter).

We conclude:

For each dependency structure *h* and recursive partitioning π of str(*h*), we can construct a LCFRS/sDCP hybrid grammar *G* such that *G* generates *h* and parses str(*h*) according to π.

### 6.7 Induction of LCFRS/sCFTG Hybrid Grammars from Dependency Structures

For dependency structures, we can define a notion of chunkiness similar to that in Section 6.5, relying on the definitions in Section 6.4. We say a recursive partitioning π with respect to dependency structure *h* = (*s*,≤_{s}) is **chunky** if for every node label *J* of π the length of ⊤_{max}(*J*) is 1. We have seen in Section 6.4 that the length of ⊤_{max}(*J*) determines the fanout of the corresponding nonterminal in the second component of the constructed hybrid rule. We observed before that a sDCP grammar in which all nonterminals have s-rank 1 is equivalent to a sCFTG. We may conclude:

For each dependency structure *h* and recursive partitioning π of str(*h*) that is chunky with respect to *h*, we can construct a LCFRS/sCFTG hybrid grammar *G* such that *G* generates *h* and parses str(*h*) according to π.

### 6.8 Induction on a Corpus

In the previous sections, we have induced a LCFRS/sDCP hybrid grammar *G* from a single phrase structure or dependency structure *h*. Given a training corpus *c* of phrase structures or dependency structures, we now want to induce a single hybrid grammar *G* that generalizes over the corpus. To this end, we apply one of the induction techniques to each hybrid tree *h* in *c*. The resulting grammars are condensed into a *single* hybrid grammar *G* by relabeling the existing nonterminals of the form ⦇*J*⦈. This relabeling should be according to a naming scheme that is consistent to ensure that hybrid rules constructed from one hybrid tree for adjacent nodes of π can still link together to form a derivation. Beyond that, the naming scheme shall also allow for interaction of rules that were constructed from different hybrid trees, such that *L*(*G*) contains meaningful hybrid trees that were not in *c*.

Two such naming schemes were considered in Nederhof and Vogler (2014) for a corpus of phrase structures. By **strict labeling**, a nonterminal name is chosen to consist of the terminal labels at the roots of the relevant subtrees of *s*. In Example 20, therefore, we would replace ⦇{1,2}⦈ by 〈**hat**,**ADV**〉. (In a more realistic grammar, we would likely have a part of speech instead of **hat**.) This tends to lead to many nonterminal labels for different combinations of terminals. Therefore, an alternative was considered, called **child labeling**. This means that for two or more consecutive siblings in *s*, we collapse their sequence of terminals into a single tag of the form children-of (*X*), where *X* is the terminal label of the parent. This creates much fewer nonterminal names, but confuses different sequences of terminals that may occur below the same terminal. For more details, we refer the reader to Nederhof and Vogler (2014).

For grammar induction with a corpus of dependency structures, there is the additional complication of inherited arguments. If a LCFRS/sDCP hybrid grammar is induced from a single dependency structure, using nonterminals of the form ⦇*J*⦈ as in Section 6.6, then cycles cannot occur in derivations of the sDCP. This is because, by definition, no cycles occur in the given dependency structure. However, if we replace nonterminals of the form ⦇*J*⦈ by any other choice of symbols, and combine rules induced by different hybrid trees from a corpus, then cycles may arise if the relationship between inherited and synthesized arguments is confused.

One solution is to encode the dependencies between inherited and synthesized attributes into the nonterminal names—that is, for each synthesized attribute the list of inherited attributes from which it receives subtrees is specified (Angelov et al., 2014). Conceptually this is close to the *g* functions in the proof of Theorem 2 in Appendix C. One way to formalize this encoding is as follows.

As explained in Section 6.6, if we have a node label *J* in π, and ⊥_{max}(*J*) =〈*O*_{1}, … , *O*_{k}〉 and ⊤_{max}(*J*) =〈*I*_{1}, … , *I*_{k′}〉, then this leads to creation of a nonterminal with *k* inherited arguments and *k*′ synthesized arguments, so in total *k* + *k*′ arguments. We can construct a s-term σ in $T[k+k\u2032]*$, in which every number in [*k* + *k*′] occurs exactly once. In σ, argument numbers are located relative to one another as the corresponding positions in ⊥_{max}(*J*) ⋅⊤_{max}(*J*) are located in the hybrid tree. More precisely, if the *i*-th element and the *j*-th element of ⊥_{max}(*J*) ⋅⊤_{max}(*J*) represent positions that share a parent, and those positions of the *i*-th precede those of the *j*-th, then number *i* precedes *j* in the root of the same sub-s-term of σ. Similarly, if the positions of the *j*-th element are descendants of at least one position in the *i*-th element, then *j* occurs as a descendant of *i* in σ.

In combination with strict labeling, the nonterminal ⦇{3,5}⦈ in Example 21 could be replaced by 〈**Piet****lezen**,**Marie**,**helpen**,σ〉, where σ is the s-term 3(1(2)). The “3” (third argument of the nonterminal) stands for the node in the hybrid tree labeled with **helpen**. Descendants of this node are the two nodes labeled with **Piet** and **lezen**, which belong to the first argument. The “2” (second argument) stands for the node labeled **Marie**, which is a descendant of **lezen** and therefore occurs below the “1.” __

## 7. Experiments

In this section, we present the first experimental results on induction of LCFRS/sDCP hybrid grammars from dependency structures and their application to parsing. We further present experiments on constituent parsing similar to those of Nederhof and Vogler (2014), but now with a larger portion of the TIGER corpus (Brants et al., 2004).

The purpose of the experiments is threefold: (i) A proof-of-concept for the induction techniques developed in Section 6 is provided. (ii) The influence of the strategy of recursive partitioning is evaluated empirically, as is the influence of the nonterminal naming scheme. We are particularly interested in how the strategy of recursive partitioning affects the size, parse time, accuracy, and robustness of the induced hybrid grammars. (iii) The performance of our architecture for syntactic parsing based on LCFRS/sDCP hybrid grammar is compared with two existing parsing architectures.

For all experiments, a corpus is split into a training set and a test set. A LCFRS/sDCP hybrid grammar is induced from the training set. Probabilities of rules are determined by relative frequency estimation. The induced grammar is then applied on each sentence of the test set (see Section 5.3), and the parse obtained from the most probable derivation is compared with the gold standard, resulting in a score for each sentence. The average of these scores for all test sentences is computed, weighted by sentence length.

All algorithms are implemented in Python and experiments are run on a server with two 2.6-GHz Intel Xeon E5-2630 v2 CPUs and 64 GB of RAM. Each experiment uses a single thread; the measured running time might be slightly distorted because of the usual load jitter. For probabilistic LCFRS parsing we use two off-the-shelf systems: If the induced grammar’s first component is equivalent to a FA, then we use the OpenFST (Allauzen et al., 2007) framework with the Python bindings of Gorman (2016). Otherwise, we utilize the LCFRS parser of Angelov and Ljunglöf (2014), which is part of the runtime system of the Grammatical Framework (Ranta, 2011).

### 7.1 Dependency Parsing

In our experiments on dependency parsing we use a corpus based on TIGER as provided in the 2006 CoNLL shared task (Buchholz and Marsi, 2006). The task specifies splits of TIGER into a training set (39,216 sentences) and a test set (357 sentences). Each sentence in the corpus consists of a sequence of tokens. A token has up to 10 fields, including the sentence position, form, lemma, **part-of-speech** (POS) tag, sentence position of the head, and **dependency relation** (DEPREL) to the head. In TIGER, 52 POS tags and 46 DEPRELs are used. We adopt the three evaluation metrics from the shared task, namely, the percentages of tokens for which a parser correctly predicts the head (the **unlabeled attachment score** or UAS), the DEPREL (the **label accuracy** or LA), or both head and DEPREL (the **labeled attachment score** or LAS). Punctuation is removed from both training and test sets, and is thereby ignored for the purposes of these metrics. For testing we restrict ourselves to the 281 sentences of up to 20 (non-punctuation) tokens.

*a*,

*b*), where

*a*is the POS tag and

*b*is the DEPREL to the head. Because input strings consist only of POS tags, each terminal symbol in the LCFRS-part of a rule is projected to its first component. Although our formal definition of hybrid rules only allows linking of pairs of terminal symbols that are identical, the definition can be suitably generalized, without changing the theory in any significant way. An example of a generalized hybrid rule, where a POS tag

**N**

**N**is linked to a node label (

**N**

**N**,

**d**

**o**

**b**

**j**), is:

Our experiments with grammar induction are controlled by three parameters, each ranging over a number of values:

- •
The

*naming scheme*for nonterminals can be (i) strict labeling or (ii) child labeling, as outlined in Section 6.8. - •
In both naming schemes, sequences of terminal labels from the hybrid tree are composed into nonterminal labels. For each terminal, which is a CoNLL token, we include only particular fields, namely, (i) POS and DEPREL, (ii) POS, or (iii) DEPREL. We call this parameter

**argument label**. - •
The considered methods to obtain

*recursive partitionings*include (i) direct extraction (see Algorithm 3), (ii) transformation to fanout*k*(see Algorithm 4), (iii) right-branching, and (iv) left-branching.

Each choice of values for the parameters determines an **experimental scenario**. In particular, direct extraction in combination with strict labeling achieves traditional LCFRS parsing.

We compare our induction framework with rparse (Maier and Kallmeyer 2010), which induces and trains unlexicalized, binarized, and Markovized LCFRS. As parser for these LCFRS we choose the runtime system of the Grammatical Framework, because it is faster than rparse’s built-in parser (for a comparison cf. Angelov and Ljunglöf 2014).

Additionally, we use MaltParser (Nivre, Hall, and Nilsson 2006) to compare our approach with a well-established transition-based parsing architecture, which is not state-of-the-art but allows us to disable features of lemmas and word forms, to match our own implementation, which does not take lemmas or word forms as part of the input. The stacklazy strategy (Nivre, Kuhlmann, and Hall 2009) with the LibLinear classifier is used.

*Experimental Results.* Statistics on the induced hybrid grammars and parsing results are listed in Table 2. For the purpose of measuring UAS, LAS, and LA, we take a default dependency structure in the case of a parse failure; in this default structure, the head of the *i*-th word is the *i* − 1-th word, and the DEPREL fields are left empty. Figure 19 shows the distributions of the fanout of nonterminals in the LCFRS and the numbers of arguments in the sDCP, depending on the recursive partitioning strategy if we fix child labeling with POS+DEPREL as argument labels. In the following we discuss trends that can be observed in these data if we change the value of one parameter while keeping others fixed.

extraction . | arg. lab. . | nont. . | rules . | f_{max}
. | f_{avg}
. | fail . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|

child labeling | ||||||||||

direct | POS+DEPREL | 4,043 | 61,923 | 4 | 1.06 | 68 | 68.0 | 59.5 | 63.2 | 97 |

k = 1 | POS+DEPREL | 17,333 | 60,399 | 1 | 1.00 | 9 | 85.8 | 79.7 | 85.5 | 94 |

k = 2 | POS+DEPREL | 10,777 | 49,448 | 2 | 1.18 | 11 | 85.7 | 79.7 | 85.4 | 103 |

k = 3 | POS+DEPREL | 10,381 | 48,844 | 3 | 1.19 | 11 | 85.6 | 79.7 | 85.3 | 108 |

r-branch | POS+DEPREL | 92,624 | 191,341 | 1 | 1.00 | 60 | 68.9 | 60.3 | 65.0 | 27 |

l-branch | POS+DEPREL | 96,980 | 196,125 | 1 | 1.00 | 56 | 70.3 | 61.9 | 66.7 | 28 |

direct | POS | 799 | 43,439 | 4 | 1.10 | 23 | 78.2 | 59.7 | 67.2 | 113 |

k = 1 | POS | 6,060 | 28,880 | 1 | 1.00 | 4 | 83.2 | 65.3 | 73.2 | 70 |

k = 2 | POS | 2,396 | 20,592 | 2 | 1.38 | 4 | 83.8 | 65.4 | 73.1 | 86 |

k = 3 | POS | 2,121 | 20,100 | 3 | 1.44 | 4 | 83.4 | 65.2 | 73.1 | 88 |

r-branch | POS | 47,661 | 123,367 | 1 | 1.00 | 23 | 77.9 | 59.8 | 68.1 | 53 |

l-branch | POS | 49,203 | 125,406 | 1 | 1.00 | 26 | 78.5 | 60.2 | 67.5 | 51 |

direct | DEPREL | 527 | 33,844 | 4 | 1.10 | 1 | 79.5 | 72.4 | 82.7 | 251 |

k = 1 | DEPREL | 4,344 | 21,613 | 1 | 1.00 | 1 | 78.0 | 70.6 | 81.4 | 172 |

k = 2 | DEPREL | 1,739 | 15,184 | 2 | 1.35 | 1 | 78.7 | 70.9 | 81.6 | 263 |

r-branch | DEPREL | 40,239 | 99,113 | 1 | 1.00 | 1 | 77.2 | 69.1 | 80.7 | 94 |

l-branch | DEPREL | 37,535 | 92,390 | 1 | 1.00 | 1 | 78.1 | 69.8 | 80.9 | 85 |

strict labeling | ||||||||||

direct | POS+DEPREL | 48,404 | 106,284 | 4 | 1.01 | 68 | 68.0 | 59.5 | 63.2 | 122 |

k = 1 | POS+DEPREL | 103,425 | 162,124 | 1 | 1.00 | 57 | 72.2 | 64.2 | 68.3 | 185 |

k = 2 | POS+DEPREL | 92,395 | 150,013 | 2 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 266 |

k = 3 | POS+DEPREL | 91,412 | 149,106 | 3 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 240 |

r-branch | POS+DEPREL | 251,536 | 338,294 | 1 | 1.00 | 141 | 44.6 | 31.4 | 32.7 | 54 |

l-branch | POS+DEPREL | 264,190 | 349,299 | 1 | 1.00 | 137 | 45.3 | 32.6 | 34.2 | 53 |

direct | POS | 29,165 | 80,695 | 4 | 1.00 | 23 | 77.6 | 59.6 | 67.5 | 120 |

k = 1 | POS | 62,769 | 115,363 | 1 | 1.00 | 18 | 78.7 | 60.9 | 69.1 | 201 |

k = 2 | POS | 54,082 | 104,390 | 2 | 1.09 | 18 | 79.5 | 61.3 | 69.4 | 237 |

k = 3 | POS | 53,186 | 103,503 | 3 | 1.11 | 18 | 79.6 | 61.4 | 69.5 | 231 |

r-branch | POS | 181,432 | 277,201 | 1 | 1.00 | 98 | 55.1 | 36.4 | 40.8 | 88 |

l-branch | POS | 190,890 | 286,273 | 1 | 1.00 | 108 | 52.2 | 33.9 | 37.7 | 87 |

direct | DEPREL | 17,047 | 53,342 | 4 | 1.00 | 3 | 82.4 | 76.5 | 84.6 | 178 |

k = 1 | DEPREL | 37,333 | 71,423 | 1 | 1.00 | 1 | 83.2 | 78.0 | 85.8 | 188 |

k = 2 | DEPREL | 31,956 | 63,487 | 2 | 1.08 | 2 | 83.0 | 77.5 | 85.5 | 231 |

r-branch | DEPREL | 126,841 | 197,261 | 1 | 1.00 | 2 | 80.8 | 74.4 | 83.1 | 101 |

l-branch | DEPREL | 124,722 | 192,922 | 1 | 1.00 | 2 | 81.2 | 74.8 | 83.5 | 96 |

rparse (v = 1, h = 5) | 46,799 | 72,962 | 5 | 1.07 | 2 | 85.3 | 79.0 | 86.4 | 228 | |

MaltParser, unlexicalized, stacklazy | 0 | 88.2 | 83.7 | 88.7 | 2 |

extraction . | arg. lab. . | nont. . | rules . | f_{max}
. | f_{avg}
. | fail . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|

child labeling | ||||||||||

direct | POS+DEPREL | 4,043 | 61,923 | 4 | 1.06 | 68 | 68.0 | 59.5 | 63.2 | 97 |

k = 1 | POS+DEPREL | 17,333 | 60,399 | 1 | 1.00 | 9 | 85.8 | 79.7 | 85.5 | 94 |

k = 2 | POS+DEPREL | 10,777 | 49,448 | 2 | 1.18 | 11 | 85.7 | 79.7 | 85.4 | 103 |

k = 3 | POS+DEPREL | 10,381 | 48,844 | 3 | 1.19 | 11 | 85.6 | 79.7 | 85.3 | 108 |

r-branch | POS+DEPREL | 92,624 | 191,341 | 1 | 1.00 | 60 | 68.9 | 60.3 | 65.0 | 27 |

l-branch | POS+DEPREL | 96,980 | 196,125 | 1 | 1.00 | 56 | 70.3 | 61.9 | 66.7 | 28 |

direct | POS | 799 | 43,439 | 4 | 1.10 | 23 | 78.2 | 59.7 | 67.2 | 113 |

k = 1 | POS | 6,060 | 28,880 | 1 | 1.00 | 4 | 83.2 | 65.3 | 73.2 | 70 |

k = 2 | POS | 2,396 | 20,592 | 2 | 1.38 | 4 | 83.8 | 65.4 | 73.1 | 86 |

k = 3 | POS | 2,121 | 20,100 | 3 | 1.44 | 4 | 83.4 | 65.2 | 73.1 | 88 |

r-branch | POS | 47,661 | 123,367 | 1 | 1.00 | 23 | 77.9 | 59.8 | 68.1 | 53 |

l-branch | POS | 49,203 | 125,406 | 1 | 1.00 | 26 | 78.5 | 60.2 | 67.5 | 51 |

direct | DEPREL | 527 | 33,844 | 4 | 1.10 | 1 | 79.5 | 72.4 | 82.7 | 251 |

k = 1 | DEPREL | 4,344 | 21,613 | 1 | 1.00 | 1 | 78.0 | 70.6 | 81.4 | 172 |

k = 2 | DEPREL | 1,739 | 15,184 | 2 | 1.35 | 1 | 78.7 | 70.9 | 81.6 | 263 |

r-branch | DEPREL | 40,239 | 99,113 | 1 | 1.00 | 1 | 77.2 | 69.1 | 80.7 | 94 |

l-branch | DEPREL | 37,535 | 92,390 | 1 | 1.00 | 1 | 78.1 | 69.8 | 80.9 | 85 |

strict labeling | ||||||||||

direct | POS+DEPREL | 48,404 | 106,284 | 4 | 1.01 | 68 | 68.0 | 59.5 | 63.2 | 122 |

k = 1 | POS+DEPREL | 103,425 | 162,124 | 1 | 1.00 | 57 | 72.2 | 64.2 | 68.3 | 185 |

k = 2 | POS+DEPREL | 92,395 | 150,013 | 2 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 266 |

k = 3 | POS+DEPREL | 91,412 | 149,106 | 3 | 1.08 | 55 | 72.0 | 64.4 | 68.6 | 240 |

r-branch | POS+DEPREL | 251,536 | 338,294 | 1 | 1.00 | 141 | 44.6 | 31.4 | 32.7 | 54 |

l-branch | POS+DEPREL | 264,190 | 349,299 | 1 | 1.00 | 137 | 45.3 | 32.6 | 34.2 | 53 |

direct | POS | 29,165 | 80,695 | 4 | 1.00 | 23 | 77.6 | 59.6 | 67.5 | 120 |

k = 1 | POS | 62,769 | 115,363 | 1 | 1.00 | 18 | 78.7 | 60.9 | 69.1 | 201 |

k = 2 | POS | 54,082 | 104,390 | 2 | 1.09 | 18 | 79.5 | 61.3 | 69.4 | 237 |

k = 3 | POS | 53,186 | 103,503 | 3 | 1.11 | 18 | 79.6 | 61.4 | 69.5 | 231 |

r-branch | POS | 181,432 | 277,201 | 1 | 1.00 | 98 | 55.1 | 36.4 | 40.8 | 88 |

l-branch | POS | 190,890 | 286,273 | 1 | 1.00 | 108 | 52.2 | 33.9 | 37.7 | 87 |

direct | DEPREL | 17,047 | 53,342 | 4 | 1.00 | 3 | 82.4 | 76.5 | 84.6 | 178 |

k = 1 | DEPREL | 37,333 | 71,423 | 1 | 1.00 | 1 | 83.2 | 78.0 | 85.8 | 188 |

k = 2 | DEPREL | 31,956 | 63,487 | 2 | 1.08 | 2 | 83.0 | 77.5 | 85.5 | 231 |

r-branch | DEPREL | 126,841 | 197,261 | 1 | 1.00 | 2 | 80.8 | 74.4 | 83.1 | 101 |

l-branch | DEPREL | 124,722 | 192,922 | 1 | 1.00 | 2 | 81.2 | 74.8 | 83.5 | 96 |

rparse (v = 1, h = 5) | 46,799 | 72,962 | 5 | 1.07 | 2 | 85.3 | 79.0 | 86.4 | 228 | |

MaltParser, unlexicalized, stacklazy | 0 | 88.2 | 83.7 | 88.7 | 2 |

*Naming Scheme.* As expected, child labeling leads to significantly fewer distinct nonterminals than strict labeling, by a factor of between 2.5 and 37, depending on the other parameter values. This makes the tree language less refined, and one may expect the scores therefore to be generally lower. However, in many cases where strict labeling suffers from a higher proportion of parse failures (so that the parser must fall back on the default structure), it is child labeling that has the higher scores.

*Argument Labels.* Including only POS tags in nonterminal labels generally leads to high UAS but low LA. By including DEPRELs but not POS tags, the LA and LAS are higher in all cases. For child labeling, this is at the expense of a lower UAS, in all cases except one. For strict labeling, UAS is higher in all cases. There are also fewer parse failures, which may be because the number of DEPRELs is smaller than the number of POS tags.

With the combination of POS tags and DEPRELs, the tree language can be most accurately described, and we see that this achieves some of the highest UAS and LAS in the case of child labeling. However, the scores can be low in the presence of many parse failures, due to the fall-back on the default structure. This holds in particular in the case of strict labeling.

*Recursive Partitionings.* Concerning the choice of the method to obtain recursive partitionings, the baseline is direct extraction, which produces an unrestricted LCFRS in the first component. For instance, for child labeling and POS and DEPREL as argument labels, the induced LCFRS has fanout 4 and about 88% of the rules have three or more nonterminals on the right-hand side. In total there are 4,043 nonterminals, the majority of which have fanout 1. Reduction of the fanout to 1 ≤ *k* ≤ 3 leads to a binarized grammar with smaller fanout as desired but the number of nonterminals is quadrupled. By transforming a recursive partitioning with parameter *k* ≥ 2, the average fanout *f*_{avg} of nonterminals may in fact increase, which is somewhat counter-intuitive. Whereas the distribution of nonterminals with fanout 1, 2, 3 is 94.8%, 4.2%, 0.9% in the case of direct extraction, this changes to distribution 83.4%, 14.2%, 2.5% for the transformation with *k* = 2. We suspect the increase is due to the high fanout of newly introduced nodes (see line 8 of Algorithm 4).

We can further see in Figure 19 that the number of arguments in the sDCP increases signficantly once the fanout is restricted to *k* = 1. The binarization involved in the process of reducing the fanout to some value of *k* seems to improve the ability of the grammar to generalize over the training data, as here both the number of parse failures drops and the scores increase. The exact choice of *k* has little impact on the scores, however.

The measured parse times in most cases increase if higher values of *k* are chosen. In some cases, however, the parse times are lower for direct extraction. This may be explained by the higher average fanout and differences in the sizes of the grammars.

The left-branching and right-branching recursive partitionings lead to many specialized nonterminal symbols (and rules). In comparison with the partitioning strategies discussed earlier, we observe that a nonterminal can have up to 10 sDCP arguments, the average lying between 3 and 4. The scores are often worse and there tend to be more parse failures. However, in the case of child labeling with DEPREL as argument label, the scores are very similar. Left-branching seems to lead to slightly higher scores than right-branching recursive partitioning, except for POS as argument label. With left-branching and right-branching recursive partitionings, parsing tends to be faster than with the other recursive partitionings. However, in many cases we do not observe the predicted asymptotic differences, which may be due to larger grammar sizes.

Overall, child labeling with POS+DEPREL and *k* = 1 and strict labeling with DEPREL and *k* = 1 turned out to be the best choices for maximizing UAS/LAS and LA, respectively. The former scenario slightly outperforms rparse with respect to UAS and LAS and parse time whereas rparse obtains better LA. Still, MaltParser outperforms these experimental scenarios with respect to both the obtained scores and the measured parse times. Our restriction to sentence length 20, which we mentioned earlier, was motivated by the excessive computational costs for instances with high (average) fanout.

The most suitable choices for naming scheme, argument labeling, and recursive partitioning may differ between languages and between annotation schemes. For instance, for the NEGRA (Skut et al., 1997) benchmark used by Maier and Kallmeyer (2010), we obtain the results in Table 3, which suggest the combination of child labeling, DEPREL, and *k* = 2 is the most suitable choice if LAS is the target score. It appears that scenarios with POS+DEPREL give lower scores than DEPREL, mainly because of the many parse failures, despite the fact that such argument labeling can be expected to lead to more refined tree languages. For this reason, we also consider a cascade of three scenarios with child labeling, *k* = 1, and argument labels set to POS+DEPREL, POS, or DEPREL, respectively, where in the case of parse failure we fall back to the next scenario. This cascade achieves better scores than any individual experimental scenario and the additional parse time with respect to POS+DEPREL alone is small. Both the single best scenario and the cascade perform better than the baseline LCFRS (rparse simple) induced with the algorithm by Kuhlmann and Satta (2009) and the LCFRS that is obtained from the baseline by binarization, vertical Markovization *v* = 1, and horizontal Markovization *h* = 3. Again, hybrid grammars fall short of MaltParser. We do not include results for NEGRA with the strict labeling strategy, because the numbers of parse failures are too high and, consequently, accuracies are too low to be of interest.

extraction . | arg. lab. . | nont. . | rules . | f_{max}
. | f_{avg}
. | fail . | . | . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

child labeling | ||||||||||||

direct | P+D | 4,739 | 27,042 | 7 | 1.10 | 693 | 51.7 | 40.7 | 52.2 | 40.8 | 42.6 | 253 |

k = 1 | P+D | 13,178 | 35,071 | 1 | 1.00 | 202 | 75.9 | 68.7 | 77.0 | 69.1 | 73.3 | 288 |

k = 2 | P+D | 11,156 | 32,231 | 2 | 1.17 | 195 | 76.5 | 69.6 | 77.7 | 70.1 | 74.2 | 355 |

r-branch | P+D | 42,577 | 79,648 | 1 | 1.00 | 775 | 45.5 | 33.1 | 45.7 | 32.8 | 34.7 | 49 |

l-branch | P+D | 40,100 | 75,321 | 1 | 1.00 | 768 | 45.8 | 33.4 | 46.0 | 33.2 | 35.0 | 45 |

direct | POS | 675 | 19,276 | 7 | 1.24 | 303 | 68.7 | 51.5 | 69.3 | 50.0 | 55.4 | 300 |

k = 1 | POS | 3,464 | 15,826 | 1 | 1.00 | 30 | 81.7 | 65.5 | 82.5 | 63.5 | 70.6 | 244 |

k = 2 | POS | 2,099 | 13,347 | 2 | 1.40 | 35 | 81.6 | 65.2 | 82.4 | 63.3 | 70.5 | 410 |

r-branch | POS | 19,804 | 51,733 | 1 | 1.00 | 372 | 62.7 | 46.4 | 62.7 | 44.7 | 50.6 | 222 |

l-branch | POS | 17,240 | 45,883 | 1 | 1.00 | 342 | 63.7 | 47.4 | 63.9 | 45.6 | 51.4 | 197 |

direct | DEP | 2,505 | 19,511 | 7 | 1.13 | 3 | 78.5 | 72.2 | 78.9 | 71.6 | 78.6 | 484 |

k = 1 | DEP | 8,059 | 22,613 | 1 | 1.00 | 1 | 78.5 | 71.7 | 79.5 | 71.7 | 79.0 | 608 |

k = 2 | DEP | 6,651 | 20,314 | 2 | 1.20 | 1 | 78.7 | 72.1 | 79.8 | 72.0 | 79.2 | 971 |

k = 3 | DEP | 6,438 | 19,962 | 3 | 1.25 | 1 | 78.6 | 72.0 | 79.5 | 71.9 | 79.1 | 1,013 |

r-branch | DEP | 27,653 | 54,360 | 1 | 1.00 | 2 | 76.0 | 68.4 | 76.3 | 67.5 | 76.1 | 216 |

l-branch | DEP | 25,699 | 50,418 | 1 | 1.00 | 1 | 75.8 | 68.4 | 76.2 | 67.6 | 76.1 | 198 |

cascade: child labeling, k = 1, P+D/POS/DEP | 1 | 83.2 | 76.2 | 84.3 | 76.1 | 81.6 | 325 | |||||

LCFRS (Maier and Kallmeyer, 2010) | - | 79.0 | 71.8 | - | - | - | - | |||||

rparse simple | 920 | 18,587 | 7 | 1.37 | 56 | 77.1 | 70.6 | 77.3 | 70.0 | 76.2 | 350 | |

rparse (v = 1, h = 3) | 40,141 | 61,450 | 7 | 1.10 | 13 | 78.4 | 72.2 | 78.5 | 71.4 | 79.0 | 778 | |

MaltParser, unlexicalized, stacklazy | 0 | 85.0 | 80.2 | 85.6 | 80.0 | 85.0 | 24 |

extraction . | arg. lab. . | nont. . | rules . | f_{max}
. | f_{avg}
. | fail . | . | . | UAS . | LAS . | LA . | time . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

child labeling | ||||||||||||

direct | P+D | 4,739 | 27,042 | 7 | 1.10 | 693 | 51.7 | 40.7 | 52.2 | 40.8 | 42.6 | 253 |

k = 1 | P+D | 13,178 | 35,071 | 1 | 1.00 | 202 | 75.9 | 68.7 | 77.0 | 69.1 | 73.3 | 288 |

k = 2 | P+D | 11,156 | 32,231 | 2 | 1.17 | 195 | 76.5 | 69.6 | 77.7 | 70.1 | 74.2 | 355 |

r-branch | P+D | 42,577 | 79,648 | 1 | 1.00 | 775 | 45.5 | 33.1 | 45.7 | 32.8 | 34.7 | 49 |

l-branch | P+D | 40,100 | 75,321 | 1 | 1.00 | 768 | 45.8 | 33.4 | 46.0 | 33.2 | 35.0 | 45 |

direct | POS | 675 | 19,276 | 7 | 1.24 | 303 | 68.7 | 51.5 | 69.3 | 50.0 | 55.4 | 300 |

k = 1 | POS | 3,464 | 15,826 | 1 | 1.00 | 30 | 81.7 | 65.5 | 82.5 | 63.5 | 70.6 | 244 |

k = 2 | POS | 2,099 | 13,347 | 2 | 1.40 | 35 | 81.6 | 65.2 | 82.4 | 63.3 | 70.5 | 410 |

r-branch | POS | 19,804 | 51,733 | 1 | 1.00 | 372 | 62.7 | 46.4 | 62.7 | 44.7 | 50.6 | 222 |

l-branch | POS | 17,240 | 45,883 | 1 | 1.00 | 342 | 63.7 | 47.4 | 63.9 | 45.6 | 51.4 | 197 |

direct | DEP | 2,505 | 19,511 | 7 | 1.13 | 3 | 78.5 | 72.2 | 78.9 | 71.6 | 78.6 | 484 |

k = 1 | DEP | 8,059 | 22,613 | 1 | 1.00 | 1 | 78.5 | 71.7 | 79.5 | 71.7 | 79.0 | 608 |

k = 2 | DEP | 6,651 | 20,314 | 2 | 1.20 | 1 | 78.7 | 72.1 | 79.8 | 72.0 | 79.2 | 971 |

k = 3 | DEP | 6,438 | 19,962 | 3 | 1.25 | 1 | 78.6 | 72.0 | 79.5 | 71.9 | 79.1 | 1,013 |

r-branch | DEP | 27,653 | 54,360 | 1 | 1.00 | 2 | 76.0 | 68.4 | 76.3 | 67.5 | 76.1 | 216 |

l-branch | DEP | 25,699 | 50,418 | 1 | 1.00 | 1 | 75.8 | 68.4 | 76.2 | 67.6 | 76.1 | 198 |

cascade: child labeling, k = 1, P+D/POS/DEP | 1 | 83.2 | 76.2 | 84.3 | 76.1 | 81.6 | 325 | |||||

LCFRS (Maier and Kallmeyer, 2010) | - | 79.0 | 71.8 | - | - | - | - | |||||

rparse simple | 920 | 18,587 | 7 | 1.37 | 56 | 77.1 | 70.6 | 77.3 | 70.0 | 76.2 | 350 | |

rparse (v = 1, h = 3) | 40,141 | 61,450 | 7 | 1.10 | 13 | 78.4 | 72.2 | 78.5 | 71.4 | 79.0 | 778 | |

MaltParser, unlexicalized, stacklazy | 0 | 85.0 | 80.2 | 85.6 | 80.0 | 85.0 | 24 |

Note that the hybrid grammar obtained with child labeling, DEPREL, and direct extraction should, in principle, be similar to the baseline LCFRS. Indeed, both grammars have the same fanout and similar numbers of rules. The differences in size can be explained by the separation of terminal generating and structural rules during hybrid grammar induction, which is not present in the induction algorithm by Kuhlmann and Satta (2009). This separation also leads to different generalization over the training data, as the higher accuracy and the lower numbers of parse failures for the hybrid grammar indicate.

One possible refinement of this result is to choose different argument labels for inherited and synthesized arguments. One may also use form or lemma next to, or instead of, POS tags and DEPRELs, possibly in combination with smoothing techniques to handle unknown words. Another subject for future investigation is the use of splitting and merging (Petrov et al., 2006) to determine parts of nonterminal names. New recursive partitioning strategies may be developed, and one could consider blending grammars that were induced using different recursive partitioning strategies.

## 7.2 Constituent Parsing

The experiments for constituent parsing are carried out as in Nederhof and Vogler (2014) but on a larger portion of the TIGER corpus: We now use the first 40,000 sentences for training (omitting 10 where a single tree does not span the entire sentence), and from the remaining 10,474 sentences we remove the ones with length greater than 20, leaving 7,597 sentences for testing. The results are displayed in Table 4 and allow for similar conclusions as in Nederhof and Vogler (2014). Note that for direct extraction both labeling strategies yield the same grammar; thus, only one entry is shown. Unsurprisingly, the larger training set leads to smaller proportions of parse failures and to improvements of F-measure. Another consequence is that the more fine-grained strict labeling now outperforms child labeling, except in the case of right-branching and left-branching, where the numbers of parse failures push the F-measure down.

. | nont. . | rules . | fail . | R . | P . | F1 . | # gaps . | time . |
---|---|---|---|---|---|---|---|---|

strict labeling | ||||||||

direct | 104 | 31,656 | 10 | 77.5 | 77.7 | 76.9 | 0.0139 | 2,098 |

k = 1 | 26,936 | 62,221 | 11 | 77.1 | 77.1 | 76.4 | 0.0136 | 2,514 |

k = 2 | 19,387 | 51,414 | 9 | 77.5 | 77.8 | 76.9 | 0.0136 | 2,892 |

k = 3 | 18,685 | 50,678 | 9 | 77.5 | 77.8 | 76.9 | 0.0135 | 2,886 |

r-branch | 164,842 | 248,063 | 658 | 61.2 | 58.2 | 59.1 | 0.0135 | 1,341 |

l-branch | 284,816 | 348,536 | 3,662 | 37.7 | 34.8 | 35.7 | 0.0131 | 2,143 |

child labeling | ||||||||

k = 1 | 2,117 | 13,877 | 1 | 75.3 | 74.9 | 74.5 | 0.0140 | 1,196 |

k = 2 | 473 | 8,078 | 1 | 75.6 | 75.4 | 74.9 | 0.0144 | 1,576 |

k = 3 | 176 | 7,352 | 1 | 75.7 | 75.4 | 74.9 | 0.0144 | 1,667 |

r-branch | 27,222 | 83,035 | 36 | 75.0 | 74.4 | 74.1 | 0.0148 | 502 |

l-branch | 87,171 | 162,645 | 137 | 74.5 | 73.9 | 73.6 | 0.0146 | 702 |

. | nont. . | rules . | fail . | R . | P . | F1 . | # gaps . | time . |
---|---|---|---|---|---|---|---|---|

strict labeling | ||||||||

direct | 104 | 31,656 | 10 | 77.5 | 77.7 | 76.9 | 0.0139 | 2,098 |

k = 1 | 26,936 | 62,221 | 11 | 77.1 | 77.1 | 76.4 | 0.0136 | 2,514 |

k = 2 | 19,387 | 51,414 | 9 | 77.5 | 77.8 | 76.9 | 0.0136 | 2,892 |

k = 3 | 18,685 | 50,678 | 9 | 77.5 | 77.8 | 76.9 | 0.0135 | 2,886 |

r-branch | 164,842 | 248,063 | 658 | 61.2 | 58.2 | 59.1 | 0.0135 | 1,341 |

l-branch | 284,816 | 348,536 | 3,662 | 37.7 | 34.8 | 35.7 | 0.0131 | 2,143 |

child labeling | ||||||||

k = 1 | 2,117 | 13,877 | 1 | 75.3 | 74.9 | 74.5 | 0.0140 | 1,196 |

k = 2 | 473 | 8,078 | 1 | 75.6 | 75.4 | 74.9 | 0.0144 | 1,576 |

k = 3 | 176 | 7,352 | 1 | 75.7 | 75.4 | 74.9 | 0.0144 | 1,667 |

r-branch | 27,222 | 83,035 | 36 | 75.0 | 74.4 | 74.1 | 0.0148 | 502 |

l-branch | 87,171 | 162,645 | 137 | 74.5 | 73.9 | 73.6 | 0.0146 | 702 |

Note that the average numbers of gaps per constituent are very similar for the different recursive partitionings. One may observe once more that the parse times of right-branching and left-branching recursive partitionings are higher than one might expect from the asymptotic time complexities. This is again due to the considerable sizes of the grammars.

## 8. Related Work

Two kinds of discriminative models of non-projective dependency parsing have been intensively studied. One is based on an algorithm for finding the maximum spanning tree (MST) of a weighted directed graph, where the vertices are the words of a sentence, and each edge represents a potential dependency relation (McDonald et al., 2005). In principle, any non-projective structure can be obtained, depending on the weights of the edges. A disadvantage of MST dependency parsing is that it is difficult to put any constraints on the desirable structures beyond the local constraints encoded in the edge weights (McDonald and Pereira, 2006).

The second kind of model involves stack-based transition systems, with an added transition that swaps stack elements. In particular, Nivre (2009) introduced a deterministic system that uses a classifier to determine the next transition to be applied. The worst-case time complexity is quadratic, and the expected complexity is linear. The classifier relies on features that look at neighboring words in the sentence, as well as at vertical and horizontal context in the syntactic tree. Advances in learning the relevant features are due to Chen and Manning (2014).

In this article we have assumed a generative model of hybrid grammars, which differs from deterministic, stack-based models in at least two ways, one of which is superficial whereas the other is more fundamental. The superficial difference is the presence of nonterminals in hybrid grammars. These, however, fulfill a role that is comparable to that of sets of features used by classifiers. We conjecture that machine learning techniques could even be introduced to create more refined nonterminals for hybrid grammars. The more fundamental difference lies in the determinism of the discussed stack-based models, which is difficult to realize for hybrid grammars, except perhaps in the case of left-branching and right-branching recursive partitionings. Without determinism, the time complexity grows with the fanout that we allow for the left components of hybrid grammars. What we get in return for the higher running time are more powerful models for parsing the input.

As in the case of MST dependency parsing, the discussed stack-based transition systems can in principle produce any non-projective structure. The non-projectivity allowed by hybrid grammars is determined by the hybrid rules, which in turn are determined by non-projectivity that occurs in the training data. This means that there is no restriction per se on non-projectivity in structures produced by a parser for test data, provided the training data contains an adequate amount of non-projectivity. This even holds if we restrict the fanout of the first component, although we may then need more training data to obtain the same coverage and accuracy.

Many established algorithms for constituent parsing rely on generative models and grammar induction (Collins, 1997; Charniak, 2000; Klein and Manning, 2003; Petrov et al., 2006). Most of these are unable to produce discontinuous structures. A notable exception is Maier and Søgaard (2008), where the induced grammar is a LCFRS. Such a grammar can be seen as a special case of a LCFRS/sDCP hybrid grammar, with the restriction that each nonterminal has a single synthesized argument. This restriction limits the power of hybrid grammars. In particular, discontinuous structures can now only be produced if the fanout is strictly greater than 1, which also implies the time complexity is more than cubic.

Generative models proposed for dependency parsing have often been limited to projective structures, based on either context-free grammars (Eisner, 1996; Klein and Manning, 2004) or tree substitution grammars (Blunsom and Cohn, 2010). Exceptions are recent models based on LCFRS (Maier and Kallmeyer, 2010; Kuhlmann, 2013). As in the case of constituent parsing, these can be seen as restricted LCFRS/sDCP hybrid grammars.

A related approach is from Satta and Kuhlmann (2013). Although it does not use an explicit grammar, there is a clear link to mildly context-sensitive grammar formalisms, in particular lexicalized TAG, following from the work of Bodirsky, Kuhlmann, and Möhl (2005).

The flexibility of generative models has been demonstrated by a large body of literature. Applications and extensions of generative models include syntax-based machine translation (Charniak, Knight, and Yamada, 2003), discriminative reranking (Collins, 2000), and involvement of discriminative training criteria (Henderson, 2004). It is very likely that these apply to hybrid grammars as well. Further discussion is outside the scope of this article.

## Appendix A. Single-Synthesized sDCPs Have the Same s-term Generating Power as sCFTGs

The proof of Theorem 1 is the following.

*Proof*. Let *G* be a single-synthesized sDCP. We construct a sCFTG *G*′ such that [*G*] = [*G*′]. For each nonterminal *A* from *G*, there will be one nonterminal *A*′ in *G*′, with rk(*A*′) = i-rk(*A*).

*s*

^{(0)})$\u2192$$\u2329A1(s1,k1(1),$

*x*

^{(1)}), … , $An(s1,kn(n),$

*x*

^{(n)})〉 be a rule of

*G*, with

*k*

_{m}=i-rk(

*A*

_{m}) for

*m*∈ [

*n*]

_{0}. Assume without loss of generality that $x1,k0(0)$ =$x1,k0$. Then

*G*′ will contain the rule $A0\u2032(x1,k0)\u2192rhs(s(0))$, where rhs is defined by:

*A*of

*G*and s-terms

*s*,

*s*

_{1}, … ,

*s*

_{k}. Hence [

*G*] = [

*G*′].

There is an inverse transformation from a sCFTG to a sDCP with s-rk(*A*) = 1 for each *A*. This is as straightforward as the transformation above. In fact, one can think of sCFTGs and sDCPs with s-rank restricted to 1 as syntactic variants of one another.

A more extensive treatment of a very closely related result is from Mönnich (2010), who refers to a special form of attributed tree transducers in place of sDCPs. This result for tree languages was inspired by an earlier result by Duske et al. (1977) for string languages, involving (non-simple) macro grammars (with IO derivation) and simple-L-attributed grammars. For arbitrary s-rank, a related result is that attributed tree transductions are equivalent to attribute-like macro tree transducers, as shown by Fülöp and Vogler (1999).

## Appendix B. The Class of s-term Languages Induced by sDCP Is Strictly Larger than that Induced by sCFTG

*G*as follows:

*S*) = 1, s-rk(

*A*) = 2, and i-rk(

*A*) = 0. It should be clear that the induced s-term language [

*G*] contains trees of the form

**A**(

*s*,

*s*′) where

*s*and

*s*′ are s-terms over {

**B**} and {

**B**′}, respectively, and pos(

*s*) = pos(

*s*′) and

*s*′ is obtained from

*s*by replacing each

**B**by

**B**′.

In Section 5 of Engelfriet and Filè (1981) it was proved that [*G*], viewed as language of binary trees, cannot be induced by 1S-AG, that is, attribute grammars with one synthesized attribute (and any number of inherited attributes). Although single-synthesized sDCP can generate s-terms (and not only trees as 1S-AG can), it is rather obvious to see that this extra power does not help to induce [*G*]. Thus we conclude that there is no single-synthesized sDCP that induces [*G*].

## Appendix C. sDCP Have the Same String Generating Power as LCFRS

The proof of Theorem 2 is the following.

*Proof*. We show that for each sDCP *G*_{1} there is a LCFRS *G*_{2} such that [*G*_{2}] = pre([*G*_{1}]), where pre was extended from s-terms to sets of s-terms in the obvious way. Our construction first produces sDCP *G*′_{1} from *G*_{1} by replacing every s-term *s* in every rule of *G*_{1} by pre(*s*). It is not difficult to see that [*G*′_{1}] =pre([*G*_{1}]).

Next, for every nonterminal *A* in *G*′_{1}, with i-rk(*A*) = *k*′ and s-rk(*A*) = *k*, we introduce nonterminals of the form *A*^{(g)}, where *g* is a mapping from [*k*] to sequences of numbers in [*k*′], such that each *j* ∈ [*k*′] occurs precisely once in *g*(1) ⋅… ⋅ *g*(*k*). The intuition is that if $g(i)=\u2329j1,\u2026,jpi\u232a$, then a value appearing as the *j*_{q}-th inherited argument of *A* reappears as part of the *i*-th synthesized argument, and it is the *q*-th inherited argument to do so. (This concept is similar to the argument selector in Courcelle and Franchi-Zannettacci [1982, page 175].) For each *A*, only those functions *g* are considered that are consistent with at least one subderivation of *G*′_{1}.

We will show that whereas replacing a nonterminal *A*_{m} by $Am(gm)$, as *m*-th member in the right-hand side of a rule of the form of Equation (3), a variable $xi(m)$ appearing as the *i*-th synthesized argument is split up into several new variables $xi(m,0)$, … , $xi(m,pi(m))$. These variables are drawn from *X* so as not to clash with variables used before. The $pi(m)$ terms of the inherited arguments whose indices are listed by *g*_{m}(*i*) will be shifted to the left-hand side of a new rule to be constructed, interspersed with the $pi(m)+1$ new variables. This shifting of terms may happen along dependencies between inherited and synthesized arguments of other members in the right-hand side. (This shifting is again very similar to the one for obtaining rules for an IO-macro grammar from a simple L-attributed grammar (Duske et al., 1977, Def.6.1)).

*G*

_{1}″ from

*G*′

_{1}by iteratively adding more elements to its set of nonterminals and to its set of rules, until no more new nonterminals and rules can be found. In each iteration, we consider a rule from

*G*′

_{1}of the form $A0(x1,k0\u2032(0),$$s1,k0(0))$$\u2192$$\u2329A1(s1,k1\u2032(1),$$x1,k1(1)),$…, $An(s1,kn\u2032(n),$$x1,kn(n))\u232a$, and a choice of functions

*g*

_{1}, … ,

*g*

_{n}such that the nonterminals $Am(gm)$ (

*m*∈ [

*n*]) were found before. We define a function “exp” that expands terms and s-terms as follows:

*i*∈ [

*k*

_{0}], define $g0(i)=\u2329j1,\cdots ,jpi\u232a$, where $exp(si(0))$ is of the form:

*j*∈ [

*p*

_{i}]

_{0}).

*G*

_{1}″ will now contain the rule:

*m*∈ [

*n*] and

*i*∈ [

*k*

_{m}]. This also defines the nonterminal $A0(g0)$ to be added to

*G*

_{1}″ if it does not already exist.

*G*′

_{1}and

*G*″

_{1}we allow the sentential forms in derivations of a sDCP to contain variables; they can be viewed as extra nullary symbols. Moreover, each derivation of the form $\u2329A(x1,k0\u2032,s1,k0)\u232a\u21d2G1\u2032*\u2329\u232a$ induces in an obvious way an argument selector function

*g*. Let

*A*be a nonterminal with

*k*′

_{0}inherited arguments and

*k*

_{0}synthesized arguments. Let $s1,k0$ be s-terms in $T\Sigma (Xk0\u2032)$ such that each $xi\u2208Xk0\u2032$ occurs exactly once in $s1,k0$. For each

*j*∈ [

*k*

_{0}] we let:

*g*by $g(j)=\u2329j1,\u2026,jpj\u232a$ for each

*j*∈ [

*k*

_{0}].

Then the following two statements are equivalent:

- 1.
$\u2329A(x1,k0\u2032,s1,k0)\u232a\u21d2G1\u2032*\u2329\u232a$ and

*g*is the argument selector function of this derivation. - 2.
$\u2329A(g)(r0,p1(1),\u2026,r0,pk0(k0))\u232a\u21d2G1\u2033*\u2329\u232a$.

The grammar *G*_{1}″ constructed here has no inherited arguments and is almost the required LCFRS *G*_{2}. To precisely follow our definitions, what remains is to consistently rename the variables in each rule to obtain the set *X*_{m}, some *m*. Furthermore, the symbols from Σ now need to be explicitly assigned rank 0.

For the converse direction of the theorem consider LCFRS *G*. We construct sDCP *G*′ by taking the same nonterminals as those of *G*, with the same ranks. Each argument is synthesized. To obtain the rules of *G*′, we replace each term *a*() by the term *a*(〈〉).

*G*

_{1}contains the rules:

*G*′

_{1}has the rules:

*A*) = 2, i-rk(

*B*) = 1, and i-rk(

*C*) = 2 for

*G*

_{1}and

*G*′

_{1}. For the first rule we have exp(σ

*x*

_{2}α

*x*

_{1}) = σ

*x*

_{2}α

*x*

_{1}and we obtain the function

*g*(1) = 〈2, 1〉, because

*x*

_{2}occurs before

*x*

_{1}. This leads to the new rule:

*G*′

_{1}, assume

*g*

_{2}=

*g*with

*g*the function we found for the first rule. Further assume we previously found nonterminal $B(g1)$ with

*g*

_{1}(1) = 〈1〉, which is in fact the only allowable function for

*B*. Then:

*x*

_{1}occurs before

*x*

_{2}we have

*g*

_{0}(1) = 〈1,2〉 and we derive the rule:

## Acknowledgments

We are grateful to the reviewers for their constructive criticism and encouragement. We also thank Markus Teichmann for helpful discussions. The third author was financially supported by the Deutsche Forschungsgemeinschaft by project DFG VO 1011/8-1.

## Notes

The term “hybrid tree” was used before by Lu et al. (2008), also for a mixture of a tree structure and a linear structure, generated by a probabilistic model. However, the linear “surface” structure was obtained by a simple left-to-right tree traversal, whereas a meaning representation was obtained by a slightly more flexible traversal of the same tree. The emphasis in the current article is rather on separating the linear structure from the tree structure. Note that similar distinctions of multiple **strata** were made before in both constituent linguistics (see, e.g., Chomsky 1981) and dependency linguistics (see, e.g., Mel’čuk 1988).

The term “simple” will here be used for definite clause programs to have analogous meaning to the term for MGs and CFTGs. This is more restrictive than the term with the same name in Deransart and Małuszynski (1985).

See Seki et al. (1991) for an even stronger normal form.

## References

*Lectures on Government and Binding,*volume 9 of

*Studies in Generative Grammar*

*Developments in Language Theory,*volume 5583 of

*Lecture Notes in Computer Science*

*Mathematical Systems Theory*

*Dependency Parsing,*volume 2(1) of

*Synthesis Lectures on Human Language Technologies*

*Proceedings of the 2nd International Conference on Graph Transformations,*volume 3256 of

*Lecture Notes in Computer Science*

*Discontinuous Constituency,*volume 20 of

*Syntax and Semantics*

## Author notes

School of Computer Science, University of St. Andrews, North Haugh, St. Andrews, KY16 9SX, UK.