## Abstract

Tree transducers are defined as relations between trees, but in syntax-based machine translation, we are ultimately concerned with the relations between the strings at the yields of the input and output trees. We examine the formal power of Multi Bottom-Up Tree Transducers from this point of view.

## 1. Introduction

Many current approaches to syntax-based statistical machine translation fall under the theoretical framework of synchronous tree substitution grammars (STSGs). Tree substitution grammars (TSGs) generalize context-free grammars (CFGs) in that each rule expands a nonterminal to produce an arbitrarily large tree fragment, rather than a fragment of depth one as in a CFG . Synchronous TSGs generate tree fragments in the source and target languages in parallel, with each rule producing a tree fragment in either language. Systems such as that of Galley et al. (2006) extract STSG rules from parallel bilingual text that has been automatically parsed in one language, and the STSG nonterminals correspond to nonterminals in these parse trees. Chiang’s 2007 Hiero system produces simpler STSGs with a single nonterminal.

STSGs have the advantage that they can naturally express many re-ordering and restructuring operations necessary for machine translation (MT). They have the disadvantage, however, that they are not closed under composition (Maletti et al. 2009). Therefore, if one wishes to construct an MT system as a pipeline of STSG operations, the result may not be expressible as an STSG. Recently, Maletti (2010) has argued that multi bottom–up tree transducers (MBOTs) (Lilin 1981; Arnold and Dauchet 1982; Engelfriet, Lilin, and Maletti 2009) provide a useful representation for natural language processing applications because they generalize STSGs, but have the added advantage of being closed under composition. MBOTs generalize traditional bottom–up tree transducers in that they allow transducer states to pass more than one output subtree up to subsequent transducer operations. The number of subtrees taken by a state is called its **rank**. MBOTs are linear and non-deleting; that is, operations cannot copy or delete arbitrarily large tree fragments.

Although STSGs and MBOTs both perform operations on trees, it is important to note that, in MT, we are primarily interested in translational relations between strings. Tree operations such as those provided by STSGs are ultimately tools to translate a string in one natural language into a string in another. Whereas MBOTs originate in the tree transducer literature and are defined to take a tree as input, MT systems such as those of Galley et al. (2006) and Chiang (2007) find a parse of the source language sentence as part of the translation process, and the decoding algorithm, introduced by Yamada and Knight (2002), has more in common with CYK parsing than with simulating a tree transducer.

In this article, we investigate the power of MBOTs, and of compositions of STSGs in particular, in terms of the set of *string* translations that they generate. We relate MBOTs and compositions of STSGs to existing grammatical formalisms defined on strings through five main results, which we outline subsequently. The first four results serve to situate general MBOTs among string formalisms, and the fifth result addresses MBOTs resulting from compositions of STSGs in particular.

Our first result is that the translations produced by MBOTs are a subset of those produced by linear context-free rewriting systems (LCFRSs) (Vijay-Shankar, Weir, and Joshi 1987). LCFRS provides a very general framework that subsumes CFG, tree adjoining grammar (TAG; Joshi, Levy, and Takahashi 1975; Joshi and Schabes 1997), and more complex systems, as well as synchronous context-free grammar (SCFG) (Aho and Ullman 1972) and synchronous tree adjoining grammar (STAG) (Shieber and Schabes 1990; Schabes and Shieber 1994) in the context of translation. LCFRS allows grammar nonterminals to generate more than one span in the final string; the number of spans produced by an LCFRS nonterminal corresponds to the rank of an MBOT state. Our second result states that the translations produced by MBOTs are equivalent to a specific restricted form of LCFRS, which we call 1-m-LCFRS. From the construction relating MBOTs and 1-m-LCFRSs follow results about the source and target sides of the translations produced by MBOTs. In particular, our third result is that the translations produced by MBOTs are context-free within the source language, and hence are strictly less powerful than LCFRSs. This implies that MBOTs are not as general as STAGs, for example. Similarly, MBOTs are not as general as the generalized multitext grammars proposed for machine translation by Melamed (2003), which retain the full power of LCFRSs in each language (Melamed, Satta, and Wellington 2004). Our fourth result is that the output of an MBOT, when viewed as a string language, does retain the full power of LCFRSs. This fact is mentioned by Engelfriet, Lilin, and Maletti (2009, page 586), although no explicit construction is given.

Our final result specifically addresses the string translations that result from compositions of STSGs, with the goal of better understanding the complexity of using such compositions in machine translation systems. We show that the translations produced by compositions of STSGs are more powerful than those produced by single STSGs, or, equivalently, by SCFGs . Although it is known that STSGs are not closed under composition, the proofs used previously in the literature rely on differences in tree structure, and do not generate string translations that cannot be generated by STSG. Our result implies that current approaches to machine translation decoding will need to be extended to handle arbitrary compositions of STSGs.

## 2. Preliminaries

**ranked alphabet**is an alphabet where each symbol has an integer

**rank**, denoting the number of children the symbol takes in a tree.

*T*

_{Σ}denotes the set of trees constructed from ranked alphabet

*Σ*. We use parentheses to write trees: for example,

*a*(

*b*,

*c*,

*d*) is an element of

*T*

_{Σ}if

*a*is an element of

*Σ*with rank 3, and

*b*,

*c*, and

*d*are elements of

*Σ*with rank 0. Similarly, given a ranked alphabet

*Σ*and a set

*X*,

*Σ*(

*X*) denotes the set of trees consisting of a single symbol of

*Σ*of rank

*k*dominating a sequence of

*k*elements from

*X*. We use

*T*

_{Σ}(

*X*) to denote the set of arbitrarily sized trees constructed from ranked alphabet

*Σ*having items from set

*X*at some leaf positions. That is,

*T*

_{Σ}(

*X*) is the smallest set such that

*X*⊂

*T*

_{Σ}(

*X*) and

*σ*(

*t*

_{1}, …,

*t*

_{k}) ∈

*T*

_{Σ}(

*X*) if

*σ*is an element of

*Σ*with rank

*k*, and

*t*

_{1}, …,

*t*

_{k}∈

*T*

_{Σ}(

*X*). A

**multi bottom–up tree transducer**(MBOT) (Lilin 1981; Arnold and Dauchet 1982; Engelfriet, Lilin, and Maletti 2009; Maletti 2010) is a system (

*S*,

*Σ*,

*Δ*,

*F*,

*R*) where:

*S*,*Σ*, and*Δ*are ranked alphabets of states, input symbols, and output symbols, respectively.*F*⊂*S*is a set of accepting states.*R*is a finite set of rules*l*→*r*where, using a set of variables*X*,*l*∈*T*_{Σ}(*S*(*X*)), and*r*∈*S*(*T*_{Δ}(*X*)) such that:every

*x*∈*X*that occurs in*l*occurs exactly once in*r*and vice versa, and*l*∉*S*(X) or*r*∉*S*(X).

*R*. We replace the fragment

*l*with

*r*, copying the subtree under each variable in

*l*to the location of the corresponding variable in

*r*. Transducer rules apply bottom–up from the leaves of the input tree, as shown in Figure 1, and must terminate in an accepting state. We use underlined symbols for the transducer states, in order to distinguish them from the symbols of the input and output alphabets.

We define a **translation** to be a set of string pairs, and we define the **yield** of an MBOT *M* to be the set of string pairs (*s*, *t*) such that there exist: a tree *s*′ ∈ *T*_{Σ} having *s* as its yield, a tree *t*′ ∈ *T*_{Δ} having *t* as its yield, and a transduction from *s*′ to *t*′ that is accepted by *M*. We refer to *s* as the **source side** and *t* as the **target side** of the translation. We use the notation source(*T*) to denote the set of source strings of a translation *T*, source(*T*) = { *s* |(*s*,*t*) ∈ *T* }, and we use the notation target(*T*) to denote the set of target strings. We use the notation yield(MBOT) to denote the set of translations produced by the set of all MBOTs.

**linear context-free rewriting system**(LCFRS) is defined as a system (

*V*

_{N},

*V*

_{T},

*P*,

*S*), where

*V*

_{N}is a set of nonterminal symbols,

*V*

_{T}is a set of terminal symbols,

*P*is a set of productions, and

*S*∈

*V*

_{N}is a distinguished start symbol. Associated with each nonterminal

*B*is a

**fan-out**

*ϕ*(

*B*), which tells how many spans

*B*covers in the final string. Productions

*p*∈

*P*take the form:

*p*:

*A*→

*g*(

*B*

_{1},

*B*

_{2}, …,

*B*

_{r}) , where

*A*,

*B*

_{1}, …,

*B*

_{r}∈

*V*

_{N}, and

*g*is a function , which specifies how to assemble the spans of the righthand side nonterminals into the

*ϕ*(

*A*) spans of the lefthand side nonterminal. The function

*g*must be

**linear**and

**non-erasing**, which means that if we write the tuple of strings on the right-hand side contains each variable

*x*

_{i,j}from the left-hand side exactly once, and may also contain terminals from

*V*

_{T}. The process of generating a string from an LCFRS grammar consists of first choosing, top–down, a production to expand each nonterminal, and then, bottom–up, applying the functions associated with each production to build the string. We refer to the tree induced by top–down nonterminal expansions of an LCFRS as the

**derivation tree**, or sometimes simply as a derivation.

Here, all nonterminals have fan-out one, reflected in the fact that all tuples defining the productions’ functions contain just one string. Just as CFG is equivalent to LCFRS with fan-out 1, SCFG and TAG can be represented as LCFRS with fan-out 2. Higher values of fan-out allow strictly more powerful grammars (Rambow and Satta 1999). Polynomial-time parsing is possible for any fixed LCFRS grammar, but the degree of the polynomial depends on the grammar. Parsing general LCFRS grammars, where the grammar is considered part of the input, is NP-complete (Satta 1992).

Following Melamed, Satta, and Wellington (2004), we represent translation in LCFRS by using a special symbol # to separate the strings of the two languages. Our LCFRS grammars will only generate strings of the form *s*#*t*, where *s* and *t* are strings not containing the symbol #, and we will identify *s* as the source string and *t* as the target string. We use the notation trans(LCFRS) to denote the set of translations that can be produced by taking the string language of some LCFRS and splitting each string into a pair at the location of the # symbol.

## 3. Translations Produced by General MBOTs

In this section, we relate the yield of general MBOTs to string rewriting systems.

To begin, we show that the translation produced by any MBOT is also produced by an LCFRS by giving a straightforward construction for converting MBOT rules to LCFRS rules.

We first consider MBOT rules having only variables, as opposed to alphabet symbols of rank zero, at their leaves. For an MBOT rule *l* → *r* with *l* ∈ *T*_{Σ}(*S*(*X*)), let *S*_{1}, *S*_{2}, …, *S*_{k} be the sequence of states appearing from left to right immediately above the leaves of *l*. Without loss of generality, we will name the variables such that *x*_{i,j} is the *j*th child of the *i*th state, *S*_{i}, and the sequence of variables at the leaves of *l*, read from left to right, is: *x*_{1,1},…,*x*_{1,d(S1)},…,*x*_{k,1},…,*x*_{k,d(Sk)}, where *d*(*S*_{i}) is the rank of state *S*_{i}. Let *S*_{0} be the state symbol at the root of the right-hand-side (r.h.s.) tree *r* ∈ *S*(*T*_{Δ}(*X*)). Let *π* and *μ* be functions such that *x*_{π(1),μ(1)}, *x*_{π(2),μ(2)}, …, *x*_{π(n),μ(n)} is the sequence of variables at the leaves of *r* read from left to right. We will call this sequence the **yield** of *r*. Finally, let *p*(*i*) for 1 ≤ *i* ≤ *d*(*S*_{0}) be the position in the yield of *r* of the rightmost leaf of *S*_{0}’s *i*th child. Thus, for all *i*, 1 ≤ *p*(*i*) ≤ *n*.

*l*→

*r*is constructed as

*S*

_{0}→

*g*(

*S*

_{1},

*S*

_{2}, …,

*S*

_{k}). The LCFRS nonterminal

*S*

_{i}has fan-out equal to the corresponding MBOT state’s rank plus one:

*ϕ*(

*S*

_{i}) =

*d*(

*S*

_{i}) + 1. This is because the LCFRS nonterminal has one span in the source language, and

*d*(

*S*

_{i}) spans in the target language of the translation. The combination function for the LCFRS rule

*S*

_{0}→

*g*(

*S*

_{1},

*S*

_{2}, …,

*S*

_{k}) is: Here we use

*e*

_{i}for the variables in the LCFRS rule corresponding to spans in the input tree of the MBOT, and

*f*

_{i,j}for variables corresponding to the output tree. The pattern in which these spans fit together is specified by the functions

*π*and

*μ*that were read off of the MBOT rule.

*α*

_{0}be the sequence of rank zero symbols appearing at the leaves of

*l*to the left of

*x*

_{1,1}, and let

*α*

_{i}for 1 ≤

*i*≤

*k*be the sequence of rank zero symbols to the right of

*x*

_{i,d(Si)}, and to the left of

*x*

_{i + 1,1}if

*i*<

*k*. Let

*β*

_{i,j}be the sequence of symbols of rank zero at the leaves of

*r*appearing in the subtree under the

*i*th child of

*S*

_{0}after the

*j*th variable in this subtree and before the

*j*+ 1th variable, with

*β*

_{i,0}being to the left of the first variable, and

*β*

_{i,p(i)}being to the right of the last variable. We can add these sequences of terminal symbols to the LCFRS rule as follows: An example of this conversion is shown in Figure 3. In this example,

*α*

_{1}=

*of*,

*β*

_{1,1}= , all other

*α*and

*β*values are the empty string, and

*d*(

*S*

_{0}) = 1. We refer to the LCFRS rule constructed from MBOT rule

*l*→

*r*as

*p*

_{l → r}.

Finally, we add a **start rule** rule *S* → *g*(*S _{i}*),

*g*(〈

*e*,

*f*〉) = 〈

*e*#

*f*〉 for each

*S*

_{i}∈

*F*to generate all final states

*S*

_{i}of the MBOT from the start symbol

*S*of the LCFRS.

We now show that the language of the LCFRS constructed from a given MBOT is identical to the yield of the MBOT. We represent MBOT transductions as derivation trees, where each node is labeled with an MBOT rule, and each node’s children are the rules used to produce the subtrees matched by any variables in the rule. We can construct an LCFRS derivation tree by simply relabeling each node with the LCFRS rule constructed from the node’s MBOT rule. Because, in the MBOT derivation tree, each node has children which produce the states required by the the MBOT rule’s left-hand side (l.h.s.), it also holds that, in the LCFRS derivation tree, each node has as its children rules which expand the set of nonterminals appearing in the parent’s r.h.s. Therefore the LCFRS tree constitutes a valid derivation.

Given the mapping from MBOT derivations to LCFRS derivations, the following lemma relates the strings produced by the derivations:

**Lemma 1**

Let *T*_{MBOT} be an MBOT derivation tree with *I* as its input tree and *O* as its output tree, and construct *T*_{LCFRS} by mapping each node *n*_{MBOT} in *T*_{MBOT} to a node *n*_{LCFRS} labeled with the LCFRS production constructed from the rule at *n*_{MBOT}.Let 〈*t*_{0},*t*_{1},…,*t*_{k}〉 be the string tuple returned by the LCFRS combination function at any node *n*_{LCFRS} in *T*_{LCFRS}. The string *t*_{0} contains the yield of the node of *I* at which the MBOT rule at the node of *T*_{MBOT} corresponding to *n*_{LCFRS} was applied. Furthermore, the strings *t*_{1}, …, *t*_{k} contain the *k* yields of the *k* MBOT output subtrees (subtrees of *O*) that are found as children of the root (state symbol) of the MBOT rule’s right-hand side.

**Proof**

*g*as 〈

*t*

_{0},

*t*

_{1},…,

*t*

_{k}〉. An MBOT rule applying at the bottom of the input tree cannot contain any variables, and for MBOT rules of this type, our construction produces an LCFRS rule with a combination function of the form: taking no arguments and returning string constants equal to the yield of the MBOT rule’s l.h.s, and the sequence of yields of the

*k*subtrees under the r.h.s.’s root. Now we consider how further rules in the LCFRS derivation make use of the tuple 〈

*t*

_{0},

*t*

_{1},…,

*t*

_{k}〉. Our LCFRS combination functions always concatenate the first elements of the input tuple in order, adding any terminals present in the portion of the input tree matched by the MBOT’s l.h.s . Thus the combination functions maintain the property that the first element in the resulting tuple,

*t*

_{0}, contains the yield of the subtree of the input tree where the corresponding MBOT rule applied. The combination functions combine the remaining elements in their input tuples in the same order given by the MBOT rule’s r.h.s., again adding any terminals added to the output tree by the MBOT rule. Thus, at each step, the strings

*t*

_{1}, …,

*t*

_{k}returned by LCFRS combination functions contain the

*k*yields of the

*k*MBOT output subtrees found as children of the root (state symbol) of the MBOT rule’s r.h.s. By induction, the lemma holds at each node in the derivation tree. ▪

The correspondence between LCFRS string tuples and MBOT tree yields gives us our first result:

**Theorem 1**

yield(MBOT) ⊂ trans(LCFRS).

**Proof**

From a given MBOT, construct an LCFRS as described previously. For any transduction of the MBOT, from Lemma 1, there exists an LCFRS derivation which produces a string consisting of the yield of the MBOT’s input and output trees joined by the # symbol. In the other direction, we note that any valid derivation of the LCFRS corresponds to an MBOT transduction on some input tree; this input tree can be constructed by assembling the left-hand sides of the MBOT rules from which the LCFRS rules of the LCFRS derivation were originally constructed. Because there is a one-to-one correspondence between LCFRS and MBOT derivations, the translation produced by the LCFRS and the yield of the MBOT are identical.

Because we can construct an LCFRS generating the same translation as the yield of any given MBOT, we see that yield(MBOT) ⊂ trans(LCFRS). ▪

The translations produced by MBOTs are equivalent to the translations produced by a certain restricted class of LCFRS grammars, which we now specify precisely.

**Theorem 2**

The class of translations yield(MBOT) is equivalent to yield(1-m-LCFRS), where 1-m-LCFRS is defined to be the class of LCFRS grammars where each rule either is a start rule of the form *S* → *g*(*S _{i}*),

*g*(〈

*e*,

*f*〉) = 〈

*e*#

*f*〉, or meets both of the following conditions:

The function

*g*_{1}returns a tuple of length 1.

**Proof**

Our construction for transforming an MBOT to an LCFRS produces LCFRS grammars satisfying the given constraints, so yield(MBOT) ⊂ trans(1-m-LCFRS).

*α*

_{i}is a string of terminals, and each symbol

*t*

_{i,j}is either a variable

*f*

_{i′,j′}, or a single terminal, we construct the MBOT rule: By the same reasoning used for our construction of LCFRS grammars from MBOTs, there is a one-to-one correspondence between derivation trees of the 1-m-LCFRS and the constructed MBOT, and the yield strings also correspond at each node in the derivation trees. Therefore, yield(1-m-LCFRS) ⊂ yield(MBOT).

Because we have containment in both directions, yield(MBOT) = trans(1-m-LCFRS). ▪

We now move on to consider the languages formed by the source and target projections of MBOT translations.

Grammars of the class 1-m-LCFRS have the property that, for any nonterminal *A* (other than the start symbol *S*) having fan-out *ϕ*(*A*), one span is always realized in the source string (to the left of the # separator), and *ϕ*(*A*) − 1 spans are always realized in the target language (to the right of the separator). This property is introduced by the start rules *S* → *g*(*S _{i}*),

*g*(〈

*e*,

*f*〉) = 〈

*e*#

*f*〉 and is maintained by all further productions because of the condition on 1-m-LCFRS that the combination function must keep the two sides of translation separate. For a 1-m-LCFRS rule constructed from an MBOT, we define the rule’s source language projection to be the rule obtained by discarding all the target language spans, as well as the separator symbol # in the case of the start productions. The definition of 1-m-LCFRS guarantees that the combination function returning a rule’s l.h.s. source span needs to have only the r.h.s. source spans available as arguments.

For an LCFRS *G*, we define *L*(*G*) to be the language produced by *G*. We define source(*G*) to be the LCFRS obtained by projecting each rule in *G*. Because more than one rule may have the same projection, we label the rules of source(*G*) with their origin rule, preserving a one-to-one correspondence between rules in the two grammars. Similarly, we obtain a rule’s target language projection by discarding the source language spans, and define target(*G*) to be the resulting grammar.

**Lemma 2**

For an LCFRS *G* constructed from an MBOT *M* by the given construction, *L*(source(*G*)) = source(trans(*M*)), and *L*(target(*G*)) = target(trans(*M*)).

**Proof**

There is a valid derivation tree in the source language projection for each valid derivation tree in the full LCFRS, because for any expansion rewriting a nonterminal of fan-out *ϕ*(*A*) in the full grammar, we can apply the projected rule to the corresponding nonterminal of fan-out 1 in the projected derivation. In the other direction, for any expansion in a derivation of the source projection, a nonterminal of fan-out *ϕ*(*A*) will be available for expansion in the corresponding derivation of the full LCFRS. Because there is a one-to-one correspondence between derivations in the full LCFRS and its source projection, the language generated by the source projection is the source of the translation generated by the original LCFRS. By the same reasoning, there is a one-to-one correspondence between derivations in the target projection and the full LCFRS, and the language produced by the target projection is the target side of the translation of the full LCFRS. ▪

Lemma 2 implies that it is safe to evaluate the power of the source and target projections of the LCFRS independently. This fact leads to our next result.

**Theorem 3**

yield(MBOT) ⊊ trans(LCFRS).

**Proof**

Although the source side of the translation produced by an MBOT must be a context-free language, we now show that the target side can be any language produced by an LCFRS.

**Theorem 4**

target(yield(MBOT)) = LCFRS

**Proof**

*t*

_{i,j}is either some variable

*x*

_{i′,j′}or a terminal from the alphabet of the LCFRS, we construct the MBOT rule: where the MBOT’s input alphabet contains a symbol S for each LCFRS nonterminal

*S*, and the MBOT’s output alphabet contains

*ϕ*(

*S*) symbols S

_{i}for each LCFRS nonterminal

*S*. This construction for converting an LCFRS to an MBOT shows that LCFRS ⊂ target(yield(MBOT)).

Given our earlier construction for generating the target projection of the LCFRS derived from an MBOT, we know that target(yield(MBOT)) ⊂ LCFRS. Combining these two facts yields the theorem. ▪

## 4. Composition of STSGs

Maletti et al. (2009) discuss the composition of extended top–down tree transducers, which are equivalent to STSGs, as shown by Maletti (2010). They show that this formalism is not closed under composition in terms of the tree transformations that are possible. In this article, we focus on the string yields of the formalisms under discussion, and from this point of view we now examine the question of whether the *yield* of the composition of two STSGs is itself the yield of an STSG in general. It is important to note that, although we focus on the yield of the composition, in our notion of STSG composition, the tree structure output by the first STSG still serves as input to the second STSG.

*x*

_{1}

*x*

_{2}⋯

*x*

_{n}. Here we have 2ℓ repeated sequences of characters c and d, each occurring

*n*

_{i}times, with each integer

*n*

_{i}for 1 ≤

*i*≤ 2ℓ varying freely.

*T*

_{crisscross}cannot be produced by any SCFG.

*V*,

*Σ*,

*Δ*,

*P*,

*S*) where

*V*is a set of nonterminals,

*Σ*and

*Δ*are the terminal alphabets of the source and target language respectively,

*S*∈

*V*is a distinguished start symbol, and

*P*is a set of productions of the following general form: where

*π*is a permutation of length

*n*, and the variables

*X*

_{i}for 0 ≤

*i*≤

*n*range over nonterminal symbols (for example,

*X*

_{1}and

*X*

_{2}may both stand for nonterminal

*A*). In SCFG productions, the l.h.s. nonterminal rewrites into a string of terminals and nonterminals in both the source and target languages, and pairs of r.h.s. nonterminals that are linked by the same superscript index must be further rewritten by the same rule.

In terms of string translations, STSGs and SCFGs are equivalent, because any SCFG is also an STSG with rules of depth 1, and any STSG can be converted to an SCFG with the same string translation by simply removing the internal tree nodes in each rule. We will adopt SCFG terminology for our proof because the internal structure of STSG rules is not relevant to our result.

**rank**refers to the maximum number of nonterminals on the r.h.s. of a rule.) We will show that strings of this form cannot be produced by any SCFG of rank less than 2ℓ. Intuitively, factoring the alignment pattern of Figure 7 into smaller SCFG rules would require identifying subsequences in the two languages that are consistently aligned to one another, and, as can be seen from the figure, no such subsequences exist. Because ℓ can be unboundedly large in our translation, the translation cannot be produced by any SCFG of fixed rank.

- 1.
Associate each sequence of terminals with the preceding nonterminal, or the following nonterminal in the case of initial terminals.

- 2.
Replace each group consisting of a nonterminal and its associated terminals with a fresh nonterminal

*A*, and add a rule rewriting*A*as the group in source and target. (Nonterminals with no associated terminals may be left intact.) - 3.
In each rule created in the previous step, replace each sequence of terminals with another fresh nonterminal

*B*, and add a rule rewriting*B*as the terminal sequence in source and target.

**preterminals**, and we refer to a pair of linked preterminals as an

**aligned preterminal pair**. Assuming that aligned preterminal pairs are indexed consecutively in the source side of the sentential form, we refer to the sequence of indices in the target side as the

**preterminal permutation**of a derivation. For example, the preterminal permutation of the derivation in Figure 10 is (3,2,1). The permutation of any sentential form of an SCFG of rank

*r*can be produced by composing permutations of length no greater than

*r*, by induction over the length of the derivation. Thus, while the permutation (3,2,1) of our example can be produced by composing permutations of length 2, the preterminal permutation (2,4,1,3) can never be produced by an SCFG of rank 2 (Wu 1997). In fact, this restriction also applies to subsequences of the preterminal permutation.

**Lemma 3**

Let *π* be a preterminal permutation produced by an SCFG derivation containing rules of maximum rank *r*, and let *π*′ be a permutation obtained from *π* by removing some elements and renumbering the remaining elements with a strictly increasing function. Then *π*′ falls within the class of compositions of permutations of length *r*.

**Proof**

From each rule in the derivation producing preterminal permutation *π*, construct a new rule by removing any nonterminals whose indices were removed from *π*. The resulting sequence of rules produces preterminal permutation *π*′ and contains rules of rank no greater than *r*. ▪

As an example of Lemma 3, removing any element from the permutation (3,2,1) results in the permutation (2,1), which can still (trivially) be produced by an SCFG of rank 2.

We will make use of another general fact about SCFGs, which we derive by applying Ogden’s Lemma (Ogden 1968), a generalized pumping lemma for context-free languages, to the source language of an SCFG.

**Lemma 4 (Ogden’s Lemma)**

For each context-free grammar *G* = (*V*, *Σ*, *P*, *S*) there is an integer *k* such that for any word *ξ* in *L*(*G*), if any *k* or more distinct positions in *ξ* are designated as distinguished, then there is some *A* in *V* and there are words *α*, *β*, *γ*, *δ*, and *μ* in *Σ** such that:

*S*⇒**αAμ*⇒**αβAδμ*⇒**αβγδμ*=*ξ*, and hence*αβ*^{m}γδ^{m}*μ*∈*L*(*G*) for all*m*≥*0*.*γ*contains at least one of the distinguished positions.Either

*α*and*β*both contain distinguished positions, or*δ*and*μ*both contain distinguished positions.*βγδ*contains at most*k*distinguished positions.

**Lemma 5**

For each SCFG *G* = (*V*, *Σ*, *Δ*, *P*, *S*) having source alphabet *Σ* and target alphabet *Δ*, there is an integer *k* such that for any string pair (*ξ*, *ξ*′) in *L*(*G*), if any *k* or more distinct positions in *ξ* are designated as distinguished, then there is some *A* in *V* and there are words *α*, *β*, *γ*, *δ*, and *μ* in *Σ** and *α*′, *β*′, *γ*′, *δ*′, and *μ*′ in *Δ** such that:

*γ*contains at least one of the distinguished positions.Either

*α*and*β*both contain distinguished positions, or*δ*and*μ*both contain distinguished positions.*βγδ*contains at most*k*distinguished positions.

*α*′,

*β*′,

*γ*′,

*δ*′, and

*μ*′, and indeed these may all be the empty string.

**Proof**

*G*which licenses the derivation

*A*⇒*

*βAδ*. If we write the

*j*th rule in this sequence as

*A*

_{j}→

*ν*

_{j}, there must exist a synchronous rule in

*G*of the form

*A*

_{j}→

*ν*

_{j},

*ν*

_{j}′ that rewrites the same nonterminal. Thus

*G*licenses a synchronous derivation for some

*β*′ and

*δ*′. Similarly, the source derivation

*S*⇒*

*αAμ*has a synchronous counterpart for some

*α*′ and

*μ*′, and the source derivation

*A*⇒ γ has a synchronous counterpart for some

*γ*′. Because the synchronous derivation can be repeated any number of times, the string pairs are generated by the SCFG for all

*m*≥ 0. The further conditions on

*α*,

*β*,

*γ*,

*δ*, and

*μ*follow directly from Ogden’s Lemma. ▪

We refer to a substring arising from a term c^{ni} or d^{ni} in the definition of (Equation (2)) as a **run**. In order to distinguish runs, we refer the run arising from c^{ni} or d^{ni} as the *i*th run. We refer to the pair (c^{ni}, c^{ni}) or (d^{ni}, d^{ni}) consisting of the *i*th run in the source and target strings as the *i*th **aligned run**. We now use Lemma 5 to show that aligned runs must be generated from aligned preterminal pairs.

**Lemma 6**

Assume that some SCFG *G*′ generates the translation for some fixed ℓ. There exists a constant *k* such that, in any derivation of grammar *G*′ having each *n _{i}* >

*k*, for any

*i*, 1 ≤

*i*≤ 2ℓ, there exists at least one aligned preterminal pair among the subsequences of source and target preterminals generating the

*i*th aligned run.

**Proof**

We consider a source string , such that the length *n*_{i} of each run is greater than the constant *k* of Lemma 5. For a fixed *i*, 1 ≤ *i* ≤ 2ℓ, we consider the distinguished positions to be all and only the terminals in the *i*th run. This implies that the run can be pumped to be arbitrarily long; indeed, this follows from the definition of the language itself.

Because our distinguished positions are within the *i*th run, and because Lemma 5 guarantees that either *α*, *β*, and *γ* all contain distinguished positions or *γ*, *δ*, and *μ* all contain distinguished positions, we are guaranteed that either *β* or *δ* lies entirely within the *i*th run. Consider the case where *β* lies within the run. We must consider three possibilities for the location of *δ* in the string:

*Case 1*. The string *δ* also lies entirely within the *i*th run.

*Case 2*. The string *δ* contains substrings of more than one run. This cannot occur, because pumped strings of the form *αβ*^{m}*γδ*^{m}*μ* would contain more than 2ℓ runs, which is not allowed under the definition of .

*Case 3*. The string

*δ*lies entirely within the

*j*th run, where

*j*≠

*i*. The strings

*αβ*

^{m}

*γδ*

^{m}

*μ*have the same form as

*αβ*

*γδ*

*μ*, with the exception that the

*i*th and

*j*th runs are extended from lengths

*n*

_{i}and

*n*

_{j}to some greater lengths and . By the definition of , for each source string, only one target string is permitted. For string pairs of the form of Equation (3) to belong to ,

*β*′ and

*δ*′ must lie within the

*i*th and

*j*th aligned runs in the target side. Because the permutation of Figure 7 cannot be decomposed, there must exist some

*k*such that the

*k*th aligned run lies between the

*i*th and

*j*th aligned runs in one side of the translation, and outside the

*i*th and

*j*th aligned runs in the other side of the translation. If this were not the case, we would be able to decompose the permutation by factoring out the subsequence between the

*i*th and

*j*th runs on both sides of the translation. Consider the case where the

*k*th aligned run lies between the

*i*th and

*j*th aligned runs in the source side, and therefore is a substring of

*γ*in the source, and a substring of either

*α*′ or

*μ*′ in the target. We apply Lemma 5 a second time, with all terminals of the

*k*th run as the distinguished positions, to the derivation (

*A*,

*A*) ⇒* (

*γ*,

*γ*′) by taking

*A*as the start symbol of the grammar. This implies that there exist such that and all strings are members of the translation . Either or is a substring of source side of the

*k*th aligned run, so the

*k*th aligned run can be pumped to be arbitrarily long in the source without changing its length in the target. This contradicts the definition of . Similarly, the case where the

*k*th aligned run lies between

*β*′ and

*δ*′ in the target leads to a contradiction. Thus the assumption that

*j*≠

*i*must be false.

Because Cases 2 and 3 are impossible, *δ* must lie entirely within the *i*th run. Similarly, in the case where *δ* contains distinguished positions, *β* must lie within the *i*th run. Thus both *β* and *δ* always lie entirely within the *i*th aligned run.

Because the *β* and *δ* lie within the *i*th aligned run, the strings *αβ*^{m}*γδ*^{m}*μ* have the same form as *αβ**γδ**μ*, with the exception that the *i*th run is extended from length *n*_{i} to some greater length *n*_{i}′. For the pairs of Equation (3) to be members of the translation, *β*′ and *δ*′ must be substrings of the *i*th aligned run in the target. Because *β*^{m}*γδ*^{m} and (*β*′)^{m}*γ*′(*δ*′)^{m} were derived from the same nonterminal, the two sequences of preterminals generating these two strings consist of aligned preterminal pairs. Because both *β*^{m}*γδ*^{m} and (*β*′)^{m}*γ*′(*δ*′)^{m} are substrings of the *i*th aligned run, we have at least one aligned preterminal pair among the source and target preterminal sequences generating the *i*th aligned run. ▪

**Lemma 7**

Assume that some SCFG *G*′ generates the translation for some fixed ℓ. There exists a constant *k*x such that, if (*ξ*, *ξ*′) is a string pair generated by *G*′ having each *n*_{i} > *k*, any derivation of (*ξ*, *ξ*′) with grammar *G*′ must contain a rule of rank at least 2ℓ.

**Proof**

Because the choice of *i* in Lemma 6 was arbitrary, each aligned run must contain at least one aligned preterminal pair. If we select one such preterminal pair from each run, the associated permutation is that of Figure 7. This permutation cannot be decomposed, so, by Lemma 3, it cannot be generated by an SCFG derivation containing only rules of rank less than 2ℓ. ▪

We will use one more general fact about SCFGs to prove our main result.

**Lemma 8**

**Proof**

Let *V* be the nonterminal set of *G*, and let *S* be the state set of *F*. Construct the SCFG *G*′ with nonterminal set *V*×*S* ×*S* by applying the construction of Bar-Hillel, Perles, and Shamir (1961) for intersection of a CFG and finite state machine to the source side of each rule in *G*. ▪

Now we are ready for our main result.

**Theorem 1**

SCFG = yield(STSG) ⊊ yield(STSG;STSG), where the semicolon denotes composition.

**Proof**

Assume that some SCFG *G* generates *T*_{crisscross}. Note that is the result of intersecting the source of *T*_{crisscross} with the regular language a[c^{+}d^{+}]^{ℓ}a. By Lemma 8, we can construct an SCFG *G*^{ℓ} generating . By Lemma 7, for each ℓ, *G*^{ℓ} has rank at least 2ℓ. The intersection construction does not increase the rank of the grammar, so *G* has rank at least 2ℓ. Because ℓ is unbounded in the definition of *T*_{crisscross}, and because any SCFG has a finite maximum rank, *T*_{crisscross} cannot be produced by any SCFG. ▪

### 4.1. Implications for Machine Translation

The ability of MBOTs to represent the composition of STSGs is given as a motivation for the MBOT formalism by Maletti (2010), but this raises the issue of whether synchronous parsing and machine translation decoding can be undertaken efficiently for MBOTs resulting from the composition of STSGs.

In discussing the complexity of synchronous parsing problems, we distinguish the case where the grammar is considered part of the input, and the case where the grammar is fixed, and only the source and target strings are considered part of the input. For SCFGs, synchronous parsing is NP-complete when the grammar is considered part of the input and can have arbitrary rank. For any fixed grammar, however, synchronous parsing is possible in time polynomial in the lengths of the source and target strings, with the degree of the polynomial depending on the rank of the fixed SCFG (Satta and Peserico 2005). Because MBOTs subsume SCFGs, the problem of recognizing whether a string pair belongs to the translation produced by an arbitrary MBOT, when the MBOT is considered part of the input, is also NP-complete.

Given our construction for converting an MBOT to an LCFRS, we can use standard LCFRS tabular parsing techniques to determine whether a given string pair belongs to the translation defined by the yield of a fixed MBOT . As with arbitrary-rank SCFG, LCFRS parsing is polynomial in the length of the input string pair, but the degree of the polynomial depends on the complexity of the MBOT . To be precise, the degree of the polynomial for LCFRS parsing is (Seki et al. 1991), which yields when applied to MBOTs.

If we restrict ourselves to MBOTs that are derived from the composition of STSGs, synchronous parsing is NP-complete if the STSGs to compose are part of the input, because a single STSG suffices. For a composition of fixed STSGs, we obtain a fixed MBOT, and polynomial time parsing is possible. Theorem 5 indicates that we cannot apply SCFG parsing techniques off the shelf, but rather that we must implement some type of more general parsing system. Either of the STSGs used in our proof of Theorem 5 can be binarized and synchronously parsed in time *O*(*n*^{6}), but tabular parsing for the LCFRS resulting from composition has higher complexity. Thus, composing STSGs generally increases the complexity of synchronous parsing.

The problem of language-model–integrated decoding with synchronous grammars is closely related to that of synchronous parsing; both problems can be seen as intersecting the grammar with a fixed source-language string and a finite-state machine constraining the target-language string. The widely used decoding algorithms for SCFG (Yamada and Knight 2002; Zollmann and Venugopal 2006; Huang et al. 2009) search for the highest-scoring translation when combining scores from a weighted SCFG and a weighted finite-state language model. As with SCFG, language-model–integrated decoding for weighted MBOTs can be performed by adding *n*-gram language model state to each candidate target language span. This, as with synchronous parsing, gives an algorithm which is polynomial in the length of the input sentence for a fixed MBOT, but with an exponent that depends on the complexity of the MBOT . Furthermore, Theorem 5 indicates that SCFG-based decoding techniques cannot be applied off the shelf to compositions of STSGs, and that composition of STSGs in general increases decoding complexity.

Finally, we note that finding the highest-scoring translation without incorporating a language model is equivalent to parsing with the source or target projection of the MBOT used to model translation. For the source language of the MBOT, this implies time *O*(*n*^{3}) because the problem reduces to CFG parsing. For the target language of the MBOT, this implies polynomial-time parsing, where the degree of the polynomial depends on the MBOT, as a result of Theorem 4.

## 5. Conclusion

MBOTs are desirable for natural language processing applications because they are closed under composition and can be used to represent sequences of transformations of the type performed by STSGs. However, the string translations produced by MBOTs representing compositions of STSGs are strictly more powerful than the string translations produced by STSGs, which are equivalent to the translations produced by SCFGs. From the point of view of machine translation, because parsing with general LCFRS is NP-complete, restrictions on the power of MBOTs will be necessary in order to achieve polynomial–time algorithms for synchronous parsing and language-model–integrated decoding. Our result on the string translations produced by compositions of STSGs implies that algorithms for SCFG-based synchronous parsing or language-model-integrated decoding cannot be applied directly to these problems, and that composing STSGs generally increases the complexity of these problems. Developing parsing algorithms specific to compositions of STSGs, as well as possible restrictions on the STSGs to be composed, presents an interesting area for future work.

## Acknowledgements

We are grateful for extensive feedback on earlier versions of this work from Giorgio Satta, Andreas Maletti, Adam Purtee, and three anonymous reviewers. This work was partially funded by NSF grant IIS-0910611.

## References

## Author notes

Computer Science Department, University of Rochester, Rochester NY 14627. E-mail: gildea@cs.rochester.edu.

Language and Information: Selected Essays on their Theory and Application, Addison-Wesley 1964, 116–150