Abstract
We study the problem of sampling trees from forests, in the setting where probabilities for each tree may be a function of arbitrarily large tree fragments. This setting extends recent work for sampling to learn Tree Substitution Grammars to the case where the tree structure (TSG derived tree) is not fixed. We develop a Markov chain Monte Carlo algorithm which corrects for the bias introduced by unbalanced forests, and we present experiments using the algorithm to learn Synchronous Context-Free Grammar rules for machine translation. In this application, the forests being sampled represent the set of Hiero-style rules that are consistent with fixed input word-level alignments. We demonstrate equivalent machine translation performance to standard techniques but with much smaller grammars.
1. Introduction
Recent work on learning Tree Substitution Grammars (TSGs) has developed procedures for sampling TSG rules from known derived trees (Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009). Here one samples binary variables at each node in the tree, indicating whether the node is internal to a TSG rule or is a split point between two rules. We consider the problem of learning TSGs in cases where the tree structure is not known, but rather where possible tree structures are represented in a forest. For example, we may wish to learn from text where treebank annotation is unavailable, but a forest of likely parses can be produced automatically. Another application on which we focus our attention in this article arises in machine translation, where we want to learn translation rules from a forest representing the phrase decompositions that are consistent with an automatically derived word alignment. Both these applications involve sampling TSG trees from forests, rather than from fixed derived trees.
Chappelier and Rajman (2000) present a widely used algorithm for sampling trees from forests: One first computes an inside probability for each node bottom–up, and then chooses an incoming hyperedge for each node top–down, sampling according to each hyperedge's inside probability. Johnson, Griffiths, and Goldwater (2007) use this sampling algorithm in a Markov chain Monte Carlo framework for grammar learning. We can combine the representations used in this algorithm and in the TSG learning algorithm discussed earlier, maintaining two variables at each node of the forest, one for the identity of the incoming hyperedge, and another representing whether the node is internal to a TSG rule or is a split point. However, computing an inside probability for each node, as in the first phase of the algorithm of Johnson, Griffiths, and Goldwater (2007), becomes difficult because of the exponential number of TSG rules that can apply at any node in the forest. Not only is the number of possible TSG rules that can apply given a fixed tree structure exponentially large in the size of the tree, but the number of possible tree structures under a node is also exponentially large. This problem is particularly acute during grammar learning, as opposed to sampling according to a fixed grammar, because any tree fragment is a valid potential rule. Cohn and Blunsom (2010) address the large number of valid unseen rules by decomposing the prior over TSG rules into an equivalent probabilistic context-free grammar; however, this technique only applies to certain priors. In general, algorithms that match all possible rules are likely to be prohibitively slow, as well as unwieldy to implement. In this article, we design a sampling algorithm that avoids explicitly computing inside probabilities for each node in the forest.
In Section 2, we derive a general algorithm for sampling tree fragments from forests. We avoid computing inside probabilities, as in the TSG sampling algorithms of Cohn, Goldwater, and Blunsom (2009) and Post and Gildea (2009), but we must correct for the bias introduced by the forest structure, a complication that does not arise when the tree structure is fixed. In order to simplify the presentation of the algorithm, we first set aside the complication of large, TSG-style rules, and describe an algorithm for sampling trees from forests while avoiding computation of inside probabilities. This algorithm is then generalized to learn the composed rules of TSG in Section 2.3.
As an application of our technique, we present machine translation experiments in the remainder of the article. We learn Hiero-style Synchronous Context-Free Grammar (SCFG) rules (Chiang 2007) from bilingual sentences for which a forest of possible minimal SCFG rules has been constructed from fixed word alignments. The construction of this forest and its properties are described in Section 3. We make the assumption that the alignments produced by a word-level model are correct in order to simplify the computation necessary for rule learning. This approach seems safe given that the pipeline of alignment followed by rule extraction has generally remained the state of the art despite attempts to learn joint models of alignment and rule decomposition (DeNero, Bouchard-Cote, and Klein 2008; Blunsom et al. 2009; Blunsom and Cohn 2010a). We apply our sampling algorithm to learn the granularity of rule decomposition in a Bayesian framework, comparing sampling algorithms in Section 4. The end-to-end machine translation experiments of Section 5 show that our algorithm is able to achieve performance equivalent to the standard technique of extracting all rules, but results in a significantly smaller grammar.
2. Sampling Trees from Forests
As a motivating example, consider the small example forest of Figure 1. This forest contains a total of five trees, one under the hyperedge labeled A, and four under the hyperedge labeled B (the cross-product of the two options for deriving node 4 and the two options for deriving node 5).
A tree can be specified by attaching a variable zn to each node n in the forest indicating which incoming hyperedge is to be used in the current tree. For example, variable z1 can take values A and B in Figure 1, whereas variables z2 and z3 can only take a single value. We use z to refer to the entire set of variables zn in a forest. Each assignment to z specifies a unique tree, τ(z), which can be found by following the incoming hyperedges specified by z from the goal node of each forest down to the terminals.
A naive sampling strategy would be to resample each of these variables zn in order, holding all others constant, as in standard Gibbs sampling. If we choose an incoming hyperedge according to the probability Pt(τ(z)) of the resulting tree, holding all other variable assignments fixed, we see that, because Pt is uniform, we will choose with uniform probability of 1/m among the m incoming hyperedges at each node. In particular, we will choose among the two incoming hyperedges at the root (node 1) with equal probability, meaning that, over the long run, the sampler will spend half its time in the state for the single tree corresponding to nodes 2 and 3, and only one eighth of its time in each of the four other possible trees. Our naive algorithm has failed at its goal of sampling among the five possible trees each with probability 1/5.
Thus, we cannot adopt the simple Gibbs sampling strategy, used for TSG induction from fixed trees, of resampling one variable at a time according to the target distribution, conditioned on all other variables. The intuitive reason for this, as illustrated by the example, is the bias introduced by forests that are bushier (that is, have more derivations for each node) in some parts than in others. The algorithm derived in the remainder of this section corrects for this bias, while avoiding the computation of inside probabilities in the forest.
2.1 Choosing a Stationary Distribution
We will design our sampling algorithm by first choosing a distribution Pz over the set of variables z defined earlier. We will show correctness of our algorithm by showing that it is a Markov chain converging to Pz, and that Pz results in the desired distribution Pt over trees.
The tree specified by an assignment to z will be denoted τ(z) (see Table 1). For a tree t the vector containing the variables found at nodes in t will be denoted z[t]. This is a subvector of z: for example, in Figure 1, if t chooses A at node 1 then z[t] = (z1, z2, z3), and if t chooses B at node 1 then z[t] = (z1, z4, z5). We use z[¬t] to denote the variables not used in the tree t. The vectors z[t] and z[¬t] differ according to t, but for any tree t, the two vectors form a partition of z. There are many values of z that correspond to the same tree, but each tree t corresponds to a unique subvector of variables z[t] and a unique assignment to those specific variables—we will denote this unique assignment of variables in z[t] by ζ(t). (In terms of τ and ζ one has for any z and t that τ(z) = t if and only if z[t] = ζ(t).)
Pt | desired distribution on trees |
z | vector of variables |
Z | random vector over z |
τ(z) | tree corresponding to setting of z |
Z[t] | subset of random variables that occur in tree t |
ζ(t) | setting of variables in Z[t] |
Z[¬t] | subset of random variables that do not occur in t |
Pt | desired distribution on trees |
z | vector of variables |
Z | random vector over z |
τ(z) | tree corresponding to setting of z |
Z[t] | subset of random variables that occur in tree t |
ζ(t) | setting of variables in Z[t] |
Z[¬t] | subset of random variables that do not occur in t |
We proceed by designing a Gibbs sampler for this Pz. The sampler resamples variables from z one at a time, according to the joint probability Pz(z) for each alternative. The set of possible values for zn at a node n having m incoming hyperedges consists of the hyperedges ej, 1 ≤ j ≤ m. Let sj be the vector z with the value of zn changed to ej. Note that the τ(sj)'s only differ at nodes below n.
Let z[inn] be the vector consisting of the variables at nodes below n (that is, contained in subtrees rooted at n, or “inside” n) in the forest, and let be the vector consisting of variables not under node n. Thus the vectors z[inn] and partition the complete vector z. We will use the notation z[t ∩ inn] to represent the vector of variables from z that are both in a tree t and under a node n.
For Gibbs sampling, we need to compute the relative probabilities of sj's. We now consider the two terms of Equation (2) in this setting. Because of our requirement that Pz correspond to the desired distribution Pt, the first term of Equation (2) can be computed, up to a normalization constant, by evaluating our model over trees (Equation (1)).
2.2 Sampling Schedule
In a standard Gibbs sampler, updates are made iteratively to each variable z1, …, zN, and this general strategy can be applied in our case. However, it may be wasteful to continually update variables that are not used by the current tree and are unlikely to be used by any tree. We propose an alternative sampling schedule consisting of sweeping from the root of the current tree down to its leaves, resampling variables at each node in the current tree as we go. If an update changes the structure of the current tree, the sweep continues along the new tree structure. This strategy is shown in Algorithm 1, where v(z, i) denotes the ith variable in a top–down ordering of the variables of the current tree τ(z). The top–down ordering may be depth-first or breadth-first, among other possibilities, as long as the variables at each node have lower indices than the variables at the node's descendants in the current tree.
To show that this sampling schedule will converge to the desired distribution over trees, we will first show that Pz is a stationary distribution for the transition defined by a single step of the sweep:
Lemma 1
For any setting of variables z, any top–down ordering v(z,i), and any i, updating variable zv(z,i) according to Equation (6) is stationary with respect to the distribution Pz defined by Equation (2).
Proof
By symmetry of the righthand side of Equation (9) in z and z′, we see that Equation (7) is satisfied. Because detailed balance implies stationarity, Pz is a stationary distribution of P(Z(i + 1) = z′|Z(i) = z).
This lemma allows us to prove the correctness of our main algorithm:
Theorem 1
For any top–down sampling schedule v(z,i), and any desired distribution over trees Pt that assigns non-zero probability to all trees in the forest, Algorithm 1 will converge to Pt.
Proof
Because Pz is stationary for each step of the sweep, it is stationary for one entire sweep from top to bottom.
To show that the Markov chain defined by an entire sweep is ergodic, we must show that it is aperiodic and irreducible. It is aperiodic because the chain can stay in the same configuration with non-zero probability by selecting the same setting for each variable in the sweep. The chain is irreducible because any configuration can be reached in a finite number of steps by sorting the variables in topological order bottom–up in the forest, and then, for each variable, executing one sweep that selects a tree that includes the desired variable with the desired setting.
Because Pz is stationary for the chain defined by entire sweeps, and this chain is ergodic, the chain will converge to Pz. Because Equation (2) guarantees that Pz( τ(Z) = t) = Pt(t), convergence to Pz implies convergence to Pt.
2.3 Sampling Composed Rules
The proof that the sampling algorithm converges to the correct distribution still applies in the TSG setting, as it makes use of the partition of z into z[τ(z)] and z[¬τ(z)], but does not depend on the functional form of the desired distribution over trees Pt.
3. Phrase Decomposition Forest
In the remainder of this article, we will apply the algorithm developed in the previous section to the problem of learning rules for machine translation in the context of a Hiero-style, SCFG-based system. As in Hiero, our grammars will make use of a single nonterminal X, and will contain rules with a mixture of nonterminals and terminals on the right-hand side, with at most two nonterminal occurrences in the right-hand side of a rule. In general, many overlapping rules of varying sizes are consistent with the input word alignments, meaning that we must address a type of segmentation problem in order to learn rules of the right granularity. Given the restriction to two right-hand side nonterminals, the maximum number of rules that can be extracted from an input sentence pair is O(n12) in the sentence length, because the left and right boundaries of the left-hand side (l.h.s.) nonterminal and each of the two right-hand side nonterminals can take O(n) positions in each of the two languages. This complexity leads us to explore sampling algorithms, as dynamic programming approaches are likely to be prohibitively slow. In this section, we show that the problem of learning rules can be analyzed as a problem of identifying tree fragments of unknown size and shape in a forest derived from the input word alignments for each sentence. These tree fragments are similar to the tree fragments used in TSG learning. As in TSG learning, each rule of the final grammar consists of some number of adjacent, minimal tree fragments: one-level treebank expansions in the case of TSG learning and minimal SCFG rules, defined subsequently, in the case of translation. The internal structure of TSG rules is used during parsing to determine the final tree structure to output, and the internal structure of machine translation rules will not be used at decoding time. This distinction is irrelevant during learning. A more significant difference from TSG learning is that the sets of minimal tree fragments in our SCFG application come not from a single, known tree, but rather from a forest representing the set of bracketings consistent with the input word alignments.
We now proceed to precisely define this phrase decomposition forest and discuss some of its theoretical properties. The phrase decomposition forest is designed to extend the phrase decomposition tree defined by Zhang, Gildea, and Chiang (2008) in order to explicitly represent each possible minimal rule with a hyperedge.
If two phrases n = ([i1, j1], [i2, j2]) and n′ = ([i′1, j′1], [i′2, j′2]) intersect, we can take the union of the two phrases by taking the union of the source and target language spans, respectively. That is, n1 ∪ n2 = ([i1, j1] ∪ [i′1, j′1], [i2, j2] ∪ [i′2, j′2]). An important property of phrases is that if two phrases intersect, their union is also a phrase. For example, given that have a date with her and with her today are both valid phrases in Figure 2, have a date with her today must also be a valid phrase. Given a set T of phrases, we define the union closure of the phrase set T, denoted ⋃ *(T), to be constructed by repeatedly joining intersecting phrases until there are no intersecting phrases left.
Each edge in E, written as T → n, is made of a set of non-intersecting tail nodes T ⊂ V, and a single head node n ∈ V that covers each tail node. Each edge is an SCFG rule consistent with the word alignments. Each tail node corresponds to a right-hand-side nonterminal in the SCFG rule, and any position included in n but not included in any tail node corresponds to a right-hand-side terminal in the SCFG rule. For example, given the aligned sentence pair of Figure 2, the edge {([3,4],[5,6]),([5, 6],[3,4])} → ([2,6], [1,6]), corresponds to a SCFG rule , have a X2 with X1.
For the rest of this section, we assume that there are no unaligned words. Unaligned words can be temporarily removed from the alignment matrix before building the phrase decomposition forest. After extracting the forest, they are put back into the alignment matrix. For each derivation in the phrase decomposition forest, an unaligned word appears in the SCFG rule whose left-hand side corresponds to the lowest forest node that covers the unaligned word.
Definition 1
An edge T → n is minimal if there does not exist another edge T′ → n such that T′ covers T.
A minimal edge is an SCFG rule that cannot be decomposed by factoring out some part of its right-hand side as a separate rule. We define a phrase decomposition forest to be made of all phrases from a sentence pair, connected by all minimal SCFG rules. A phrase decomposition forest compactly represents all possible SCFG rules that are consistent with word alignments. For the example word alignment shown in Figure 2, the phrase decomposition forest is shown in Figure 3. Each boxed phrase in Figure 2 corresponds to a node in the forest of Figure 3, and hyperedges in Figure 3 represent ways of building phrases out of shorter phrases.
The structure and size of phrase decomposition forests are constrained by the following lemma:
Lemma 2
When there exists more than one minimal edge leading to the same head node n = ([i1,j1], [i2,j2]), each of these minimal edges is a binary split of phrase pair n, which gives us either a straight or inverted binary SCFG rule with no terminals.
Proof
Another interesting property of phrase decomposition forests relates to the length of derivations. A derivation is a tree of minimal edges reaching from a given node all the way down to the forest's terminal nodes. The length of a derivation is the number of minimal edges it contains.
Lemma 3
All derivations under a node in a phrase decomposition forest have the same length.
Proof
This is proved by induction. As the base case, all the nodes at the bottom of the phrase decomposition forest have only one derivation of length 1. For the induction step, we consider the two possible cases in Lemma 2. The case where a node n has only a single edge underneath is trivial. It can have only one derivation length because the children under that single edge already do. For the case where there are multiple valid binary splits for a node n at span (i,j), we assume the split points are k1, …, ks. Because the intersection of two phrases is also a phrase, we know that spans (i,k1),(k1,k2), …, (ks,j) are all valid phrases, and so is any concatenation of consecutive phrases in these spans. Any derivation in this sub-forest structure leading from these s + 1 spans to n has length s, which completes the proof under the assumption of the induction.
Because all the different derivations under the same node in a minimal phrase forest contain the same number of minimal rules, we call that number the level of a node. The fact that nodes can be grouped by levels forms the basis of our fast iterative sampling algorithm as described in Section 5.3.
3.1 Constructing the Phrase Decomposition Forest
Given a word-aligned sentence pair, a phrase decomposition tree can be extracted with a shift-reduce algorithm (Zhang, Gildea, and Chiang 2008). Whereas the algorithm of Zhang, Gildea, and Chiang (2008) constructs a single tree which compactly represents the set of possible phrase trees, we wish to represent the set of all trees as a forest. We now describe a bottom–up parsing algorithm, shown in Algorithm 2, for building this forest. The algorithm considers all spans (i, j) in order of increasing length. The CYK-like loop over split points k (line 10) is only used for the case where a phrase can be decomposed into two phrases, corresponding to a binary SCFG rule with no right-hand side terminals. By Lemma 2, this is the only source of ambiguity in constructing phrase decompositions. When no binary split is found (line 16), a single hyperedge is made that connects the current span with all its maximal children. (A child is maximal if it is not itself covered by another child.) This section can produce SCFG rules with more than two right-hand side nonterminals, and it also produces any rules containing both terminals and nonterminals in the right-hand side. Right-hand side nonterminals correspond to previously constructed nodes n(l, m) in line 23, and right-hand side terminals correspond to advancing a position in the string in line 20.
The running time of the algorithm is O(n3) in terms of the length of the Chinese sentence f. The size of the resulting forests depends on the input alignments. The worst case in terms of forest size is when the input consists of a monotonic, one-to-one word alignment. In this situation, all (i, k, j) tuples correspond to valid hyperedges, and the size of the output forest is O(n3). At the other extreme, when given a non-decomposable permutation as an input alignment, the output forest consists of a single hyperedge. In practice, given Chinese–English word alignments from GIZA++, we find that the resulting forests are highly constrained, and the algorithm's running time is negligible in our overall system. In fact, we find it better to rebuild every forest from a word alignment every time we re-sample a sentence, rather than storing the hypergraphs across sampling iterations.
4. Comparison of Sampling Methods
To empirically verify the sampling methods presented in Section 2, we construct phrase decomposition forests over which we try to learn composed translation rules. In this section, we use a simple probability model for the tree probability Pt in order to study the convergence behavior of our sampling algorithm. We will use a more sophisticated probability model for our end-to-end machine translation experiments in Section 5. For studying convergence, we desire a simpler model with a probability that can be evaluated in closed form.
4.1 Model
We use a very basic generative model based on a Dirichlet process defined over composed rules. The model is essentially the same as the TSG model used by Cohn, Goldwater, and Blunsom (2009) and Post and Gildea (2009).
4.2 Sampling Methods
We wish to sample from the set of possible decompositions into rules, including composed rules, for each sentence in our training data. We follow the top–down sampling schedule discussed in Section 2 and also implement tree-level rejection sampling as a baseline.
Our rejection sampling baseline is a form of Metropolis-Hastings where a new tree t is resampled from a simple proposal distribution Q(t), and then either accepted or rejected according the Metropolis-Hastings acceptance rule, as shown in Algorithm 3. As in Algorithm 1, we use v(z,i) to denote a top–down ordering of forest variables. As in all our experiments, Pt is the current tree probability conditioned on the current trees for all other sentences in our corpus, using Equation (11) as the rule probability in Equation (10).
We now describe in more detail our implementation of the approach of Section 2.2. We define two operations on a hypergraph node n, SampleCut and SampleEdge, to change the sampled tree from the hypergraph. SampleCut(n) chooses whether n is a segmentation point or not, deciding if two rules should merge, while SampleEdge(n) chooses a hyperedge under n, making an entire new subtree. Algorithm 4 shows our implementation of Algorithm 1 in terms of tree operations and the sampling operations SampleEdge(n) and SampleCut(n).
4.3 Experiments
We used a Chinese–English parallel corpus available from the Linguistic Data Consortium (LDC), composed of newswire text. The corpus consists of 41K sentence pairs, which is 1M words on the English side. We constructed phrase decomposition forests with this corpus and ran the top–down sampling algorithm and the rejection sampling algorithm described in Section 4.2 for one hundred iterations. We used α = 100 for every experiment. The likelihood of the current state was calculated for every iteration. Each setting was repeated five times, and then we computed the average likelihood for each iteration.
Figure 5 shows a comparison of the likelihoods found by rejection sampling and top–down sampling. As expected, we found that the likelihood converged much more quickly with top–down sampling. Figure 6 shows a comparison between two different versions of top–down sampling: the first experiment was run with the density factor described in Section 2, Equation (6), and the second one was run without the density factor. The density factor has a much smaller effect on the convergence of our algorithm than does the move from rejection sampling to top–down sampling, such that the difference between the two curves shown in Figure 6 is not visible at the scale of Figure 5. (The first ten iterations are omitted in Figure 6 in order to highlight the difference.) The small difference is likely due to the fact that our trees are relatively evenly balanced, such that the ratio of the density factor for two trees is not significant in comparison to the ratio of their model probabilities. Nevertheless, we do find higher likelihood states with the density factor than without it. This shows that, in addition to providing a theoretical guarantee that our Markov chain converges to the desired distribution Pt in the limit, the density factor also helps us find higher probability trees in practice.
5. Application to Machine Translation
The results of the previous section demonstrate the performance of our algorithm in terms of the probabilities of the model it is given, but do not constitute an end-to-end application. In this section we demonstrate its use in a complete machine translation system, using the SCFG rules found by the sampler in a Hiero-style MT decoder. We discuss our approach and how it relates to previous work in machine translation in Section 5.1 before specifying the precise probability model used for our experiments in Section 5.2, discussing a technique to speed-up the model's burn-in in Section 5.3, and describing our experiments in Section 5.4.
5.1 Approach
A typical pipeline for training current statistical machine translation systems consists of the following three steps: word alignment, rule extraction, and tuning of feature weights. Word alignment is most often performed using the models of Brown et al. (1993) and Vogel, Ney, and Tillmann (1996). Phrase extraction is performed differently for phrase-based (Koehn, Och, and Marcu 2003), hierarchical (Chiang 2005), and syntax-based (Galley et al. 2004) translation models, whereas tuning algorithms are generally independent of the translation model (Och 2003; Chiang, Marton, and Resnik 2008; Hopkins and May 2011).
Recently, a number of efforts have been made to combine the word alignment and rule extraction steps into a joint model, with the hope both of avoiding some of the errors of the word-level alignment, and of automatically learning the decomposition of sentence pairs into rules (DeNero, Bouchard-Cote, and Klein 2008; Blunsom et al. 2009; Blunsom and Cohn 2010a; Neubig et al. 2011). This approach treats both word alignment and rule decomposition as hidden variables in an EM-style algorithm. While these efforts have been able to match the performance of systems based on two successive steps for word alignment and rule extraction, they have generally not improved performance enough to become widely adopted. One possible reason for this is the added complexity and in particular the increased computation time when compared to the standard pipeline. The accuracy of word-level alignments from the standard GIZA++ package has proved hard to beat, in particular when large amounts of training data are available.
Given this state of affairs, the question arises whether static word alignments can be used to guide rule learning in a model which treats the decomposition of a sentence pair into rules as a hidden variable. Such an approach would favor rules which are consistent with the other sentences in the data, and would contrast with the standard practice in Hiero-style systems of simply extracting all overlapping rules consistent with static word alignments. Constraining the search over rule decomposition with word alignments has the potential to significantly speed up training of rule decomposition models, overcoming one of the barriers to their widespread use. Rule decomposition models also have the benefit of producing much smaller grammars than are achieved when extracting all possible rules. This is desirable given that the size of translation grammars is one of the limiting computational factors in current systems, necessitating elaborate strategies for rule filtering and indexing.
In this section, we apply our sampling algorithm to learn rules for the Hiero translation model of Chiang (2005). Hiero is based on SCFG, with a number of constraints on the form that rules can take. The grammar has a single nonterminal, and each rule has at most two right-hand side nonterminals. Most significantly, Hiero allows rules with mixed terminals and nonterminals on the right-hand side. This has the great benefit of allowing terminals to control re-ordering between languages, but also leads to very large numbers of valid rules during the rule extraction process. We wish to see whether, by adding a learned model of sentence decomposition to Hiero's original method of leveraging fixed word-level alignments, we can learn a small set of rules in a system that is both efficient to train and efficient to decode. Our approach of beginning with fixed word alignments is similar to that of Sankaran, Haffari, and Sarkar (2011), although their sampling algorithm reanalyzes individual phrases extracted with Hiero heuristics rather than entire sentences, and produces rules with no more than one nonterminal on the right-hand side.
Most previous works on joint word alignment and rule extraction models were evaluated indirectly by resorting to heuristic methods to extract rules from learned word alignment or bracketing structures (DeNero, Bouchard-Cote, and Klein 2008; Zhang et al. 2008; Blunsom et al. 2009; Levenberg, Dyer, and Blunsom 2012), and do not directly learn the SCFG rules that are used during decoding. In this article, we work with lexicalized translation rules with a mix of terminals and nonterminals, and we use the rules found by our sampler directly for decoding. Because word alignments are fixed in our model, any improvements we observe in translation quality indicate that our model learns how SCFG rules interplay with each other, rather than fixing word alignment errors.
The problem of rule decomposition is not only relevant to the Hiero model. Translation models that make use of monolingual parsing, such as string-to-tree (Galley et al. 2004), and tree-to-string (Liu, Liu, and Lin 2006), are all known to benefit greatly from learning composed rules (Galley et al. 2006). In the particular case of Hiero rule extraction, although there is no explicit rule composition step, the extracted rules are in fact “composed rules” in the sense of string-to-tree or tree-to-string rule extraction, because they can be further decomposed into smaller SCFG rules that are also consistent with word alignments. Although our experiments only include the Hiero model, the method presented in this article is also applicable to string-to-tree and tree-to-string models, because the phrase decomposition forest presented in Section 3 can be extended to rule learning and extraction of other syntax-based MT models.
5.2 Model
In this section, we describe a generative model based on the Pitman-Yor process (Pitman and Yor 1997; Teh 2006) over derivation trees consisting of composed rules. Bayesian methods have been applied to a number of segmentation tasks in natural language processing, including word segmentation, TSG learning, and learning machine translation rules, as a way of controlling the overfitting produced when Expectation Maximization would tend to prefer longer segments. However, it is important to note that the Bayesian priors in most cases control the size and number of the clusters, but do not explicitly control the size of rules. In many cases, this type of Bayesian prior alone is not strong enough to overcome the preference for longer, less generalizable rules. For example, some previous work in word segmentation (Liang and Klein 2009; Naradowsky and Toutanova 2011) adopts a “length penalty” to remedy this situation. Because we have the prior knowledge that longer rules are less likely to generalize and are therefore less likely to be a good rule, we adopt a similar scheme to control the length of rules in our model.
The first two parameters, a concentration parameter α and a discount parameter d, control the shape of distribution G by controlling the size and the number of clusters. The label of the cluster is decided by the base distribution P0. Because our alignment is fixed, we do not need a complex base distribution that differentiates better aligned phrases from others. We use a uniform distribution where each rule of the same size has equal probability. Since the number of possible shorter rules is smaller than that of longer rules, we need to reflect this fact and need to have larger uniform probability for shorter rules and smaller uniform probability for longer rules. We reuse the Poisson probability for the base distribution, essentially assuming that the number of possible rules of length ℓ is 1/P(ℓ;λ).
5.3 Stratified Sampling
We follow the same Gibbs sampler introduced in Section 4.2. The SampleEdge operation in our Gibbs sampler can be a relatively expensive operation, because the entire subtree under a node is being changed during sampling. We observe that in a phrase decomposition forest, lexicalized rules, which are crucial to translation quality, appear at the bottom level of the forest. This lexicalized information propagates up the forest as rules get composed. It is reasonable to constrain initial sampling iterations to work only on those bottom level nodes, and then gradually lift the constraint. This not only makes the sampler much more efficient, but also gives it a chance to focus on getting better estimates of the more important parameters, before starting to consider nodes at higher levels, which correspond to rules of larger size. Fortunately, as mentioned in Section 3, each node in a phrase decomposition forest already has a unique level, with level 1 nodes corresponding to minimal phrase pairs. We design the sampler to use a stratified sampling process (i.e., sampling level one nodes for K iterations, then level 1 and 2 nodes for K iterations, and so on). We emphasize that when we sample for level 2 nodes, level 1 nodes are also sampled, which means parameters for the smaller rules are given more chance to mix, and thereby settle into a more stable distribution.
In our experiments, running the first 100 iterations of sampling with regular sampling techniques took us about 18 hours. However, with stratified sampling, it took only about 6 hours. We also compared translation quality as measured by decoding with rules from the 100th sample, and by averaging over every 10th sample. Both sampling methods gave us roughly the same translation quality as measured in BLEU. We therefore used stratified sampling throughout our experiments.
5.4 Experiments
We used a Chinese–English parallel corpus available from LDC,1 composed of newswire text. The corpus consists of 41K sentence pairs, which is 1M words on the English side. We used a 392-sentence development set with four references for parameter tuning, and a 428-sentence test set with four references for testing.2 The development set and the test set have sentences with less than 30 words. A trigram language model was used for all experiments. BLEU (Papineni et al. 2002) was calculated for evaluation.
5.4.1 Baseline
For our baseline system, we extract Hiero translation rules using the heuristic method (Chiang 2007), with the standard Hiero rule extraction constraints. We use our in-house SCFG decoder for translation with both the Hiero baseline and our sampled grammars. Our features for all experiments include differently normalized rule counts and lexical weightings (Koehn, Och, and Marcu 2003) of each rule. Weights are tuned using Pairwise Ranking Optimization (Hopkins and May 2011) using the baseline grammar and development set, then used throughout the experiments.
Because our sampling procedure results in a smaller rule table, we also establish a no-singleton baseline to compare our results to a simple heuristic method of reducing rule table size. The no-singleton baseline discards rules that occur only once and that have more than one word on the Chinese side during the Hiero rule extraction process, before counting the rules and computing feature scores.
5.4.2 Experimental Settings
Model parameters. For all experiments, we used d = 0.5 for the Pitman-Yor discount parameter, except where we compared the Pitman-Yor process with Dirichlet process (d = 0). Although we have a separate Pitman-Yor process for each rule length, we used the same α = 5 for all rule sizes in all experiments, including Dirichlet process experiments. For rule length probability, a Poisson distribution where λ = 2 was used for all experiments.
Sampling. The samples are initialized such that all nodes in a forest are set to be segmented, and a random edge is chosen under each node. For all experiments, we ran the sampler for 100 iterations and took the sample from the last iteration to compare with the baseline. For stratified sampling, we increased the level we sample at every 10th iteration. We also tried “averaging” samples, where samples from every 10th iteration are merged to a single grammar. For averaging samples, we took the samples from the 0th iteration (initialization) to the 70th iteration at every 10th iteration.3 We decided on the 70th iteration (last iteration of level 7 sampling) as the last iteration because we constrained the sampler not to sample nodes whose span covers more than seven words (for SampleCut only, SampleCut always segments for these nodes), and the likelihood becomes very stable at that point.
5.4.3 Results
Table 2 summarizes our results. As a general test of our probability model, we compare the result from initialization and the 100th sample. The translation performance of the grammar from the 100th iteration of sampling is much higher than that of the initialization state. This shows that states with higher probability in our Markov chain generally do result in better translation, and that the sampling process is able to learn valuable composed rules.
. | iteration . | model . | pruning . | #rules . | dev . | test . | time (s) . |
---|---|---|---|---|---|---|---|
Baseline | heuristic | Hiero | 3.59M | 25.5 | 25.1 | 809 | |
No-singleton | heuristic | Hiero | 1.09M | 24.7 | 24.2 | 638 | |
Sampled | 0th (init) | Pitman-Yor | scope < 3 | 212K | 19.9 | 19.1 | 489 |
Sampled | 100th | Pitman-Yor | scope < 3 | 313K | 23.9 | 23.3 | 1,214 |
Sampled | averaged (0 to 70) | Pitman-Yor | scope < 3 | 885K | 26.2 | 24.5 | 1,488 |
Sampled | averaged (0 to 70) | Pitman-Yor | Hiero | 785K | 25.6 | 25.1 | 532 |
Sampled | averaged (0 to 70) | Dirichlet | scope < 3 | 774K | 24.6 | 23.8 | 930 |
. | iteration . | model . | pruning . | #rules . | dev . | test . | time (s) . |
---|---|---|---|---|---|---|---|
Baseline | heuristic | Hiero | 3.59M | 25.5 | 25.1 | 809 | |
No-singleton | heuristic | Hiero | 1.09M | 24.7 | 24.2 | 638 | |
Sampled | 0th (init) | Pitman-Yor | scope < 3 | 212K | 19.9 | 19.1 | 489 |
Sampled | 100th | Pitman-Yor | scope < 3 | 313K | 23.9 | 23.3 | 1,214 |
Sampled | averaged (0 to 70) | Pitman-Yor | scope < 3 | 885K | 26.2 | 24.5 | 1,488 |
Sampled | averaged (0 to 70) | Pitman-Yor | Hiero | 785K | 25.6 | 25.1 | 532 |
Sampled | averaged (0 to 70) | Dirichlet | scope < 3 | 774K | 24.6 | 23.8 | 930 |
In order to determine whether the composed rules learned by our algorithm are particularly valuable, we compare them to the standard baseline of extracting all rules. The size of the grammar taken from the single sample (100th sample) is only about 9% of the baseline but still produces translation results that are not far worse than the baseline. A simple way to reduce the number of rules in the baseline grammar is to remove all rules that occur only once in the training data and that contain more than a single word on the Chinese side. This “no-singleton” baseline still leaves us with more rules than our algorithm, with translation results between those of the standard baseline and our algorithm.
We also wish to investigate the trade-off between grammar size and translation performance that is induced by including rules from multiple steps of the sampling process. It is helpful for translation quality to include more than one analysis of each sentence in the final grammar in order to increase coverage of new sentences. Averaging samples also better approximates the long-term behavior of the Markov chain, whereas taking a single sample involves an arbitrary random choice. When we average eight different samples, we get a larger number of rules than from a single sample, but still only a quarter as many rules as in the Hiero baseline. The translation results with eight samples are comparable to the Hiero baseline (not significantly different according to 1,000 iterations of paired bootstrap resampling [Koehn 2004]). Translation results are better with the sampled grammar than with the no-singleton method of reducing grammar size, while the sampled grammar was smaller than the no-singleton rule set. Thus, averaging samples seems to produce a good trade-off between grammar size and quality.
The filtering applied to the final rule set affects both the grammar size and decoding speed, because rules with different terminal/nonterminal patterns have varying decoding complexities. We experimented with two methods of filtering the final grammar: retaining rules of scope no greater than three, and the more restrictive the Hiero constraints. We do not see a consistent difference in translation quality between these methods, but there is a large impact in terms of speed. The Hiero constraints dramatically speeds decoding. The following is the full list of the Hiero constraints, taken verbatim from Chiang (2007):
If there are multiple initial phrase pairs containing the same set of alignments, only the smallest is kept. That is, unaligned words are not allowed at the edges of phrases.
Initial phrases are limited to a length of 10 words on either side.
Rules are limited to five nonterminals plus terminals on the French side.
Rules can have at most two nonterminals, which simplifies the decoder implementation. This also makes our grammar weakly equivalent to an inversion transduction grammar (Wu 1997), although the conversion would create a very large number of new nonterminal symbols.
It is prohibited for nonterminals to be adjacent on the French side, a major cause of spurious ambiguity.
A rule must have at least one pair of aligned words, so that translation decisions are always based on some lexical evidence.
Finally, we wish to investigate whether the added power of the Pitman-Yor process gives any benefit over the simpler Dirichlet process prior, using the same modeling of word length in both cases. We find better translation quality with the Pitman-Yor process, indicating that the additional strength of the Pitman-Yor process in suppressing infrequent rules helps prevent overfitting.
6. Conclusion
We presented a hypergraph sampling algorithm that overcomes the difficulties inherent in computing inside probabilities in applications where the segmentation of the tree into rules is not known.
Given parallel text with word-level alignments, we use this algorithm to learn sentence bracketing and SCFG rule composition. Our rule learning algorithm is based on a compact structure that represents all possible SCFG rules extracted from word-aligned sentences pairs, and works directly with highly lexicalized model parameters. We show that by effectively controlling overfitting with a Bayesian model, and designing algorithms that efficiently sample that parameter space, we are able to learn more compact grammars with competitive translation quality. Based on the framework we built in this work, it is possible to explore other rule learning possibilities that are known to help translation quality, such as learning refined nonterminals.
Our general sampling algorithm is likely to be useful in settings beyond machine translation. One interesting application would be unsupervised or partially supervised learning of (monolingual) TSGs, given text where the tree structure is completely or partially unknown, as in the approach of Blunsom and Cohn (2010b).
Acknowledgments
This work was partially funded by NSF grant IIS-0910611.
Notes
We randomly sampled our data from various different sources (LDC2006E86, LDC2006E93, LDC2002E18, LDC2002L27, LDC2003E07, LDC2003E14, LDC2004T08, LDC2005T06, LDC2005T10, LDC2005T34, LDC2006E26, LDC2005E83, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006E24, LDC2006E92, LDC2006E24). The language model is trained on the English side of entire data (1.65M sentences, which is 39.3M words).
They are from newswire portion of NIST MT evaluation data from 2004, 2005, and 2006.
Not including initialization has negligible effect on translation quality.
References
Author notes
Computer Science Dept., University of Rochester, Rochester NY 14627. E-mail: [email protected].
Computer Science Dept., University of Rochester, Rochester NY 14627. E-mail: [email protected].
Computer Science Dept., University of Rochester, Rochester NY 14627. E-mail: [email protected].
Computer Science Dept., University of Rochester, Rochester NY 14627. E-mail: [email protected].