Abstract
In statistical machine translation (SMT), the optimization of the system parameters to maximize translation accuracy is now a fundamental part of virtually all modern systems. In this article, we survey 12 years of research on optimization for SMT, from the seminal work on discriminative models (Och and Ney 2002) and minimum error rate training (Och 2003), to the most recent advances. Starting with a brief introduction to the fundamentals of SMT systems, we follow by covering a wide variety of optimization algorithms for use in both batch and online optimization. Specifically, we discuss losses based on direct error minimization, maximum likelihood, maximum margin, risk minimization, ranking, and more, along with the appropriate methods for minimizing these losses. We also cover recent topics, including large-scale optimization, nonlinear models, domain-dependent optimization, and the effect of MT evaluation measures or search on optimization. Finally, we discuss the current state of affairs in MT optimization, and point out some unresolved problems that will likely be the target of further research in optimization for MT.
1. Introduction
Machine translation (MT) has long been both one of the most promising applications of natural language processing technology and one of the most elusive. However, over approximately the past decade, huge gains in translation accuracy have been achieved (Graham et al. 2014), and commercial systems deployed for hundreds of language pairs are being used by hundreds of millions of users. There are many reasons for these advances in the accuracy and coverage of MT, but among them two particularly stand out: statistical machine translation (SMT) techniques that make it possible to learn statistical models from data, and massive increases in the amount of data available to learn SMT models.
Within the SMT framework, there have been two revolutions in the way we mathematically model the translation process. The first was the pioneering work of Brown et al. (1993), who proposed the idea of SMT, and described methods for estimation of the parameters used in translation. In that work, the parameters of a word-based generative translation model were optimized to maximize the conditional likelihood of the training corpus. The second major advance in SMT is the discriminative training framework proposed by Och and Ney (2002) and Och (2003), who propose log-linear models for MT, optimized to maximize either the probability of getting the correct sentence from a k-best list of candidates, or to directly achieve the highest accuracy over the entire corpus. By describing the scoring function for MT as a flexibly parameterizable log-linear model, and describing discriminative algorithms to optimize these parameters, it became possible to think of MT like many other structured prediction problems, such as POS tagging or parsing (Collins 2002).
However, within the general framework of structured prediction, MT stands apart in many ways, and as a result requires a number of unique design decisions not necessary in other frameworks (as summarized in Table 1). The first is the search space that must be considered. The search space in MT is generally too large to expand exhaustively, so it is necessary to decide which subset of all the possible hypotheses should be used in optimization. In addition, the evaluation of MT accuracy is not straightforward, with automatic evaluation measures for MT still being researched to this day. From the optimization perspective, even once we have chosen an automatic evaluation metric, it is not necessarily the case that it can be decomposed for straightforward integration with structured learning algorithms. Given this evaluation measure, it is necessary to incorporate it into a loss function to target. The loss function should be closely related to the final evaluation objective, while allowing for the use of efficient optimization algorithms. Finally, it is necessary to choose an optimization algorithm. In many cases it is possible to choose a standard algorithm from other fields, but there are also algorithms that have been tailored towards the unique challenges posed by MT.
Which Loss Functions? | Which Optimization Algorithm? |
Error (§3.1) | Minimum Error Rate Training (§5.1) |
Softmax (§3.2) | Gradient-based Methods (§5.2, §6.5) |
Risk (§3.3) | Margin-based Methods (§5.3) |
Margin, Perceptron (§3.4) | Linear Regression (§5.4) |
Ranking (§3.5) | Perceptron (§6.2) |
Minimum Squared Error (§3.6) | MIRA (§6.3) |
AROW (§6.4) | |
Which Evaluation Measure? | Which Hypotheses to Target? |
Corpus-level, Sentence Level (§2.5) | k-best vs. Lattice vs. Forest (§2.4) |
BLEU and Approximations (§2.5.1, §2.5.2) | Merged k-bests (§5) |
Other Measures (§8.3) | Forced Decoding (§2.4), Oracles (§4) |
Other Topics: | |
Large Data Sets (§7), Non-linear Models (§8.1), | |
Domain Adaptation (§8.2), Search and Optimization (§8.4) |
Which Loss Functions? | Which Optimization Algorithm? |
Error (§3.1) | Minimum Error Rate Training (§5.1) |
Softmax (§3.2) | Gradient-based Methods (§5.2, §6.5) |
Risk (§3.3) | Margin-based Methods (§5.3) |
Margin, Perceptron (§3.4) | Linear Regression (§5.4) |
Ranking (§3.5) | Perceptron (§6.2) |
Minimum Squared Error (§3.6) | MIRA (§6.3) |
AROW (§6.4) | |
Which Evaluation Measure? | Which Hypotheses to Target? |
Corpus-level, Sentence Level (§2.5) | k-best vs. Lattice vs. Forest (§2.4) |
BLEU and Approximations (§2.5.1, §2.5.2) | Merged k-bests (§5) |
Other Measures (§8.3) | Forced Decoding (§2.4), Oracles (§4) |
Other Topics: | |
Large Data Sets (§7), Non-linear Models (§8.1), | |
Domain Adaptation (§8.2), Search and Optimization (§8.4) |
In this article, we survey the state of the art in machine translation optimization in a comprehensive and systematic fashion, covering a wide variety of topics, with a unified set of terminology. In Section 2, we first provide definitions of the problem of machine translation, describe briefly how models are built, how features are defined, and how translations are evaluated, and finally define the optimization setting. In Section 3, we next describe a variety of loss functions that have been targeted in machine translation optimization. In Section 4, we explain the selection of oracle translations, a non-trivial process that directly affects the optimization results. In Section 5, we describe batch optimization algorithms, starting with the popular minimum error rate training, and continuing with other approaches using likelihood, margin, rank loss, or risk as objectives. In Section 6, we describe online learning algorithms, first explaining the relationship between corpus-level optimization and sentence-level optimization, and then moving on to algorithms based on perceptron, margin, or likelihood-based objectives. In Section 7, we describe the recent advances in scaling training of MT systems up to large amounts of data through parallel computing, and in Section 8, we cover a number of other topics in MT optimization such as non-linear models, domain adaptation, and the relationship between MT evaluation and optimization. Finally, we conclude in Section 9, overviewing the methods described, making a brief note about which methods see the most use in actual systems, and outlining some of the unsolved problems in the optimization of MT systems.
2. Machine Translation Preliminaries and Definitions
Before delving into the details of actual optimization algorithms, we first introduce preliminaries and definitions regarding MT in general and the MT optimization problem in particular. We focus mainly on the aspects of MT that are relevant to optimization, and readers may refer to Koehn (2010) or Lopez (2008) for more details about MT in general.
2.1 Machine Translation
Machine translation is the problem of automatically translating from one natural language to another. Formally, we define this problem by specifying to be the collection of all source sentences to be translated, as one of the sentences, and as the collection of all possible target language sentences that can be obtained by translating f. Machine translation systems perform this translation process by dividing the translation of a full sentence into the translation and recombination of smaller parts, which are represented as hidden variables, which together form a derivation.
For example, in phrase-based translation (Koehn, Och, and Marcu 2003), the hidden variables will be the alignment between the phrases of the source and target sentences, and in tree-based translation models (Yamada and Knight 2001; Chiang 2007), the hidden variables will represent the latent tree structure used to generate the translation. We will define to be the space of possible derivations that can be acquired from source sentence f, and to be one of those derivations. Any particular derivation d will correspond to exactly one , although the opposite is not true (the derivation uniquely determines the translation, but there can be multiple derivations corresponding to a particular translation). We also define tuple 〈e, d〉 consisting of a target sentence and its corresponding derivation, and as the set of all of these tuples.
The optimization problem that we will be surveying in this article is generally concerned with finding the most effective weight vector w from the set of possible weight vectors ℝM.1 Optimization is also widely called tuning in the SMT literature. In addition, because of the exponentially large number of possible translations in that must be considered, it is necessary to take advantage of the problem structure, making MT optimization an instance of structured learning.
2.2 Model Construction
The first step of creating a machine translation system is model construction, in which translation models (TMs) are extracted from a large parallel corpus. The TM is usually created by first aligning the parallel text (Och and Ney 2003), using this text to extract multi-word phrase pairs or synchronous grammar rules (Koehn, Och, and Marcu 2003; Chiang 2007), and scoring these rules according to several features explained in more detail in Section 2.3. The construction of the TM is generally performed first in a manner that does not directly consider the optimization of translation accuracy, followed by an optimization step that explicitly considers the accuracy achieved by the system.2 In this survey, we focus on the optimization step, and thus do not cover elements of model construction that do not directly optimize an objective function related to translation accuracy, but interested readers can reference Koehn (2010) for more details.
In the context of this article, however, the TM is particularly important in the role it plays in defining our derivation space . For example, in the case of phrase-based translation, only phrase pairs included in the TM will be expanded during the process of searching for the best translation (explained in Section 2.4).
This has major implications from the point of view of optimization, the most important of which being that we must use separate data for training the TM and optimizing the parameters w. The reason for this lies in the fact that the TM is constructed in such a way that allows it to “memorize” long multi-word phrases included in the training data. Using the same data to train the model parameters will result in overfitting, learning parameters that heavily favor using these memorized multi-word phrases, which will not be present in a separate test set.
The traditional way to solve this problem is to train the TM on a large parallel corpus on the order of hundreds of thousands to tens of millions of sentences, then perform optimization of parameters on a separate set of data consisting of around one thousand sentences, often called the development set. When learning the weights for larger feature sets, however, a smaller development set is often not sufficient, and it is common to perform cross-validation, holding out some larger portion of the training set for parameter optimization. It is also possible to perform leaving-one-out training, where counts of rules extracted from a particular sentence are subtracted from the model before translating the sentence (Wuebker, Mauser, and Ney 2010).
2.3 Features for Machine Translation
Given this overall formulation of MT, the features h(f, e, d) that we choose to use to represent each translation hypothesis are of great importance. In particular, with regard to optimization, there are two important distinctions between types of features: local vs. non-local, and dense vs. sparse.
With regard to the first distinction, local features, such as phrase translation probabilities, do not require additional contexts from other partial derivations, and they are computed independently from one another. On the other hand, when features for a particular phrase pair or synchronous rule cannot be computed independently from other pairs, they are called non-local features. This distinction is important, as local features will not result in an increase in the size of the search space, whereas non-local features have the potential to make search more difficult.
The second distinction is between dense features, which define a small number of highly informative feature functions, and sparse features, which define a large number of less informative feature functions. Dense features are generally easier to optimize, both from a computational point of view because the smaller number of features reduces computational and memory requirements, and because the smaller number of parameters reduces the risk of overfitting. On the other hand, sparse features allow for more flexibility, as their parameters can be directly optimized to increase translation accuracy, so if optimization is performed well they have the potential to greatly increase translation accuracy. The remainder of this section describes some of the widely used features in more detail.
2.3.1 Dense Features
The n-gram language model assigns higher penalties for longer translations, and it is common to add a word penalty feature that measures the length of translation e to compensate for this. Similarly, phrase penalty or rule penalty features express the trade-off between longer or shorter derivations. There exist other features that are dependent on the underlying MT system model. Phrase-based MT heavily relies on the distortion probabilities that are computed by the distance on the source side of target-adjacent phrase pairs. More refined lexicalized reordering models estimate the parameters from the training data based on the relative distance of two phrase pairs (Tillman 2004; Galley and Manning 2008).
2.3.2 Sparse features
Although dense features form the foundation of most SMT systems, in recent years the ability to define richer feature sets and directly optimize the system using rich features has been shown to allow for significant increases in accuracy. On the other hand, large and sparse feature sets make the MT optimization problem significantly harder, and many of the optimization methods we will cover in the rest of this survey are aimed at optimizing rich feature sets.
Another alternative for the creation of features that are sparse, but less sparse than features of phrases or rules, are lexical features (Watanabe et al. 2007). Lexical features, similar to lexical weighting, focus on the correspondence between the individual words that are included in a phrase or rule. The simplest variety of lexical features remembers which source words f are aligned with which target words e, and fires a feature for each pair. It is also possible to condition lexical features on the surrounding context in the source language (Chiang, Knight, and Wang 2009; Xiao et al. 2011), fire features between every pair of words in the source or target sentences (Watanabe et al. 2007), or integrate bigrams on the target side (Watanabe et al. 2007). Of these, the former two can be calculated from source and local target context, but target bigrams require target bigram context and are thus non-local features.
One final variety of features that has proven useful is syntax-based features (Blunsom and Osborne 2008; Marton and Resnik 2008). In particular, phrase-based and hierarchical phrase-based translations do not directly consider syntax (in the linguistic sense) in the construction of the models, so introducing this information in the form of features has a potential for benefit. One way to introduce this information is to parse the input sentence before translation, and use the information in the parse tree in the calculation of features. For example, we can count the number of times a phrase or translation rule matches, or partially matches (Marton and Resnik 2008), a span with a particular label, based on the assumption that rules that match a syntactic span are more likely to be syntactically reasonable.
2.3.3 Summary features
2.4 Decoding
The problem of decoding is treated as a search problem in which partial derivations together with in Equation (9) are enumerated to form hypotheses or states. In phrase-based MT, search is carried out by enumerating partial derivations in left-toright order on the target side while remembering the translated source word positions. Similarly, the search in MT with synchronous grammars is performed by using the CYK+ algorithm (Chappelier and Rajman 1998) on the source side and generating partial derivations for progressively longer source spans. Because of the enormous search space brought about by maintaining in each partial derivation, beam search is used to heuristically prune the search space. As a result, the search is inexact because of the search error caused by heuristic pruning, in which the best scoring hypothesis is not necessarily optimal in terms of given model parameters.
The search is efficiently carried out by merging equivalent states encoded as ρ (Koehn, Och, and Marcu 2003; Huang and Chiang 2007), and the space is succinctly represented by compact data structures, such as graphs (Ueffing, Och, and Ney 2002) (or lattices) in phrase-based MT (Koehn, Och, and Marcu 2003) and hypergraphs (Klein and Manning 2004) (or packed forests) in tree-based MT (Huang and Chiang 2007). These data structures may be directly used as compact representations of all derivations for optimization.
However, using these data structures directly can be unwieldly, and thus it is more common to obtain a k-best list as an approximation of the derivation space. Figure 1(a) shows an example of k-best English translations for a French input sentence, ‘la délégation chinoise appuiera pleinement la présidence.’ The k-best list may be obtained either from a lattice in Figure 1(b) or from a forest in Figure 1(c). It should be noted that different derivations in a k-best list may share the same translation due to the variation of phrases or rules in constructing a translation, e.g., the choice of support the chair or support and the chair in Figure 1(b). A diverse k-best list can be obtained by extracting a unique k-best list that maintains only the best scored derivation sharing the same translation (Huang, Knight, and Joshi 2006; Hasan, Zens, and Ney 2007), by incorporating a penalty term when scoring derivations (Gimpel et al. 2013), or by performing Monte Carlo sampling to acquire a more diverse set of candidates (Blunsom and Osborne 2008).
Another class of decoding problem is forced decoding, in which the output from a decoder is forced to match with a reference translation of the input sentence. In phrase-based MT, this is implemented by adding additional features to reward hypotheses that match with the given target sentence (Liang, Zhang, and Zhao 2012; Yu et al. 2013). In MT using synchronous grammars, it is carried out by biparsing over two languages, for instance, by a variant of the CYK algorithm (Wu 1997) or by a more efficient two-step algorithm (Dyer 2010b; Peitz et al. 2012). Even if we perform forced decoding, we are still not guaranteed that the decoder will be able to produce the reference translation (because of unknown words, reordering limits, or other factors). This problem can be resolved by preserving the prefix of partial derivations (Yu et al. 2013), or by allowing approximate matching of the target side (Liang, Zhang, and Zhao 2012). It is also possible to create a neighborhood of a forced decoding derivation by adding additional hyperedges to the true derivation, which allows for efficient generation of negative examples for discriminative learning algorithms (Xiao et al. 2011).
2.5 Evaluation
Once we have a machine translation system that can produce translations, we next must perform evaluation to judge how good the generated translations actually are. As the final consumer of machine translation output is usually a human, the most natural form of evaluation is manual evaluation by human annotators. However, because human evaluation is expensive and time-consuming, in recent years there has been a shift to automatic calculation of the quality of MT output.
In general, automatic evaluation measures use a set of data consisting of N input sentences , each of which having a reference translation that was created by a human translator. The input F is automatically translated using a machine translation system to acquire MT results , which are then compared to the corresponding references. The closer the MT output is to the reference, the better it is deemed to be, according to automatic evaluation. In addition, as there are often many ways to translate a particular sentence, it is also possible to perform evaluation with multiple references created by different translators. There has also been some work on encoding a huge number of references in a lattice, created either by hand (Dreyer and Marcu 2012) or by automatic paraphrasing (Zhou, Lin, and Hovy 2006).
One major distinction between optimization measures is whether they are calculated on the corpus level or the sentence level. Corpus-level measures are calculated by taking statistics over the whole corpus, whereas sentence-level measures are calculated by measuring sentence-level accuracy, and defining the corpus-level accuracy as the average of the sentence-level accuracies. All optimization algorithms that are applicable to corpus-level measures are applicable to sentence-level measures, but the opposite is not true, making this distinction important from the optimization point of view.
The most commonly used MT evaluation measure BLEU (Papineni et al. 2002) is defined on the corpus level, and we will cover it in detail as it plays an important role in some of the methods that follow. Of course, there have been many other evaluation measures proposed since BLEU, with TER (Snover et al. 2006) and METEOR (Banerjee and Lavie 2005) being among the most widely used. The great majority of metrics other than BLEU are defined on the sentence level, and thus are conducive to optimization algorithms that require sentence-level evaluation measures. We discuss the role of evaluation in MT optimization more completely in Section 8.3.
2.5.1 BLEU
2.5.2 BLEU+1
2.6 The Optimization Setting
During the optimization process, we will assume that we have some data consisting of sources with corresponding references as defined in the previous section, and that we would like to use these to optimize the parameters of the model. As mentioned in Section 2.5, it is also possible to use more than one reference translation in evaluation, but in this survey we will assume for simplicity of exposition that only one reference is used.
3. Defining a Loss Function
The first step in performing optimization is defining the loss function that we are interested in optimizing. The choice of a proper loss function is critical in that it effects the final performance of the optimized MT system, and also the possible choices for optimization algorithms. This section describes several common choices for loss functions, and describes their various features.
3.1 Error
Error has the advantage of being simple, easy to explain, and directly related to translation performance, and these features make it perhaps the most commonly used loss in current machine translation systems. On the other hand, it also has a large disadvantage in that the loss function expressed in Equation (17) is not convex, and most MT evaluation measures used in the calculation of the error function error(·) are not continuously differentiable. This makes direct minimization of error a difficult optimization problem (particularly for larger feature sets), and thus a number of other, easier-to-optimize losses are used as well.
3.2 Softmax Loss
One thing to note about error is that there is no concept of “probability” of each translation candidate incorporated in its calculation. Being able to define a well-scaled probability of candidates can be useful, however, for estimation of confidence measures or incorporation with downstream applications. Softmax loss is a loss that is similar to the zero–one loss, but directly defines a probabilistic model and attempts to maximize the probability of the oracle translations (Berger, Della Pietra, and Della Pietra 1996; Och and Ney 2002; Blunsom, Cohn, and Osborne 2008).
From Equation (21) we can see that only the oracle translations contribute to the numerator, and all candidates in c(i) contributes to the denominator. Thus, intuitively, the softmax objective prefers parameter settings that assign high scores to the oracle translations, and lower scores to any other members of c(i) that are not oracles.
It should be noted that this loss can be calculated from a k-best list by iterating over the entire list and calculating the numerators and denominators in Equation (19). It is also possible, but more involved, to calculate over lattices or forests by using dynamic programming algorithms such as the forward–backward or inside–outside algorithms (Blunsom, Cohn, and Osborne 2008; Gimpel and Smith 2009).
3.3 Risk-Based Loss
In contrast to softmax loss, which can be viewed as a probabilistic version of zero–one loss, risk defines a probabilistic version of the translation error (Smith and Eisner 2006; Zens, Hasan, and Ney 2007; Li and Eisner 2009; He and Deng 2012). Specifically, risk is based on the expected error incurred by a probabilistic model parameterized by w. This combines the advantages of the probabilistic model in softmax loss with the direct consideration of translation accuracy afforded by using error directly. In comparison to error, it also has the advantage of being differentiable, allowing for easier optimization.
3.4 Margin-Based Loss
The zero–one loss in Section 3.1 was based on whether the oracle received a higher score than other hypotheses. The idea of margin, which is behind the classification paradigm of support vector machines (SVMs) (Joachims 1998), takes this a step further, finding parameters that explicitly maximize the distance, or margin, between correct and incorrect candidates. The main advantage of margin-based methods is that they are able to consider the error function, and often achieve high accuracy. These advantages make margin-based methods perhaps the second most popular loss used in current MT systems after direct minimization of error.
3.5 Ranking Loss
It should be noted that standard ranking techniques make a hard decision between candidates with higher and lower error, which can cause problems when the ranking by error does not correlate well with the ranking measured by the model. The cross-entropy ranking loss solves this problem by softly fitting the model distribution to the distribution of ranking measured by errors (Green et al. 2014).
3.6 Mean Squared Error Loss
4. Choosing Oracles
In the previous section, many loss functions used oracle translations, which are defined as a set of translations for any sentence that are “good.” Choosing oracle translations is not a trivial task, and in this section we describe the details involved.
4.1 Bold vs. Local Updates
In other structured learning tasks such as part-of-speech tagging or parsing, it is common to simply use the correct answer as an oracle. In translation, this is equivalent to optimizing towards an actual human reference, which is called bold update (Liang et al. 2006). It should be noted that even if we know the reference e, we still need to obtain a derivation d, and thus it is necessary to perform forced decoding (described in Section 2.4) to obtain this derivation.
However, bold update has a number of practical difficulties. For example, we are not guaranteed that the decoder is able to actually produce the reference (for example, in the case of unknown words), in which case forced decoding will fail. In addition, even if the hypothesis exists in the search space, it might require a large change in parameters w to ensure that the reference gets a higher score than all other hypotheses. This is true in the case of non-literal translations, for example, which may be producible by the decoder, but only by using a derivation that would normally receive an extremely low probability.
Local update is an alternative method that selects an oracle from a set of hypotheses produced during the normal decoding process. The space of hypotheses used to select oracles is usually based on k-best lists, but can also include lattices or forests output by the decoder as described in Section 2.4. Because of the previously mentioned difficulties with bold update, it has been empirically observed that local update tends to outperform bold update in online optimization (Liang et al. 2006). However, it also makes it necessary to select oracle translations from a set of imperfect decoder outputs, and we will describe this process in more detail in the following section.
4.2 Selecting Oracles and Approximating Corpus-Level Errors
However, when using a corpus-level error function we need a slightly more sophisticated method, such as the greedy method of Venugopal and Vogel (2005). In this method (Figure 2), the oracle is first initialized either as an empty set or by randomly picking from the candidates. Next, we iterate randomly through the translation candidates in c(i), try replacing the current oracle o(i) with the candidate, and check the change in the error function (Line 9), and if the error decreases, replace the oracle with the tested candidate. This process is repeated until there is no change in O.
4.3 Selecting Oracles for Margin-Based Methods
5. Batch Methods
Now that we have explained the details of calculating loss functions used in machine translation, we turn to the actual algorithms used in optimizing using these loss functions. In this section, we cover batch learning approaches to MT optimization. Batch learning works by considering the entire training data on every update of the parameters, in contrast to online learning (covered in the following section), which considers only part of the data at any one time. In standard approaches to batch learning, for every training example 〈f(i), e(i)〉 we enumerate every translation and derivation in the respective sets and , and attempt to adjust the parameters so we can achieve the translations with the lowest error for the entire data.
However, as mentioned previously, the entire space of derivations is too large to handle in practice. To resolve this problem, most batch learning algorithms for MT follow the general procedure shown in Figure 3, performing iterations that alternate between decoding and optimization (Och and Ney 2002). In line 6, GEN(f(i), w(t)) = indicates that we use the current parameters w(t) to perform decoding of sentence f(i), and obtain a subset of all derivations. For convenience, we will assume that this subset is expressed using a k-best list kbest(i), but it is also possible to use lattices or forests, as explained in Section 2.4.
A k-best list with scores for each hypothesis can be used as an approximation for the distribution over potential translations of f(i) according to the parameters w. However, because the size of the k-best list is limited, and the presence of search errors in decoding means that we are not even guaranteed to find the highest-scoring hypotheses, this approximation is far from perfect. The effect of this approximation is particularly obvious if the lack of coverage of the k-best list is systematic. For example, if the hypotheses in the k-best list are all much too short, optimization may attempt to fix this by adjusting the parameters to heavily favor very long hypotheses, far overshooting the actual optimal parameters.7
As a way to alleviate the problems caused by this approximation, in line 7 we merge the k-best lists from multiple decoding iterations, finding a larger and more accurate set C of derivations. Given C and the training data 〈F, E〉, we perform minimization of the Ω(w) regularized loss function ℓ(·) and obtain new parameters w(t+1) (line 9). Generation of k-best lists and optimization is performed until a hard limit of T iterations is reached, or until training has converged. In this setting, usually convergence is defined as any iteration in which the merged k-best list does not change, or when the parameters w do not change (Och 2003).
Within this batch optimization framework, the most critical challenge is to find an effective way to solve the optimization problem in line 9 of Figure 3. Section 5.1 describes methods for directly optimizing the error function. There are also methods for optimizing other losses such as those based on probabilistic models (Section 5.2), error margins (Section 5.3), ranking (Section 5.4), and risk (Section 5.5).
5.1 Error Minimization
5.1.1 Minimum Error Rate Training Overview
Minimum error rate training (MERT) (Och 2003) is one of the first, and is currently the most widely used, method for MT optimization, and focuses mainly on direct minimization of the error described in Section 3.1. Because error is not continuously differentiable, MERT uses optimization methods that do not require the calculation of a gradient, such as iterative line search inspired by Powell's method (Och 2003; Press et al. 2007), or the Downhill-Simplex method (Nelder-Mead method) (Press et al. 2007; Zens, Hasan, and Ney 2007; Zhao and Chen 2009).
The algorithm for MERT using line search is shown in Figure 4. Here, we assume that w and h(·) are M-dimensional, and bm is an M-dimensional vector where the m-th element is 1 and the rest of the elements are zero. For the T iterations, we decide the dimension m of the feature vector (line 6), and for each possible weight vector w(j) + γbm choose the γ ∈ ℝ that minimizes ℓerror(·) using line search (line 7). Then, among the γ for each of the M search dimensions, we perform an update using that affords the largest reduction in error (lines 9 and 10). This algorithm can be deemed a variety of steepest descent, which is a standard method used in most implementations of MERT (Koehn et al. 2007). Another alternative is a variant of coordinate descent (e.g., Powell's method), in which search and update is performed in each dimension.
One feature of MERT is that it is known to easily fall into local optima of the error function. Because of this, it is standard to choose R starting points (line 4), perform optimization starting at each of these starting points, and finally choose the that minimizes the loss from the weights acquired from each of the R random restarts. The R starting points are generally chosen so that one of the points is the best w from the previous iteration, and the remaining R − 1 have each element of w chosen randomly and uniformly from some interval, although it has also been shown that more intelligent choice of initial points can result in better final scores (Moore and Quirk 2008).
5.1.2 Line Search for MERT
A function like Equation (41) that chooses the highest-scoring line for each span over γ is called an envelope, and can be used to compactly express the results we will obtain by rescoring c(i) according to a particular γ (Figure 6a). After finding the envelope, for each line that participates in the envelope, we can calculate the sufficient statistics necessary for calculating the loss ℓerror(·) and error error(·). For example, given the envelope in Figure 6a, Figure 6b is an example of the sentence-wise loss with respect to γ.
The envelope shown in Equation (41) can also be viewed as the problem of finding a convex hull in computational geometry. A standard and efficient algorithm for finding a convex hull of multiple lines is the sweep line algorithm (Bentley and Ottmann 1979; Macherey et al. 2008) (see Figure 7). Here, we assume L is a set of the lines corresponding to the K translation candidates in c(i), each line l ∈ L is expressed as 〈a(l), b(l), γ(l)〉 with intercept a(l) = a(f(i), e, d), and slope b(l) = b(f(i), e, d). Furthermore, we define γ(l) as an intersection initialized to − ∞. SortLines(L) in Figure 3 sorts the lines in the order of their slope b(l), and if two lines lk1 have the same slope, lk2 chooses the one with the larger intercept a(lk1) > a(lk2) and deletes the other. We next process the sorted set of lines L′ (|L′| ≤ K) in order of ascending slope (lines 4–18). If we assume H is the envelope expressed as the set of lines it contains, we find the line that intersects with line under consideration at the highest point (lines 6–12), and update the envelope H. As L contains at most K lines, H's size is also at most K.
Given a particular input sentence f(i), its set of translation candidates c(i), and the resulting envelope H(i), we can also define the set of intersections between lines in the envelope as . We also define to be the change in the loss function that occurs when we move from one span to the next . If we first calculate the loss incurred when setting γ = −∞, then process the spans in increasing order, keeping track of the difference incurred at each span boundary, it is possible to efficiently calculate the loss curve over all spans of γ.
In addition, whereas all explanation of line search to this point has focused on the procedure for a single sentence, by calculating the envelopes for each sentence in the data 1 ≤ i ≤ N, and combining these envelopes into a single plane, it is relatively simple to perform this processing on the corpus level as well. It should be noted that for corpus-based evaluation measures such as BLEU, when performing corpus-level processing, we do not keep track of the change in the loss, but the change in the sufficient statistics required to calculate the loss for each sentence. In the case of BLEU, the sufficient statistics amount to n-gram counts cn, n-gram matches mn, and reference lengths r. We then calculate the loss curve ℓerror(·) for the entire corpus based on these sufficient statistics, and find a γ that minimizes Equation (17) based on this curve. By repeating this line search for each parameter until we can no longer obtain a decrease, it is possible to find a local minimum in the loss function, even for non-convex or non-differential functions.
5.1.3 MERT's Weaknesses and Extensions
Although MERT is widely used as the standard optimization procedure for MT, it also has a number of weaknesses, and a number of extensions to the MERT framework have been proposed to resolve these problems.
The first weakness of MERT is the randomness in the optimization process. Because each iteration of the training algorithm generally involves a number of random restarts, the results will generally change over multiple training runs, with the changes often being quite significant. Some research has shown that this randomness can be stabilized somewhat by improving the ability of the line-search algorithm to find a globally good solution by choosing random seeds more intelligently (Moore and Quirk 2008; Foster and Kuhn 2009) or by searching in directions that consider multiple features at once, instead of using the simple coordinate ascent as described in Figure 4 (Cer, Jurafsky, and Manning 2008). Orthogonally to actual improvement of the results, Clark et al. (2011) suggest that because randomness is a fundamental feature of MERT and other optimization algorithms for MT, it is better experimental practice to perform optimization multiple times, and report the resulting means and standard deviations over various optimization runs.
It is also possible to optimize the MERT objective using other optimization algorithms. For example, Suzuki, Duh, and Nagata (2011) present a method for using particle swarm optimization, a distributed algorithm where many “particles” are each associated with a parameter vector, and the particle updates its vector in a way such that it moves towards the current local and global optima. Another alternative optimization algorithm is Galley and Quirk's (2011) method for using linear programming to perform search for optimal parameters over more than one dimension, or all dimensions at a single time. However, as MERT remains a fundamentally computationally hard problem, this method takes large amounts of time for larger training sets or feature spaces.
It should be noted that instability in MERT is not entirely due to the fact that search is random, but also due to the fact that k-best lists are poor approximations of the whole space of possible translations. One way to improve this approximation is by performing MERT over an exponentially large number of hypotheses encoded in a translation lattice (Macherey et al. 2008) or hypergraph (Kumar et al. 2009). It is possible to perform MERT over these sorts of packed data structures by observing the fact that the envelopes used in MERT can be expressed as a semiring (Dyer 2010a; Sokolov and Yvon 2011), allowing for exact calculation of the full envelope for all hypotheses in a lattice or hypergraph using polynomial-time dynamic programming (the forward algorithm or inside algorithm, respectively). There has also been work to improve the accuracy of the k-best approximation by either sampling k-best candidates from the translation lattice (Chatterjee and Cancedda 2010), or performing forced decoding to find derivations that achieve the reference translation, and adding them to the k-best list (Liang, Zhang, and Zhao 2012).
The second weakness of MERT is that it has no concept of regularization, causing it to overfit the training data if there are too many features, and there have been several attempts to incorporate regularization to ameliorate this problem. Cer, Jurafsky, and Manning (2008) propose a method to incorporate regularization by not choosing the plateau in the loss curve that minimizes the loss itself, but choosing the point considering the loss values for a few surrounding plateaus, helping to avoid points that have a low loss but are surrounded by plateaus with higher loss. It is also possible to incorporate regularization into MERT-style line search using an SVM-inspired margin-based objective (Hayashi et al. 2009) or by using scale-invariant regularization methods such as L0 or a scaled version of L2 (Galley et al. 2013).
The final weakness of MERT is that it has computational problems when scaling to large numbers of features. When using only a standard set of 20 or so features, MERT is able to perform training in reasonable time, but the number of line searches, and thus time, required in Algorithm 4 scales linearly with the number of features. Thus training of hundreds of features is time-consuming, and there are no published results training standard MERT on thousands or millions of features. It should be noted, however, that Galley et al. (2013) report results for thousands of features by choosing intelligent search directions by calculating the gradient of expected BLEU, as explained in Section 5.5.2.
5.2 Gradient-Based Batch Optimization
In the previous section, MERT optimized a loss function that was exactly equivalent to the error function, which is not continuously differentiable and thus precludes the use of standard convex optimization algorithms used in other optimization problems. In contrast, other losses such as the softmax loss described in Section 3.2 and risk-based losses described in Section 3.3 are differentiable, allowing for the use of these algorithms for MT optimization (Smith and Eisner 2006; Blunsom and Osborne 2008).
Convex optimization is well covered in the standard machine learning literature, so we do not cover it in depth, but methods such as conjugate gradient (using first-order statistics) (Nocedal and Wright 2006) and the limited-memory Broyden-FletcherGoldfarb-Shanno method (using second-order statistics) (Liu and Nocedal 1989) are standard options for optimizing these losses. These methods are equally applicable when the loss is combined with a differentiable regularizer Ω(w), such as L2 regularization. Using a non-differentiable regularizer such as L1 makes optimization more difficult, but can be handled by other algorithms such as orthant-wise limited-memory Quasi-Newton (Andrew and Gao 2007).
In addition to the function being differentiable, if it is also convex we can be guaranteed that these algorithms will not get stuck in local optima and instead they will reach a globally optimal solution. In general, the softmax objective is convex if there is only one element in the oracle set o(i), and not necessarily convex if there are multiple oracles. In the case of MT, as there are usually multiple translations e that minimize error(·), and multiple derivations d that result in the same translation e, o(i) will generally contain multiple members. Thus, we cannot be entirely certain that we will reach a global optimum.
5.3 Margin-Based Optimization
Minimizing the margin-based loss described in Section 3.4, possibly with the addition of a regularizer, is also a relatively standard problem in the machine learning literature. Methods to solve Equation (25) include sequential minimization optimization (Platt 1999), dual coordinate descent (Hsieh et al. 2008), as well as the quadratic program solvers used in standard SVMs (Joachims 1998).
It should also be noted that there have also been several attempts to apply margin-based online learning algorithms explained in Section 6.3, but in a batch setting where the whole training corpus is decoded before each iteration of optimization (Cherry and Foster 2012; Gimpel and Smith 2012). We will explain these methods in more detail later, but it should be noted that the advantage of using these methods in a batch setting mainly lies in simplicity; for online learning it is often necessary to directly implement the optimization procedure within the decoder, whereas in a batch setting the implementation of the decoding and optimization algorithm can be performed separately.
5.4 Ranking and Linear Regression Optimization
The rank-based loss described in Section 3.5 is essentially the combination of multiple losses over binary decisions. These binary decisions can be solved using gradient-based or margin-based methods, and thus optimization itself can be solved with the algorithms described in the previous two sections. However, one important concern in this setting is training time. At the worst, the number of pairwise comparisons for any particular k-best list is k(k − 1)/2, leading to unmanageably large amounts of time required for training.
One way to alleviate this problem is by randomly sampling a small number of these k(k − 1)/2 hypotheses for use in optimization, which has been shown empirically to allow for increases in training speed without decreases in accuracy. For example, Hopkins and May (2011) describe a method dubbed pairwise ranking optimization that selects 5,000 pairs randomly for each sentence, and among these random pairs using the 50 with the largest difference in error for training the classifier. Other selection heuristics—for example, avoiding training on candidate pairs with overly different scores (Nakov, Guzmán, and Vogel 2013), or performing Monte Carlo sampling (Roth et al. 2010; Haddow, Arun, and Koehn 2011)—are also possible and potentially increase accuracy. Recently, there has also been a method proposed that uses an efficient ranking SVM formulation that alleviates the need for this sampling and explicitly performs ranking over all pairs (Dreyer and Dong 2015).
The mean squared error loss described in Section 3.6, which is similar to ranking loss in that it will prefer a proper ordering of the k-best list, is much easier to optimize. This loss can be minimized using standard techniques for solving least-squared-error linear regression (Press et al. 2007).
5.5 Risk Minimization
It should be noted that in Equation (24), and the discussion up to this point, we have been using not the corpus-based error, but the sentence-based error err(e(i), e). There have also been attempts to make the risk minimization framework applicable to corpus-level error error(·), specifically BLEU. We will discuss two such methods.
5.5.1 Linear BLEU
5.5.2 Expectations of Sufficient Statistics
DeNero, Chiang, and Knight (2009) present an alternative method that calculates not the expectation of the error itself, but the expectation of the sufficient statistics used in calculating the error. In contrast to sentence-level approximations or formulations such as linear BLEU, the expectation of the sufficient statistics can be calculated directly on the corpus level. Because of this, by maximizing the evaluation derived by these expected statistics, it is possible to directly optimize for a corpus-level error, in a manner similar to MERT (Pauls, Denero, and Klein 2009).
6. Online Methods
In the batch learning methods of Section 5, the steps of decoding and optimization are performed sequentially over the entire training data. In contrast, online learning performs updates not after the whole corpus has been processed, but over smaller subsets of the training data deemed mini-batches. One of the major advantages of online methods is that updates are performed on a much more fine-grained basis—it is often the case that online methods converge faster than batch methods, particularly on larger data sets. On the other hand, online methods have the disadvantage of being harder to implement (they often must be implemented inside the decoder, whereas batch methods can be separate), and also generally being less stable (with sensitivity to the order in which the training data is processed or other factors).
In the online learning algorithm in Figure 8, from the training data 〈F, E〉 = we first randomly choose a mini-batch consisting of K sentences of parallel data (line 4). We then decode each source sentence of the mini-batch and generate a k-best list (line 7), which is used in optimization (line 9). In contrast to the batch learning algorithm in Figure 3, we do not merge the k-bests from previous iterations. In addition, optimization is performed not over the entire data, but only the data and its corresponding k-best, . Like batch learning, within the online learning framework, there are a number of optimization algorithms and objective functions that can be used.
The first thing we must consider during online learning is that because we only optimize over the data in the mini-batch, it is not possible to directly optimize a corpus-level evaluation measure such as BLEU, and it is necessary to define an error function that is compatible with the learning framework (see Section 6.1). Once the error has been set, we can perform parameter updates according to a number of different algorithms including the perceptron (Section 6.2), MIRA (Section 6.3), AROW (Section 6.4), and stochastic gradient descent (SGD) (Section 6.5).
6.1 Approximating the Error
In online learning, parameters are updated not with respect to the entire training corpus, but with respect to a subset of data sampled from the corpus. This has consequences for the calculation of translation quality when using a corpus-level evaluation measure such as BLEU. For example, when choosing an oracle for oracle-based optimization methods, the oracles chosen when considering the entire corpus will be different from the oracles chosen when considering a mini-batch. In general, the amount of difference between the corpus-level and mini-batch level oracles will vary depending on the size of a mini-batch, with larger mini-batches providing a better approximation (Tan et al. 2013; Watanabe 2012). Thus, when using smaller batches, especially single sentences, it is necessary to use methods to approximate the corpus-level error function as covered in the next two sections.
6.1.1 Approximation with a Pseudo-Corpus
6.1.2 Approximation with Decay
When approximating the error function using a pseudo-corpus, it is necessary to remember translation candidates for every sentence in the corpus. In addition, the size of differences in the sentence-level error becomes dependent on the number of other sentences in the corpus, making it necessary to perform scaling of the error, particularly for max-margin methods (Watanabe et al. 2007). As an alternative method that alleviates these problems, there has also been a method proposed that remembers a single set of sufficient statistics for the whole corpus, and upon every update forces these statistics to decay according to some criterion (Chiang, Marton, and Resnik 2008; Chiang, Knight, and Wang 2009; Chiang 2012).
6.2 The Perceptron
In line 12, we return the final parameters to be used in translation. The most straightforward approach here is to simply return the parameters resulting from the final iteration of the perceptron training, but in a popular variant called the averaged perceptron, we instead use the average of the parameters over all iterations in training (Collins 2002). This averaging helps reduce overfitting of sentences that were viewed near the end of the training process, and is known to improve robustness to unknown data, resulting in higher translation accuracy (Liang et al. 2006).
6.3 MIRA
6.4 AROW
One of the problems often pointed out with MIRA is that it is overagressive, mainly because it attempts to classify the current training example correctly according to Equation (60), even when the training example is an outlier or includes noise. One way to reduce the effect of noise is through the use of adaptive regularization of weights (AROW) (Crammer, Kulesza, and Dredze 2009; Chiang 2012). AROW is based on a similar concept to MIRA, but instead of working directly on the weight vector w, it defines a Gaussian distribution over the weights. The covariance matrix Σ is usually assumed to be diagonal, and each variance term in Σ functions as a sort of learning rate for its corresponding weight, with weights of higher variance being updated more widely, and weights with lower variance being updated less widely.
6.5 Stochastic Gradient Descent
Stochastic gradient descent (SGD) is a gradient-based online algorithm for optimizing differentiable losses, possibly with the addition of L2 regularization Ω2(w), or L1 regularization Ω1(w). As SGD relies on gradients, it can be thought of as an alternative to the gradient-based batch algorithms in Section 5.2. Compared with batch algorithms, SGD requires less memory and tends to converge faster, but requires more care (particularly with regard to the selection of a learning rate) to ensure that it converges to a good answer.
7. Large-Scale Optimization
Up to this point, we have generally given a mathematical or algorithmic explanation of the various optimization methods, and placed a smaller emphasis on factors such as training efficiency. In traditional optimization settings for MT where we optimize only a small number of weights for dense features on a training set of around 1,000 sentences, efficiency is often less of a concern. However, when trying to move to larger sets of sparse features, 1,000 sentences of training data is simply not enough to robustly estimate the parameters, and larger training sets become essential. When moving to larger training sets, parallelization of both the decoding process and the optimization process becomes essential. In this section, we outline the methods that can be used to perform parallelization, greatly increasing the efficiency of training. As parallelization in MT has seen wider use with respect to online learning methods, we will start with a description of online methods and touch briefly upon batch methods afterwards.
7.1 Large-Scale Online Optimization
Within the online learning framework, it is possible to improve the efficiency of learning through parallelization (McDonald, Hall, and Mann 2010). An example of this is shown in Figure 10, where the training data is split into Sshards (line 2), learning is performed locally over each shard 〈Fs, Es〉, and the S sets of parameters ws acquired through local learning are combined according to a function mix(·), a process called parameter mixing (line 6). This can be considered an instance of the MapReduce (Dean and Ghemawat 2008) programming model, where each Map assigns shards to S CPUs and performs training, and Reduce combines the resulting parameters ws.
In the training algorithm of Figure 10, because parameters are learned locally on each shard, it is not necessarily guaranteed that the parameters are optimized for the data as a whole. In addition, it is also known that some divisions of the data can lead to contradictions between the parameters (McDonald, Hall, and Mann 2010). Because of this, when performing distributed online learning, it is common to perform parameter mixing several times throughout the training process, which allows the separate shards to share information and prevents contradiction between the learned parameters. Based on the timing of the update, these varieties of mixing are called synchronous update and asynchronous update.
7.1.1 Synchronous Update
In the online learning algorithm with synchronous update shown in Figure 11, learning is performed independently over each shard 〈Fs, Es〉(1 ≤ s ≤ S). The difference between this and Figure 10 lies in the fact that learning is performed T′ times, with each iteration initialized with the parameters w(t) from the previous iteration (line 7).
Simianer, Riezler, and Dyer (2012) propose another method for mixing parameters that, instead of averaging at each iteration, chooses to preserve only the parameters that have been learned over all shards, and sets all the remaining parameters to zero, allowing for a simple sort of feature selection. In particular, we define a S × M matrix that combines the parameters at each shard as , takes the L2 norm of each matrix column, and averages the columns with high norm values while setting the rest to zero.
7.1.2 Asynchronous Update
While parallel learning with synchronous update is guaranteed to converge (McDonald, Hall, and Mann 2010), parameter mixing only occurs after all the data has been processed, leading to inefficiency over large data sets. To fix this problem, asynchronous update sends information about parameter updates to each shard asynchronously, allowing the parameters to be updated more frequently, resulting in faster learning (Chiang, Marton, and Resnik 2008; Chiang, Knight, and Wang 2009).
The algorithm for learning with asynchronous update is shown in Figure 12. With the data 〈Fs, Es〉 (1 ≤ s ≤ S) split into S pieces, each shard performs T iterations of training by sampling a mini-batch (line 7), translating each sentence (line 10), and performing optimization on the mini-batch level (line 12).
7.2 Large-Scale Batch Optimization
Compared with online learning, within the batch optimization framework, parallelization is usually straightforward. Often, decoding takes the majority of time required for the optimization process, and because the parameters will be identical for each sentence in the decoding run (Figure 3, line 6), decoding can be parallelized trivially. The process of parallelizing optimization itself depends slightly on the optimization algorithm, but is generally possible to achieve in a number of ways.
The first, and simplest, method for parallelization is the parallelization of optimization runs. The most obvious example of this is MERT, where random restarts are required. Each of the random restarts is completely independent, so it is possible to run these on different nodes, and finally check which run achieved the best accuracy and use that result.
Another more fine-grained method for parallelization, again most useful for MERT, is the parallelization of search directions. In the loop starting at Figure 4, line 6, MERT performs line search in several different directions, each one being independent of the others. Each of these line searches can be performed in parallel, and the direction allowing for the greatest gain in accuracy is chosen when all threads have completed.
A method that is applicable to a much broader array of optimization methods is the parallelization of calculation of sufficient statistics. In this approach, like in Section 7.1.1, we first split the data into shards 〈Fs, Es〉(1 ≤ s ≤ S). Then, over these shards we calculate the sufficient statistics necessary to perform a parameter update. For example, in MERT these sufficient statistics would consist of the envelope for each of the potential search directions. In gradient based methods, the sufficient statistics would consist of the gradient calculated with respect to only the data on the shard. Finally, when all threads have finished calculating these statistics, a master thread combines the statistics from each shard, either by merging the envelopes, or by adding the gradients.
8. Other Topics in MT Optimization
In this section we cover several additional topics in optimization for MT, including nonlinear models (Section 8.1), optimization for a particular domain or test set (Section 8.2), and the interaction between evaluation measures (Section 8.3) or search (Section 8.4) and optimization.
8.1 Non-Linear Models
Note that up until this point, all models that we have considered calculate the scores for translation hypotheses according to a linear model, where the score is calculated according to the dot product of the features and weights shown in Equation (1). However, linear models are obviously limited in their expressive power, and a number of works have attempted to move beyond linear combinations of features to nonlinear combinations.
In general, most nonlinear models for machine learning can be applied to MT as well, with one major caveat. Specifically, the efficiency of the decoding process largely relies on the feature locality assumption mentioned in Section 2.3. Unfortunately, the locality assumption breaks down when moving beyond a simple linear scoring function, and overcoming this problem is the main obstacle to applying nonlinear models to MT (or structured learning in general). A number of countermeasures to this problem exist:
Reranking: The most simple and commonly used method for incorporating nonlinearity, or other highly nonlocal features that cannot be easily incorporated in search, is through the use of reranking (Shen, Sarkar, and Och 2004). In this case, a system optimized using a standard linear model is used to create a k-best list of outputs, and this k-best list is then reranked using the nonlinear model (Nguyen, Mahajan, and He 2007; Duh and Kirchhoff 2008). Because we are now only dealing with fully expanded hypotheses, scoring becomes trivial, but reranking also has the major downsides of potentially missing useful hypotheses not included in the k-best list,9 and requiring time directly proportional to the size of the k-best list.
Local Nonlinearity: Another possibility is to first use a nonlinear function to calculate local features, which are then used as part of the standard linear model (Liu et al. 2013). Alternatively, it is possible to treat feature-value pairs as new binary features (Clark, Dyer, and Lavie 2014). In this case, all effects of nonlinearity are resolved before the search actually begins, allowing for the use of standard and efficient search algorithms. On the other hand, it is not possible to incorporate non-local features into the nonlinear model.
Improved Search Techniques: Although there is no general-purpose solution to incorporating nonlinear models into search, for some particular models it is possible to perform search in a way that allows for incorporation of nonlinearities. For example, ensemble decoding has been used with stacking-based models (Razmara and Sarkar 2013), and it has been shown that the search space can be simplified to the extent that kernel functions can be calculated efficiently (Wang, Shawe-Taylor, and Szedmak 2007).
Once the problems of search have been solved, a number of actual learning techniques can be used to model nonlinear scoring functions. One of the most popular examples of nonlinear functions are those utilizing kernels, and methods applied to MT include kernel-like functions over the feature space such as the Parzen window, binning, and Gaussian kernels (Nguyen, Mahajan, and He 2007), or the n-spectrum string kernel for finding associations between the source and target strings (Wang, Shawe-Taylor, and Szedmak 2007). Neural networks are another popular method for modeling nonlinearities, and it has been shown that neural networks can effectively be used to calculate new local features for MT (Liu et al. 2013). Methods such as boosting or stacking, which combine together multiple parameterizations of the translation model, have been incorporated through reranking (Duh and Kirchhoff 2008; Lagarda and Casacuberta 2008; Duan et al. 2009; Sokolov, Wisniewski, and Yvon 2012b), or ensemble decoding (Razmara and Sarkar 2013). Regression decision trees have also been introduced as a method for inducing nonlinear functions, incorporated through history-based search algorithms (Turian, Wellington, and Melamed 2006), or by using the trees to induce features local to the search state (Toutanova and Ahn 2013).
8.2 Domain-Dependent Optimization
One widely acknowledged feature of machine learning problems in general is that the parameters are sensitive to the domain of the data, and by optimizing the parameters with data from the target domain it is possible to achieve gains in accuracy. In machine translation, this is also very true, although much of the work on domain adaptation has focused on adapting the model learning process prior to explicit optimization towards an evaluation measure (Koehn and Schroeder 2007). However, there are a few works on optimization-based domain adaptation in MT, as we will summarize subsequently.
One relatively simple way of performing domain adaptation is by selecting a subset of the training data that is similar to the data that we want to translate (Li et al. 2010). This can be done by selecting sentences that are similar to our test corpus, or even selecting adaptation data for each individual test sentence (Liu et al. 2012). If no parallel data exist in the target domain, it has also been shown that first automatically translating data from the source to the target language or vice versa, then using this data for optimization and model training is also helpful (Ueffing, Haffari, and Sarkar 2007; Li et al. 2011; Zhao et al. 2011) In addition, in a computer-assisted translation scenario, it is possible to reflect post-edited translations back into the optimization process as new in-domain training data (Mathur, Mauro, and Federico 2013; Denkowski, Dyer, and Lavie 2014).
Once adaptation data have been chosen, it is necessary to decide how to use the data. The most straightforward way is to simply use these in-domain data in optimization, but if the data set is small it is preferable to combine both in-and out-ofdomain data to achieve more robust parameter estimates. This is essentially equivalent to the standard domain-adaptation problem in machine learning, and in the context of MT there have been methods proposed to perform Bayesian adaptation of probabilistic models (Sanchis-Trilles and Casacuberta 2010), and online update using ultraconservative algorithms (Liu et al. 2012). This can be extended to cover multiple target domains using multi-task learning (Cui et al. 2013).
Finally, it has been noted that when optimizing a few dense parameters, it is useful to make the distinction between in-domain translation (when the model training data matches the test domain) and cross-domain translation (when the model training data mismatches the test domain). In cross-domain translation, fewer long rules will be used, and translation probabilities will be less reliable, and the parameters must change accordingly to account for this (Pecina, Toral, and van Genabith 2012). It has also been shown that building TMs for several domains and tuning the parameters to maximize translation accuracy can improve MT accuracy on the target domain (Haddow 2013). Another option for making the distinction between in-domain and out-of-domain data is by firing different features for in-domain and out-of-domain training data, allowing for the learning of different weights for different domains (Clark, Lavie, and Dyer 2012).
8.3 Evaluation Measures and Optimization
In the entirety of this article, we have assumed that optimization for MT aims to reduce MT error defined using an evaluation measure, generally BLEU. However, as mentioned in Section 2.5, evaluation of MT is an active research field, and there are many alternatives in addition to BLEU. Thus, it is of interest whether changing the measure used in optimization can affect the overall quality of the translations achieved, as measured by human evaluators.
There have been a few comprehensive studies on the effect of the metric used in optimization on human assessments of the generated translations (Cer, Manning, and Jurafsky 2010; Callison-Burch et al. 2011). These studies showed the rather surprising result that despite the fact that other evaluation measures had proven superior to BLEU with regards to post facto correlation with human evaluation, a BLEU-optimized system proved superior to systems tuned using other metrics. Since this result, however, there have been other reports stating that systems optimized using other metrics such as TESLA (Liu, Dahlmeier, and Ng 2011) and MEANT (Lo et al. 2013) achieve superior results to BLEU-optimized systems.
There have also been attempts to directly optimize not automatic, but human evaluation measures of translation quality (Zaidan and Callison-Burch 2009). However, the cost of performing this sort of human-in-the-loop optimization is prohibitive, so Zaidan and Callison-Burch (2009) propose a method that re-uses partial hypotheses in evaluation. Saluja, Lane, and Zhang (2012) also propose a method for incorporating binary good/bad input into optimization, with the motivation that this sort of feedback is easier for human annotators to provide than generating new reference sentences.
8.4 Search and Optimization
As mentioned in Section 2.4, because MT decoders perform approximate search, they may make search errors and not find the hypothesis that achieves the highest model score. There have been a few attempts to consider this fact in the optimization process.
For example, in the perceptron algorithm of Section 6.2 it is known that the convergence guarantees of the structured perceptron no longer hold when using approximate search. The first method that can be used to resolve this problem is the early updating strategy (Collins and Roark 2004; Cowan, Kuerová, and Collins 2006). The early updating strategy is a variety of bold updates, where the decoder output e*(i) must be exactly equal to the reference e(i). Decoding proceeds as normal, but the moment the correct hypothesis e(i) can no longer be produced by any hypothesis in the search space (i.e., a search error has occurred), search is stopped and update is performed using only the partial derivation. The second method is the max-violation perceptron (Huang, Fayong, and Guo 2012; Yu et al. 2013). In the max-violation perceptron, forced decoding is performed to acquire a derivation 〈e*(i), d*(i)〉 that can exactly reproduce the correct output e(i), and update is performed at the point when the score of a partial hypothesis exceeds that of the partial hypothesis 〈e*(i), d*(i)〉 by the greatest margin (the point of “maximum violation”).
Search-aware tuning (Liu and Huang 2014) is a method that is able to consider search errors using an arbitrary optimization method. It does so by defining an evaluation measure for not only full sentences, but also partial derivations that occur during the search process, and optimizes parameters for k-best lists of partial derivations.
Finally, there has also been some work on optimizing features not of the model itself, but parameters of the search process, using the downhill simplex algorithm (Chung and Galley 2012). Using this method, it is possible to adjust the beam width, distortion penalty, or other parameters that actually affect the size and shape of the derivation space, as opposed to simply rescoring hypotheses within it.
9. Conclusion
In this survey article, we have provided a review of the current state-of-the-art in machine translation optimization, covering batch optimization, online optimization, expansions to large scale data, and a number of other topics. While these optimization algorithms have already led to large improvements in machine translation accuracy, the task of MT optimization is, as stated in the Introduction, an extremely hard one that is far from solved.
The utility of an optimization algorithm can be viewed from a number of perspectives. The final accuracy achieved is, of course, one of the most important factors, but speed, scalability, ease of implementation, final resulting model size, and many other factors play an important role. We can assume that the algorithms being used outside of the context of research on optimization itself are those that satisfy these criteria in some way. Although it is difficult to exactly discern exactly which algorithms are seeing the largest amount of use (industrial SMT systems rarely disclose this sort of information publicly), one proxy for this is to look at systems that performed well on shared tasks such as the Workshop on Machine Translation (WMT) (Bojar et al. 2014). In Table 2 we show the percentage of WMT systems using each optimization algorithm for the past four years, both including all systems, and systems that achieved the highest level of human evaluation in the resource-constrained setting for at least one language pair. From these statistics we can see that even after over ten years, MERT is still the dominant optimization algorithm. However, starting in WMT 2013, we can see a move to systems based on MIRA, and to a lesser extent ranking, particularly in the most competitive systems.
. | WMT All . | 2011 Best . | WMT All . | 2012 Best . | WMT All . | 2013 Best . | WMT All . | 2014 Best . |
---|---|---|---|---|---|---|---|---|
MERT | 80 | 100 | 79 | 100 | 68 | 25 | 63 | 50 |
MIRA | 0 | 0 | 0 | 0 | 20 | 75 | 27 | 50 |
Ranking | 0 | 0 | 4 | 0 | 8 | 0 | 5 | 0 |
Softmax | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Risk | 3 | 0 | 0 | 0 | 0 | 0 | 5 | 0 |
None | 17 | 0 | 17 | 0 | 4 | 0 | 0 | 0 |
. | WMT All . | 2011 Best . | WMT All . | 2012 Best . | WMT All . | 2013 Best . | WMT All . | 2014 Best . |
---|---|---|---|---|---|---|---|---|
MERT | 80 | 100 | 79 | 100 | 68 | 25 | 63 | 50 |
MIRA | 0 | 0 | 0 | 0 | 20 | 75 | 27 | 50 |
Ranking | 0 | 0 | 4 | 0 | 8 | 0 | 5 | 0 |
Softmax | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Risk | 3 | 0 | 0 | 0 | 0 | 0 | 5 | 0 |
None | 17 | 0 | 17 | 0 | 4 | 0 | 0 | 0 |
In these systems, the preferred choice of an optimization algorithm seems to be MERT when using up to 20 features, and MIRA when using a large number of features (up to several hundred). There are fewer examples of systems using large numbers of features (tens of thousands, or millions) in actual competitive systems, with a few exceptions (Dyer et al. 2009; Neidert et al. 2014; Wuebker et al. 2014). In the case when a large number of sparse features are used, it is most common to use a softmax or risk-based objective and gradient-based optimization algorithms, often combining the features into summary features and performing a final tuning pass with MERT.
The fact that algorithms other than MERT are seeing adoption in competitive systems for shared tasks is a welcome sign for the future of MT optimization research. However, there are still many open questions in the field, a few of which can be outlined here:
Stable Training with Millions of Features: At the moment, there is still no stable training recipe that has been widely proven to effectively optimize millions of features. Finding an algorithm that gives consistent improvements in this setting is perhaps the largest open problem in MT optimization.
Evaluation Measures for Optimization: Although many evaluation measures show consistent improvements in correlation with human evaluation scores over BLEU when used to evaluate the output of existing MT systems, there are few results that show that systems optimized with evaluation measures other than BLEU achieve consistent improvements in human evaluation scores.
Better Training/Utilization of Nonlinear Scoring Functions: Nonlinear functions using neural networks have recently achieved large improvements in a number of areas of natural language processing and machine learning. Better methods to incorporate these sort of nonlinear scoring functions into MT is a highly promising direction, but will require improvements in both learning the scoring functions and correctly incorporating these functions into MT decoding.
Appendix A: Derivation for xBLEU Gradients
In this appendix, we explain in detail how to derive a gradient for the xBLEU objective in Equation (54), which has not been described completely in previous work.
After calculating this gradient, it is possible to optimize this according to standard gradient-based methods. However, like MR using sentence-level evaluation mentioned in Section 5.5, the evaluation measure is not convex, and the same precautions need to be taken to avoid falling into local optima.
Notes
It should be noted that although most work on MT optimization is concerned with linear models (and thus we will spend the majority of this article discussing optimization of these models), optimization using non-linear models is also possible, and is discussed in Section 8.1.
It should also be noted there have been a few recent attempts to jointly perform rule extraction and optimization, doing away with this two-step process (Xiao and Xiong 2013).
We let #A (a) denote the number of times a appeared in a multiset A, and define: |A| = ∑a #A(a), #A∪B(a) = max{#A(a), #B(a)}, and #A∩B(a) = min{#A(a), #B(a)}.
Equation (25) can be regarded as an instance of ranking loss described in Section 3.5 in which better translations are selected only from a set of oracle translations.
We take the inverse because we would like model scores and errors to be inversely correlated.
More accurately, finding the oracle in the k-best list by enumeration of the hypotheses is easy, but finding the oracle in a compressed data structure such as a lattice is computationally difficult, and approximation algorithms are necessary (Leusch, Matusov, and Ney 2008; Li and Khudanpur 2009; Sokolov, Wisniewski, and Yvon 2012a).
Liu et al. (2012) propose a method to avoid over-aggressive moves in parameter space by considering the balance between increase in the evaluation score and the similarity with the parameters on the previous iteration.
This problem can be ameliorated somewhat by ensuring that there is sufficient diversity in the n-best list (Gimpel et al. 2013).
References
Author notes
8916-5 Takayama-cho, Ikoma, Nara, Japan. E-mail: [email protected].
6-10-1 Roppongi, Minato-ku, Tokyo, Japan. E-mail: [email protected].
This work was mostly done while the second author was affiliated with the National Institute of Information and Communications Technology, 3–5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619–0289, Japan.