## Abstract

In statistical machine translation (SMT), the optimization of the system parameters to maximize translation accuracy is now a fundamental part of virtually all modern systems. In this article, we survey 12 years of research on optimization for SMT, from the seminal work on discriminative models (Och and Ney 2002) and minimum error rate training (Och 2003), to the most recent advances. Starting with a brief introduction to the fundamentals of SMT systems, we follow by covering a wide variety of optimization algorithms for use in both batch and online optimization. Specifically, we discuss losses based on direct error minimization, maximum likelihood, maximum margin, risk minimization, ranking, and more, along with the appropriate methods for minimizing these losses. We also cover recent topics, including large-scale optimization, nonlinear models, domain-dependent optimization, and the effect of MT evaluation measures or search on optimization. Finally, we discuss the current state of affairs in MT optimization, and point out some unresolved problems that will likely be the target of further research in optimization for MT.

## 1. Introduction

Machine translation (MT) has long been both one of the most promising applications of natural language processing technology and one of the most elusive. However, over approximately the past decade, huge gains in translation accuracy have been achieved (Graham et al. 2014), and commercial systems deployed for hundreds of language pairs are being used by hundreds of millions of users. There are many reasons for these advances in the accuracy and coverage of MT, but among them two particularly stand out: statistical machine translation (SMT) techniques that make it possible to learn statistical models from data, and massive increases in the amount of data available to learn SMT models.

Within the SMT framework, there have been two revolutions in the way we mathematically model the translation process. The first was the pioneering work of Brown et al. (1993), who proposed the idea of SMT, and described methods for estimation of the parameters used in translation. In that work, the parameters of a word-based generative translation model were optimized to maximize the conditional likelihood of the training corpus. The second major advance in SMT is the discriminative training framework proposed by Och and Ney (2002) and Och (2003), who propose log-linear models for MT, optimized to maximize either the probability of getting the correct sentence from a *k*-best list of candidates, or to directly achieve the highest accuracy over the entire corpus. By describing the scoring function for MT as a flexibly parameterizable log-linear model, and describing discriminative algorithms to optimize these parameters, it became possible to think of MT like many other **structured prediction** problems, such as POS tagging or parsing (Collins 2002).

However, within the general framework of structured prediction, MT stands apart in many ways, and as a result requires a number of unique design decisions not necessary in other frameworks (as summarized in Table 1). The first is the **search space** that must be considered. The search space in MT is generally too large to expand exhaustively, so it is necessary to decide which subset of all the possible hypotheses should be used in optimization. In addition, the **evaluation** of MT accuracy is not straightforward, with automatic evaluation measures for MT still being researched to this day. From the optimization perspective, even once we have chosen an automatic evaluation metric, it is not necessarily the case that it can be decomposed for straightforward integration with structured learning algorithms. Given this evaluation measure, it is necessary to incorporate it into a **loss function** to target. The loss function should be closely related to the final evaluation objective, while allowing for the use of efficient optimization algorithms. Finally, it is necessary to choose an **optimization algorithm**. In many cases it is possible to choose a standard algorithm from other fields, but there are also algorithms that have been tailored towards the unique challenges posed by MT.

Which Loss Functions? | Which Optimization Algorithm? |

Error (§3.1) | Minimum Error Rate Training (§5.1) |

Softmax (§3.2) | Gradient-based Methods (§5.2, §6.5) |

Risk (§3.3) | Margin-based Methods (§5.3) |

Margin, Perceptron (§3.4) | Linear Regression (§5.4) |

Ranking (§3.5) | Perceptron (§6.2) |

Minimum Squared Error (§3.6) | MIRA (§6.3) |

AROW (§6.4) | |

Which Evaluation Measure? | Which Hypotheses to Target? |

Corpus-level, Sentence Level (§2.5) | k-best vs. Lattice vs. Forest (§2.4) |

BLEU and Approximations (§2.5.1, §2.5.2) | Merged k-bests (§5) |

Other Measures (§8.3) | Forced Decoding (§2.4), Oracles (§4) |

Other Topics: | |

Large Data Sets (§7), Non-linear Models (§8.1), | |

Domain Adaptation (§8.2), Search and Optimization (§8.4) |

Which Loss Functions? | Which Optimization Algorithm? |

Error (§3.1) | Minimum Error Rate Training (§5.1) |

Softmax (§3.2) | Gradient-based Methods (§5.2, §6.5) |

Risk (§3.3) | Margin-based Methods (§5.3) |

Margin, Perceptron (§3.4) | Linear Regression (§5.4) |

Ranking (§3.5) | Perceptron (§6.2) |

Minimum Squared Error (§3.6) | MIRA (§6.3) |

AROW (§6.4) | |

Which Evaluation Measure? | Which Hypotheses to Target? |

Corpus-level, Sentence Level (§2.5) | k-best vs. Lattice vs. Forest (§2.4) |

BLEU and Approximations (§2.5.1, §2.5.2) | Merged k-bests (§5) |

Other Measures (§8.3) | Forced Decoding (§2.4), Oracles (§4) |

Other Topics: | |

Large Data Sets (§7), Non-linear Models (§8.1), | |

Domain Adaptation (§8.2), Search and Optimization (§8.4) |

In this article, we survey the state of the art in machine translation optimization in a comprehensive and systematic fashion, covering a wide variety of topics, with a unified set of terminology. In Section 2, we first provide definitions of the problem of machine translation, describe briefly how models are built, how features are defined, and how translations are evaluated, and finally define the optimization setting. In Section 3, we next describe a variety of loss functions that have been targeted in machine translation optimization. In Section 4, we explain the selection of oracle translations, a non-trivial process that directly affects the optimization results. In Section 5, we describe batch optimization algorithms, starting with the popular minimum error rate training, and continuing with other approaches using likelihood, margin, rank loss, or risk as objectives. In Section 6, we describe online learning algorithms, first explaining the relationship between corpus-level optimization and sentence-level optimization, and then moving on to algorithms based on perceptron, margin, or likelihood-based objectives. In Section 7, we describe the recent advances in scaling training of MT systems up to large amounts of data through parallel computing, and in Section 8, we cover a number of other topics in MT optimization such as non-linear models, domain adaptation, and the relationship between MT evaluation and optimization. Finally, we conclude in Section 9, overviewing the methods described, making a brief note about which methods see the most use in actual systems, and outlining some of the unsolved problems in the optimization of MT systems.

## 2. Machine Translation Preliminaries and Definitions

Before delving into the details of actual optimization algorithms, we first introduce preliminaries and definitions regarding MT in general and the MT optimization problem in particular. We focus mainly on the aspects of MT that are relevant to optimization, and readers may refer to Koehn (2010) or Lopez (2008) for more details about MT in general.

### 2.1 Machine Translation

Machine translation is the problem of automatically translating from one natural language to another. Formally, we define this problem by specifying to be the collection of all source sentences to be translated, as one of the sentences, and as the collection of all possible target language sentences that can be obtained by translating ** f**. Machine translation systems perform this translation process by dividing the translation of a full sentence into the translation and recombination of smaller parts, which are represented as

**hidden variables**, which together form a

**derivation**.

For example, in phrase-based translation (Koehn, Och, and Marcu 2003), the hidden variables will be the alignment between the phrases of the source and target sentences, and in tree-based translation models (Yamada and Knight 2001; Chiang 2007), the hidden variables will represent the latent tree structure used to generate the translation. We will define to be the space of possible derivations that can be acquired from source sentence ** f**, and to be one of those derivations. Any particular derivation

**will correspond to exactly one , although the opposite is not true (the derivation uniquely determines the translation, but there can be multiple derivations corresponding to a particular translation). We also define tuple 〈**

*d***,**

*e***〉 consisting of a target sentence and its corresponding derivation, and as the set of all of these tuples.**

*d***linear model**that determines the score of each translation candidate. In this linear model we first define an

*M*-dimensional

**feature vector**for each output and its derivation as . For each feature, we also define a corresponding weight, resulting in an

*M*-dimensional

**weight vector**

**∈ ℝ**

*w*^{M}. Based on these feature and weight vectors, we proceed to define the problem of selecting the best 〈

**,**

*e***〉 as the following maximization problemwhere the dot product of the parameters and features is equivalent to the score assigned to a particular translation.**

*d*The **optimization** problem that we will be surveying in this article is generally concerned with finding the most effective weight vector ** w** from the set of possible weight vectors ℝ

^{M}.

^{1}Optimization is also widely called

**tuning**in the SMT literature. In addition, because of the exponentially large number of possible translations in that must be considered, it is necessary to take advantage of the problem structure, making MT optimization an instance of

**structured learning**.

### 2.2 Model Construction

The first step of creating a machine translation system is model construction, in which **translation models** (TMs) are extracted from a large parallel corpus. The TM is usually created by first aligning the parallel text (Och and Ney 2003), using this text to extract multi-word phrase pairs or synchronous grammar rules (Koehn, Och, and Marcu 2003; Chiang 2007), and scoring these rules according to several features explained in more detail in Section 2.3. The construction of the TM is generally performed first in a manner that does not directly consider the optimization of translation accuracy, followed by an optimization step that explicitly considers the accuracy achieved by the system.^{2} In this survey, we focus on the optimization step, and thus do not cover elements of model construction that do not directly optimize an objective function related to translation accuracy, but interested readers can reference Koehn (2010) for more details.

In the context of this article, however, the TM is particularly important in the role it plays in defining our derivation space . For example, in the case of phrase-based translation, only phrase pairs included in the TM will be expanded during the process of searching for the best translation (explained in Section 2.4).

This has major implications from the point of view of optimization, the most important of which being that we must use separate data for training the TM and optimizing the parameters ** w**. The reason for this lies in the fact that the TM is constructed in such a way that allows it to “memorize” long multi-word phrases included in the training data. Using the same data to train the model parameters will result in

**overfitting**, learning parameters that heavily favor using these memorized multi-word phrases, which will not be present in a separate test set.

The traditional way to solve this problem is to train the TM on a large parallel corpus on the order of hundreds of thousands to tens of millions of sentences, then perform optimization of parameters on a separate set of data consisting of around one thousand sentences, often called the **development set**. When learning the weights for larger feature sets, however, a smaller development set is often not sufficient, and it is common to perform **cross-validation**, holding out some larger portion of the training set for parameter optimization. It is also possible to perform **leaving-one-out** training, where counts of rules extracted from a particular sentence are subtracted from the model before translating the sentence (Wuebker, Mauser, and Ney 2010).

### 2.3 Features for Machine Translation

Given this overall formulation of MT, the features ** h**(

**,**

*f***,**

*e***) that we choose to use to represent each translation hypothesis are of great importance. In particular, with regard to optimization, there are two important distinctions between types of features: local vs. non-local, and dense vs. sparse.**

*d*With regard to the first distinction, **local features**, such as phrase translation probabilities, do not require additional contexts from other partial derivations, and they are computed independently from one another. On the other hand, when features for a particular phrase pair or synchronous rule cannot be computed independently from other pairs, they are called **non-local features**. This distinction is important, as local features will not result in an increase in the size of the search space, whereas non-local features have the potential to make search more difficult.

The second distinction is between **dense features**, which define a small number of highly informative feature functions, and **sparse features**, which define a large number of less informative feature functions. Dense features are generally easier to optimize, both from a computational point of view because the smaller number of features reduces computational and memory requirements, and because the smaller number of parameters reduces the risk of overfitting. On the other hand, sparse features allow for more flexibility, as their parameters can be directly optimized to increase translation accuracy, so if optimization is performed well they have the potential to greatly increase translation accuracy. The remainder of this section describes some of the widely used features in more detail.

#### 2.3.1 Dense Features

**translation probabilities**or

**relative frequencies**in which the log of sentence-wise probability distributions

*p*(

**|**

*f***) and**

*e**p*(

**|**

*e***), are split into the sum of phrase or rule log probabilitiesHere α and β are the source and target sides of a phrase pair or rule. These features are estimated using counts of each phrase derived from the training corpus as follows:In addition, it is also common to use**

*f***lexical weighting**, which estimates parameters for each phrase pair or rule by further decomposing them into word-wise probabilities (Koehn, Och, and Marcu 2003). This helps more accurately estimate the reliability of phrase pairs or rules that have low counts. It should be noted that all of these features can be calculated directly from the rules themselves, and are thus local features.

The *n*-gram language model assigns higher penalties for longer translations, and it is common to add a **word penalty feature** that measures the length of translation ** e** to compensate for this. Similarly,

**phrase penalty**or

**rule penalty**features express the trade-off between longer or shorter derivations. There exist other features that are dependent on the underlying MT system model. Phrase-based MT heavily relies on the

**distortion probabilities**that are computed by the distance on the source side of target-adjacent phrase pairs. More refined

**lexicalized reordering models**estimate the parameters from the training data based on the relative distance of two phrase pairs (Tillman 2004; Galley and Manning 2008).

#### 2.3.2 Sparse features

Although dense features form the foundation of most SMT systems, in recent years the ability to define richer feature sets and directly optimize the system using rich features has been shown to allow for significant increases in accuracy. On the other hand, large and sparse feature sets make the MT optimization problem significantly harder, and many of the optimization methods we will cover in the rest of this survey are aimed at optimizing rich feature sets.

**phrase features**or

**rule features**, which count the occurrence of every phrase or rule. Of course, it is only possible to learn parameters for a translation rule if it exists in the training data used in optimization, so when using a smaller data set for optimization it is difficult to robustly learn these features. Chiang, Knight, and Wang (2009) have noted that this problem can be alleviated by only selecting and optimizing the more frequent of the sparse features. Simianer, Riezler, and Dyer (2012) also propose features using the “shape” of translation rules, transforming a ruleinto a string simply indicating whether each word is a terminal (T) or non-terminal (N)Count-based features can also be extended to cover other features of the translation, such as phrase or rule bigrams, indicating which phrases or rules tend to be used together (Simianer, Riezler, and Dyer 2012).

Another alternative for the creation of features that are sparse, but less sparse than features of phrases or rules, are **lexical features** (Watanabe et al. 2007). Lexical features, similar to lexical weighting, focus on the correspondence between the individual words that are included in a phrase or rule. The simplest variety of lexical features remembers which source words *f* are aligned with which target words *e*, and fires a feature for each pair. It is also possible to condition lexical features on the surrounding context in the source language (Chiang, Knight, and Wang 2009; Xiao et al. 2011), fire features between every pair of words in the source or target sentences (Watanabe et al. 2007), or integrate bigrams on the target side (Watanabe et al. 2007). Of these, the former two can be calculated from source and local target context, but target bigrams require target bigram context and are thus non-local features.

One final variety of features that has proven useful is **syntax-based features** (Blunsom and Osborne 2008; Marton and Resnik 2008). In particular, phrase-based and hierarchical phrase-based translations do not directly consider syntax (in the linguistic sense) in the construction of the models, so introducing this information in the form of features has a potential for benefit. One way to introduce this information is to parse the input sentence before translation, and use the information in the parse tree in the calculation of features. For example, we can count the number of times a phrase or translation rule matches, or partially matches (Marton and Resnik 2008), a span with a particular label, based on the assumption that rules that match a syntactic span are more likely to be syntactically reasonable.

#### 2.3.3 Summary features

**summary feature**, and can be expressed as followsThere has also been work that splits sparse features into not one, but multiple groups, creating a dense feature for each group (Xiang and Ittycheriah 2011; Liu et al. 2013).

### 2.4 Decoding

**, the task of decoding is defined as an inference problem of finding the best scoring derivation according to Equation (1). In general, the inference is intractable if we enumerate all possible derivations in and rank each derivation by the model. We assume that a derivation is composed of a set of stepswhere each**

*f**d*

_{j}is a step—for example, a phrase pair in phrase-based MT or a synchronous rule in tree-based MT—ordered in a particular way. We also assume that each feature function can be decomposed over each step, and Equation (1) can be expressed bywhere is a feature function for the

*j*th step decomposed from the global feature function of

*h*

_{i}(

**,**

*f***,**

*e***). As mentioned in the previous section, non-local features require information that cannot be calculated directly from the rule itself, and is a variable that defines the residual information to score this**

*d**i*th feature function using the partial derivation (Gesmundo and Henderson 2011; Green, Cer, and Manning 2014). For example, in phrase-based translation, for an

*n*-gram language model feature, will be the

*n*− 1 word suffix of the partial translation (Koehn, Och, and Marcu 2003). The local feature functions, such as phrase translation probabilities in Section 2.3.1, require no context from partial derivations, and thus .

The problem of decoding is treated as a search problem in which partial derivations together with in Equation (9) are enumerated to form hypotheses or states. In phrase-based MT, search is carried out by enumerating partial derivations in left-toright order on the target side while remembering the translated source word positions. Similarly, the search in MT with synchronous grammars is performed by using the CYK+ algorithm (Chappelier and Rajman 1998) on the source side and generating partial derivations for progressively longer source spans. Because of the enormous search space brought about by maintaining in each partial derivation, **beam search** is used to heuristically prune the search space. As a result, the search is **inexact** because of the **search error** caused by heuristic pruning, in which the best scoring hypothesis is not necessarily optimal in terms of given model parameters.

The search is efficiently carried out by merging equivalent states encoded as ρ (Koehn, Och, and Marcu 2003; Huang and Chiang 2007), and the space is succinctly represented by compact data structures, such as **graphs** (Ueffing, Och, and Ney 2002) (or **lattices**) in phrase-based MT (Koehn, Och, and Marcu 2003) and **hypergraphs** (Klein and Manning 2004) (or **packed forests**) in tree-based MT (Huang and Chiang 2007). These data structures may be directly used as compact representations of all derivations for optimization.

However, using these data structures directly can be unwieldly, and thus it is more common to obtain a *k*-best list as an approximation of the derivation space. Figure 1(a) shows an example of *k*-best English translations for a French input sentence, ‘*la délégation chinoise appuiera pleinement la présidence*.’ The *k*-best list may be obtained either from a lattice in Figure 1(b) or from a forest in Figure 1(c). It should be noted that different derivations in a *k*-best list may share the same translation due to the variation of phrases or rules in constructing a translation, e.g., the choice of *support the chair* or *support* and *the chair* in Figure 1(b). A diverse *k*-best list can be obtained by extracting a unique *k*-best list that maintains only the best scored derivation sharing the same translation (Huang, Knight, and Joshi 2006; Hasan, Zens, and Ney 2007), by incorporating a penalty term when scoring derivations (Gimpel et al. 2013), or by performing Monte Carlo sampling to acquire a more diverse set of candidates (Blunsom and Osborne 2008).

Another class of decoding problem is **forced decoding**, in which the output from a decoder is forced to match with a reference translation of the input sentence. In phrase-based MT, this is implemented by adding additional features to reward hypotheses that match with the given target sentence (Liang, Zhang, and Zhao 2012; Yu et al. 2013). In MT using synchronous grammars, it is carried out by **biparsing** over two languages, for instance, by a variant of the CYK algorithm (Wu 1997) or by a more efficient two-step algorithm (Dyer 2010b; Peitz et al. 2012). Even if we perform forced decoding, we are still not guaranteed that the decoder will be able to produce the reference translation (because of unknown words, reordering limits, or other factors). This problem can be resolved by preserving the prefix of partial derivations (Yu et al. 2013), or by allowing approximate matching of the target side (Liang, Zhang, and Zhao 2012). It is also possible to create a **neighborhood** of a forced decoding derivation by adding additional hyperedges to the true derivation, which allows for efficient generation of negative examples for discriminative learning algorithms (Xiao et al. 2011).

### 2.5 Evaluation

Once we have a machine translation system that can produce translations, we next must perform **evaluation** to judge how good the generated translations actually are. As the final consumer of machine translation output is usually a human, the most natural form of evaluation is manual evaluation by human annotators. However, because human evaluation is expensive and time-consuming, in recent years there has been a shift to automatic calculation of the quality of MT output.

In general, automatic evaluation measures use a set of data consisting of *N* input sentences , each of which having a **reference translation** that was created by a human translator. The input *F* is automatically translated using a machine translation system to acquire MT results , which are then compared to the corresponding references. The closer the MT output is to the reference, the better it is deemed to be, according to automatic evaluation. In addition, as there are often many ways to translate a particular sentence, it is also possible to perform evaluation with multiple references created by different translators. There has also been some work on encoding a huge number of references in a lattice, created either by hand (Dreyer and Marcu 2012) or by automatic paraphrasing (Zhou, Lin, and Hovy 2006).

One major distinction between optimization measures is whether they are calculated on the **corpus level** or the **sentence level**. Corpus-level measures are calculated by taking statistics over the whole corpus, whereas sentence-level measures are calculated by measuring sentence-level accuracy, and defining the corpus-level accuracy as the average of the sentence-level accuracies. All optimization algorithms that are applicable to corpus-level measures are applicable to sentence-level measures, but the opposite is not true, making this distinction important from the optimization point of view.

The most commonly used MT evaluation measure BLEU (Papineni et al. 2002) is defined on the corpus level, and we will cover it in detail as it plays an important role in some of the methods that follow. Of course, there have been many other evaluation measures proposed since BLEU, with TER (Snover et al. 2006) and METEOR (Banerjee and Lavie 2005) being among the most widely used. The great majority of metrics other than BLEU are defined on the sentence level, and thus are conducive to optimization algorithms that require sentence-level evaluation measures. We discuss the role of evaluation in MT optimization more completely in Section 8.3.

#### 2.5.1 BLEU

*n*-gram precisions (usually for

*n*from 1 to 4), and a brevity penalty to prevent short sentences from receiving unfairly high evaluation scores. For a single reference sentence

**and a corresponding system output , we can define as the number of**

*e**n*-grams in , and as the number of

*n*-grams in that match

**Here, and are multisets that can contain identical**

*e**n*-grams more than once, and ∩ is an operator for multisets that allows for consideration of multiple instances of the same

*n*-gram.

^{3}Note that the total count for a candidate

*n*-gram is

**clipped**to be no more than the count in the reference translation. If we have a corpus of reference sets

*R*= {

*e*^{(1)}, … ,

*e*^{(N)}}, where each sentence has

*M*references , the BLEU score of the corresponding system outputs can be defined aswhere the first term corresponds to geometric mean of the

*n*-gram precisions, and the second term BP(

*E*, ) is the

**brevity penalty**. The brevity penalty is necessary here because evaluation of precision favors systems that output only the words and phrases that have high accuracy, and avoids outputting more difficult-to-translate content that might not match the reference. The brevity penalty prevents this by discounting outputs that are shorter than the referencewhere is defined as the longest reference with a length shorter than or equal to .

#### 2.5.2 BLEU+1

*n*-grams to become zero, resulting in a BLEU score of zero for the entire sentence. One common solution to this problem is the use of a smoothed version of BLEU, commonly referred to as BLEU+1 (Lin and Och 2004). In BLEU+1, we add one to the numerators and denominators of each

*n*-gram of order greater than onewhere δ(·) is a function that takes a value of 1 when the corresponding statement is true. We can then re-define a sentence-level BLEU using these smoothed countsand the corpus-level evaluation can be re-defined as the average of sentence level evaluationsIt has also been noted, however, that the average of sentence-level BLEU+1 is not a very accurate approximation of corpus-level BLEU, but by adjusting the smoothing heuristics it is possible to achieve a more accurate approximation (Nakov, Guzman, and Vogel 2012).

### 2.6 The Optimization Setting

During the optimization process, we will assume that we have some data consisting of sources with corresponding references as defined in the previous section, and that we would like to use these to optimize the parameters of the model. As mentioned in Section 2.5, it is also possible to use more than one reference translation in evaluation, but in this survey we will assume for simplicity of exposition that only one reference is used.

**, this will affect the scores calculated according to the model, and thus the result acquired during decoding, as described in Section 2.4. To express whether this effect is a positive or negative one, we define a**

*w***loss function**that provides a numerical indicator of how “bad” the translations generated when we use a particular

**are. As the goal of optimization is to achieve better translations, we would like to choose parameters that reduce this loss. More formally, we can cast the problem as minimizing the expectation of ℓ(·), or**

*w***risk minimization**:Here,

*Pr*(

*F*,

*E*) is the true joint distribution over all sets of input and output sentences that we are likely to be required to translate. However, in reality we will not know the true distribution over all sets of sentences a user may ask us to translate. Instead, we have a single set of data (henceforth,

**training data**), and attempt to find the

**that minimizes the loss on this data:Because we are now optimizing on a single empirically derived set of training data, this framework is called**

*w***empirical risk minimization**.

**regularization**to prevent the learning of parameters that over-fit the training data. This gives us the framework of

**regularized empirical risk minimization**, which will encompass most of the methods described in this survey, and is formalized aswhere λ is a parameter adjusting the strength of regularization, and Ω(

**) is a regularization term, common choices for which include the**

*w**L*

_{2}regularizer Ω

_{2}(

**) = = or the**

*w**L*

_{1}regularizer (Tibshirani 1996; Chen and Rosenfeld 1999). Intuitively, if λ is set to a small value, optimization will attempt to learn a

**that effectively minimizes loss on the training data, but there is a risk of over-fitting reducing generalization capability. On the other hand, if λ is set to a larger value, optimization will be less aggressive in minimizing loss on the training data, reducing over-fitting, but also possibly failing to capture useful information that could be used to improve accuracy.**

*w*## 3. Defining a Loss Function

The first step in performing optimization is defining the loss function that we are interested in optimizing. The choice of a proper loss function is critical in that it effects the final performance of the optimized MT system, and also the possible choices for optimization algorithms. This section describes several common choices for loss functions, and describes their various features.

### 3.1 Error

**error**(Och 2003). We assume that by comparing the decoder's translation result with the reference

*E*, we are able to calculate a function that describes the extent of error included in the translations. For example, if we use the BLEU described in Section 2.5 as an evaluation measure for our system, it is natural to use 1 − BLEU as an error function, so that as our evaluation improves, the error decreases. Converting this to a loss function that is dependent on the model parameters, we obtain the following loss expressing the error over the 1-best results obtained by decoding in Equation (1):

Error has the advantage of being simple, easy to explain, and directly related to translation performance, and these features make it perhaps the most commonly used loss in current machine translation systems. On the other hand, it also has a large disadvantage in that the loss function expressed in Equation (17) is not convex, and most MT evaluation measures used in the calculation of the error function error(·) are not continuously differentiable. This makes direct minimization of error a difficult optimization problem (particularly for larger feature sets), and thus a number of other, easier-to-optimize losses are used as well.

**zero–one loss**. Zero–one loss focuses on whether an

**oracle translation**is chosen as the system output. Oracle translations can be vaguely defined as “good” translations, such as the reference translation

*e*^{(i)}, or perhaps the best translation in the

*k*-best list (described in detail in Section 4). If we define the set of oracle translations for sentence

*i*as

*o*^{(i)}, zero–one loss is defined by plugging the following zero–one error function into Equation (17):where is the one-best translation candidate, and is one if is a member of

*o*^{(i)}and zero otherwise.

### 3.2 Softmax Loss

One thing to note about error is that there is no concept of “probability” of each translation candidate incorporated in its calculation. Being able to define a well-scaled probability of candidates can be useful, however, for estimation of confidence measures or incorporation with downstream applications. **Softmax loss** is a loss that is similar to the zero–one loss, but directly defines a probabilistic model and attempts to maximize the probability of the oracle translations (Berger, Della Pietra, and Della Pietra 1996; Och and Ney 2002; Blunsom, Cohn, and Osborne 2008).

From Equation (21) we can see that only the oracle translations contribute to the numerator, and all candidates in *c*^{(i)} contributes to the denominator. Thus, intuitively, the softmax objective prefers parameter settings that assign high scores to the oracle translations, and lower scores to any other members of *c*^{(i)} that are not oracles.

It should be noted that this loss can be calculated from a *k*-best list by iterating over the entire list and calculating the numerators and denominators in Equation (19). It is also possible, but more involved, to calculate over lattices or forests by using dynamic programming algorithms such as the forward–backward or inside–outside algorithms (Blunsom, Cohn, and Osborne 2008; Gimpel and Smith 2009).

### 3.3 Risk-Based Loss

In contrast to softmax loss, which can be viewed as a probabilistic version of zero–one loss, **risk** defines a probabilistic version of the translation error (Smith and Eisner 2006; Zens, Hasan, and Ney 2007; Li and Eisner 2009; He and Deng 2012). Specifically, risk is based on the expected error incurred by a probabilistic model parameterized by ** w**. This combines the advantages of the probabilistic model in softmax loss with the direct consideration of translation accuracy afforded by using error directly. In comparison to error, it also has the advantage of being differentiable, allowing for easier optimization.

**every hypothesis 〈**

*w***,**

*e***〉 will be assigned a uniform probability, and when γ = 1 the probabilities are equivalent to those in the log-linear model of Equation (19). When γ → ∞, the probability of the highest-scored hypothesis will approach 1, and thus our objective will approach the error defined in Equation (17). This γ can be adjusted in a way that allows for more effective search of the parameter space, as described in more detail in Section 5.5.**

*d*### 3.4 Margin-Based Loss

The zero–one loss in Section 3.1 was based on whether the oracle received a higher score than other hypotheses. The idea of **margin**, which is behind the classification paradigm of **support vector machines** (SVMs) (Joachims 1998), takes this a step further, finding parameters that explicitly maximize the distance, or margin, between correct and incorrect candidates. The main advantage of margin-based methods is that they are able to consider the error function, and often achieve high accuracy. These advantages make margin-based methods perhaps the second most popular loss used in current MT systems after direct minimization of error.

*o*^{(i)}, and non-oracle candidates

*c*^{(i)}\

*o*^{(i)}, the margin

*w*^{⊺}Δ

**(·) between oracle**

*h**** and non-oracle**

*e***should be greater than the difference in the error Δ err(·).**

*e*^{4}We then define the loss as the total amount that this margin is violated. In this loss calculation, the number of pairs is . Note that here err(·) is not calculated on the corpus level, but on the sentence level, and may not directly correspond to our corpus-level error error(·).

**hinge loss**. If we define as the 1-best translation candidateand 〈

***,**

*e******

*d*^{(i)}〉 ∈

*o*^{(i)}as the oracle translationthe hinge loss can be defined as followsA special instance of this hinge loss that is widely used in machine translation, and machine learning in general, is

**perceptron loss**(Liang et al. 2006), which further removes the term considering the error, and simply incurs a penalty if the 1-best candidate receives a higher score than the oracle

**relative margin**(Eidelman, Marton, and Resnik 2013). To explain the relative margin, we first define the

**worst hypothesis**asand then calculate the

**spread**, which is the difference of errors between the oracle hypothesis and worst hypothesis . An additional term can then be added to the objective function to penalize parameter settings with large spreads. The intuition behind the relative margin criterion is that in addition to increasing the margin, considering the spread reduces the variance between the non-oracle hypotheses. Given an identical margin, having a smaller variance indicates that an unseen hypothesis will be less likely to pass over the margin and be misclassified.

### 3.5 Ranking Loss

**ranking**framework (Herbrich, Graepel, and Obermayer 1999; Freund et al. 2003; Burges et al. 2005; Cao et al. 2007), where, for an arbitrary pair of translation candidates, a binary classifier is trained to distinguish which of the two candidates has the lower error. If a particular pair of candidates in the training data 〈

*e*_{k},

*d*_{k}〉 and 〈

*e*_{k′},

*d*_{k′}〉 is ranked in the correct order, the following condition is satisfied:This can be expressed aswhere Δ

**(**

*h*

*f*^{(i)},

*e*_{k},

*d*_{k},

*e*_{k′},

*d*_{k′}) can be treated as training data to be classified using any variety of

**binary classifier**. Each binary decision made by this classifier becomes an individual choice, and thus the ranking loss is the sum of these individual losses. As the binary classifier, it is possible to use perceptron, hinge, or softmax losses between the correct and incorrect answers.

It should be noted that standard ranking techniques make a hard decision between candidates with higher and lower error, which can cause problems when the ranking by error does not correlate well with the ranking measured by the model. The **cross-entropy ranking loss** solves this problem by softly fitting the model distribution to the distribution of ranking measured by errors (Green et al. 2014).

### 3.6 Mean Squared Error Loss

**mean squared error loss**is another method that does not make a hard zero–one decision between the better and worse candidates, but instead attempts to directly estimate the difference in scores (Bazrafshan, Chung, and Gildea 2012). This is done by first finding the difference in errors between the two candidates Δ err(

*e*^{(i)},

***,**

*e***) and defining the loss as the mean squared error of the difference between the inverse of the difference in the errors and the difference in the model scores**

*e*^{5}:

## 4. Choosing Oracles

In the previous section, many loss functions used **oracle translations**, which are defined as a set of translations for any sentence that are “good.” Choosing oracle translations is not a trivial task, and in this section we describe the details involved.

### 4.1 Bold vs. Local Updates

In other structured learning tasks such as part-of-speech tagging or parsing, it is common to simply use the correct answer as an oracle. In translation, this is equivalent to optimizing towards an actual human reference, which is called **bold update** (Liang et al. 2006). It should be noted that even if we know the reference ** e**, we still need to obtain a derivation

**, and thus it is necessary to perform forced decoding (described in Section 2.4) to obtain this derivation.**

*d*However, bold update has a number of practical difficulties. For example, we are not guaranteed that the decoder is able to actually produce the reference (for example, in the case of unknown words), in which case forced decoding will fail. In addition, even if the hypothesis exists in the search space, it might require a large change in parameters ** w** to ensure that the reference gets a higher score than all other hypotheses. This is true in the case of non-literal translations, for example, which may be producible by the decoder, but only by using a derivation that would normally receive an extremely low probability.

**Local update** is an alternative method that selects an oracle from a set of hypotheses produced during the normal decoding process. The space of hypotheses used to select oracles is usually based on *k*-best lists, but can also include lattices or forests output by the decoder as described in Section 2.4. Because of the previously mentioned difficulties with bold update, it has been empirically observed that local update tends to outperform bold update in online optimization (Liang et al. 2006). However, it also makes it necessary to select oracle translations from a set of imperfect decoder outputs, and we will describe this process in more detail in the following section.

### 4.2 Selecting Oracles and Approximating Corpus-Level Errors

*o*^{(i)}⊆

*c*^{(i)}as the set of oracle translations, derivation-translation pairs in

*c*^{(i)}that minimize the error functionOne thing to note here is that error(·) is a corpus-level error function. As mentioned in Section 2.5, evaluation measures for MT can be classified into those that are decomposable on the sentence level, and those that are not. If this error function can be composed as the sum of sentence-level errors, such as BLEU+1, choosing the oracle is simple; we simply need to find the set of candidates that have the lowest error independently sentence by sentence.

^{6}

However, when using a corpus-level error function we need a slightly more sophisticated method, such as the **greedy method** of Venugopal and Vogel (2005). In this method (Figure 2), the oracle is first initialized either as an empty set or by randomly picking from the candidates. Next, we iterate randomly through the translation candidates in *c*^{(i)}, try replacing the current oracle *o*^{(i)} with the candidate, and check the change in the error function (Line 9), and if the error decreases, replace the oracle with the tested candidate. This process is repeated until there is no change in *O*.

### 4.3 Selecting Oracles for Margin-Based Methods

*****

*e*^{(i)},

*****

*d*^{(i)}〉 form the pair with the minimal margin. Thus, when using margin-based objectives, it is common to modify the criterion for selecting candidates to use in the update as follows (Chiang, Marton, and Resnik 2008; Chiang, Knight, and Wang 2009):Thus, we can replace and 〈

*****

*e*^{(i)},

*****

*d*^{(i)}〉 with and , resulting in a margin ofwhich is the largest margin in the

*k*-best list. Explaining more intuitively, this criterion provides a bias towards selecting hypotheses with high error, making the learning algorithm work harder to correctly classify very bad hypotheses than it does for hypotheses that are only slightly worse than the oracle. Inference methods that consider the loss as in Equations (35) and (36) are called

**loss-augmented inference**(Taskar et al. 2005) methods, and can minimize losses with respect to the candidate with the largest violation. Gimpel and Smith (2012) take this a step further, defining a

**structured ramp loss**that additionally considers Equations (28) and (29) within this framework.

## 5. Batch Methods

Now that we have explained the details of calculating loss functions used in machine translation, we turn to the actual algorithms used in optimizing using these loss functions. In this section, we cover **batch learning** approaches to MT optimization. Batch learning works by considering the entire training data on every update of the parameters, in contrast to online learning (covered in the following section), which considers only part of the data at any one time. In standard approaches to batch learning, for every training example 〈*f*^{(i)}, *e*^{(i)}〉 we enumerate every translation and derivation in the respective sets and , and attempt to adjust the parameters so we can achieve the translations with the lowest error for the entire data.

However, as mentioned previously, the entire space of derivations is too large to handle in practice. To resolve this problem, most batch learning algorithms for MT follow the general procedure shown in Figure 3, performing iterations that alternate between decoding and optimization (Och and Ney 2002). In line 6, GEN(*f*^{(i)}, *w*^{(t)}) = indicates that we use the current parameters *w*^{(t)} to perform decoding of sentence *f*^{(i)}, and obtain a subset of all derivations. For convenience, we will assume that this subset is expressed using a *k*-best list *kbest*^{(i)}, but it is also possible to use lattices or forests, as explained in Section 2.4.

A *k*-best list with scores for each hypothesis can be used as an approximation for the distribution over potential translations of *f*^{(i)} according to the parameters ** w**. However, because the size of the

*k*-best list is limited, and the presence of search errors in decoding means that we are not even guaranteed to find the highest-scoring hypotheses, this approximation is far from perfect. The effect of this approximation is particularly obvious if the lack of coverage of the

*k*-best list is systematic. For example, if the hypotheses in the

*k*-best list are all much too short, optimization may attempt to fix this by adjusting the parameters to heavily favor very long hypotheses, far overshooting the actual optimal parameters.

^{7}

As a way to alleviate the problems caused by this approximation, in line 7 we merge the *k*-best lists from multiple decoding iterations, finding a larger and more accurate set *C* of derivations. Given *C* and the training data 〈*F*, *E*〉, we perform minimization of the Ω(** w**) regularized loss function ℓ(·) and obtain new parameters

*w*^{(t+1)}(line 9). Generation of

*k*-best lists and optimization is performed until a hard limit of

*T*iterations is reached, or until training has converged. In this setting, usually convergence is defined as any iteration in which the merged

*k*-best list does not change, or when the parameters

**do not change (Och 2003).**

*w*Within this batch optimization framework, the most critical challenge is to find an effective way to solve the optimization problem in line 9 of Figure 3. Section 5.1 describes methods for directly optimizing the error function. There are also methods for optimizing other losses such as those based on probabilistic models (Section 5.2), error margins (Section 5.3), ranking (Section 5.4), and risk (Section 5.5).

### 5.1 Error Minimization

#### 5.1.1 Minimum Error Rate Training Overview

Minimum error rate training (MERT) (Och 2003) is one of the first, and is currently the most widely used, method for MT optimization, and focuses mainly on direct minimization of the error described in Section 3.1. Because error is not continuously differentiable, MERT uses optimization methods that do not require the calculation of a gradient, such as iterative line search inspired by **Powell's method** (Och 2003; Press et al. 2007), or the **Downhill-Simplex method** (**Nelder-Mead method**) (Press et al. 2007; Zens, Hasan, and Ney 2007; Zhao and Chen 2009).

The algorithm for MERT using line search is shown in Figure 4. Here, we assume that ** w** and

**(·) are**

*h**M*-dimensional, and

*b*^{m}is an

*M*-dimensional vector where the

*m*-th element is 1 and the rest of the elements are zero. For the

*T*iterations, we decide the dimension

*m*of the feature vector (line 6), and for each possible weight vector

*w*^{(j)}+ γ

*b*^{m}choose the γ ∈ ℝ that minimizes ℓ

_{error}(·) using

**line search**(line 7). Then, among the γ for each of the

*M*search dimensions, we perform an update using that affords the largest reduction in error (lines 9 and 10). This algorithm can be deemed a variety of

**steepest descent**, which is a standard method used in most implementations of MERT (Koehn et al. 2007). Another alternative is a variant of

**coordinate descent**(e.g., Powell's method), in which search and update is performed in each dimension.

One feature of MERT is that it is known to easily fall into local optima of the error function. Because of this, it is standard to choose *R* starting points (line 4), perform optimization starting at each of these starting points, and finally choose the that minimizes the loss from the weights acquired from each of the *R* random restarts. The *R* starting points are generally chosen so that one of the points is the best ** w** from the previous iteration, and the remaining

*R*− 1 have each element of

**chosen randomly and uniformly from some interval, although it has also been shown that more intelligent choice of initial points can result in better final scores (Moore and Quirk 2008).**

*w*#### 5.1.2 Line Search for MERT

*c*^{(i)}that achieves the lowest error. In order to do so, MERT uses an algorithm that allows for exact enumeration of which of the

*K*candidates in

*c*^{(i)}will be chosen for each value of γ. Concretely, we definewhere each hypothesis 〈

**,**

*e***〉 in**

*d*

*c*^{(i)}of Equation (40) is expressed as a line with intercept

*a*(

*f*^{(i)},

**,**

*e***)(=**

*d*

*w*^{(i) ⊺}

**(**

*h*

*f*^{(i)},

**,**

*e***)) and slope**

*d**b*(

*f*^{(i)},

**,**

*e***)(=**

*d**h*

_{m}(

*f*^{(i)},

**,**

*e***)) with γ as a parameter. Equation (40) is a function that returns the translation candidate with the highest score. We can define a function**

*d**g*(γ;

*f*^{(i)}) that corresponds to the score of this highest-scoring candidate as follows:We can see that Equation (41) is a

**piecewise linear**function (Papineni 1999; Och 2003), as at any given γ ∈ ℝ the translation candidate with the highest score

*a*(·) + γ ·

*b*(·) will be selected, and this score corresponds to the line that is in the highest position at that particular γ. In Figure 5, we show an example with the following four translation candidates:If we set γ to a very small value such as − ∞, the candidate with the smallest slope, in this example , will be chosen. Furthermore, if we make γ gradually larger, we will see that continues to be the highest scoring candidate until we reach the intersection of and atafter which will be the highest scoring candidate. If we continue increasing γ, we will continue by selecting and starting at their corresponding intersections.

A function like Equation (41) that chooses the highest-scoring line for each span over γ is called an **envelope**, and can be used to compactly express the results we will obtain by rescoring *c*^{(i)} according to a particular γ (Figure 6a). After finding the envelope, for each line that participates in the envelope, we can calculate the sufficient statistics necessary for calculating the loss ℓ_{error}(·) and error error(·). For example, given the envelope in Figure 6a, Figure 6b is an example of the sentence-wise loss with respect to γ.

The envelope shown in Equation (41) can also be viewed as the problem of finding a **convex hull** in computational geometry. A standard and efficient algorithm for finding a convex hull of multiple lines is the **sweep line algorithm** (Bentley and Ottmann 1979; Macherey et al. 2008) (see Figure 7). Here, we assume *L* is a set of the lines corresponding to the *K* translation candidates in *c*^{(i)}, each line *l* ∈ *L* is expressed as 〈*a*(*l*), *b*(*l*), γ(*l*)〉 with intercept *a*(*l*) = *a*(*f*^{(i)}, ** e**,

**), and slope**

*d**b*(

*l*) =

*b*(

*f*^{(i)},

**,**

*e***). Furthermore, we define γ(**

*d**l*) as an intersection initialized to − ∞. SortLines(

*L*) in Figure 3 sorts the lines in the order of their slope

*b*(

*l*), and if two lines

*l*

_{k1}have the same slope,

*l*

_{k2}chooses the one with the larger intercept

*a*(

*l*

_{k1}) >

*a*(

*l*

_{k2}) and deletes the other. We next process the sorted set of lines

*L*′ (|

*L*′| ≤

*K*) in order of ascending slope (lines 4–18). If we assume

*H*is the envelope expressed as the set of lines it contains, we find the line that intersects with line under consideration at the highest point (lines 6–12), and update the envelope

*H*. As

*L*contains at most

*K*lines,

*H*'s size is also at most

*K*.

Given a particular input sentence *f*^{(i)}, its set of translation candidates *c*^{(i)}, and the resulting envelope *H*^{(i)}, we can also define the set of intersections between lines in the envelope as . We also define to be the change in the loss function that occurs when we move from one span to the next . If we first calculate the loss incurred when setting γ = −∞, then process the spans in increasing order, keeping track of the difference incurred at each span boundary, it is possible to efficiently calculate the loss curve over all spans of γ.

In addition, whereas all explanation of line search to this point has focused on the procedure for a single sentence, by calculating the envelopes for each sentence in the data 1 ≤ *i* ≤ *N*, and combining these envelopes into a single plane, it is relatively simple to perform this processing on the corpus level as well. It should be noted that for corpus-based evaluation measures such as BLEU, when performing corpus-level processing, we do not keep track of the change in the loss, but the change in the sufficient statistics required to calculate the loss for each sentence. In the case of BLEU, the sufficient statistics amount to *n*-gram counts *c*_{n}, *n*-gram matches *m*_{n}, and reference lengths *r*. We then calculate the loss curve ℓ_{error}(·) for the entire corpus based on these sufficient statistics, and find a γ that minimizes Equation (17) based on this curve. By repeating this line search for each parameter until we can no longer obtain a decrease, it is possible to find a local minimum in the loss function, even for non-convex or non-differential functions.

#### 5.1.3 MERT's Weaknesses and Extensions

Although MERT is widely used as the standard optimization procedure for MT, it also has a number of weaknesses, and a number of extensions to the MERT framework have been proposed to resolve these problems.

The first weakness of MERT is the **randomness** in the optimization process. Because each iteration of the training algorithm generally involves a number of random restarts, the results will generally change over multiple training runs, with the changes often being quite significant. Some research has shown that this randomness can be stabilized somewhat by improving the ability of the line-search algorithm to find a globally good solution by choosing random seeds more intelligently (Moore and Quirk 2008; Foster and Kuhn 2009) or by searching in directions that consider multiple features at once, instead of using the simple coordinate ascent as described in Figure 4 (Cer, Jurafsky, and Manning 2008). Orthogonally to actual improvement of the results, Clark et al. (2011) suggest that because randomness is a fundamental feature of MERT and other optimization algorithms for MT, it is better experimental practice to perform optimization multiple times, and report the resulting means and standard deviations over various optimization runs.

It is also possible to optimize the MERT objective using other optimization algorithms. For example, Suzuki, Duh, and Nagata (2011) present a method for using **particle swarm optimization**, a distributed algorithm where many “particles” are each associated with a parameter vector, and the particle updates its vector in a way such that it moves towards the current local and global optima. Another alternative optimization algorithm is Galley and Quirk's (2011) method for using **linear programming** to perform search for optimal parameters over more than one dimension, or all dimensions at a single time. However, as MERT remains a fundamentally computationally hard problem, this method takes large amounts of time for larger training sets or feature spaces.

It should be noted that instability in MERT is not entirely due to the fact that search is random, but also due to the fact that *k*-best lists are poor approximations of the whole space of possible translations. One way to improve this approximation is by performing MERT over an exponentially large number of hypotheses encoded in a translation lattice (Macherey et al. 2008) or hypergraph (Kumar et al. 2009). It is possible to perform MERT over these sorts of packed data structures by observing the fact that the envelopes used in MERT can be expressed as a **semiring** (Dyer 2010a; Sokolov and Yvon 2011), allowing for exact calculation of the full envelope for all hypotheses in a lattice or hypergraph using polynomial-time dynamic programming (the **forward algorithm** or **inside algorithm**, respectively). There has also been work to improve the accuracy of the *k*-best approximation by either sampling *k*-best candidates from the translation lattice (Chatterjee and Cancedda 2010), or performing forced decoding to find derivations that achieve the reference translation, and adding them to the *k*-best list (Liang, Zhang, and Zhao 2012).

The second weakness of MERT is that it has no concept of **regularization**, causing it to overfit the training data if there are too many features, and there have been several attempts to incorporate regularization to ameliorate this problem. Cer, Jurafsky, and Manning (2008) propose a method to incorporate regularization by not choosing the plateau in the loss curve that minimizes the loss itself, but choosing the point considering the loss values for a few surrounding plateaus, helping to avoid points that have a low loss but are surrounded by plateaus with higher loss. It is also possible to incorporate regularization into MERT-style line search using an SVM-inspired margin-based objective (Hayashi et al. 2009) or by using scale-invariant regularization methods such as *L*_{0} or a scaled version of *L*_{2} (Galley et al. 2013).

The final weakness of MERT is that it has computational problems when scaling to **large numbers of features**. When using only a standard set of 20 or so features, MERT is able to perform training in reasonable time, but the number of line searches, and thus time, required in Algorithm 4 scales linearly with the number of features. Thus training of hundreds of features is time-consuming, and there are no published results training standard MERT on thousands or millions of features. It should be noted, however, that Galley et al. (2013) report results for thousands of features by choosing intelligent search directions by calculating the gradient of expected BLEU, as explained in Section 5.5.2.

### 5.2 Gradient-Based Batch Optimization

In the previous section, MERT optimized a loss function that was exactly equivalent to the error function, which is not continuously differentiable and thus precludes the use of standard convex optimization algorithms used in other optimization problems. In contrast, other losses such as the softmax loss described in Section 3.2 and risk-based losses described in Section 3.3 are differentiable, allowing for the use of these algorithms for MT optimization (Smith and Eisner 2006; Blunsom and Osborne 2008).

Convex optimization is well covered in the standard machine learning literature, so we do not cover it in depth, but methods such as **conjugate gradient** (using first-order statistics) (Nocedal and Wright 2006) and the **limited-memory Broyden-FletcherGoldfarb-Shanno** method (using second-order statistics) (Liu and Nocedal 1989) are standard options for optimizing these losses. These methods are equally applicable when the loss is combined with a differentiable regularizer Ω(** w**), such as

*L*

_{2}regularization. Using a non-differentiable regularizer such as

*L*

_{1}makes optimization more difficult, but can be handled by other algorithms such as

**orthant-wise limited-memory Quasi-Newton**(Andrew and Gao 2007).

In addition to the function being differentiable, if it is also convex we can be guaranteed that these algorithms will not get stuck in local optima and instead they will reach a globally optimal solution. In general, the softmax objective is convex if there is only one element in the oracle set *o*^{(i)}, and not necessarily convex if there are multiple oracles. In the case of MT, as there are usually multiple translations ** e** that minimize error(·), and multiple derivations

**that result in the same translation**

*d***,**

*e*

*o*^{(i)}will generally contain multiple members. Thus, we cannot be entirely certain that we will reach a global optimum.

### 5.3 Margin-Based Optimization

Minimizing the margin-based loss described in Section 3.4, possibly with the addition of a regularizer, is also a relatively standard problem in the machine learning literature. Methods to solve Equation (25) include **sequential minimization optimization** (Platt 1999), **dual coordinate descent** (Hsieh et al. 2008), as well as the **quadratic program** solvers used in standard SVMs (Joachims 1998).

It should also be noted that there have also been several attempts to apply margin-based online learning algorithms explained in Section 6.3, but in a batch setting where the whole training corpus is decoded before each iteration of optimization (Cherry and Foster 2012; Gimpel and Smith 2012). We will explain these methods in more detail later, but it should be noted that the advantage of using these methods in a batch setting mainly lies in simplicity; for online learning it is often necessary to directly implement the optimization procedure within the decoder, whereas in a batch setting the implementation of the decoding and optimization algorithm can be performed separately.

### 5.4 Ranking and Linear Regression Optimization

The rank-based loss described in Section 3.5 is essentially the combination of multiple losses over binary decisions. These binary decisions can be solved using gradient-based or margin-based methods, and thus optimization itself can be solved with the algorithms described in the previous two sections. However, one important concern in this setting is training time. At the worst, the number of pairwise comparisons for any particular *k*-best list is *k*(*k* − 1)/2, leading to unmanageably large amounts of time required for training.

One way to alleviate this problem is by randomly sampling a small number of these *k*(*k* − 1)/2 hypotheses for use in optimization, which has been shown empirically to allow for increases in training speed without decreases in accuracy. For example, Hopkins and May (2011) describe a method dubbed **pairwise ranking optimization** that selects 5,000 pairs randomly for each sentence, and among these random pairs using the 50 with the largest difference in error for training the classifier. Other selection heuristics—for example, avoiding training on candidate pairs with overly different scores (Nakov, Guzmán, and Vogel 2013), or performing Monte Carlo sampling (Roth et al. 2010; Haddow, Arun, and Koehn 2011)—are also possible and potentially increase accuracy. Recently, there has also been a method proposed that uses an efficient ranking SVM formulation that alleviates the need for this sampling and explicitly performs ranking over all pairs (Dreyer and Dong 2015).

The mean squared error loss described in Section 3.6, which is similar to ranking loss in that it will prefer a proper ordering of the *k*-best list, is much easier to optimize. This loss can be minimized using standard techniques for solving least-squared-error linear regression (Press et al. 2007).

### 5.5 Risk Minimization

*p*

_{γ,w}(·) asWhen γ takes a small value this entropy will be high, indicating that the loss function is relatively smooth and less sensitive to local optima. Conversely, when γ → ∞, the entropy becomes lower, and the loss function becomes more peaky with more local optima. It has been noted that this fact can be used for effective optimization through the process of

**deterministic annealing**(Sindhwani, Keerthi, and Chapelle 2006). In deterministic annealing, the parameter γ is not set as a hyperparameter, and instead the entropy

*H*(

*p*

_{γ,w}) is directly used as a regularization function during the optimization process (Smith and Eisner 2006; Li and Eisner 2009):In Equation (45),

*T*is the

**temperature**, which can either be set as a hyperparameter, or gradually decreased from ∞ to −∞ (or 0) through a process of

**cooling**(Smith and Eisner 2006). The motivation for cooling is that if we start with a large

*T*, the earlier steps using a smoother function will allow us to approach the global optimum, and the later steps will allow us to approach the actual error function.

It should be noted that in Equation (24), and the discussion up to this point, we have been using not the corpus-based error, but the sentence-based error err(*e*^{(i)}, ** e**). There have also been attempts to make the risk minimization framework applicable to corpus-level error error(·), specifically BLEU. We will discuss two such methods.

#### 5.5.1 Linear BLEU

**Linear BLEU**(Tromble et al. 2008) provides an approximation for corpus-level BLEU that can be divided among sentences. Linear BLEU uses a

**Taylor expansion**to approximate the effect that the sufficient statistics of any particular sentence will have on corpus-level BLEU. We define

*r*as the total length of the reference translations,

*c*as the total length of the candidates, and

*c*

_{n}and

*m*

_{n}(1 ≤

*n*≤ 4) as the translation candidate's number of

*n*-grams, and number of

*n*-grams that match the reference respectively. Taking the equation for corpus-level BLEU (Papineni et al. 2002) and assuming that the

*n*-gram counts are approximately equal for 1 ≤

*n*≤ 4, we get the following approximation:If we assume that when we add the sufficient statistics of a particular sentence

**, the corpus-level statistics change to**

*e**r*′,

*c*′, , and , then we can express the change in BLEU in the logarithm domain as followsIf we make the assumption that there is no change in the brevity penalty, Δ log BLEU relies solely on

*m*

_{n}and

*c*. Δ log BLEU can then be approximated using a first-order Taylor expansion as follows:As

*c*′ −

*c*is the length of

**, and is the number of**

*e**n*-grams (

*g*

_{n}) in

**that match the**

*e**n*-grams in

*e*^{(i)}, the sentence-level error function err

_{lBLEU}(·) for linear BLEU is

*c*and

*m*

_{n}are set to a fixed value (Tromble et al. 2008). For example, in the batch optimization algorithm of Figure 3 they can be calculated based on the

*k*-best list generated prior to optimization.

#### 5.5.2 Expectations of Sufficient Statistics

DeNero, Chiang, and Knight (2009) present an alternative method that calculates not the expectation of the error itself, but the expectation of the **sufficient statistics** used in calculating the error. In contrast to sentence-level approximations or formulations such as linear BLEU, the expectation of the sufficient statistics can be calculated directly on the corpus level. Because of this, by maximizing the evaluation derived by these expected statistics, it is possible to directly optimize for a corpus-level error, in a manner similar to MERT (Pauls, Denero, and Klein 2009).

**xBLEU**(Rosti et al. 2010, 2011) and the required sufficient statistics include

*n*-gram counts and matched

*n*-gram counts. We define the

*k*th translation candidate in

*c*^{(i)}as 〈

*e*_{k},

*d*_{k}〉, its score as

*s*

_{i,k}= γ

*w*^{⊺}

**(**

*h*

*f*^{(i)},

*e*_{k},

*d*_{k}), and the probability in Equation (22) asNext, we define the expectation of the

*n*-gram (

*g*

_{n}∈

*e*_{k}) frequency as

*c*

_{n,i,k}, the expectation of the number of

*n*-gram matches as

*m*

_{n,i,k}, and the expectation of the reference length as

*r*

_{i,k}. These values can be calculated as:It should be noted that although these equations apply to

*k*-best lists, it is also possible to calculate statistics over lattices or forests using dynamic programming algorithms and tools such as the

**expectation semiring**(Eisner 2002; Li and Eisner 2009).

*x*) is the brevity penalty. Compared to the risk minimization in Equation (24), we define our optimization problem as the maximization of xBLEU:It is possible to calculate a gradient for xBLEU, allowing for optimization using gradient-based optimization methods, and we explain the full (somewhat involved) derivation in Appendix A.

## 6. Online Methods

In the batch learning methods of Section 5, the steps of decoding and optimization are performed sequentially over the entire training data. In contrast, **online learning** performs updates not after the whole corpus has been processed, but over smaller subsets of the training data deemed **mini-batches**. One of the major advantages of online methods is that updates are performed on a much more fine-grained basis—it is often the case that online methods converge faster than batch methods, particularly on larger data sets. On the other hand, online methods have the disadvantage of being harder to implement (they often must be implemented inside the decoder, whereas batch methods can be separate), and also generally being less stable (with sensitivity to the order in which the training data is processed or other factors).

In the online learning algorithm in Figure 8, from the training data 〈*F*, *E*〉 = we first randomly choose a mini-batch consisting of *K* sentences of parallel data (line 4). We then decode each source sentence of the mini-batch and generate a *k*-best list (line 7), which is used in optimization (line 9). In contrast to the batch learning algorithm in Figure 3, we do not merge the *k*-bests from previous iterations. In addition, optimization is performed not over the entire data, but only the data and its corresponding *k*-best, . Like batch learning, within the online learning framework, there are a number of optimization algorithms and objective functions that can be used.

The first thing we must consider during online learning is that because we only optimize over the data in the mini-batch, it is not possible to directly optimize a corpus-level evaluation measure such as BLEU, and it is necessary to define an error function that is compatible with the learning framework (see Section 6.1). Once the error has been set, we can perform parameter updates according to a number of different algorithms including the perceptron (Section 6.2), MIRA (Section 6.3), AROW (Section 6.4), and stochastic gradient descent (SGD) (Section 6.5).

### 6.1 Approximating the Error

In online learning, parameters are updated not with respect to the entire training corpus, but with respect to a subset of data sampled from the corpus. This has consequences for the calculation of translation quality when using a corpus-level evaluation measure such as BLEU. For example, when choosing an oracle for oracle-based optimization methods, the oracles chosen when considering the entire corpus will be different from the oracles chosen when considering a mini-batch. In general, the amount of difference between the corpus-level and mini-batch level oracles will vary depending on the size of a mini-batch, with larger mini-batches providing a better approximation (Tan et al. 2013; Watanabe 2012). Thus, when using smaller batches, especially single sentences, it is necessary to use methods to approximate the corpus-level error function as covered in the next two sections.

#### 6.1.1 Approximation with a Pseudo-Corpus

**pseudo-corpus**, and using it to augment the statistics used in the mini-batch error calculation (Watanabe et al. 2007). Specifically, given the training data , we define its corresponding pseudo-corpus . could be, for example, either the 1-best translation candidate or the oracle calculated during the decoding step in line 7 of Figure 8. In the pseudo-corpus approximation, the sentence-level error for the translation candidate

**′ acquired by decoding the**

*e**i*th source sentence in the training data can be defined as the corpus-level error acquired when in , the

*i*th sentence is replaced with

**′**

*e*#### 6.1.2 Approximation with Decay

When approximating the error function using a pseudo-corpus, it is necessary to remember translation candidates for every sentence in the corpus. In addition, the size of differences in the sentence-level error becomes dependent on the number of other sentences in the corpus, making it necessary to perform scaling of the error, particularly for max-margin methods (Watanabe et al. 2007). As an alternative method that alleviates these problems, there has also been a method proposed that remembers a single set of sufficient statistics for the whole corpus, and upon every update forces these statistics to **decay** according to some criterion (Chiang, Marton, and Resnik 2008; Chiang, Knight, and Wang 2009; Chiang 2012).

*n*-grams, the number of matched

*n*-grams, and the reference length. When evaluating the error for a candidate

**′ of the**

*e**i*th sentence in the training data, we first decay the sufficient statisticswhere is the amount of decay, taking a value such as 0.9 or 0.99. Next, based on these sufficient statistics, we calculate the error of each candidate for the sentence by summing the sufficient statistics of the sentence with the decayed sufficient statistics . For example, if we want to calculate BLEU using stat

_{BLEU}(

*e*^{(i)},

**′) and the**

*e**i*th training sentence 〈

*f*^{(i)},

*e*^{(i)}〉, if BLEU(·) is a function calculating BLEU from a particular set of sufficient statistics, we can use the following equation:After performing an update for a particular sentence, is then updated with the statistics from the 1-best hypothesis found during decoding. When the training data is large, this function will place more emphasis on the recently generated examples, forgetting the older ones.

### 6.2 The Perceptron

**perceptron algorithm**(shown in Figure 9), which, as its name suggests, optimizes the perceptron loss of Section 3.4. The most central feature of the algorithm is that when the 1-best and oracle translations differ, we update the parameters at line 9. Because the gradient of the perceptron loss in Equation (31) with respect to

**isthis algorithm can also be viewed as updating the parameters based on this gradient, making the parameters for oracle 〈**

*w******

*e*^{(i)},

*****

*d*^{(i)}〉 stronger, and the parameters for the mistaken translation weaker.

In line 12, we return the final parameters to be used in translation. The most straightforward approach here is to simply return the parameters resulting from the final iteration of the perceptron training, but in a popular variant called the **averaged perceptron**, we instead use the average of the parameters over all iterations in training (Collins 2002). This averaging helps reduce overfitting of sentences that were viewed near the end of the training process, and is known to improve robustness to unknown data, resulting in higher translation accuracy (Liang et al. 2006).

### 6.3 MIRA

*w*^{(t)}(see Equation (59)). This update has the advantage of being simple and being guaranteed to converge when the data is linearly separable, but it is common for MT to handle feature sets that do not allow for linear separation of the 1-best and oracle hypotheses, resulting in instability in learning. The

**margin infused relaxed algorithm**(MIRA) (Crammer and Singer 2003; Crammer et al. 2006) is another online learning algorithm designed to help reduce these problems of instability. The update in MIRA follows the same Equation (59), but also adds an additional term that prevents the parameters

*w*^{(t)}from varying largely from their previous values

^{8}It should be noted that although this is defined as a loss, it is dependent on the parameters at the previous time step, in contrast to the losses in Section 3.

*K*= 1, line 9 of Figure 9), the next parameters

*w*^{(t+1)}are chosen to minimize Equation (60) (Watanabe et al. 2007; Chiang, Marton, and Resnik 2008; Chiang, Knight, and Wang 2009). λ

_{MIRA}is a hyperparameter that controls the amount of fitting to the data, with larger values indicating a stronger fit. Intuitively, the MIRA objective contains an error-minimizing term similar to that of the perceptron, but also contains a regularization term with respect to the change in parameters compared to

*w*^{(t)}, preferring smaller changes in parameters. When

*K*> 1, this equation can be formulated using Lagrange multipliers and solved using a quadratic programming solver similar to that used in SVMs. When

*K*= 1, we can simply use the following update formulaThe amount of update is proportional to the difference in loss between hypotheses, but when the difference in the features between the oracle and incorrect hypotheses is small, the features that are different will be subject to an extremely large update. The function of the parameter λ

_{MIRA}is to control these over-aggressive updates. It should also be noted that when we set α

^{(t)}= 1, MIRA reduces to the perceptron algorithm.

### 6.4 AROW

One of the problems often pointed out with MIRA is that it is overagressive, mainly because it attempts to classify the current training example correctly according to Equation (60), even when the training example is an outlier or includes noise. One way to reduce the effect of noise is through the use of **adaptive regularization of weights** (AROW) (Crammer, Kulesza, and Dredze 2009; Chiang 2012). AROW is based on a similar concept to MIRA, but instead of working directly on the weight vector ** w**, it defines a Gaussian distribution over the weights. The covariance matrix Σ is usually assumed to be diagonal, and each variance term in Σ functions as a sort of learning rate for its corresponding weight, with weights of higher variance being updated more widely, and weights with lower variance being updated less widely.

*K*= 1 and where λ

_{MIRA}= λ

_{var}, the AROW update that minimizes this loss consists of the following updates of

**and Σ (Crammer, Kulesza, and Dredze 2009):**

*w*### 6.5 Stochastic Gradient Descent

**Stochastic gradient descent** (SGD) is a gradient-based online algorithm for optimizing differentiable losses, possibly with the addition of *L*_{2} regularization Ω_{2}(** w**), or

*L*

_{1}regularization Ω

_{1}(

**). As SGD relies on gradients, it can be thought of as an alternative to the gradient-based batch algorithms in Section 5.2. Compared with batch algorithms, SGD requires less memory and tends to converge faster, but requires more care (particularly with regard to the selection of a learning rate) to ensure that it converges to a good answer.**

*w*

*w*^{(t)}(Bottou 1998):where η

^{(t)}> 0 is the

**learning rate**, which is generally initialized to a value η

^{(1)}and gradually reduced according to a function update(·) as learning progresses. One standard method for updating η

^{(t)}according to the following formulaallows for a guarantee of convergence (Collins et al. 2008). In Equation (69), the parameters are updated and we obtain

*w*^{(t+1)}. Within this framework, in the perceptron algorithm η

^{(t)}is set to a fixed value, and in MIRA the amount of update changes for every mini-batch. SGD-style online gradient-based methods have been used in translation for optimizing risk-based (Gao and He 2013), ranking-based (Watanabe 2012; Green et al. 2013), and other (Tillmann and Zhang 2006) objectives. When the regularization term Ω(

**) is not differentiable, such as**

*w**L*

_{1}regularization, it is a common practice to use

**forward-backward splitting**(FOBOS) (Duchi and Singer 2009; Green et al. 2013) in which the optimization is performed in two steps:First, we perform updates without considering the regularization term in Equation (71). Second, the regularization term is applied in Equation (72), which balances regularization and proximity to . As an alternative to FOBOS, it is possible to use

**dual averaging**, which keeps track of the average of previous gradients and optimizes for these along with the full regularization term (Xiao 2010).

**adaptive gradient**(AdaGrad) (Duchi, Hazan, and Singer 2011; Green et al. 2013) updates. The motivation behind AdaGrad is similar to that of AROW (Section 6.4), using second-order covariance statistics Σ to adjust the learning rate of individual parameters based on their update frequency. If we define the SGD gradient as for notational simplicity, the update rule for AdaGrad can be expressed as followsLike AROW, it is common to use a diagonal covariance matrix, and each time an update is performed the variance for the updated features decreases, reducing the overall learning rate for more commonly updated features. It should be noted that as the update of each feature is automatically controlled by the covariance matrix, there is no need to decay η as is necessary in SGD.

## 7. Large-Scale Optimization

Up to this point, we have generally given a mathematical or algorithmic explanation of the various optimization methods, and placed a smaller emphasis on factors such as training efficiency. In traditional optimization settings for MT where we optimize only a small number of weights for dense features on a training set of around 1,000 sentences, efficiency is often less of a concern. However, when trying to move to larger sets of sparse features, 1,000 sentences of training data is simply not enough to robustly estimate the parameters, and larger training sets become essential. When moving to larger training sets, parallelization of both the decoding process and the optimization process becomes essential. In this section, we outline the methods that can be used to perform parallelization, greatly increasing the efficiency of training. As parallelization in MT has seen wider use with respect to online learning methods, we will start with a description of online methods and touch briefly upon batch methods afterwards.

### 7.1 Large-Scale Online Optimization

Within the online learning framework, it is possible to improve the efficiency of learning through parallelization (McDonald, Hall, and Mann 2010). An example of this is shown in Figure 10, where the training data is split into *S***shards** (line 2), learning is performed locally over each shard 〈*F*_{s}, *E*_{s}〉, and the *S* sets of parameters *w*_{s} acquired through local learning are combined according to a function mix(·), a process called **parameter mixing** (line 6). This can be considered an instance of the **MapReduce** (Dean and Ghemawat 2008) programming model, where each Map assigns shards to *S* CPUs and performs training, and Reduce combines the resulting parameters *w*_{s}.

In the training algorithm of Figure 10, because parameters are learned locally on each shard, it is not necessarily guaranteed that the parameters are optimized for the data as a whole. In addition, it is also known that some divisions of the data can lead to contradictions between the parameters (McDonald, Hall, and Mann 2010). Because of this, when performing distributed online learning, it is common to perform parameter mixing several times throughout the training process, which allows the separate shards to share information and prevents contradiction between the learned parameters. Based on the timing of the update, these varieties of mixing are called **synchronous update** and **asynchronous update**.

#### 7.1.1 Synchronous Update

In the online learning algorithm with synchronous update shown in Figure 11, learning is performed independently over each shard 〈*F*_{s}, *E*_{s}〉(1 ≤ *s* ≤ *S*). The difference between this and Figure 10 lies in the fact that learning is performed *T*′ times, with each iteration initialized with the parameters *w*^{(t)} from the previous iteration (line 7).

_{s}, it is possible to use a uniform distribution, or a weight proportional to the number of online updates performed at each shard (McDonald, Hall, and Mann 2010; Simianer, Riezler, and Dyer 2012). It should be noted that this algorithm can be considered a variety of the MapReduce framework, allowing for relatively straightforward implementation using parallel processing infrastructure such as Hadoop (Eidelman 2012).

Simianer, Riezler, and Dyer (2012) propose another method for mixing parameters that, instead of averaging at each iteration, chooses to preserve only the parameters that have been learned over all shards, and sets all the remaining parameters to zero, allowing for a simple sort of feature selection. In particular, we define a *S* × *M* matrix that combines the parameters at each shard as , takes the *L*_{2} norm of each matrix column, and averages the columns with high norm values while setting the rest to zero.

#### 7.1.2 Asynchronous Update

While parallel learning with synchronous update is guaranteed to converge (McDonald, Hall, and Mann 2010), parameter mixing only occurs after all the data has been processed, leading to inefficiency over large data sets. To fix this problem, asynchronous update sends information about parameter updates to each shard asynchronously, allowing the parameters to be updated more frequently, resulting in faster learning (Chiang, Marton, and Resnik 2008; Chiang, Knight, and Wang 2009).

The algorithm for learning with asynchronous update is shown in Figure 12. With the data 〈*F*_{s}, *E*_{s}〉 (1 ≤ *s* ≤ *S*) split into *S* pieces, each shard performs *T* iterations of training by sampling a mini-batch (line 7), translating each sentence (line 10), and performing optimization on the mini-batch level (line 12).

_{async}(·) in line 13 sends the result of the mini-batch level updateto each shard

*s*′ (

*s*′ ≠

*s*,1 ≤

*s*′ ≤

*S*). It should be noted that because the number of dimensions

*M*in the parameter vector is extremely large, and each mini-batch will only update a small fraction of these parameters, by only sending the parameters that have actually changed in each mini-batch update we can greatly increase the efficiency of the training. In operator copy

_{async}(·) of line 6, the learner receives the update from all the other shards and mixes them together to acquire the full update vectorIt should be noted that at mixing time, there is no need to wait for the update vectors from all of the other shards; update can be performed with only the update vectors that have been received at the time. Because of this, each shard will not necessarily be using parameters that reflect all the most recent updates performed on other shards. However, as updates are performed much more frequently than in synchronous update, it is easier to avoid local optima, and learning tends to converge faster. It can also be shown that (under some conditions) the amount of accuracy lost by this delay in update is bounded (Zinkevich, Langford, and Smola 2009; Recht et al. 2011).

### 7.2 Large-Scale Batch Optimization

Compared with online learning, within the batch optimization framework, parallelization is usually straightforward. Often, decoding takes the majority of time required for the optimization process, and because the parameters will be identical for each sentence in the decoding run (Figure 3, line 6), decoding can be parallelized trivially. The process of parallelizing optimization itself depends slightly on the optimization algorithm, but is generally possible to achieve in a number of ways.

The first, and simplest, method for parallelization is the parallelization of **optimization runs**. The most obvious example of this is MERT, where random restarts are required. Each of the random restarts is completely independent, so it is possible to run these on different nodes, and finally check which run achieved the best accuracy and use that result.

Another more fine-grained method for parallelization, again most useful for MERT, is the parallelization of **search directions**. In the loop starting at Figure 4, line 6, MERT performs line search in several different directions, each one being independent of the others. Each of these line searches can be performed in parallel, and the direction allowing for the greatest gain in accuracy is chosen when all threads have completed.

A method that is applicable to a much broader array of optimization methods is the parallelization of **calculation of sufficient statistics**. In this approach, like in Section 7.1.1, we first split the data into shards 〈*F*_{s}, *E*_{s}〉(1 ≤ *s* ≤ *S*). Then, over these shards we calculate the sufficient statistics necessary to perform a parameter update. For example, in MERT these sufficient statistics would consist of the envelope for each of the potential search directions. In gradient based methods, the sufficient statistics would consist of the gradient calculated with respect to only the data on the shard. Finally, when all threads have finished calculating these statistics, a master thread combines the statistics from each shard, either by merging the envelopes, or by adding the gradients.

## 8. Other Topics in MT Optimization

In this section we cover several additional topics in optimization for MT, including nonlinear models (Section 8.1), optimization for a particular domain or test set (Section 8.2), and the interaction between evaluation measures (Section 8.3) or search (Section 8.4) and optimization.

### 8.1 Non-Linear Models

Note that up until this point, all models that we have considered calculate the scores for translation hypotheses according to a linear model, where the score is calculated according to the dot product of the features and weights shown in Equation (1). However, linear models are obviously limited in their expressive power, and a number of works have attempted to move beyond linear combinations of features to nonlinear combinations.

In general, most nonlinear models for machine learning can be applied to MT as well, with one major caveat. Specifically, the efficiency of the decoding process largely relies on the feature locality assumption mentioned in Section 2.3. Unfortunately, the locality assumption breaks down when moving beyond a simple linear scoring function, and overcoming this problem is the main obstacle to applying nonlinear models to MT (or structured learning in general). A number of countermeasures to this problem exist:

**Reranking:**The most simple and commonly used method for incorporating nonlinearity, or other highly nonlocal features that cannot be easily incorporated in search, is through the use of reranking (Shen, Sarkar, and Och 2004). In this case, a system optimized using a standard linear model is used to create a*k*-best list of outputs, and this*k*-best list is then reranked using the nonlinear model (Nguyen, Mahajan, and He 2007; Duh and Kirchhoff 2008). Because we are now only dealing with fully expanded hypotheses, scoring becomes trivial, but reranking also has the major downsides of potentially missing useful hypotheses not included in the*k*-best list,^{9}and requiring time directly proportional to the size of the*k*-best list.**Local Nonlinearity:**Another possibility is to first use a nonlinear function to calculate local features, which are then used as part of the standard linear model (Liu et al. 2013). Alternatively, it is possible to treat feature-value pairs as new binary features (Clark, Dyer, and Lavie 2014). In this case, all effects of nonlinearity are resolved before the search actually begins, allowing for the use of standard and efficient search algorithms. On the other hand, it is not possible to incorporate non-local features into the nonlinear model.**Improved Search Techniques:**Although there is no general-purpose solution to incorporating nonlinear models into search, for some particular models it is possible to perform search in a way that allows for incorporation of nonlinearities. For example,**ensemble decoding**has been used with stacking-based models (Razmara and Sarkar 2013), and it has been shown that the search space can be simplified to the extent that kernel functions can be calculated efficiently (Wang, Shawe-Taylor, and Szedmak 2007).

Once the problems of search have been solved, a number of actual learning techniques can be used to model nonlinear scoring functions. One of the most popular examples of nonlinear functions are those utilizing kernels, and methods applied to MT include kernel-like functions over the feature space such as the Parzen window, binning, and Gaussian kernels (Nguyen, Mahajan, and He 2007), or the *n*-spectrum string kernel for finding associations between the source and target strings (Wang, Shawe-Taylor, and Szedmak 2007). Neural networks are another popular method for modeling nonlinearities, and it has been shown that neural networks can effectively be used to calculate new local features for MT (Liu et al. 2013). Methods such as boosting or stacking, which combine together multiple parameterizations of the translation model, have been incorporated through reranking (Duh and Kirchhoff 2008; Lagarda and Casacuberta 2008; Duan et al. 2009; Sokolov, Wisniewski, and Yvon 2012b), or ensemble decoding (Razmara and Sarkar 2013). Regression decision trees have also been introduced as a method for inducing nonlinear functions, incorporated through history-based search algorithms (Turian, Wellington, and Melamed 2006), or by using the trees to induce features local to the search state (Toutanova and Ahn 2013).

### 8.2 Domain-Dependent Optimization

One widely acknowledged feature of machine learning problems in general is that the parameters are sensitive to the domain of the data, and by optimizing the parameters with data from the target domain it is possible to achieve gains in accuracy. In machine translation, this is also very true, although much of the work on domain adaptation has focused on adapting the model learning process prior to explicit optimization towards an evaluation measure (Koehn and Schroeder 2007). However, there are a few works on optimization-based domain adaptation in MT, as we will summarize subsequently.

One relatively simple way of performing domain adaptation is by selecting a **subset** of the training data that is similar to the data that we want to translate (Li et al. 2010). This can be done by selecting sentences that are similar to our test corpus, or even selecting adaptation data for each individual test sentence (Liu et al. 2012). If no parallel data exist in the target domain, it has also been shown that first automatically translating data from the source to the target language or vice versa, then using this data for optimization and model training is also helpful (Ueffing, Haffari, and Sarkar 2007; Li et al. 2011; Zhao et al. 2011) In addition, in a computer-assisted translation scenario, it is possible to reflect post-edited translations back into the optimization process as new in-domain training data (Mathur, Mauro, and Federico 2013; Denkowski, Dyer, and Lavie 2014).

Once adaptation data have been chosen, it is necessary to decide how to use the data. The most straightforward way is to simply use these in-domain data in optimization, but if the data set is small it is preferable to combine both in-and out-ofdomain data to achieve more robust parameter estimates. This is essentially equivalent to the standard domain-adaptation problem in machine learning, and in the context of MT there have been methods proposed to perform Bayesian adaptation of probabilistic models (Sanchis-Trilles and Casacuberta 2010), and online update using ultraconservative algorithms (Liu et al. 2012). This can be extended to cover multiple target domains using multi-task learning (Cui et al. 2013).

Finally, it has been noted that when optimizing a few dense parameters, it is useful to make the distinction between in-domain translation (when the model training data matches the test domain) and cross-domain translation (when the model training data mismatches the test domain). In cross-domain translation, fewer long rules will be used, and translation probabilities will be less reliable, and the parameters must change accordingly to account for this (Pecina, Toral, and van Genabith 2012). It has also been shown that building TMs for several domains and tuning the parameters to maximize translation accuracy can improve MT accuracy on the target domain (Haddow 2013). Another option for making the distinction between in-domain and out-of-domain data is by firing different features for in-domain and out-of-domain training data, allowing for the learning of different weights for different domains (Clark, Lavie, and Dyer 2012).

### 8.3 Evaluation Measures and Optimization

In the entirety of this article, we have assumed that optimization for MT aims to reduce MT error defined using an evaluation measure, generally BLEU. However, as mentioned in Section 2.5, evaluation of MT is an active research field, and there are many alternatives in addition to BLEU. Thus, it is of interest whether changing the measure used in optimization can affect the overall quality of the translations achieved, as measured by human evaluators.

There have been a few comprehensive studies on the effect of the metric used in optimization on human assessments of the generated translations (Cer, Manning, and Jurafsky 2010; Callison-Burch et al. 2011). These studies showed the rather surprising result that despite the fact that other evaluation measures had proven superior to BLEU with regards to post facto correlation with human evaluation, a BLEU-optimized system proved superior to systems tuned using other metrics. Since this result, however, there have been other reports stating that systems optimized using other metrics such as TESLA (Liu, Dahlmeier, and Ng 2011) and MEANT (Lo et al. 2013) achieve superior results to BLEU-optimized systems.

There have also been attempts to directly optimize not automatic, but **human evaluation measures** of translation quality (Zaidan and Callison-Burch 2009). However, the cost of performing this sort of human-in-the-loop optimization is prohibitive, so Zaidan and Callison-Burch (2009) propose a method that re-uses partial hypotheses in evaluation. Saluja, Lane, and Zhang (2012) also propose a method for incorporating binary good/bad input into optimization, with the motivation that this sort of feedback is easier for human annotators to provide than generating new reference sentences.

**multiple evaluation metrics**at one time. The easiest way to do so is to simply use the linear interpolation of two or more metrics as the error function (Dyer et al. 2009; He and Way 2009; Servan and Schwenk 2011):where

*L*is the number of error functions, and ρ

_{i}is a manually set interpolation coefficient for its respective error function. There are also more sophisticated methods based on the idea of optimizing towards Pareto-optimal hypotheses (Duh et al. 2012), which achieve errors lower than all other hypotheses on at least one evaluation measure,To incorporate this concept of Pareto optimality into optimization, the Pareto-optimal set is defined on the sentence level, and ranking loss (Section 3.5) is used to ensure that the Pareto-optimal hypotheses achieve a higher score than those that are not Pareto optimal. This method has also been extended to take advantage of ensemble decoding, where multiple parameter settings are used simultaneously in decoding (Sankaran, Sarkar, and Duh 2013).

### 8.4 Search and Optimization

As mentioned in Section 2.4, because MT decoders perform approximate search, they may make search errors and not find the hypothesis that achieves the highest model score. There have been a few attempts to consider this fact in the optimization process.

For example, in the perceptron algorithm of Section 6.2 it is known that the convergence guarantees of the structured perceptron no longer hold when using approximate search. The first method that can be used to resolve this problem is the **early updating** strategy (Collins and Roark 2004; Cowan, Kuerová, and Collins 2006). The early updating strategy is a variety of bold updates, where the decoder output ** e***

^{(i)}must be exactly equal to the reference

*e*^{(i)}. Decoding proceeds as normal, but the moment the correct hypothesis

*e*^{(i)}can no longer be produced by any hypothesis in the search space (i.e., a search error has occurred), search is stopped and update is performed using only the partial derivation. The second method is the

**max-violation perceptron**(Huang, Fayong, and Guo 2012; Yu et al. 2013). In the max-violation perceptron, forced decoding is performed to acquire a derivation 〈

*****

*e*^{(i)},

*****

*d*^{(i)}〉 that can exactly reproduce the correct output

*e*^{(i)}, and update is performed at the point when the score of a partial hypothesis exceeds that of the partial hypothesis 〈

*****

*e*^{(i)},

*****

*d*^{(i)}〉 by the greatest margin (the point of “maximum violation”).

**Search-aware tuning** (Liu and Huang 2014) is a method that is able to consider search errors using an arbitrary optimization method. It does so by defining an evaluation measure for not only full sentences, but also partial derivations that occur during the search process, and optimizes parameters for *k*-best lists of partial derivations.

Finally, there has also been some work on optimizing features not of the model itself, but parameters of the search process, using the downhill simplex algorithm (Chung and Galley 2012). Using this method, it is possible to adjust the beam width, distortion penalty, or other parameters that actually affect the size and shape of the derivation space, as opposed to simply rescoring hypotheses within it.

## 9. Conclusion

In this survey article, we have provided a review of the current state-of-the-art in machine translation optimization, covering batch optimization, online optimization, expansions to large scale data, and a number of other topics. While these optimization algorithms have already led to large improvements in machine translation accuracy, the task of MT optimization is, as stated in the Introduction, an extremely hard one that is far from solved.

The utility of an optimization algorithm can be viewed from a number of perspectives. The final accuracy achieved is, of course, one of the most important factors, but speed, scalability, ease of implementation, final resulting model size, and many other factors play an important role. We can assume that the algorithms being used outside of the context of research on optimization itself are those that satisfy these criteria in some way. Although it is difficult to exactly discern exactly which algorithms are seeing the largest amount of use (industrial SMT systems rarely disclose this sort of information publicly), one proxy for this is to look at systems that performed well on shared tasks such as the Workshop on Machine Translation (WMT) (Bojar et al. 2014). In Table 2 we show the percentage of WMT systems using each optimization algorithm for the past four years, both including all systems, and systems that achieved the highest level of human evaluation in the resource-constrained setting for at least one language pair. From these statistics we can see that even after over ten years, MERT is still the dominant optimization algorithm. However, starting in WMT 2013, we can see a move to systems based on MIRA, and to a lesser extent ranking, particularly in the most competitive systems.

. | WMT All . | 2011 Best . | WMT All . | 2012 Best . | WMT All . | 2013 Best . | WMT All . | 2014 Best . |
---|---|---|---|---|---|---|---|---|

MERT | 80 | 100 | 79 | 100 | 68 | 25 | 63 | 50 |

MIRA | 0 | 0 | 0 | 0 | 20 | 75 | 27 | 50 |

Ranking | 0 | 0 | 4 | 0 | 8 | 0 | 5 | 0 |

Softmax | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Risk | 3 | 0 | 0 | 0 | 0 | 0 | 5 | 0 |

None | 17 | 0 | 17 | 0 | 4 | 0 | 0 | 0 |

. | WMT All . | 2011 Best . | WMT All . | 2012 Best . | WMT All . | 2013 Best . | WMT All . | 2014 Best . |
---|---|---|---|---|---|---|---|---|

MERT | 80 | 100 | 79 | 100 | 68 | 25 | 63 | 50 |

MIRA | 0 | 0 | 0 | 0 | 20 | 75 | 27 | 50 |

Ranking | 0 | 0 | 4 | 0 | 8 | 0 | 5 | 0 |

Softmax | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Risk | 3 | 0 | 0 | 0 | 0 | 0 | 5 | 0 |

None | 17 | 0 | 17 | 0 | 4 | 0 | 0 | 0 |

In these systems, the preferred choice of an optimization algorithm seems to be MERT when using up to 20 features, and MIRA when using a large number of features (up to several hundred). There are fewer examples of systems using large numbers of features (tens of thousands, or millions) in actual competitive systems, with a few exceptions (Dyer et al. 2009; Neidert et al. 2014; Wuebker et al. 2014). In the case when a large number of sparse features are used, it is most common to use a softmax or risk-based objective and gradient-based optimization algorithms, often combining the features into summary features and performing a final tuning pass with MERT.

The fact that algorithms other than MERT are seeing adoption in competitive systems for shared tasks is a welcome sign for the future of MT optimization research. However, there are still many open questions in the field, a few of which can be outlined here:

**Stable Training with Millions of Features:**At the moment, there is still no stable training recipe that has been widely proven to effectively optimize millions of features. Finding an algorithm that gives consistent improvements in this setting is perhaps the largest open problem in MT optimization.**Evaluation Measures for Optimization:**Although many evaluation measures show consistent improvements in correlation with human evaluation scores over BLEU when used to evaluate the output of existing MT systems, there are few results that show that systems optimized with evaluation measures other than BLEU achieve consistent improvements in human evaluation scores.**Better Training/Utilization of Nonlinear Scoring Functions:**Nonlinear functions using neural networks have recently achieved large improvements in a number of areas of natural language processing and machine learning. Better methods to incorporate these sort of nonlinear scoring functions into MT is a highly promising direction, but will require improvements in both learning the scoring functions and correctly incorporating these functions into MT decoding.

## Appendix A: Derivation for xBLEU Gradients

In this appendix, we explain in detail how to derive a gradient for the xBLEU objective in Equation (54), which has not been described completely in previous work.

*P*) ·

*B*, leading to the following equationTaking the derivative of xBLEU with respect to

**, we getThus,andAdditionally, the following equation holds:**

*w*After calculating this gradient, it is possible to optimize this according to standard gradient-based methods. However, like MR using sentence-level evaluation mentioned in Section 5.5, the evaluation measure is not convex, and the same precautions need to be taken to avoid falling into local optima.

## Notes

It should be noted that although most work on MT optimization is concerned with linear models (and thus we will spend the majority of this article discussing optimization of these models), optimization using non-linear models is also possible, and is discussed in Section 8.1.

It should also be noted there have been a few recent attempts to jointly perform rule extraction and optimization, doing away with this two-step process (Xiao and Xiong 2013).

We let #_{A} (*a*) denote the number of times *a* appeared in a multiset *A*, and define: |*A*| = ∑_{a} #_{A}(*a*), #_{A∪B}(*a*) = max{#_{A}(*a*), #_{B}(*a*)}, and #_{A∩B}(*a*) = min{#_{A}(*a*), #_{B}(*a*)}.

Equation (25) can be regarded as an instance of ranking loss described in Section 3.5 in which better translations are selected only from a set of oracle translations.

We take the inverse because we would like model scores and errors to be inversely correlated.

More accurately, finding the oracle in the *k*-best list by enumeration of the hypotheses is easy, but finding the oracle in a compressed data structure such as a lattice is computationally difficult, and approximation algorithms are necessary (Leusch, Matusov, and Ney 2008; Li and Khudanpur 2009; Sokolov, Wisniewski, and Yvon 2012a).

Liu et al. (2012) propose a method to avoid over-aggressive moves in parameter space by considering the balance between increase in the evaluation score and the similarity with the parameters on the previous iteration.

This problem can be ameliorated somewhat by ensuring that there is sufficient diversity in the *n*-best list (Gimpel et al. 2013).

## References

## Author notes

8916-5 Takayama-cho, Ikoma, Nara, Japan. E-mail: neubig@is.naist.jp.

6-10-1 Roppongi, Minato-ku, Tokyo, Japan. E-mail: tarow@google.com.

This work was mostly done while the second author was affiliated with the National Institute of Information and Communications Technology, 3–5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619–0289, Japan.