## Abstract

Hyperparameter selection is a crucial part of building neural machine translation (NMT) systems across both academia and industry. Fine-grained adjustments to a model’s architecture or training recipe can mean the difference between a positive and negative research result or between a state-of-the-art and underperforming system. While recent literature has proposed methods for automatic hyperparameter optimization (HPO), there has been limited work on applying these methods to neural machine translation (NMT), due in part to the high costs associated with experiments that train large numbers of model variants. To facilitate research in this space, we introduce a lookup-based approach that uses a library of pre-trained models for fast, low cost HPO experimentation. Our contributions include (1) the release of a large collection of trained NMT models covering a wide range of hyperparameters, (2) the proposal of targeted metrics for evaluating HPO methods on NMT, and (3) a reproducible benchmark of several HPO methods against our model library, including novel graph-based and multiobjective methods.

## 1 Introduction

Choosing effective hyperparameters is crucial for building strong neural machine translation (NMT) systems. Although some choices present obvious trade-offs (e.g., more and larger layers tend to increase quality at the cost of speed), others are more subtle (e.g., effects of batch size, learning rate, and normalization techniques on different layer types). Optimal versus suboptimal hyperparameters can lead to dramatic swings in system performance; consider the wide range of BLEU scores for variants of the same base system in Figure 1 (left). In practice, these hyperparameters are often tuned manually based on intuition and heuristics, a tedious and error-prone process that can lead to unreliable experimental results and underperforming shared task or production systems. The difficulty is compounded when system builders must jointly optimize *multiple objectives*, such as translation accuracy (BLEU) and decoding speed, which are largely uncorrelated as shown in Figure 1 (right).

In the past decade, various hyperparameter optimization (HPO) methods have emerged in the machine learning literature under the labels of “AutoML” (Bergstra et al., 2011; Hutter et al., 2011; Bardenet et al., 2013; Snoek et al., 2015) and “neural architecture search” (Zoph and Le, 2016; Liu et al., 2018a,s; Cai et al., 2018; Real et al., 2019). However, it is unclear how they perform on NMT; we are not aware of any prior work with comprehensive evaluation. One challenge is that the state-of-the-art NMT models (Sutskever et al., 2014; Bahdanau et al., 2015; Gehring et al., 2017; Vaswani et al., 2017) require significant computational resources for training. Secondly, they usually have large hyperparameter search spaces. Thus, it is prohibitively expensive in practice to compare HPO methods on NMT tasks.

In order to *enable reproducible HPO research on NMT tasks*, we adopt a benchmark procedure based on “table-lookup”. This approach was recently introduced to neural architecture search by Ying et al. (2019), and to hyperparameter optimization by Klein and Hutter (2019). First, we train an extremely large number of NMT models with diverse hyperparameter settings and record their performance metrics (e.g., BLEU, decoding time) in a table. Then, we constrain our HPO methods to sample from this finite set of models. This allows us to simply “look-up” their pre-computed performance metrics, and amortizes the burden of computation: As long as we ensure that we have trained and pre-computed a large number of representative NMT models beforehand, HPO algorithm developers no longer need to deal with the cost of training NMT. Importantly, this kind of benchmark significantly speeds up the HPO experiment turnover time, enabling fast repeated trials for rigorous tests and facilitates detailed error analysis.

The main contributions of this work are:

**Dataset**: We release a benchmark dataset^{1}for comparing HPO methods on NMT models. This “table-lookup” HPO dataset supports both single-objective and multiobjective optimization of translation accuracy and decoding time (Section 3). Specifically, we trained a total of 2,245 Transformers (Vaswani et al., 2017) on six different corpora (with a cost of approximately 1,547 GPU days), and collected all pairs of hyperparameter settings and corresponding performance metrics.**Evaluation protocols**: We provide three kinds of metrics for evaluating HPO methods, based on different computational budgets (Section 4). We also demonstrate error analysis techniques that are enabled by this “table-lookup” framework, which provide insights into algorithm behavior (Section 7).**HPO method benchmarks**: We benchmark the performance of several HPO methods on our dataset (Section 6). These include Bayesian optimization as well as a novel graph-based method that exploits the structure of the hyperparameter space (Section 5). We also extend these methods to handle the multiobjective optimization of both BLEU and decoding time. These experiments illustrate how to utilize the dataset to rigorously evaluate HPO for NMT.

## 2 HPO Problem Definition

Given a machine learning algorithm with *H* hyperparameters, we denote the domain of the *h*-th hyperparameter by Λ_{h} and the overall hyperparameter configuration space as **Λ** = Λ_{1} × Λ_{2} ×…Λ_{H}. When trained with a hyperparameter setting ** λ** ∈

**Λ**on data $Dtrain$, the algorithm’s performance metric on some validation data $Dvalid$ is $f\lambda :=V\lambda ,Dtrain,Dvalid$. In the context of NMT,

*f*(⋅) or $V(\u22c5)$ could be the perplexity, translation accuracy (e.g., BLEU score), or decoding time on $Dvalid$. In general,

*f*(⋅) is computationally expensive to obtain; it requires training a model to completion, then evaluating some performance metric on a validation set. For purposes of exposition, we assume that lower

*f*(⋅) is better, so we might define

*f*(⋅) as 1 − BLEU.

The goal of hyperparameter optimization is then to find a $\lambda \u2605=argmin\lambda \u2208\Lambda f\lambda $, with as few evaluations of *f*(⋅) as possible. An HPO problem can be challenging for three reasons: (a) **Λ** may be a combinatorially large space, prohibiting grid search over hyperparameters. (b) *f*(⋅) may be expensive to compute, so there is a tight budget on how many evaluations of *f*(⋅) are allowed. (c) *f* is not a continuous function and no gradient information can be exploited, forcing us to view the $argmin$ as a blackbox discrete search problem. NMT HPO search exhibits all these conditions.

One class of algorithms that tackles the HPO problem is sequential model-based optimization (SMBO), illustrated in Figure 2. SMBO approximates *f* with a cheap-to-evaluate surrogate model $f^$ (Feurer and Hutter, 2019; Luo, 2016; Jones et al., 1998). SMBO starts by querying *f* with initial hyperparameters {*λ*_{init}} and recording the resulting $\lambda init,f(\lambda init)$ pairs. Then, it iteratively (1) fits the surrogate $f^$ on pairs observed so far; (2) gets the predictions $f^(\lambda i)$ for unlabeled/unobserved hyperparameters; and (3) selects a promising *λ*_{p} to query next based on these predictions and an acquisition function, whose role is to trade off exploration in **Λ** with high model uncertainty and exploitation in **Λ** with low $f^(\u22c5)$.

## 3 Table-Lookup HPO Datasets

### 3.1 Table-Lookup Framework

To evaluate a newly devised HPO algorithm, one needs to run each component of the loop in Figure 2. However, the “query” step is computationally expensive: We need to train a new NMT system each time we sample a new hyperparameter.

The idea of table lookup is to simply pre-train a large set of *I* NMT systems and record the pairs {*λ*_{i},*f*(*λ*_{i})}_{i=1,…,I} in a table. Thus, when running the loop in Figure 2, the HPO algorithm developer can look up *f*(*λ*_{i}) whenever necessary, without having to train a NMT model from scratch. This significantly speeds up the experimental process. The advantages are:

One can perform multiple random trials of the same algorithm, to test robustness.

One can perform comparisons with more baseline algorithms, to make stronger claims.

One can perform the same experiment under different budget constraints, to simulate different real-world use cases.

One can track the progress of an experiment with respect to oracle results, allowing for more detailed error analysis of HPO.

To be effective, table lookup depends on two important assumptions: First, the table has to be sufficiently large to cover the space of hyperparameters **Λ**. Second, the HPO algorithm needs to be modified to sample from the finite set of hyperparameters in the table; this is usually easy to implement but the assumption is that finite-sample results will generalize.

### 3.2 HPO Algorithm Selection/Development

There exist many choices of HPO algorithm, which can be evaluated or further developed on our lookup tables. Figure 3 illustrates this process. The performance of HPO algorithm candidates on various MT datasets serves as the basis for HPO selection. The selected HPO algorithm can then be applied on new MT datasets.

There are two kinds of generalization effects at play: (1) generalization of an HPO algorithm across MT datasets, and (2) generalization of MT models and their associated hyperparameters across MT datasets. We mainly care about (1) in the algorithm development process, which is why we opt to provide six distinct datasets described in Section 3.3 (as opposed to, e.g., 1 dataset trained on large MT data). If an HPO algorithm performs efficiently in finding good hyperparameter configurations on many MT datasets, then we can more reasonably believe that it will run quickly on a new dataset, regardless of the underlying MT data characteristics. Even if the best configuration on one MT dataset does not transfer to another, a robust HPO algorithm should still be capable of finding good hyperparameters because the algorithm learns from scratch on each dataset independently.

### 3.3 MT Data and Setup

To create a robust HPO benchmark, we trained NMT models on six different parallel corpora, which exhibit a variety of characteristics:

**TED Talks**: We trained Chinese–English (**zh-en**) and Russian–English (**ru-en**) models on the data-split of Duh (2018). This is a mid-resource setup, where $Dtrain$ consists of 170k lines for zh-en and 180k lines for ru-zh. $Dvalid$ has 1,958 sentences and is multiway parallel for both language-pairs.

**WMT2019 Robustness task** (Li et al., 2019): We trained models on Japanese–English data, in both directions (**ja-en**, **en-ja**). $Dtrain$ has 4 M lines from a mix of domains. $Dvalid$ is a concatenation of 4k mixed-domain sentences and 1k Reddit sentences, for a total of 5,405 lines. The goal of the Robustness task is to test how NMT systems perform on non-standard and noisy text (e.g., Reddit).

**Low Resource tasks**: We trained models using the IARPA MATERIAL datasets for Swahili–English (**sw-en**) and Somali–English (**so-en**). $Dtrain$ consists of only 24k lines for both language pairs (BUILD set), and $Dvalid$ consists of 2675 lines (ANALYSIS2 set).

Although there are many potential MT datasets we could choose from, we believe these six datasets form a good representative set. It ranges from high-to-low resource; it contains both noisy and clean settings. These datasets also have different levels of similarity—for example, zh-en and ru-en TED talks use the same multiway parallel $Dvalid$, so one could ask whether the optimal hyperparameters transfer.

The text is tokenized by Jieba for Chinese, by Kytea for Japanese, and by the Moses tokenizer for the rest. Byte pair encoding (BPE) segmentation (Sennrich et al., 2016) is learned and applied separately for each side of bitext. We train Transformer NMT models with Sockeye^{3} (Hieber et al., 2017), focusing on these hyperparameters:

- •
**preprocessing configurations**: number of BPE symbols^{4}(bpe) - •
**training settings**: initial learning rate (init_lr) for the Adam optimizer - •
**architecture designs**:^{5}number of layers (#layers), embedding size (#embed), number of hidden units in each layer (#hidden), number of heads in self-attention (#att_heads).

These hyperparameters are chosen because they significantly affect both accuracy and speed of the resulting NMT. Other hyperparameters are kept at their Sockeye defaults.^{6}Table 1 shows our overall hyperparameter space **Λ**; in total among all six datasets, we have 1,983 models; Table 2 shows the exact number of models per dataset, along with the best models and their hyperparameter settings.^{7}

dataset
. | bpe (1k)
. | #layers
. | #embed
. | #hidden
. | #att_heads
. | init_lr (10^{−4})
. |
---|---|---|---|---|---|---|

zh, ru, ja, en | 10, 30, 50 | 2, 4 | 256, 512, 1024 | 1024, 2048 | 8, 16 | 3, 6, 10 |

sw | 1, 2, 4, 8, 16, 32 | 1, 2, 4, 6 | 256, 512, 1024 | 1024, 2048 | 8, 16 | 3, 6, 10 |

so | 1, 2, 4, 8, 16, 32 | 1, 2, 4 | 256, 512, 1024 | 1024, 2048 | 8, 16 | 3, 6, 10 |

dataset
. | bpe (1k)
. | #layers
. | #embed
. | #hidden
. | #att_heads
. | init_lr (10^{−4})
. |
---|---|---|---|---|---|---|

zh, ru, ja, en | 10, 30, 50 | 2, 4 | 256, 512, 1024 | 1024, 2048 | 8, 16 | 3, 6, 10 |

sw | 1, 2, 4, 8, 16, 32 | 1, 2, 4, 6 | 256, 512, 1024 | 1024, 2048 | 8, 16 | 3, 6, 10 |

so | 1, 2, 4, 8, 16, 32 | 1, 2, 4 | 256, 512, 1024 | 1024, 2048 | 8, 16 | 3, 6, 10 |

Dataset
. | #models
. | Best BLEU
. | bpe
. | #layers
. | #embed
. | #hidden
. | #att_heads
. | init_lr
. |
---|---|---|---|---|---|---|---|---|

zh-en | 118 | 14.66 | 30k | 4 | 512 | 1024 | 16 | 3e-4 |

ru-en | 176 | 20.23 | 10k | 4 | 256 | 2048 | 8 | 3e-4 |

ja-en | 150 | 16.41 | 30k | 4 | 512 | 2048 | 8 | 3e-4 |

en-ja | 168 | 20.74 | 10k | 4 | 1024 | 2048 | 8 | 3e-4 |

sw-en | 767 | 26.09 | 1k | 2 | 256 | 1024 | 8 | 6e-4 |

so-en | 604 | 11.23 | 8k | 2 | 512 | 1024 | 8 | 3e-4 |

Dataset
. | #models
. | Best BLEU
. | bpe
. | #layers
. | #embed
. | #hidden
. | #att_heads
. | init_lr
. |
---|---|---|---|---|---|---|---|---|

zh-en | 118 | 14.66 | 30k | 4 | 512 | 1024 | 16 | 3e-4 |

ru-en | 176 | 20.23 | 10k | 4 | 256 | 2048 | 8 | 3e-4 |

ja-en | 150 | 16.41 | 30k | 4 | 512 | 2048 | 8 | 3e-4 |

en-ja | 168 | 20.74 | 10k | 4 | 1024 | 2048 | 8 | 3e-4 |

sw-en | 767 | 26.09 | 1k | 2 | 256 | 1024 | 8 | 6e-4 |

so-en | 604 | 11.23 | 8k | 2 | 512 | 1024 | 8 | 3e-4 |

#### Rationale for Hyperparameter Values:

There are various design trade-offs in deciding the range and granularity of hyperparameter values. First, we might expand on a wider range of values (e.g., change #hidden = {1024, 2048} to {512, 1024, 2048, 4096}). The effect of this is that we test the HPO algorithm on a wider range of inputs, with potentially more variability in metrics like BLEU and inference speed. Second, we might expand on a more finegrained range of values (e.g., change #hidden = {1024, 2048} to {1024, 1536, 2048}). This might result in smoother metrics, making it easier for HPO algorithms to learn. Although wider range and finer granularity are desirable properties for a HPO dataset, each additional value causes an exponential increase in the number of models because of the cross-product of all values. In general, we think Table 1 represents a reasonable set of values used in the literature. Nevertheless, it should be clarified that empirical findings from table-lookup datasets should be interpreted in light of the limits of hyperparameter range and granularity.

### 3.4 Objectives: Accuracy and Cost

We train all models on $Dtrain$ until they converge in terms of perplexity on $Dvalid$. We then record various performance measurements:

- •
**Translation accuracy**: BLEU (Papineni et al., 2002) and perplexity on $Dvalid$. - •
**Computational cost**: GPU wall clock time for decoding $Dvalid$, number of updates for the model to converge, GPU memory used for training, total number of model parameters.

In this paper, we use BLEU on $Dvalid$ for single-objective experiments; we use BLEU and decoding time for multiobjective experiments.

### 3.5 Hyperparameter Importance/Correlation

We might be interested in seeing whether good configurations are always good across datasets. This can be done by ranking configurations by BLEU for each dataset, then measuring correlation between rankings. We show the Spearman’s correlation coefficient in Figure 4. NMT systems with same language pairs (ja-en vs. en-ja) are highly correlated. On the contrary, other pairs show low correlation (0.084 for ja-en vs. so-en), implying the need to run HPO on new datasets separately.

The table-lookup approach also enables in-depth analyses of how hyperparameters generally affect system performance. Following Klein and Hutter (2019), we assess the importance of hyperparameters with fANOVA, which computes the variation in BLEU when changing a specific hyperparameter with values of all the other hyperparameters fixed. In Figure 5, on en-ja, when considering only the top performing NMT models (top left), #att_heads, init_lr, and #embed impact BLEU the most, over the entire configuration space (top middle), #embed is the distinguishing factor. The analysis can be extended to pairs of hyperparameters, where we observe the interaction of init_lr and #embed being important (Figure 5 bottom left).

Questions may arise over whether the results on en-ja can be taken as general conclusions. We find that it is dataset-dependent—hyperparameter importance ranking differs across language pairs, and is dependent on the range and granularity of hyperparameters considered. As shown in the right column of Figure 5, bpe is the most important hyperparameter for sw-en, instead of #embed. This shows the diversity of our selected MT datasets and the hyperparameter importance analysis is a good tool for probing the search space characteristics of these datasets.

### 3.6 Reproducible and Efficient Benchmarks

Our table-lookup dataset enables reproducible and efficient benchmarks for HPO of NMT systems. Li and Talwalkar (2019) introduce two notions of reproducibility: exact reproducibility (the reproducibility of reported experimental results); and broad reproducibility (the generalization of the experimental results).^{9} Our benchmarks are exact reproducible in the sense that we provide the tables that record all model results (Section 3.3) and the code to run and evaluate our HPO algorithms (Section 6). However, they are not guaranteed to be broad reproducible, because the generalizability of the results might be restricted due to fixed collections of hyperparameter configurations, the variance associated with multiple runs, and the unknown best representative set of MT data. As a result, in this work, we should be careful to not make general conclusions from the observations, but to show how the dataset can be potentially used in facilitating HPO research.

## 4 Evaluation Protocols

To assess HPO method performance, we measure the **runtime** to reach a *quality indicator* (e.g., BLEU) target value. The **runtime** is defined as the number of NMT models trained, or equivalently the number of function evaluations *f*(** λ**) in Figure 2. We consider two ways to measure the HPO performance:

**fixed-target**and

**fixed-budget**.

### 4.1 Single-Objective Evaluation Metrics

For single-objective optimization, we have:

- •
**fixed-target best (ftb)**: We fix the quality indicator value to the best value in the dataset and measure runtime to reach this target. - •
**fixed-target close (ftc)**: We measure the runtime to reach a target that is slightly less than the oracle best. This is useful when one can tolerate some performance loss. - •
**fixed-budget (fb)**: We fix the budget of function evaluations and measure the difference between the oracle best quality indicator value (e.g., oracle best BLEU) in the dataset vs. the best value achieved by systems queried by the HPO method.

The fixed-budget metric asks what is the best possible system assuming a hard constraint on training resources. The fixed-target metrics ask how much training information is needed to find the best (or approximate best) system in the dataset.

### 4.2 Multiobjective Evaluation Metrics

In practice, one might desire to optimize multiple objectives, such as translation accuracy and speed. Suppose we have *J* objectives, and they can be jointly represented as *F*(** λ**) = [

*f*

^{1}(

**),**

*λ**f*

^{2}(

**),⋯ ,**

*λ**f*

^{J}(

**)]. As it is unlikely that any one**

*λ***will optimize all objectives simultaneously, we adopt the concept of**

*λ**Pareto optimality*(Godfrey et al., 2007). In the context of minimization,

**is said to dominate**

*λ*

*λ**′*, that is,

**≺**

*λ*

*λ**′*, if

*f*

^{j}(

**) ≤**

*λ**f*

^{j}(

*λ**′*)∀

*j*and

*f*

^{j}(

**) <**

*λ**f*

^{j}(

*λ**′*) for at least one

*j*. If nothing dominates

**, we call it the**

*λ**Pareto optimal*solution. The set of all Pareto solutions is referred to as the

*Pareto front*, that is, ${\lambda \u2223\u2204\lambda \u2032\u2208\Lambda :\lambda \u2032\u227a\lambda}$. Intuitively, these are solutions satisfying all possible trade-offs in the multiobjective space. Figure 1 shows an example of Pareto solutions that maximize BLEU and minimize speed.

For multiobjective optimization, the quality indicator becomes the Pareto front, thus we have:

- •
**fixed-target all (fta)**: We measure the runtime to find all points on the Pareto front. - •
**fixed-target one (fto)**: We measure the runtime to get at least one Pareto point. - •
**fixed-budget (fbp)**: We fix the budget of function evaluations and measure the number of Pareto-optimal points obtained.

In the literature, a common way to compare HPO methods is to plot quality indicator value as a function of runtime on a graph (e.g., see Figure 6). The proposed metrics can be viewed as summary statistics drawn as line thresholds on such graphs (Hansen et al., 2016), where the budget/target is set to a value appropriate for the use case.

### 4.3 Repeated Trials

Some HPO methods may be sensitive to randomness in initial seeds {*λ*_{init}} (Feurer et al., 2015). We suggest that repeated randomized trials are important for a rigorous evaluation, and this is only feasible with a table-lookup dataset. In our experiments, we average results of HPO runs across 100 trials, where each trial is seeded with a different set of 3 random initial hyperparameter settings.

## 5 Methods

We now describe two HPO/SMBO methods used in our experiments: Bayesian optimization^{10} is a popular method. Graph-based SMBO is a novel method that adapts ideas in graph-based semi-supervised learning to the HPO problem.

### 5.1 Bayesian Optimization (BO)

Given a target function $f:\Lambda \u2192R$, Bayesian optimization (Brochu et al., 2010; Shahriari et al., 2015; Frazier, 2018) aims to find an input $\lambda \u2605\u2208argmin\lambda \u2208\Lambda f(\lambda )$. It models *f* with a posterior probability distribution *p*(*f*∣ℒ), where ℒ is a set of observed points. This posterior distribution is updated each time we observe *f* at a new point *λ*_{p}. The *utility* of each candidate point is quantified by an acquisition function $a:\Lambda \u2192R$, and $\lambda p\u2208argmax\lambda \u2208\Lambda a(\lambda )$. In practice, a prominent choice for *p*(*f*∣ℒ) is Gaussian process regression, and a common acquisition function is Expected Improvement (EI).

#### 5.1.1 Gaussian Process Regression

*m*(

**) and a covariance function or a**

*λ**kernel*

*k*(

**,**

*λ*

*λ**′*), and the sufficient statistics of the posterior predictive distribution,

*μ*(⋅)

^{11}and Σ(⋅), can be computed with

**K**

_{★}=

*k*(

**Λ**

_{observed},

**) and**

*λ***K**=

*k*(

**Λ**

_{observed},

**Λ**

_{observed}). In the case of HPO, the kernel

*k*() measures the similarity between hyperparameter configurations and

*μ*() is a prediction of the

*f*() values of not-evaluated hyperparameters.

#### 5.1.2 Expected Improvement (EI)

*f*

_{min}is the best observed value thus far, and $f^(\lambda )=\mu (\lambda )$. When the prediction $f^(\lambda )$ follows a normal distribution as in the GP, EI can be computed in a closed form. Our acquisition function computes EI for each point in the grid of hyperparameters, and queries the one with largest value.

### 5.2 Graph-Based SMBO (GB)

Semi-supervised learning addresses the question how to utilize a handful of labeled data and a large amount of unlabeled data to improve prediction accuracy. Graph-based semi-supervised learning (GBSSL, Zhu et al., 2003; Zhu, 2005) describes the structure of data with a graph, where each vertex is a data point and each weighted edge reflects the similarity between vertices. It makes a *smoothness* assumption that neighbors connected by edges tend to have similar labels, and labels can propagate throughout the graph.

In SMBO surrogate modeling, we hope to make predictions for the unlabeled or not-evaluated points in the hyperparameter space based on the information of labeled or evaluated points. If we pre-define the set of all potential points, then this becomes highly related to semi-supervised learning. From this point of view, we propose GBSSL equipped with suitable acquisition functions as a new SMBO method for searching over a grid of representative hyperparameter configurations.

#### 5.2.1 Graph-Based Regression

*G*= (

*V*,

*E*) with nodes

*V*corresponding to

*n*points, of which ℒ denotes the set of labeled points {(

*λ*_{1},

*f*(1)),⋯ ,(

*λ*_{l},

*f*(

*l*))}, where

*f*(

*i*) is short for

*f*(

*λ*_{i}), and $U$ denotes the set of unlabeled points {

*λ*_{l+1},⋯ ,

*λ*_{l+u}}, where

*n*=

*l*+

*u*. The edges

*E*are represented by a

*n*×

*n*weight matrix

*W*. For instance,

*W*can be given as the radial basis function (RBF):

*G*is not necessarily fully connected; in practice,

*k*NN graphs with a small

*k*turn out to perform well, where nodes

*i*,

*j*are connected if

*i*is in

*j*’s

*k*-nearest-neighborhood or vice versa.

^{12}

*energy*function as:

*f*(

*i*),

*i*∈

*L*or

*f*

_{L}to be true labels and aim to find

*f*(

*i*),

*i*∈

*U*or

*f*

_{U}that minimizes the energy.

*D*, where $Dii=\u2211jWij$ and the

*combinatorial Laplacian*

**Δ**=

*D*−

*W*, Equation (5) can then be rewritten to

*E*(

*f*) =

*f*^{T}

**Δ**

**. If we partition the Laplacian matrix into blocks:**

*f**f*() values for unlabeled points by:

#### 5.2.2 Expected Influence (EIF)

We propose a novel acquisition function called *expected influence* that exploits the graph structure. The idea is to query the point such that, if its *f*() is observed, has the highest potential to change the *f*() of all other points as we re-run label propagation through the graph.

We first scale the labels on the graph *f*(*i*) ∈ℝ to be between 0 or 1. The best labeled point is set to 1; for the other labeled points, we first compute the probability that a random walk starting at 1 reaches it, then set the label to be 1 if the probability is larger than 0.5 and 0 otherwise.

If we were to query an unlabeled point *k*, there are two scenarios: Its label is either 1 with probability *f*(*k*) or 0 with probability 1 − *f*(*k*). For each scenario, we then consider including *k* as a newly added “labeled” point and re-running label propagation. $f+(\lambda k,1)(i)$ are the new predictions for points *i* in the scenario where *k* is added with label 1. If *k* is an influencer in the positive direction, this means that many points *i* will now have large $f+(\lambda k,1)(i)$; otherwise, $f+(\lambda k,1)(i)$ might be small on average in magnitude. On the other hand, suppose we add *k* with label 0 and run label propagation again to obtain new predictions $f+(\lambda k,0)(i)$. If *k* is an influencer in the negative direction, this means that $f+(\lambda k,0)(i)$ will be small (or conversely $1\u2212f+(\lambda k,0)(i)$ will be large).

*p*that maximizes the following:

### 5.3 BO vs. GB

There is a connection between the BO and GB due to the link between GPs and graphs. The GB method defines a Gaussian random field on the graph, which is a multivariate Gaussian distribution on the nodes. This is equivalent to “finite set” GPs. Zhu (2005) showed that the kernel matrix *K* of the finite set GP is equivalent to the inverse of a function of the graph Laplacian **Δ**, that is, $K=(2\beta (\Delta +I\sigma 2))\u22121$^{13}. The difference between the finite set GP and GP is that the kernel matrix of the former is defined on $L\u222aU$, while the latter is defined on **Λ**. As a semi-supervised method, the label propagation rule of GB (Equation (7)) shows that all the nodes on the graph contribute to the prediction of a single unlabeled node, whereas for GP, the posterior predictive distribution of a new point does not depend on other unlabeled points as shown by Equation (1).

The main advantage of GB is that it offers flexibility to build graphs over the search space. For instance, one can build a graph with configurations from different model architectures, for example, RNN, CNN, and Transformers. Nodes of the same architecture might gather into a cluster, and clusters can be connected with each other. One can also manipulate the edge weights by manually defined heuristics. One example of such rules could be Euclidean distance scaled by hyperparameter importance. We leave this as future work.

The theoretical caveat of the GB method is that it is restricted to a discrete search space defined by a graph. If a dense grid is desired to mimic a continuous search space, increasing time and space complexity would make it a less efficient method.

### 5.4 Multiobjective Optimization

For multiobjective optimization, we can use the same surrogate models to estimate each $f^j$ independently; but we need a new acquisition function that considers the Pareto front. Various methods have been proposed (Zitzler and Thiele, 1998; Ponweiser et al., 2008; Picheny, 2015; Shah and Ghahramani, 2016; Svenson and Santner, 2016). Here, we adopt the expected hypervolume improvement (EHVI) method (Emmerich et al., 2011), which is a generalization of EI. EHVI as an *infill criterion* and can be combined with different surrogate models. Algorithm 1 provides pseudo-code for the framework.

## 6 Experiments and Results

We evaluate HPO methods on six NMT tasks with the provided benchmark dataset and report their performance measured by three runtime-based assessment metrics mentioned in Section 4. The code base is provided to ensure reproducibility.^{14}

### 6.1 Single-Objective Optimization

For single-objective optimization, our goal is to find a hyperparameter configuration giving the highest BLEU score over a predefined grid.

#### 6.1.1 Experimental Comparison

We run the comparison with two surrogate models, two kernels,^{15} and two acquisition functions, leading to the following HPO systems, where all the GB systems are introduced by this work:

- •
**RS**: random search (Bergstra and Bengio, 2012), which uniformly samples hyperparameter configurations at random over the grid. - •
**BO_EI_M**: GP-based BO with Matérn52 covariance function and expected improvement as acquisition function. - •
**BO_EI_R**: GP-based BO with RBF kernel and EI as acquisition function. - •
**GB_EI_M**: GB with Matérn52 kernel and EI as acquisition function.^{16} - •
**GB_EI_R**: GB with RBF kernel and EI. - •
**GB_EIF_M**: GB with Matérn52 kernel and expected influence as acquisition function. - •
**GB_EIF_R**: GB with RBF and EIF.

We use the George library (Ambikasaran et al., 2014) for GP implementation. For all the methods, configurations are sampled without replacement.

#### 6.1.2 Results

Results for single-objective optimization are summarized in Table 3:

- •
RS always needs to explore roughly half of all the NMT models to get the best one (ftb).

- •
The effectiveness of BO is confirmed: On sw-en, BO_EI_M only takes 10% of the runtime used by RS to achieve the optima.

- •
For ftb, the best GB outperforms the best BO on four of the six datasets: on en-ja, GB_EI_M reduces the ftb runtime of BO_EI_M by 38. GB_EIF often works better than GB_EI.

- •
Matérn kernel and RBF kernel are almost equally good for both BO and GB.

- •
Adjusting initialization can result in a noticeable variance on performance. We suggest that researchers experiment with enough random trials when evaluating HPO systems.

. | zh-en
. | ru-en
. | ja-en
. | ||||||
---|---|---|---|---|---|---|---|---|---|

. | ftb
. | ftc
. | fb
. | ftb
. | ftc
. | fb
. | ftb
. | ftc
. | fb
. |

RS | 61±34 | 14±11 | 0.26±0.25 | 79±47 | 20±17 | 0.42±0.29 | 71±43 | 16±15 | 0.40±0.24 |

BO_EI_M | 29±19 | 13±9 | 0.24±0.24 | 41±19 | 26±17 | 0.51±0.36 | 27±17 | 16±15 | 0.39±0.45 |

BO_EI_R | 24±15 | 11±8 | 0.22±0.26 | 40±26 | 20±13 | 0.44±0.37 | 20±11 | 13±9 | 0.33±0.44 |

GB_EI_M | 84±15 | 13±8 | 0.35±0.21 | 50±34 | 18±17 | 0.35±0.25 | 23±7 | 6 ±3 | 0.14±0.11 |

GB_EI_R | 86±15 | 12±7 | 0.33±0.20 | 51±32 | 18±17 | 0.35±0.28 | 21±6 | 6 ±3 | 0.10±0.12 |

GB_EIF_M | 19±21 | 8±5 | 0.11±0.17 | 32±18 | 22±13 | 0.46±0.31 | 13 ±4 | 6 ±2 | 0.01 ±0.04 |

GB_EIF_R | 13 ±20 | 6 ±4 | 0.06 ±0.15 | 28 ±17 | 17 ±12 | 0.33 ±0.30 | 13 ±3 | 6 ±2 | 0.01 ±0.05 |

en-ja | sw-en | so-en | |||||||

ftb | ftc | fb | ftb | ftc | fb | ftb | ftc | fb | |

RS | 71±46 | 12±10 | 0.71±0.37 | 334±201 | 186±152 | 2.45±0.97 | 301±161 | 39±39 | 0.63±0.32 |

BO_EI_M | 60±29 | 15±17 | 0.86±0.60 | 33 ±17 | 29 ±17 | 1.60±1.41 | 65±62 | 19±21 | 0.41±0.36 |

BO_EI_R | 62±36 | 13±12 | 0.79±0.58 | 55±47 | 33±24 | 1.42 ±1.33 | 52±70 | 13 ±11 | 0.24 ±0.30 |

GB_EI_M | 22 ±20 | 11±11 | 0.42 ±0.57 | 63±37 | 62±36 | 3.56±0.95 | 187±99 | 61±28 | 1.17±0.44 |

GB_EI_R | 24±21 | 13±12 | 0.47±0.59 | 56±26 | 55±26 | 3.39±0.95 | 201±104 | 62±29 | 1.16±0.44 |

GB_EIF_M | 47±22 | 9 ±7 | 0.63±0.32 | 58±24 | 57±24 | 3.13±0.51 | 42 ±30 | 26±8 | 0.48±0.13 |

GB_EIF_R | 45±22 | 10±7 | 0.69±0.39 | 59±25 | 58±25 | 3.15±0.52 | 42 ±30 | 28±7 | 0.49±0.12 |

. | zh-en
. | ru-en
. | ja-en
. | ||||||
---|---|---|---|---|---|---|---|---|---|

. | ftb
. | ftc
. | fb
. | ftb
. | ftc
. | fb
. | ftb
. | ftc
. | fb
. |

RS | 61±34 | 14±11 | 0.26±0.25 | 79±47 | 20±17 | 0.42±0.29 | 71±43 | 16±15 | 0.40±0.24 |

BO_EI_M | 29±19 | 13±9 | 0.24±0.24 | 41±19 | 26±17 | 0.51±0.36 | 27±17 | 16±15 | 0.39±0.45 |

BO_EI_R | 24±15 | 11±8 | 0.22±0.26 | 40±26 | 20±13 | 0.44±0.37 | 20±11 | 13±9 | 0.33±0.44 |

GB_EI_M | 84±15 | 13±8 | 0.35±0.21 | 50±34 | 18±17 | 0.35±0.25 | 23±7 | 6 ±3 | 0.14±0.11 |

GB_EI_R | 86±15 | 12±7 | 0.33±0.20 | 51±32 | 18±17 | 0.35±0.28 | 21±6 | 6 ±3 | 0.10±0.12 |

GB_EIF_M | 19±21 | 8±5 | 0.11±0.17 | 32±18 | 22±13 | 0.46±0.31 | 13 ±4 | 6 ±2 | 0.01 ±0.04 |

GB_EIF_R | 13 ±20 | 6 ±4 | 0.06 ±0.15 | 28 ±17 | 17 ±12 | 0.33 ±0.30 | 13 ±3 | 6 ±2 | 0.01 ±0.05 |

en-ja | sw-en | so-en | |||||||

ftb | ftc | fb | ftb | ftc | fb | ftb | ftc | fb | |

RS | 71±46 | 12±10 | 0.71±0.37 | 334±201 | 186±152 | 2.45±0.97 | 301±161 | 39±39 | 0.63±0.32 |

BO_EI_M | 60±29 | 15±17 | 0.86±0.60 | 33 ±17 | 29 ±17 | 1.60±1.41 | 65±62 | 19±21 | 0.41±0.36 |

BO_EI_R | 62±36 | 13±12 | 0.79±0.58 | 55±47 | 33±24 | 1.42 ±1.33 | 52±70 | 13 ±11 | 0.24 ±0.30 |

GB_EI_M | 22 ±20 | 11±11 | 0.42 ±0.57 | 63±37 | 62±36 | 3.56±0.95 | 187±99 | 61±28 | 1.17±0.44 |

GB_EI_R | 24±21 | 13±12 | 0.47±0.59 | 56±26 | 55±26 | 3.39±0.95 | 201±104 | 62±29 | 1.16±0.44 |

GB_EIF_M | 47±22 | 9 ±7 | 0.63±0.32 | 58±24 | 57±24 | 3.13±0.51 | 42 ±30 | 26±8 | 0.48±0.13 |

GB_EIF_R | 45±22 | 10±7 | 0.69±0.39 | 59±25 | 58±25 | 3.15±0.52 | 42 ±30 | 28±7 | 0.49±0.12 |

### 6.2 Multiobjective Optimization

We now show benchmarks for multiobjective optimization. Our goal is to search for configurations achieving higher BLEU and less decoding time.

#### 6.2.1 Experimental Comparison

We run the comparison on the following systems, where GB systems are introduced by this work:

- •
**RS**: random search, uniformly samples the configurations at random. - •
**BO_M**: GP-based BO equipped with Matérn kernel and EHVI as the infill criterion. - •
**BO_R**: GP-based BO with RBF kernel and EHVI. - •
**GB_M**: GB equipped with Matérn kernel and EHVI as the infill criterion. - •
**GB_R**: GB with RBF kernel and EHVI.

#### 6.2.2 Results

The multiobjective optimization evaluation results are summarized in Table 4:

- •
RS is a bad choice for multiobjective optimization, if one aims to quickly collect as many Pareto-optimal configurations as possible: To get all the true optima, RS usually needs to go through the whole search space (fta), and with fixed budget it obtains much fewer Pareto points than other methods (fbp).

- •
BO is generally superior across datasets. On sw-en, it only spends less than half of the time that RS takes to get the Pareto set (344 vs. 719), and can find 8.6 more Pareto points than RS with 200 NMT models evaluated.

- •
GB provides comparable performance as BO on four datasets, whereas on sw-en and so-en, BO noticeably outperforms GB, which might not be a perfect solution for a multiobjective task.

. | zh-en
. | ru-en
. | ja-en
. | ||||||
---|---|---|---|---|---|---|---|---|---|

fto | fta (J=3) | fbp (B=50) | fto | fta (J=4) | fbp (B=50) | fto | fta (J=5) | fbp (B=50) | |

RS | 30±24 | 88±22 | 1.3±0.8 | 33±26 | 139±28 | 1.3±0.9 | 21±18 | 129±20 | 1.7±1.0 |

BO_M | 24±16 | 81±16 | 1.7±0.7 | 16 ±14 | 80 ±26 | 2.4 ±0.9 | 17±13 | 77 ±28 | 3.3 ±1.3 |

BO_R | 20 ±13 | 75 ±15 | 1.8 ±0.5 | 17±15 | 84±32 | 2.4 ±1.0 | 18±14 | 94±32 | 2.8±1.2 |

GB_M | 24±16 | 85±16 | 1.8 ±0.6 | 17±14 | 102±30 | 1.9±0.9 | 16 ±12 | 103±21 | 2.4±1.1 |

GB_R | 24±15 | 90±12 | 1.7±0.6 | 17±12 | 103±30 | 2.0±0.9 | 19±12 | 107±20 | 2.2±1.0 |

en-ja | sw-en | so-en | |||||||

fto | fta (J=8) | fbp (B=50) | fto | fta (J=14) | fbp (B=200) | fto | fta (J=7) | fbp (B=200) | |

RS | 17±16 | 150±17 | 2.5±1.4 | 54±51 | 719±47 | 3.4±1.7 | 88±73 | 534±55 | 2.1±1.3 |

BO_M | 15 ±10 | 100±34 | 4.6 ±1.7 | 26 ±20 | 344 ±201 | 12.0 ±2.8 | 30 ±21 | 321 ±113 | 5.1 ±1.2 |

BO_R | 17±13 | 93 ±30 | 4.3±2.0 | 28±27 | 454±153 | 10.0±2.2 | 31±25 | 399±129 | 4.7±1.4 |

GB_M | 17±13 | 121±28 | 4.0±1.5 | 59±75 | 469±198 | 7.8±4.3 | 61±63 | 447±99 | 2.9±1.4 |

GB_R | 17±14 | 119±24 | 3.6±1.5 | 58±75 | 509±193 | 7.4±4.1 | 66±58 | 426±102 | 2.9±1.4 |

. | zh-en
. | ru-en
. | ja-en
. | ||||||
---|---|---|---|---|---|---|---|---|---|

fto | fta (J=3) | fbp (B=50) | fto | fta (J=4) | fbp (B=50) | fto | fta (J=5) | fbp (B=50) | |

RS | 30±24 | 88±22 | 1.3±0.8 | 33±26 | 139±28 | 1.3±0.9 | 21±18 | 129±20 | 1.7±1.0 |

BO_M | 24±16 | 81±16 | 1.7±0.7 | 16 ±14 | 80 ±26 | 2.4 ±0.9 | 17±13 | 77 ±28 | 3.3 ±1.3 |

BO_R | 20 ±13 | 75 ±15 | 1.8 ±0.5 | 17±15 | 84±32 | 2.4 ±1.0 | 18±14 | 94±32 | 2.8±1.2 |

GB_M | 24±16 | 85±16 | 1.8 ±0.6 | 17±14 | 102±30 | 1.9±0.9 | 16 ±12 | 103±21 | 2.4±1.1 |

GB_R | 24±15 | 90±12 | 1.7±0.6 | 17±12 | 103±30 | 2.0±0.9 | 19±12 | 107±20 | 2.2±1.0 |

en-ja | sw-en | so-en | |||||||

fto | fta (J=8) | fbp (B=50) | fto | fta (J=14) | fbp (B=200) | fto | fta (J=7) | fbp (B=200) | |

RS | 17±16 | 150±17 | 2.5±1.4 | 54±51 | 719±47 | 3.4±1.7 | 88±73 | 534±55 | 2.1±1.3 |

BO_M | 15 ±10 | 100±34 | 4.6 ±1.7 | 26 ±20 | 344 ±201 | 12.0 ±2.8 | 30 ±21 | 321 ±113 | 5.1 ±1.2 |

BO_R | 17±13 | 93 ±30 | 4.3±2.0 | 28±27 | 454±153 | 10.0±2.2 | 31±25 | 399±129 | 4.7±1.4 |

GB_M | 17±13 | 121±28 | 4.0±1.5 | 59±75 | 469±198 | 7.8±4.3 | 61±63 | 447±99 | 2.9±1.4 |

GB_R | 17±14 | 119±24 | 3.6±1.5 | 58±75 | 509±193 | 7.4±4.1 | 66±58 | 426±102 | 2.9±1.4 |

## 7 Analysis

### 7.1 HPO Algorithm Behavior

Section 6 shows how to rigorously compare HPO methods based on various performance metrics. Here we illustrate examples of how to obtain deeper insights into HPO algorithm behavior using the table-lookup framework.

For single-objective optimization, we compare the best BLEU and mean squared error (MSE), which is the averaged squared difference between ground-truth BLEU and predictions, achieved by different HPO methods across time. We can see from Figure 6 (left) that BO and GB converge much faster than RS, and GB is superior over time. This could be partly explained by Figure 6 (right), GB can already fit the data well in the beginning, while BO starts from a much larger MSE and decreases gradually.

For multiobjective optimization, we show the evolution of Pareto-optimal fronts in Figure 7. There is a trend that Pareto fronts are moving towards the lower right corner at each iteration, verifying the effectiveness of our HPO methods.

### 7.2 Effect of Random Initialization

NMT training might not be deterministic due to the random initialization of model parameters. All the experimental results so far are obtained by a single run using one random seed. In order to explore the variance of the model performance induced by initialization effects, we fix the hyperparameter configurations and train models initialized with various random seeds. Specifically, we select five hyperparameter configurations,^{20} and re-trained them for additional five times each with different random initializations. We did this for two datasets: the low-resource sw-en task and the larger WMT2019 ja-en task.

The results on ja-en and sw-en are shown in Figure 8. The variance of performance is kept in a small range in most cases and the ranking of configurations remains about the same when different random seeds are applied. Based on this observation, we think that it is a reasonable strategy to use a single run to build table-lookup datasets; but at the same time it should be understood that the BLEU scores in the lookup table are only approximations. We note that there can be a few cases where variance is large, and this might be best addressed by inventing HPO methods that explicitly accounts for such uncertainty.

## 8 Related Work

To alleviate the computational burden for benchmarking HPO methods and to improve research reproducibility, several studies have explored the table-lookup framework. Klein and Hutter (2019) published a mix of datasets focusing on feed forward neural networks. Ying et al. (2019) released a dataset of convolutional architectures for image classification problems. To the best of our knowledge, this work is the first that focuses on NMT and transformer models.

One challenge with table-lookup is that sufficient coverage of the hyperparameter grid is assumed. Eggensperger et al. (2015) and Klein et al. (2019) propose using a predictive meta-model trained on a table-lookup benchmark to approximate hyperparameters that are not in the table. This is an interesting avenue for future work.

Studies on HPO for NMT are scarce. Qin et al. (2017) propose an evolution strategy–based HPO method for NMT. So et al. (2019) apply NAS to Transformer on NMT tasks. There is also work on empirically exploring hyperparameters and architectures of NMT systems (Bahar et al., 2017; Britz et al., 2017; Lim et al., 2018), though the focus is on finding general best-practice configurations. This differs from the goal of HPO, which aims to find the best configuration specific to a given dataset.

## 9 Conclusions

In this paper, we presented a benchmark dataset for hyperparameter optimization of neural machine translation systems. We provided multiple evaluation protocols and analysis approaches for comparing HPO methods. We benchmarked Bayesian optimization and a novel graph-based semi-supervised learning method on the dataset for both single-objective and multiobjective optimization. Our hope is that this kind of dataset will facilitate reproducible research and rigorous evaluation of HPO for complex and expensive models.

## Acknowledgments

This work is supported in part by an Amazon Research Award and an IARPA MATERIAL grant. We are especially grateful to Michael Denkowski for helpful discussions and feedback throughout the project.

## Notes

We focus on SMBO methods in this paper, but note that our dataset is amenable to any HPO method.

Same number of BPE operations is used for both sides.

Same values are used for encoder and decoder.

In this paper, we only focused on integer and real-valued hyperparameters. Categorical hyperparameters need special treatment for most HPO algorithms, thus are not considered.

Note that not all possible hyperparameter configurations are included in the dataset: We excluded ones where training failed or clearly did not learn (e.g., achieved ≈ 0 BLEU).

The ranking is computed only on the subset of MT systems common in all datasets. For this, we consider 30k bpe (for zh, ru, ja, en) to be equivalent to 32k bpe (for sw, so).

They comment: “Of the 12 papers published since 2018 at NeurIPS, ICML, and ICLR that introduce novel Neural Architecture Search methods, none are exactly reproducible.”

For simplicity, we assume a mean of 0 for the prior.

In experiments, based on initial tuning, we set *k*NN so that each point has on average $n7$ neighbors.

*β* and *σ* are adjustable parameters.

We choose Matérn52 and RBF kernel because they exhibit different properties and are both frequently used in literature. As shown in Rasmussen (2003), a parameter *ν* of the Matérn class of covariance functions can affect the smoothness of the functions drawn from GP. For $\nu =12$, the process becomes very rough, and for $\nu \u2192\u221e$, the covariance function converges to RBF kernel.

We can make an equivalence between the covariance matrix in multivariate Gaussian distribution and the inverse of a function of the graph Laplacian **Δ** (see Section 5.3 for details), so EI can also be applied to GB models.

Except for en-ja, where tolerance is set to 1 BLEU, because BLEU difference between top two models is > 0.5.

Including three initial evaluations.

Budget is adjusted based on the size of search space.

Four of these are randomly selected. We also include the configuration that achieved the best BLEU in Table 2.