## Abstract

This article deals with Gaussian process surrogate models for the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES)—several already existing and two by the authors recently proposed models are presented. The work discusses different variants of surrogate model exploitation and focuses on the benefits of employing the Gaussian process uncertainty prediction, especially during the selection of points for the evaluation with a surrogate model. The experimental part of the article thoroughly compares and evaluates the five presented Gaussian process surrogate and six other state-of-the-art optimizers on the COCO benchmarks. The algorithm presented in most detail, DTS-CMA-ES, which combines cheap surrogate-model predictions with the objective function evaluations in every iteration, is shown to approach the function optimum at least comparably fast and often faster than the state-of-the-art black-box optimizers for budgets of roughly 25–100 function evaluations per dimension, in 10- and less-dimensional spaces even for 25–250 evaluations per dimension.

## 1 Introduction

Computers have been used to solve continuous optimization problems in engineering and many other areas for decades. A broad set of these problems is specified by *black-box objective functions*—functions without a known mathematical expression and with values gathered only experimentally or via a time-demanding computer simulation. Such functions can be found, for example, in engineering design optimization (Bekasiewicz and Koziel, 2017), computational fluid dynamics (CFD) simulations (Lee et al., 2016), or in the search for optimal materials (Baerns and Holeňa, 2009; Zaefferer et al., 2016). Because of the time or cost demands, the allowable budget of such evaluations is often strictly limited, and reaching a reasonable value of the objective function (also known as *fitness* in the evolutionary context) in as few evaluations as possible is usually the prime criterion for assessing black-box optimizers.

The early successful black-box optimization algorithms, such as those proposed by Nelder and Mead (1965) or Brent (1973), have been heavily used for decades (both are, for example, a standard part of MATLAB). However, these algorithms are often unable to escape from the local optimum of the function's attractor where they were started. Therefore, stochastic algorithms such as evolutionary algorithms have become increasingly popular for global optimization.

An important group of global optimizers is formed by Bayesian optimization algorithms, introduced in the 1970s by Mockus and Žilinskas (1972) and Mockus et al. (1978). Nowadays, Efficient Global Optimization (EGO) (Jones, 2001) and its successors (Hutter et al., 2011, 2013) are used most often. These algorithms estimate the prediction uncertainty, usually using Gaussian processes, and they work perfectly globally for very few *fitness function evaluations per dimension* (FE/D). However, they often get stuck in a local optimum after several tens of FE/D.

For higher fitness evaluations budgets, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) by Hansen and Ostermeier (2001) stands for one of the most successful evolutionary continuous black-box optimizers of the last two decades. It iteratively samples $\lambda $ points from a Gaussian distribution $N(m,\sigma 2C)$ and then recalculates the distribution parameters $m$, $\sigma $ and $C$ based on the $\mu $ best fitness-evaluated points. The CMA-ES profits from its invariant properties, particularly on ill-conditioned or highly multi-modal functions, or in high-dimensional spaces when the allowable number of FE/D is higher than roughly several hundreds.

The CMA-ES learns some core characteristics of the fitness via perturbations of its covariance matrix $C$ and the step-size $\sigma $. Nevertheless, the exact information from the passed evaluations can be used more intensively via *surrogate modeling*—a technique in which a regression model of the objective function is used interchangeably with the function itself. While spending extra CPU time for model construction, surrogate models have been shown to save some of the original-fitness evaluations; see Jin (2011) or Rios and Sahinidis (2012). Surrogate models have been increasingly often used in many optimization algorithms, including the CMA-ES. The acceleration of the CMA-ES via surrogates was shown, for example, by Auger et al. (2004), Kern et al. (2006) and Loshchilov et al. (2013) and it also constitutes the research objective of this article.

A particular position among existing surrogate models is occupied by Gaussian processes (GPs). For an unseen point, they return not only the expected value but the whole probability distribution of the expected output. The variance of this distribution can serve as an uncertainty measure of the prediction.

This article reviews previous attempts to use Gaussian processes as surrogates for the CMA-ES and thoroughly compares most of them with each other as well as with a few other important black-box optimization methods. The empirical comparison is based on the COCO benchmark suite (Hansen et al., 2016)—a platform repeatedly used for assessing black-box optimizers for the past eight years.

As a new contribution to the development of GP-based surrogate models for the CMA-ES, this article introduces two novel algorithms: S-CMA-ES (Surrogate CMA-ES) and DTS-CMA-ES (Doubly Trained Surrogate CMA-ES). The former switches between the CMA-ES iterations evaluating the population with the original fitness and the iterations using a GP or random-forest surrogate model. The latter uses a GP surrogate model for the most promising points in every iteration, and the selection of these points relies on the GP ability to estimate the whole distribution of the predicted fitness value.

The principles of the S-CMA-ES and DTS-CMA-ES were recently presented in conference articles (Bajer et al., 2015; Pitra et al., 2016, 2017). This article more thoroughly describes the details of the employed Gaussian process surrogate models, evaluates their parameter settings, reinvestigates the criteria for selecting points for the evaluation with the original fitness in the DTS-CMA-ES evolution control, presents such a new criterion, and brings new experimental results of the proposed algorithms. The DTS-CMA-ES both with and without self-adaptation is shown to converge to optima on average faster or comparably to the fastest black-box optimizers in dimensions up to $D\u226410$ starting from approximately 25 FE/D and in dimension $D=20$ between approximately 25 and 100 FE/D. It is worth mentioning that the versions without and with the self-adaptation were placed the first and second, respectively in the two single-objective tracks of the *Black Box Optimization Competition* at the conference GECCO 2017.^{1}

Related surrogate-assisted CMA-ES algorithms are briefly reviewed in the next section, followed by the foundations of the CMA-ES and Gaussian process surrogate modeling in Section 3. Section 4 provides the description of our new algorithms which is supplemented by their implementation details and parameter settings in the first half of Section 5 and by the experimental comparison of 13 different algorithms in the second half. Section 6 concludes the article and suggests a few future research objectives.

## 2 Related Works

Surrogate modeling has been used for accelerating expensive optimization for several decades, but we will restrict our attention only to its combination with the CMA-ES while primarily focussing on Gaussian process models.

Surrogate models were probably first used with the CMA-ES by Jin et al. (2001, 2002). Their framework uses multilayer perceptron networks with the *individual-* and *generation-based* evolution control (EC) strategies. In each generation, the individual-based EC selects a smaller number $\lambda $ of points for evaluation with the original fitness (*original-evaluated* points in the following) out of a larger set of $\lambda Pre$ model-evaluated points: the authors suggest choosing them randomly or selecting the points with the best model fitness. The generation-based EC uses cycles of $\Lambda $ generations. It switches between $\eta $ generations in which all the points are model-evaluated and $(\Lambda -\eta )$ generations evaluated with the original fitness. The proportions $\lambda /\lambda Pre$ and $\eta /\Lambda $ of the original-fitness evaluations in both strategies are proportional to the current model error divided by the highest measured error.

**LS-CMA-ES**. The *LS-CMA-ES*, introduced by Auger et al. (2004), uses a quadratic model approximation of the fitness trained on the recent $O(n2)$ points. This approximation is shown to provide a faster update of the CMA-ES covariance matrix when applied to the fitness functions similar to ellipsoid. The algorithm also automatically controls whether to use the model for the CMA-ES update or not.

**lmm-CMA**. Two years later, Kern et al. (2006) presented their *Local meta-model CMA-ES* (lmm-CMA) which uses a set of local quadratic models to predict ranking of the CMA-ES' $\mu $ best points from each generation (provided that there are enough points to construct the model). Being inspired by the approximate ranking procedure of Runarsson (2004), the lmm-CMA first evaluates the current population with the models and then iteratively evaluates the best-predicted points with the original fitness until the difference in ranking of the best $\mu $ points between two consecutive iterations becomes low enough. This algorithm was later slightly modified by Auger et al. (2013).

**$s*$ACM-ES**. Recently, Loshchilov et al. (2013) incorporated the Ranking SVM model into the CMA-ES in their *$s*$ACM-ES* algorithm. The Ranking SVM is trained to predict only ranking of the input points, which perfectly suits the CMA-ES' invariance with respect to monotonous fitness scaling. In addition, the CMA-ES covariance matrix $C$ is used in the SVM as a fixed kernel, which results in low computational complexity of the model fitting.

Although Gaussian processes have been studied in probability theory for many decades, they started to be widely used as a regression model only about twenty years ago: it was due to advances in computational power and because of clarifying contributions like those by Neal (1994, 1997), Williams and Rasmussen (1996) and MacKay (1998). Since that time, Gaussian processes have been established as a model with a decent number of parameters, and they are used in many optimization domains, as long as the number of continuous problem variables does not exceed few dozen and the function being regressed is smooth enough.

**MA-ES**. To the best of our knowledge, the first combination of the CMA-ES optimizer and a Gaussian process model was in the *Metamodel-Assisted Evolution Strategy* (MA-ES) proposed by Ulmer et al. (2003). According to Jin's terminology, the algorithm employs individual-based evolution control—it uses a GP surrogate model for the pre-selection of points for evaluation with the original fitness. The algorithm pre-selects points based on one of two criteria: the best GP mean prediction, or the *probability of improvement* (PoI)—the probability that the considered point's function value will drop below some value, usually the so-far achieved optimum $fmin$ (see Section 4.2). The authors have shown that the latter criterion, which uses the GP's prediction of uncertainty, performs better especially on multi-modal functions like Ackley's or Griewank. The same algorithm name was used for the GP-surrogate evolutionary strategy (not necessarily the CMA-ES) by Emmerich et al. (2002, 2006) who also showed superior performance of the criteria exploiting the GP uncertainty on multi-modal functions.

**GPOP**. Büche et al. (2005) combined GPs and the CMA-ES in the opposite way. In fact, their *Gaussian process optimization procedure* (GPOP) is a Bayesian optimization algorithm (similar to the EGO mentioned above) which additionally restricts the search around the current best point $x^opt$. In each iteration, the GPOP constructs a GP model of the local neighborhood of $x^opt$, and the CMA-ES is then used to locate the optimum of this model for evaluation with the original fitness. The CMA-ES optimizes functions defined as $(y^(x)-\alpha s^(x))$ where $y^(x)$ and $s^(x)$ are the mean and standard deviation of the GP prediction in $x$ respectively. The authors suggest using four different $\alpha =0,1,2,4$ in each generation: while $\alpha =0$ exploits the current best-predicted point, $\alpha =4$ explores mainly uncertain regions of the search space.

**Kriging metamodeling CMA-ES**. Kruisselbrink et al. (2010) used GPs to handle robustness in optimization of noisy functions. Their *kriging metamodeling based CMA-ES* evaluates the population with the noisy original function and then constructs a separate GP/kriging model for each such point in order to estimate its true value without noise. The algorithm uses the CMA-ES covariance matrix $C$ for the transformation of the input space. Apart from a procedure for selecting training points for the GP from an archive, the authors also suggest which points to take for an additional expensive evaluation if there are not enough points for the GP model building. The kriging method is shown to approximate the noisy fitness much better than two other methods based on a simple (re-)evaluation of the points during CMA-ES optimization.

**Ensemble of Local Gaussian Process Models**. The article by Lu et al. (2013) presents another surrogate-assisted CMA-ES algorithm that employs an ensemble of local Gaussian process models. The algorithm uses the GP models to prescreen a larger population ($\lambda =25$, $\mu =5$) and to select the best 6 points for expensive re-evaluation according to the same criterion as in Büche et al. (2005) with $\alpha =2$. Although the results in the article seem rather promising, we were not able to reproduce them using the provided source code.

**EGO-CMA**. Mohammadi et al. (2015) combine GPs and the CMA-ES in their *EGO-CMA*. It starts as the EGO algorithm and switches into the CMA-ES after a small number of EGO iterations. Hence, it is not a surrogate model in the proper sense of repeatedly switching between the evaluation with the original fitness and with the model. The core contribution of the article lies in the sophisticated initialization of the CMA-ES covariance matrix and step-size based on the learned parameters of the EGO's Gaussian process. The authors show interesting preliminary results on the 5-dimensional Sphere and Ackley functions which we were also not able to reproduce due to the incompleteness of the provided source code.

**SAPEO**. The last single-objective GP/CMA-ES algorithm we are aware of is the Surrogate-Assisted Partial Order-Based Evolutionary Optimization Algorithm (SAPEO) by Volz et al. (2017). The algorithm uses several dominance relations that combine the GP predicted mean and variance to induce a partial order on the points in a CMA-ES population, usually using confidence intervals. If this partial order is able to identify the ranking of the $\mu $ best points with a given probability, the algorithm proceeds without any further fitness evaluations. Otherwise, the SAPEO tries to choose its another dominance relation or original-evaluates some of the points if necessary.

Being interested in the mentioned works, we present a thorough comparison of the following algorithms in Section 5: the *Metamodel-Assisted Evolution Strategy* (MA-ES) by Ulmer et al. (2003), the *GPOP* by Büche et al. (2005), the *lmm-CMA* in the modification by Auger et al. (2013), the *$s*$ACM-ES* by Loshchilov et al. (2013), the *SAPEO* by Volz et al. (2017), and our algorithms *S-CMA-ES* and *DTS-CMA-ES*.

## 3 Surrogate Modeling in the Context of the CMA-ES and Gaussian Processes

The considered optimization problem can be formalized as follows. Let $f:X\u2192R$ be a function defined on a connected set $X\u2286RD$.

*black-box*. The true optimum $yopt$ is often unknown in real-world applications, and thus the stopping criteria have to be defined differently. On the contrary, the formulas and the optima $yopt$ are known for the COCO benchmarks, so the framework uses a precision $\Delta fopt=10-8$—the optimization ends when a point $x^opt$ for which $f(x^opt)\u2208[yopt,yopt+\Delta fopt]$ is found.

Furthermore, the function $f$ is supposed to be *expensive*. More precisely, we assume that evaluating the fitness of a point represents a considerably higher cost than training a regression model. This assumption, commonly valid in real-world problems, justifies the aim to use surrogate models, and it implies that the criteria for assessing different black-box optimizers are calculated based on the number of the expensive function evaluations, not on the CPU time spent during the optimization.

As a *surrogate model*, we understand any regression model $f^:X\u2192R$ that is trained on the data from an archive $A={(xi,yi)|yi=f(xi),i=1,\cdots ,N}$. If it can be defined only on a substantially restricted $X'\u2282X$, we then call it a *local surrogate model*.

### 3.1 The CMA-ES

All of the surrogate-assisted algorithms considered in this article are based on the Covariance Matrix Adaptation Evolution Strategy (CMA-ES; Hansen and Ostermeier, 2001; Hansen, 2006). The algorithm iteratively samples a new population of points from a Gaussian distribution $N(m,\sigma 2C)$ and uses the $\mu $ best-evaluated points to update its state variables: the current distribution mean $m$, which is used as an estimate of the function optimum, the covariance matrix $C$, and the step-size $\sigma $ (cf. Algorithm 1).

The CMA-ES leaves only several parameters to be set by its user: the initial population size $\lambda $, step-size $\sigma (0)$, and mean $m(0)$. Increasing the recommended starting value of the population size $\lambda =4+\u230a3lnD\u230b$ decreases the convergence speed on unimodal functions, but enables the algorithm to escape easily from local optima (see, e.g., $f1$ and $f17$ in Figure 3). The initial step-size on (all dimensions equally-) bounded problems has been recommended as one half of the search space width $\sigma (0)=0.5((xup)1-(xlow)1)$, and the initial mean has been advised to be sampled uniformly $m(0)\u223cU(X)$. The remaining parameters of the CMA-ES have been tuned by its authors (Hansen, 2006).

Without stating explicitly, the CMA-ES is used in our algorithms with the default setting of its IPOP-CMA-ES version which gradually doubles the population size on restarts. These restarts occur when the optimization converges to a local minimum or stalls for several generations without finding better fitness.

### 3.2 Training Sets

The quality of a surrogate model is largely determined by the selection of already evaluated points into its training set. We denote $Nmax$ the given maximum number of points in a training set and $Nmin$ the given minimum number of points required to train a model. All points from the archive $A$ are used if $|A|<Nmax$. Let us list a few possible methods for training set selection (TSS):

Taking up to $Nmax$ most recently evaluated points (e.g., in Ulmer et al., 2003 or Loshchilov et al., 2013),

selecting the union of the $k$ nearest neighbors of every point for which the fitness should be predicted (e.g., in Kern et al., 2006), where $k$ is maximal such that the total number of selected points does not exceed $Nmax$,

clustering the points in the input space into $Nmax$ clusters and taking the points nearest to clusters' centroids (e.g., in Bajer et al., 2015), and

selecting $Nmax$ points which are closest to any point in the current population.

All of these selection methods can be restricted to an area considered relevant for the particular generation. Localizing this area around the current algorithm position (around the mean $m(g)$ in the CMA-ES, for example) is important for local optimizers, and especially for the 3.2 which is not biased towards the current population. The algorithms in Section 4 use the Mahalanobis distance given by the CMA-ES matrix $\sigma 2C$ for selecting points into training sets, similarly to, for example, Kruisselbrink et al. (2010).

### 3.3 Gaussian Processes

A Gaussian process on the considered set $X\u2286RD$ is a collection of random variables $(fGP(x))x\u2208X$, indexed by the space $X$, such that any finite $n$-element subcollection has a joint $n$-dimensional normal distribution.

Each Gaussian process is specified by its mean $\mu :X\u2192R$ and covariance function $K:X\xd7X\u2192R0+$ where $\mu (x)=E[fGP(x)]$ and $K(x1,x2)=cov(fGP(x1),fGP(x2))$. Both functions are parametrized by a relatively small number of parameters which are usually fitted by the maximum-likelihood (ML) or leave-one-out cross-validation (LOO-CV) method. Since both functions $\mu ,K$ are themselves parameters of the GP, their parameters are usually called hyperparameters of the GP.

**Prediction**. Using a Gaussian process for prediction always starts with a training set of $N$ points $XN={xi|xi\u2208X,i=1,\u2026,N}$ for which the function values $yN={yi=f(xi),i=1,\u2026,N}$ are known. If the mean function $\mu $ of the GP is zero, then $yN|XN\u223cN(0,CN)$. We consider the prediction in a new, $(N+1)$-st point $(x*,y*)$. Adding this point to the training set, the conditional distribution of the extended vector $yN+1=(y1,\u2026,yN,y*)\u22a4$ is

*noisy*:

From the computational perspective, the calculation of the covariance matrix $CN$ takes $O(DN2)$ time, and the complexity of the likelihood calculation for hyperparameter estimation is $O(N3)$ due to inversion of $CN$. Once the $CN-1$ is calculated, the complexity of the prediction is only $O(N2)$.

**Covariance functions**. The covariance function $K$ describes prior assumptions on the shape of the modeled function: $K(xp,xq)$ models the covariance $cov(f(xp),f(xq))$ between the *function values* at two data points $xp$ and $xq$.

For any $N\u2208N$, the matrix $KN$ has to be positive semidefinite. Most of the covariance functions encountered in applications are stationary; that is, the value of the function $K(x1,x2)$ depends only on the distance $r(x1,x2)$ between two input points for some metric $r$, usually the Euclidean distance.

*squared-exponential*

*Matérn*covariance function for two values of its parameter $\nu \u2208{3/2,5/2}$

The closer the points $x1$ and $x2$ are, the closer the covariance function value is to $\sigma f2$, and the stronger the correlation between the function values $f(x1)$ and $f(x2)$ is. The parameter $\u2113$, a characteristic length-scale, is a distance to which the distance of two considered data points is compared, and $\sigma f2$ is the signal variance. For all three mentioned covariance functions, a version which has a separate length scale parameter per each dimension exists (often called automatic relevance determination, ARD). However, these variants, especially in higher dimensions, dramatically increase the number of hyperparameters. Our preliminary experiments showed that the Gaussian processes with ARD covariance functions often give worse results, probably because they are sensitive to over-fitting.

## 4 Designing New Gaussian Process Surrogates for the CMA-ES

The results in the experimental section reveal that the convergence speeds of the Gaussian process surrogate CMA-ES algorithms GPOP and MA-ES are considerably lower on the COCO testbed, compared to the lmm-CMA with quadratic surrogate models or $s*$ACM-ES with ranking Support vector machines. These results, which are in contradiction to a great success of Gaussian processes during the last decade, motivated us to design new GP surrogate CMA-ES algorithms S-CMA-ES and DTS-CMA-ES. In addition to the use of GPs as a regression model of the objective function, we aim to utilize the GP uncertainty estimation in the optimization, too.

### 4.1 Surrogate CMA-ES: Generation-Based Evolution Control

Our first GP surrogate algorithm S-CMA-ES (Bajer et al., 2015) uses the generation-based EC: once the model is trained, it is used instead of the original fitness for several consecutive generations of the CMA-ES. Then, the original fitness is used for one generation, the model is re-trained also using the new points, and this procedure repeats (Algorithm 2). This EC, used for example in the $s*$ACM-ES, has a great advantage of simplicity and, compared to the individual-based EC, there is no need for any special selection of points into the population of the respective generation.

Apart from the surrogate-model parameters described below, the algorithm is parametrized by the model life-length $gm$: the number of generations for which the model is used. Although this parameter is problem-dependent, it was shown that the 5-generations life-length works reasonably well on most COCO functions (Bajer et al., 2015). Two years later, Repický et al. (2017) brought a variant called A-S-CMA-ES that governs the model life-length adaptively and which slightly improves the convergence rate for a higher number of FE/D. Both these algorithms demonstrate an impressive speed-up at the beginning of optimization runs, but they get often stuck in local optima and are outperformed by other algorithms, particularly by the DTS-CMA-ES.

### 4.2 Doubly Trained Surrogate CMA-ES: Closer to the Individual-Based Evolution Control

In the generation-based EC, the whole population is always original-evaluated at once which makes the exploitation of the GP predictive uncertainty difficult. Additionally, the CMA-ES can drift outside the region where any archive points are stored. These two facts led us to employ another evolution control. However, the other common choice, individual-based EC, can modify the distribution $N(m,\sigma 2C)$ of the CMA-ES population due to non-uniform selection of the $\lambda $ promising points from the larger set of model-evaluated points.

Although Loshchilov et al. (2010) suggested subsampling the points according to the predicted fitness mapped onto a Gaussian distribution and Hansen (2011) has later given restrictions on injecting external points into the CMA-ES population, we have proposed another solution called *doubly trained* EC. Each generation of this EC can be summarized in the following steps:

sample a new population of size $\lambda $ (standard CMA-ES offspring),

train the

*first*surrogate model on the original-evaluated points from the archive $A$,select $\u2308\alpha \lambda \u2309$ point(s) wrt. a criterion $C$, which is based on the

*first*model's prediction,evaluate these point(s) with the original fitness,

re-train the surrogate model also using these new point(s), and

predict the fitness for the non-original evaluated points with this

*second*model.

The algorithm is partially similar to the evolution control of the lmm-CMA, but the lmm-CMA always selects the points with the best-predicted fitness (in step (3)) and retrains the model generally more times than twice (it cycles steps (3)–(5)).

The key characteristics of the doubly trained EC are the following. First, due to step (1), the population is always a sample from the CMA-ES distribution $Nm,\sigma 2C$, similarly to the generation-based EC. Second, in step (3), the Gaussian process estimation of uncertainty can be utilized for the selection of a typically low number of original-evaluated points; see below for possible criteria and the description of the parameter $\alpha $. Third, the estimated fitness values in step (6), which are returned to the CMA-ES, are predicted using the model trained on as recent points as possible. Finally, evaluating at least a small number of points in each generation using the original fitness maintains at least a decent number of points in the training set near the current CMA-ES distribution mean.

Employing the doubly trained EC in the CMA-ES results in the Doubly trained surrogate CMA-ES (DTS-CMA-ES, (Pitra et al., 2016)) whose pseudocode is shown in Algorithm 3. The algorithm is parametrized by the parameter $\alpha (0)$—the initial value of the ratio of original-evaluated points in a population, by the criterion $C$ for the selection of these points, and by the surrogate model and its parameters. Furthermore, a user can let the ratio $\alpha $ adapt throughout the algorithm run, which requires an additional set of parameters—see Algorithm 4 and the corresponding paragraph below.

**Ranking difference error**. In accordance with most of the model-assessment methods for the CMA-ES surrogates, we have also used the difference in ranking of the fitness values in a population as a measure of model quality. We have proposed a normalized version of the error measure used by Kern et al. (2006), which we call the

*Ranking difference error*(RDE). We denote as

*ranking*a function $\rho :R\lambda \u2192{1,\u2026,\lambda}\lambda $ which maps each element in a vector $y\u2208R\lambda $ to its rank, that is, $(\rho (y))i\u2264(\rho (y))j$ whenever $(y)i\u2264(y)j$. The RDE is defined for two vectors of fitness values $y1*$, $y2*$. We assume the second vector $y2*$ to be a more accurate prediction: $y2*$ might contain more original-evaluated $f$-values, or the $y2*$'s model training set could be larger. The RDE is then calculated as

The larger the difference in rankings of these two $f$-values vectors, the more the CMA-ES would be fooled if it got the less accurate $f$-values prediction $y1*$ instead of $y2*$.

**Criteria for the selection of original-evaluated points**. While the non-Gaussian-process surrogate algorithms select points for the original evaluation mostly according to the predicted fitness, the Gaussian process surrogates, which for any point $x$ predict the whole Gaussian distribution $N(y^(x),(s^(x))2)$, offer more options when used with the individual-based or doubly trained EC. The following criteria are considered in the DTS-CMA-ES. Whereas the first four are defined for any point of the input space and have been used in Bayesian optimization for decades, the last one, *Expected ranking difference error*, is our new contribution directly exploiting the DTS-CMA-ES' Gaussian processes prediction and is defined only for the points from the considered population.

*GP predictive mean*. The GP mean prediction $y^(x)$ is the maximum-likelihood estimate of the original fitness. This criterion is defined as its negative value$CM(x)=-y^(x).$(15)*GP predictive standard deviation*. Choosing the points with the highest uncertainty leads to the criterion$CSTD(x)=s^(x).$(16)*Expected improvement*(EI). If $ymin$ stands for the minimum fitness in the considered training set $y1,\u2026,yN$, the EI criterion is$CEI(x)=E((ymin-f(x))I(f(x)<ymin)|y1,\u2026,yN),whereI(f(x)<ymin)=1forf(x)<ymin0forf(x)\u2265ymin.$(17)*Probability of improvement*(PoI). The PoI express the probability of finding lower fitness than some threshold $T$where $\phi $ is the distribution function of $N(0,1)$. As $T$, a value slightly lower than $ymin$ is usually chosen, see, e.g., Jones (2001) for an evaluation of these thresholds.$CPoI(x,T)=P(f(x)\u2264T|y1,\u2026,yN)=\phi T-y^(x)s^(x),$(18)*Expected ranking difference error*(ERDE). The goal of this criterion is to select the points from the current population for which the expected RDE would decrease most after adding them to the GP training set. It is a ranking counterpart of the*GP predictive standard deviation*criterion since it selects the points for which the model is least certain. In DTS-CMA-ES, the $RDE\mu $ has been used as an error measure.^{2}For each point $xk*\u2208{x1,\u2026,x\lambda}$ in the considered population, let us denote $X-k$ the population with the $k$-th point removed and $y^-k=fM1(X-k)$ the vector of the corresponding mean predictions of the first DTS-CMA-ES model. The ERDE criterion is the expected contribution of $xk*$ to the ranking errorwhere $y^-k+=fM1+(X-k)$ is the vector of mean predictions of a GP $fM1+$. This process $fM1+$ is identical to the first model $fM1$ except that the $k$-th point $(xk*,yk)$ is additionally incorporated into its training set, say to its end, which becomes $(XN+,yN+)=([XNxk*],(yN\u22a4yk)\u22a4)$; particularly, $fM1+$ uses the same hyperparameters as $fM1$. The expectation is calculated over normally distributed $yk\u223cN(y^k,(s^k)2)$ with density $\varphi (yk)$, where $y^k$ and $(s^k)2$ are defined using the first DTS-CMA-ES model $fM1$ by equations (8) and (9).$CERDE(xk*,X-k)=E[RDE\mu (y^-k,y^-k+)]=\u222b-\u221e\u221eRDE\mu (y^-k,y^-k+)\varphi (yk)dyk,$(19)The vector of predicted means $y^-k+$ is given by a multivariate variant of (8)For the matrix $A=K(X-k,XN+)CN+-1$ that is for a considered $xk*$ constant, the value of each element of $y^-k+$ depends only on $yk$ as$y^-k+=K(X-k,XN+)CN+-1yN+.$(20)$(y^-k+)i=(A)i,1\u2026NyN+yk(A)i,N+1.$(21)Further, the integral (19) over $yk$ can be calculated as the sum of subintegrals $Jp$ over the intervals $Ip=[lp,up]$, $p=1,\u2026,(12(\lambda -1)(\lambda -2)+1)$; the intervals are defined in such a way that the ranking of $y^-k+$ is constant on each of them. The intervals $Ip$'s can be derived from the mutual comparisonswhere the interval boundaries $u1=l2\u2264u2=l3\u2264\u2026\u2264up-1=lp$ are the points for which the equalities arise. As the rankings are on each $Ip$ constant, the subintegrals simplify to $Jp=RDE\mu (y^-k,y^-k+)(\Phi (up)-\Phi (lp))$ where $\Phi $ is the distribution function of $N(y^k,(s^k)2)$. The complexity of calculating ERDE for each point is thus $O(N3+\lambda 2N)$ which enables efficient computation of this criterion even for moderately large $\lambda $.$(y^-k+)i\u2264(y^-k+)jfori\u2260j,i,j=1,\u2026,(\lambda -1),$(22)

**Self-adaptation of the original-evaluated points ratio $\alpha $**. Whereas the speed-up of all surrogate-assisted algorithms is based on the model's ability to estimate fitness, the models can mislead the optimization algorithm when their error is high. Similarly to the lmm-CMA and $s*$ACM-ES, such an adverse effect can be controlled by raising the number of original-evaluated points if necessary. The DTS-CMA-ES measures the ranking difference error of the model and raises the original-evaluated points ratio $\alpha $ if the model error gets higher (Pitra et al., 2017). The details are summarized in Algorithm 4: first, the $RDE\mu $ of the last model is exponentially smoothed with the update rate $\beta $, and the ratio $\alpha $ is then calculated as the result of a linear transfer function that is defined by the bounds on the smoothed error $\epsilon min$, $\epsilon max$ and on the resulting ratio $\alpha min$, $\alpha max$.

## 5 Implementation Details and Experimental Validation

This section starts with the details of the proposed algorithms, particularly addressing the training of and predicting with the Gaussian processes as well as the evaluation of their parameter settings. The second part continues with the experimental evaluation of the DTS-CMA-ES as well as other GP/CMA-ES algorithms on the COCO benchmarks. All the algorithms proposed by the authors are implemented in MATLAB.^{3}

### 5.1 Configuration of the Gaussian Process Models

The GP models employed in our algorithms make use of the GPML 4.0—a toolbox that accompanied the book of Rasmussen and Williams (2006). The only modifications are using the MATLAB fmincon optimizer or the CMA-ES for the ML estimation of the GP hyperparameters and slight modifications in error handling in the case of numerical instabilities.

The estimation of the GP model hyperparameters is preceded by the transformation of both the input and output values; see Algorithm 5 for the details. The likelihood optimization in step 4 is started at and bounded by values obtained from brief experiments; see Table 1 for the exact values. The CMA-ES is used instead of fmincon if fmincon fails due to an infeasible starting point or numerical problems.

Hyperparameter . | Starting Value . | Lower Bound . | Upper Bound . |
---|---|---|---|

$m\mu $ | median $f$-value $medyN$ | $minyN-2\Delta y$ | $maxyN+2\Delta y$ |

$\sigma f2$ | 0.5 | $exp(-2)$ | $exp(25)$ |

$\u2113$ | 2 | $exp(-2)$ | $exp(25)$ |

$\sigma n2$ | $10-2$ | $10-6$ | 10 |

Hyperparameter . | Starting Value . | Lower Bound . | Upper Bound . |
---|---|---|---|

$m\mu $ | median $f$-value $medyN$ | $minyN-2\Delta y$ | $maxyN+2\Delta y$ |

$\sigma f2$ | 0.5 | $exp(-2)$ | $exp(25)$ |

$\u2113$ | 2 | $exp(-2)$ | $exp(25)$ |

$\sigma n2$ | $10-2$ | $10-6$ | 10 |

Analogically, the predictions with such trained GP models start with the inverse transformation of the input variables into the space that was used in the model training, and the predicted $f$-values and variances are transformed back to the original scale. Note that the GP surrogate models have to save not only the covariance function and the corresponding hyperparameters but also these transformations and the model's training set.

**Evaluation of GP model parameters**. Because we are not aware of any systematic comparison of different parameter settings of the GP surrogate models for the CMA-ES in literature, we provide such evaluation on the data from the COCO benchmark functions. Table 2 lists all the tested parameters with their considered values; the $rmaxA$ denotes the maximum distance between $m(g)$ and the points to be selected from $A$. The minimum training set size $Nmin$ is $3\xb7D$ in all our GP models mentioned in this article.

Parameter . | Considered Values . |
---|---|

training set selection method $TSS$ | (TSS1), (TSS2), (TSS3), (TSS4) |

maximum distance $rmaxA$ | $2Q\chi 2(0.99,D)$, $4Q\chi 2(0.99,D)$ |

$Nmax$ | $10\xb7D$, $15\xb7D$, $20\xb7D$ |

covariance function $K$ | $KSE$, $KMate\xb4rn\nu =3/2$, $KMate\xb4rn\nu =5/2$ |

Parameter . | Considered Values . |
---|---|

training set selection method $TSS$ | (TSS1), (TSS2), (TSS3), (TSS4) |

maximum distance $rmaxA$ | $2Q\chi 2(0.99,D)$, $4Q\chi 2(0.99,D)$ |

$Nmax$ | $10\xb7D$, $15\xb7D$, $20\xb7D$ |

covariance function $K$ | $KSE$, $KMate\xb4rn\nu =3/2$, $KMate\xb4rn\nu =5/2$ |

As training and testing datasets for this comparison, three generations from the first half and three generations from the second half of optimization runs of the DTS-CMA-ES ($\alpha =0.05$), limited to 250 FE/D, were recorded on the first five instances of 24 COCO functions in dimensions 2, 3, 5, 10, and 20. For each recorded generation, the CMA-ES state variables, archive $A$ and populations (as testing sets) were saved.

Figure 1 depicts the third quartiles of $RDE\mu /rsucc$ where $rsucc$ denotes the ratio of successful trainings of the second model $fM2$. Both the errors and ratios were taken for each model setting of the full-factorial design of the parameters from Table 2 across 15 repetitions (5 instances $\xd7$ 3 generations) per each half of the optimization run on each function in each dimension. The same results concerning the influence of the individual parameters are summarized over functions via the $n$-way ANOVA in Table 3.

dim . | Part of Run . | Trainset Selection . | $rmaxA$ . | $Nmax$ . | Covariance Function . | ||||
---|---|---|---|---|---|---|---|---|---|

2-D | i | 0.40 | $k$-NN (TSS2) | 0.41 | 2 | 0.40 | $20\xb7D$ | 0.40 | $KMate\xb4rn\nu =5/2$ |

0.41 | population (TSS4) | 0.41 | 4 | 0.41 | $15\xb7D$ | 0.41 | $KMate\xb4rn\nu =3/2$ | ||

0.41 | clustering (TSS3) | 0.41 | $10\xb7D$ | 0.41 | $KSE$ | ||||

0.41 | recent (TSS1) | ||||||||

2-D | ii | 0.42 | $k$-NN (TSS2) | 0.42 | 2$\u2605\u2605$ | 0.42 | $20\xb7D$ | 0.42 | $KSE$ |

0.42 | recent (TSS1) | 0.42 | 4 $\u2193$ | 0.42 | $15\xb7D$ | 0.42 | $KMate\xb4rn\nu =5/2$ | ||

0.42 | clustering (TSS3) | 0.42 | $10\xb7D$ | 0.42 | $KMate\xb4rn\nu =3/2$ | ||||

0.42 | population (TSS4) | ||||||||

1<$\u2605\u2605$2 | |||||||||

3-D | i | 0.38 | recent (TSS1) | 0.38 | 4 | 0.38 | $15\xb7D$ | 0.38 | $KMate\xb4rn\nu =3/2$ |

0.38 | $k$-NN (TSS2) | 0.39 | 2 | 0.38 | $10\xb7D$ | 0.38 | $KMate\xb4rn\nu =5/2$ | ||

0.38 | clustering (TSS3) | 0.38 | $20\xb7D$ | 0.39 | $KSE$ | ||||

0.39 | population (TSS4) | ||||||||

3-D | ii | 0.38 | recent (TSS1)$\u2605\u2605\u2605$ | 0.39 | 2 | 0.39 | $15\xb7D$ | 0.39 | $KMate\xb4rn\nu =3/2$ |

0.39 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.39 | 4 | 0.39 | $10\xb7D$ | 0.39 | $KMate\xb4rn\nu =5/2$ | ||

0.39 | clustering (TSS3) $\u2193$ | 0.39 | $20\xb7D$ | 0.40 | $KSE$ | ||||

0.40 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605$3, 1<$\u2605\u2605\u2605$4, 2<$\u2605$4 | |||||||||

5-D | i | 0.34 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.34 | 2$\u2605\u2605\u2605$ | 0.34 | $10\xb7D$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2605\u2605\u2605$ |

0.34 | recent (TSS1)$\u2605\u2605\u2605$ | 0.35 | 4 $\u2193$ | 0.34 | $15\xb7D$ | 0.34 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ | ||

0.34 | clustering (TSS3)$\u2605\u2605\u2605$ | 0.34 | $20\xb7D$ | 0.35 | $KSE$$\u2193$ | ||||

0.35 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605\u2605\u2605$4, 2<$\u2605$4, 3<$\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$3, 2<$\u2605\u2605\u2605$3 | |||||||

5-D | ii | 0.35 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.35 | 2$\u2605\u2605$ | 0.35 | $15\xb7D$ | 0.35 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.35 | recent (TSS1)$\u2605\u2605\u2605$ | 0.36 | 4 $\u2193$ | 0.35 | $10\xb7D$ | 0.35 | $KMate\xb4rn\nu =3/2$$\u2605\u2605\u2605$ | ||

0.35 | clustering (TSS3)$\u2605\u2605\u2605$ | 0.35 | $20\xb7D$ | 0.36 | $KSE$$\u2193$ | ||||

0.37 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605\u2605\u2605$4, 2<$\u2605\u2605\u2605$4, 3<$\u2605\u2605$4 | 1<$\u2605\u2605$2 | 1<$\u2605\u2605$3, 2<$\u2605\u2605$3 | |||||||

10-D | i | 0.34 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.33 | 4$\u2605\u2605\u2605$ | 0.34 | $20\xb7D$ | 0.33 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.34 | recent (TSS1)$\u2605\u2605\u2605$ | 0.35 | 2 $\u2193$ | 0.34 | $15\xb7D$ | 0.34 | $KSE$$\u2193$ | ||

0.34 | clustering (TSS3) $\u2193$ | 0.34 | $10\xb7D$ | 0.35 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||||

0.35 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605$3, 1<$\u2605\u2605\u2605$4, 2<$\u2605\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605$2, 1<$\u2605\u2605\u2605$3 | |||||||

10-D | ii | 0.32 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.32 | 4$\u2605\u2605\u2605$ | 0.33 | $20\xb7D$ | 0.33 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.33 | recent (TSS1) $\u2193$ | 0.34 | 2 $\u2193$ | 0.33 | $15\xb7D$ | 0.34 | $KSE$$\u2193$ | ||

0.34 | clustering (TSS3) $\u2193$ | 0.34 | $10\xb7D$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||||

0.34 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605\u2605$2, 1<$\u2605\u2605\u2605$3, 1<$\u2605\u2605\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$2, 1<$\u2605\u2605\u2605$3 | |||||||

20-D | i | 0.35 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.34 | 4$\u2605\u2605\u2605$ | 0.35 | $20\xb7D$$\u2605\u2605\u2605$ | 0.34 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.36 | population (TSS4) $\u2193$ | 0.37 | 2 $\u2193$ | 0.35 | $15\xb7D$$\u2605\u2605\u2605$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||

0.36 | clustering (TSS3) $\u2193$ | 0.36 | $10\xb7D$$\u2193$ | 0.39 | $KSE$$\u2193$ | ||||

0.36 | recent (TSS1) $\u2193$ | ||||||||

1<$\u2605$2, 1<$\u2605\u2605\u2605$3, 1<$\u2605\u2605\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$3, 2<$\u2605\u2605$3 | 1<$\u2605$2, 1<$\u2605\u2605\u2605$3, | ||||||

2<$\u2605\u2605\u2605$3 | |||||||||

20-D | ii | 0.34 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.33 | 4$\u2605\u2605\u2605$ | 0.34 | $20\xb7D$$\u2605\u2605\u2605$ | 0.32 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.35 | recent (TSS1) $\u2193$ | 0.36 | 2 $\u2193$ | 0.35 | $15\xb7D$$\u2605\u2605\u2605$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||

0.35 | population (TSS4) $\u2193$ | 0.36 | $10\xb7D$$\u2193$ | 0.38 | $KSE$$\u2193$ | ||||

0.36 | clustering (TSS3) $\u2193$ | ||||||||

1<$\u2605$2, 1<$\u2605\u2605\u2605$3, 1<$\u2605\u2605\u2605$4, 2<$\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$3, 2<$\u2605\u2605\u2605$3 | 1<$\u2605\u2605\u2605$2, 1<$\u2605\u2605\u2605$3, | ||||||

2<$\u2605\u2605\u2605$3 |

dim . | Part of Run . | Trainset Selection . | $rmaxA$ . | $Nmax$ . | Covariance Function . | ||||
---|---|---|---|---|---|---|---|---|---|

2-D | i | 0.40 | $k$-NN (TSS2) | 0.41 | 2 | 0.40 | $20\xb7D$ | 0.40 | $KMate\xb4rn\nu =5/2$ |

0.41 | population (TSS4) | 0.41 | 4 | 0.41 | $15\xb7D$ | 0.41 | $KMate\xb4rn\nu =3/2$ | ||

0.41 | clustering (TSS3) | 0.41 | $10\xb7D$ | 0.41 | $KSE$ | ||||

0.41 | recent (TSS1) | ||||||||

2-D | ii | 0.42 | $k$-NN (TSS2) | 0.42 | 2$\u2605\u2605$ | 0.42 | $20\xb7D$ | 0.42 | $KSE$ |

0.42 | recent (TSS1) | 0.42 | 4 $\u2193$ | 0.42 | $15\xb7D$ | 0.42 | $KMate\xb4rn\nu =5/2$ | ||

0.42 | clustering (TSS3) | 0.42 | $10\xb7D$ | 0.42 | $KMate\xb4rn\nu =3/2$ | ||||

0.42 | population (TSS4) | ||||||||

1<$\u2605\u2605$2 | |||||||||

3-D | i | 0.38 | recent (TSS1) | 0.38 | 4 | 0.38 | $15\xb7D$ | 0.38 | $KMate\xb4rn\nu =3/2$ |

0.38 | $k$-NN (TSS2) | 0.39 | 2 | 0.38 | $10\xb7D$ | 0.38 | $KMate\xb4rn\nu =5/2$ | ||

0.38 | clustering (TSS3) | 0.38 | $20\xb7D$ | 0.39 | $KSE$ | ||||

0.39 | population (TSS4) | ||||||||

3-D | ii | 0.38 | recent (TSS1)$\u2605\u2605\u2605$ | 0.39 | 2 | 0.39 | $15\xb7D$ | 0.39 | $KMate\xb4rn\nu =3/2$ |

0.39 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.39 | 4 | 0.39 | $10\xb7D$ | 0.39 | $KMate\xb4rn\nu =5/2$ | ||

0.39 | clustering (TSS3) $\u2193$ | 0.39 | $20\xb7D$ | 0.40 | $KSE$ | ||||

0.40 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605$3, 1<$\u2605\u2605\u2605$4, 2<$\u2605$4 | |||||||||

5-D | i | 0.34 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.34 | 2$\u2605\u2605\u2605$ | 0.34 | $10\xb7D$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2605\u2605\u2605$ |

0.34 | recent (TSS1)$\u2605\u2605\u2605$ | 0.35 | 4 $\u2193$ | 0.34 | $15\xb7D$ | 0.34 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ | ||

0.34 | clustering (TSS3)$\u2605\u2605\u2605$ | 0.34 | $20\xb7D$ | 0.35 | $KSE$$\u2193$ | ||||

0.35 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605\u2605\u2605$4, 2<$\u2605$4, 3<$\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$3, 2<$\u2605\u2605\u2605$3 | |||||||

5-D | ii | 0.35 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.35 | 2$\u2605\u2605$ | 0.35 | $15\xb7D$ | 0.35 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.35 | recent (TSS1)$\u2605\u2605\u2605$ | 0.36 | 4 $\u2193$ | 0.35 | $10\xb7D$ | 0.35 | $KMate\xb4rn\nu =3/2$$\u2605\u2605\u2605$ | ||

0.35 | clustering (TSS3)$\u2605\u2605\u2605$ | 0.35 | $20\xb7D$ | 0.36 | $KSE$$\u2193$ | ||||

0.37 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605\u2605\u2605$4, 2<$\u2605\u2605\u2605$4, 3<$\u2605\u2605$4 | 1<$\u2605\u2605$2 | 1<$\u2605\u2605$3, 2<$\u2605\u2605$3 | |||||||

10-D | i | 0.34 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.33 | 4$\u2605\u2605\u2605$ | 0.34 | $20\xb7D$ | 0.33 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.34 | recent (TSS1)$\u2605\u2605\u2605$ | 0.35 | 2 $\u2193$ | 0.34 | $15\xb7D$ | 0.34 | $KSE$$\u2193$ | ||

0.34 | clustering (TSS3) $\u2193$ | 0.34 | $10\xb7D$ | 0.35 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||||

0.35 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605$3, 1<$\u2605\u2605\u2605$4, 2<$\u2605\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605$2, 1<$\u2605\u2605\u2605$3 | |||||||

10-D | ii | 0.32 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.32 | 4$\u2605\u2605\u2605$ | 0.33 | $20\xb7D$ | 0.33 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.33 | recent (TSS1) $\u2193$ | 0.34 | 2 $\u2193$ | 0.33 | $15\xb7D$ | 0.34 | $KSE$$\u2193$ | ||

0.34 | clustering (TSS3) $\u2193$ | 0.34 | $10\xb7D$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||||

0.34 | population (TSS4) $\u2193$ | ||||||||

1<$\u2605\u2605$2, 1<$\u2605\u2605\u2605$3, 1<$\u2605\u2605\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$2, 1<$\u2605\u2605\u2605$3 | |||||||

20-D | i | 0.35 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.34 | 4$\u2605\u2605\u2605$ | 0.35 | $20\xb7D$$\u2605\u2605\u2605$ | 0.34 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.36 | population (TSS4) $\u2193$ | 0.37 | 2 $\u2193$ | 0.35 | $15\xb7D$$\u2605\u2605\u2605$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||

0.36 | clustering (TSS3) $\u2193$ | 0.36 | $10\xb7D$$\u2193$ | 0.39 | $KSE$$\u2193$ | ||||

0.36 | recent (TSS1) $\u2193$ | ||||||||

1<$\u2605$2, 1<$\u2605\u2605\u2605$3, 1<$\u2605\u2605\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$3, 2<$\u2605\u2605$3 | 1<$\u2605$2, 1<$\u2605\u2605\u2605$3, | ||||||

2<$\u2605\u2605\u2605$3 | |||||||||

20-D | ii | 0.34 | $k$-NN (TSS2)$\u2605\u2605\u2605$ | 0.33 | 4$\u2605\u2605\u2605$ | 0.34 | $20\xb7D$$\u2605\u2605\u2605$ | 0.32 | $KMate\xb4rn\nu =5/2$$\u2605\u2605\u2605$ |

0.35 | recent (TSS1) $\u2193$ | 0.36 | 2 $\u2193$ | 0.35 | $15\xb7D$$\u2605\u2605\u2605$ | 0.34 | $KMate\xb4rn\nu =3/2$$\u2193$ | ||

0.35 | population (TSS4) $\u2193$ | 0.36 | $10\xb7D$$\u2193$ | 0.38 | $KSE$$\u2193$ | ||||

0.36 | clustering (TSS3) $\u2193$ | ||||||||

1<$\u2605$2, 1<$\u2605\u2605\u2605$3, 1<$\u2605\u2605\u2605$4, 2<$\u2605$4 | 1<$\u2605\u2605\u2605$2 | 1<$\u2605\u2605\u2605$3, 2<$\u2605\u2605\u2605$3 | 1<$\u2605\u2605\u2605$2, 1<$\u2605\u2605\u2605$3, | ||||||

2<$\u2605\u2605\u2605$3 |

Not surprisingly, the highest error $RDE\mu /rsucc$ is visible on the *Attractive sector*$f6$ function. This benchmark has its optimum in a point without continuous derivatives and with different slopes in different directions which is, therefore, very hard to regress by GPs. On the contrary, the lowest errors (in lower dimensions often equal to 0) were always achieved on the *Linear slope*$f5$ whose results were omitted from the heatmaps for better color scaling of the remaining functions. Low errors were measured in higher dimensions on the *Sphere* function $f1$; relatively high errors in lower dimensions on $f1$ are probably due to a fast decrease of the CMA-ES step-size $\sigma $, which results in the low number of points in the training sets.

Let us look more closely at the effects of individual surrogate model parameters.

The dependence of the model error on the maximum radius of the training set $rmaxA$ seems to be problem-dependent in low dimensions (in 2-D and 5-D—lower errors can be observed for smaller radius, for example, on the

*Rastrigin's*functions $f3$ and $f15$ or on the multimodal*Gallagher's Gaussians*$f21$ and $f22$). Rather surprising are the significantly lower errors for smaller radius in the 5-D, but Table 3 shows that larger radius yields significantly lower error in the 10-D and 20-D spaces.The model error appears not to depend on the number of training points $Nmax$ (with the exception of 20-D where both $20\xb7D$ and $15\xb7D$ points are significantly better than $10\xb7D$ points). This insignificance is, however, affected by the number of points that are available for the training sets: the number of these points is often lower than $10\xb7D$.

Further, the choice of the covariance function has shown to be insignificant for 2-D and 3-D, but starting from 5-D, the Matérn functions have shown to be significantly better and starting from 10-D even more specifically with the parameter $\nu =5/2$. Results in Figure 1 reveal that even though the squared exponential covariance exhibits in few cases the best performance (e.g., $f9$ in 2-D), it is more often susceptible to training failure or high $RDE\mu $ error, as can be seen, for example, on $f21$ and $f23$ in 10-D or on several functions in 20-D.

Finally, the TSS method of choice according to Table 3 is (TSS2)—taking the $k$ nearest neighbors to each of the points in the population, in lower dimensions closely followed by (TSS1)—taking up to $Nmax$ most recent points.

For the experiments in Section 5.3, the surrogate model settings performing significantly best in 10-D and 20-D were chosen for all dimensions: $TSS=$ (TSS2), $rmaxA=4Q\chi 2(0.99,D)$, $Nmax=20\xb7D$, $K=KMate\xb4rn\nu =5/2$.

**Fixed hyperparameter values**. In addition to the evaluation of GP model parameters, we performed a brief investigation of inferring the values of two GP hyperparameters $\u2113$ and $m\mu $ from the DTS-CMA-ES state variables instead of their ML estimation.

Table 4 shows the third quartiles of $RDE\mu /rsucc$ for independently set $\u2113=1$ or $\mu (x)=0$ for two covariance functions, measured on the same datasets as in the previous evaluation—on the $2\xd715$ populations of points per each COCO function and considered dimension. As can be seen, using the fixed lengthscale $\u2113=1$ provides practically the same results as ML-estimated $\u2113$ for $D=2,3$, but is harmful in higher dimensions 10-D and 20-D where fitting the lengthscale seems to be essential for a good GP prediction. On the other hand, using the constant mean function $\mu (x)=0$ in connection with the Matérn covariance in higher dimensions does not dramatically deteriorate the prediction, but provides worse results in low-dimensional spaces.

covar. fun. . | $\mu (x)$ . | $\u2113$ . | 2-D . | 3-D . | 5-D . | 10-D . | 20-D . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

$KMate\xb4rn\nu =5/2$ | ML estimate | ML estimate | 0.41 | $\xb10.07$ | $0.39$ | $\xb10.07$ | $0.33$ | $\xb10.06$ | $0.31$ | $\xb10.08$ | $0.31$ | $\xb10.12$ |

$KMate\xb4rn\nu =5/2$ | ML estimate | $\u2113=1$ | 0.41 | $\xb10.06$ | 0.39 | $\xb10.07$ | 0.38 | $\xb10.07$ | 0.42 | $\xb10.06$ | 0.45 | $\xb10.08$ |

$KMate\xb4rn\nu =5/2$ | $m\mu =0$ | ML estimate | 0.45 | $\xb10.08$ | 0.42 | $\xb10.08$ | 0.37 | $\xb10.09$ | 0.33 | $\xb10.08$ | 0.33 | $\xb10.11$ |

$KMate\xb4rn\nu =5/2$ | $m\mu =0$ | $\u2113=1$ | $0.40$ | $\xb10.06$ | 0.39 | $\xb10.07$ | 0.39 | $\xb10.07$ | 0.44 | $\xb10.06$ | 0.49 | $\xb10.07$ |

$KSE$ | ML estimate | ML estimate | 0.41 | $\xb10.07$ | 0.39 | $\xb10.07$ | 0.35 | $\xb10.05$ | 0.32 | $\xb10.08$ | 0.37 | $\xb10.13$ |

$KSE$ | ML estimate | $\u2113=1$ | 0.41 | $\xb10.06$ | 0.40 | $\xb10.07$ | 0.39 | $\xb10.07$ | 0.42 | $\xb10.06$ | 0.47 | $\xb10.09$ |

$KSE$ | $m\mu =0$ | ML estimate | 0.45 | $\xb10.09$ | 0.44 | $\xb10.09$ | 0.39 | $\xb10.08$ | 0.36 | $\xb10.10$ | 0.58 | $\xb10.34$ |

$KSE$ | $m\mu =0$ | $\u2113=1$ | 0.41 | $\xb10.06$ | 0.41 | $\xb10.07$ | 0.40 | $\xb10.07$ | 0.44 | $\xb10.06$ | 0.51 | $\xb10.09$ |

covar. fun. . | $\mu (x)$ . | $\u2113$ . | 2-D . | 3-D . | 5-D . | 10-D . | 20-D . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

$KMate\xb4rn\nu =5/2$ | ML estimate | ML estimate | 0.41 | $\xb10.07$ | $0.39$ | $\xb10.07$ | $0.33$ | $\xb10.06$ | $0.31$ | $\xb10.08$ | $0.31$ | $\xb10.12$ |

$KMate\xb4rn\nu =5/2$ | ML estimate | $\u2113=1$ | 0.41 | $\xb10.06$ | 0.39 | $\xb10.07$ | 0.38 | $\xb10.07$ | 0.42 | $\xb10.06$ | 0.45 | $\xb10.08$ |

$KMate\xb4rn\nu =5/2$ | $m\mu =0$ | ML estimate | 0.45 | $\xb10.08$ | 0.42 | $\xb10.08$ | 0.37 | $\xb10.09$ | 0.33 | $\xb10.08$ | 0.33 | $\xb10.11$ |

$KMate\xb4rn\nu =5/2$ | $m\mu =0$ | $\u2113=1$ | $0.40$ | $\xb10.06$ | 0.39 | $\xb10.07$ | 0.39 | $\xb10.07$ | 0.44 | $\xb10.06$ | 0.49 | $\xb10.07$ |

$KSE$ | ML estimate | ML estimate | 0.41 | $\xb10.07$ | 0.39 | $\xb10.07$ | 0.35 | $\xb10.05$ | 0.32 | $\xb10.08$ | 0.37 | $\xb10.13$ |

$KSE$ | ML estimate | $\u2113=1$ | 0.41 | $\xb10.06$ | 0.40 | $\xb10.07$ | 0.39 | $\xb10.07$ | 0.42 | $\xb10.06$ | 0.47 | $\xb10.09$ |

$KSE$ | $m\mu =0$ | ML estimate | 0.45 | $\xb10.09$ | 0.44 | $\xb10.09$ | 0.39 | $\xb10.08$ | 0.36 | $\xb10.10$ | 0.58 | $\xb10.34$ |

$KSE$ | $m\mu =0$ | $\u2113=1$ | 0.41 | $\xb10.06$ | 0.41 | $\xb10.07$ | 0.40 | $\xb10.07$ | 0.44 | $\xb10.06$ | 0.51 | $\xb10.09$ |

In summary, the ability of Gaussian processes to predict the right ranking varies to a large extent with a problem at hand. On the contrary, a particular *type of covariance function* or some of the training set parameters, such as the algorithm of *selection points from the archive* or the *maximum radius* for this selection, might be preferred, especially in higher dimensions.

### 5.2 DTS-CMA-ES Implementation and Parameter Settings

The implementation of the DTS-CMA-ES is based on the IPOP-CMA-ES in the MATLAB version $3.62\beta $. For the COCO benchmarks, the initial mean is uniformly sampled from $[-4,4]D$, the initial step-size $\sigma (0)=83$ and the number of IPOP restarts is limited to 50. Changes in the CMA-ES code are rather subtle: the doubly trained EC is employed just after the CMA-ES new population sampling (step 3 in Alg. 3), and the EC returns to the CMA-ES a mix of the original and GP model fitness, scaled such that none of the predicted fitness is lower than the so-far minimum original $f$-value in $A$.

Both the GP model trainings in the DTS-CMA-ES (steps 4 and 9) can fail, usually either due to numerical errors in Cholesky matrix decomposition during GP hyperparameter estimation (used for the GP covariance matrix inversion), or the model is over-fitted in such a way that it returns a constant almost everywhere. If training the first model $fM1$ fails and there is no successfully trained model at most two generations old, the algorithm proceeds as the standard CMA-ES: the whole population is evaluated with the original fitness. If training the second model $fM2$ fails, the first model is used for the final prediction, too, and the self-adaptation step is skipped.

**Ratio of original-evaluated points $\alpha $**. The ratio of original-evaluated points should inversely depend on the model error and can be either set fixed or can be adapted by the DTS-CMA-ES. For the evaluations budget of roughly 50–200 FE/D, the experimentally discovered fixed value $\alpha =0.05$ has shown to achieve one of the best results on the COCO testbed (Pitra et al., 2016)---updated results are provided later in this section. The self-adaptation setting is, on the other hand, better for larger evaluation budgets and for harder-to-regress functions, and its setting is described below.

**Population size**. Replacing the exact fitness values with values returned by a surrogate model introduces an uncertainty of which the CMA-ES is not aware. The CMA-ES updates are then too rapid with respect to the number of original FE/D, and especially the step-size $\sigma (g)$ is shrunk too fast. As a result, the CMA-ES is often trapped in a local optimum. As the goal of our work is not a precise fine-tuning of the CMA-ES internal constants, we use the population size $\lambda =8+\u23086lnD\u2309$ that is doubled compared to the default. This enlargement results in adequate CMA-ES updates and the convergence rate of the DTS-CMA-ES is considerably improved (Pitra et al., 2016).

**Criteria for the selection of original-evaluated points**. Figure 2 shows aggregated results of all five mentioned criteria, separately for selected unimodal and multimodal COCO benchmarks. These graphs confirm both theoretical expectations and the previous investigations, e.g., in Ulmer et al. (2003). Selecting the original-evaluated points based on the best GP predictive mean ($CM$) works most *exploitative*: it leads the algorithm towards the nearest local optimum. It thus works well on the unimodal functions, but it provides the worst results on the multimodal benchmarks.

Similarly, the $CEI$ is in comparison with the $CPoI$ less localized and thus more *explorative*. The $CPoI$ performs considerably better on the unimodal functions while the $CEI$ and also the $CSTD$ have superior results on the multimodal functions where the $CPoI$ still performs adequately. $CERDE$ has been shown neither the best nor the worst; it performs similarly to or slightly better than the $CSTD$ on the unimodal functions, but considerably worse in the case of the multimodal benchmarks.

Altogether, the best average results on the whole COCO benchmark set is provided by the $CPoI$, which is used with the threshold $T=fmin-0.05(fmax-fmin)$ in the rest of this work. However, it can be replaced with the $CEI$ or $CSTD$ when some a priori knowledge about the multimodality of the fitness is available.

**Self-adaptation parameter setting**. The specific behavior of the proposed $\alpha $-self-adaptation is determined by the following set of parameters whose values were recently investigated by Pitra et al. (2017).

$\beta $ – exponential smoothing update rate (a.k.a. learning rate) moderates vivid $\alpha $ oscillations. Pitra et al. (2017) recommend $\beta =0.3$, which is used in our experiments, too, and which performs very similarly to the value $\beta =0.2$, the value used in the $s*$ACM-ES (Loshchilov et al., 2013). The DTS-CMA-ES has shown to be rather robust with respect to the exact setting of this parameter.

$\alpha min$, $\alpha max$ – minimum and maximum bounds for the ratio $\alpha $—should be set to enable the self-adaptation to adequately react on a broad set of problems. The used minimum bound $\alpha min=0.04$ corresponds to one evaluated point per generation in $D=2,\u2026,20$, and the maximum bound should be set to $\alpha max=1.0$ for the possibility of disabling the surrogate modeling completely on the fitness landscapes where the GPs are not able to satisfyingly predict the population ranking.

- $\epsilon min$, $\epsilon max$ – lower and upper saturation bounds on the exponentially smoothed $RDE\mu $ error for the linear transfer function. These parameters determine the self-adaptive behavior to a large extent and should change with both the dimension $D$ of the problem and the actual used ratio $\alpha (g+1)$. We have used the proposed set of multiple linear regression models $fmin\epsilon (\alpha ,D)$, $fmax\epsilon (\alpha ,D)$ from Pitra et al. (2017) with the largest considered interval $\epsilon min,\epsilon max$, i.e. using the regression models $\epsilon minQ2$ and $\epsilon maxQ3$ that are most capable of adapting to changing GP ranking error. We report the parameters of these linear models which were tuned on one half of the COCO functionswhere$\epsilon minQ2=fmin\epsilon (\alpha ,D)=(1ln(D)\alpha \alpha ln(D)\alpha 2)\xb7bmin\epsilon maxQ3=fmax\epsilon (\alpha ,D)=(1ln(D)\alpha \alpha ln(D)\alpha 2)\xb7bmax$Because the calculation of the values $\epsilon min$, $\epsilon max$ and the calculation of the resulting ratio $\alpha (g+1)$ (step 3 in Algorithm 4) are mutually dependent, they are alternately repeated in a cycle until convergence or 500 iterations.$bmin=(0.11-0.0092-0.130.0440.14)\u22a4bmax=(0.35-0.0470.440.044-0.19)\u22a4.$

### 5.3 Evaluation of Gaussian Process-Based CMA-ES Algorithms

This section assesses the performance of the six algorithms that are based on the CMA-ES and a GP model: the S-CMA-ES, the DTS-CMA-ES in both the fixed and self-adaptive version, the MA-ES, GPOP and SAPEO. For the sake of comparison with other state-of-the art black-box optimizers, the graphs in Figures 3 and 4 also present the results of the standard IPOP-CMA-ES with both the recommended and the doubled population size, two other surrogate-assisted CMA-ES algorithms lmm-CMA and $s*$ACM-ES, the SMAC algorithm as a representative of Bayesian optimization algorithms, and two local-search numerical optimization algorithms based on a trust region method: the BOBYQA algorithm by Powell (2009) and the interior-point method (Byrd et al., 2000) from the MATLAB fmincon function.^{4} The numbers of COCO functions for which one algorithm is better than each of the others are presented in Table 5.

$5-D$ . | S-CMA-ES . | 0.05 DTS . | adp. DTS . | MA-ES . | GPOP . | SAPEO . | CMA-ES . | CMA-ES 2pop . | $s*$ACM-ES . | lmm-CMA . | BOBYQA . | SMAC . | fmincon . | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

eval. budget . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . |

S-CMA-ES | — | — | 3 | 2 | 5 | 2 | 12 | 9 | 9 | 13 | 9 | 7 | 14 | 12 | 13 | 14 | 11 | 10 | 5 | 2 | 8 | 8 | 16 | 18 | 12 | 10 |

0.05 DTS | 21$*$ | 22$*$ | — | — | 16 | 15 | 18 | 19$*$ | 21 | 23$*$ | 19 | 21$*$ | 19$*$ | 19$*$ | 21$*$ | 18$*$ | 20$*$ | 19$*$ | 15 | 16 | 16 | 19 | 19$*$ | 20$*$ | 16 | 22$*$ |

adp. DTS | 19 | 22$*$ | 6 | 7 | — | — | 17 | 20 | 15 | 20$*$ | 18 | 19 | 22 | 23$*$ | 23$*$ | 23$*$ | 22 | 18 | 12 | 16 | 12 | 15 | 18 | 20$*$ | 13 | 15 |

MA-ES | 12 | 15 | 6 | 5 | 7 | 4 | — | — | 12 | 15 | 13 | 12 | 18 | 16 | 20 | 15 | 16 | 11 | 8 | 6 | 10 | 10 | 14 | 19 | 11 | 10 |

GPOP | 15 | 11 | 3 | 1 | 9 | 4 | 12 | 9 | — | — | 11 | 9 | 14 | 13 | 16 | 11 | 15 | 9 | 8 | 7 | 10 | 7 | 12 | 13 | 14 | 10 |

SAPEO | 15 | 17 | 5 | 3 | 6 | 5 | 11 | 12 | 13 | 15 | — | — | 15 | 16 | 21 | 16 | 14 | 11 | 6 | 6 | 10 | 10 | 14 | 18 | 13 | 9 |

CMA-ES | 10 | 12 | 5 | 5 | 2 | 1 | 6 | 8 | 10 | 11 | 9 | 8 | — | — | 18 | 10 | 11 | 11 | 3 | 6 | 11 | 9 | 12 | 15 | 11 | 9 |

CMA-ES 2pop | 11 | 10 | 3 | 6 | 1 | 1 | 4 | 9 | 8 | 13 | 3 | 8 | 6 | 14 | — | — | 9 | 10 | 5 | 8 | 8 | 10 | 8 | 15 | 11 | 10 |

$s*$ACM-ES | 13 | 14 | 4 | 5 | 2 | 6 | 8 | 13 | 9 | 15 | 10 | 13 | 13 | 13 | 15 | 14 | — | — | 6 | 5 | 9 | 12 | 13 | 17 | 12 | 13 |

lmm-CMA | 19 | 22 | 9 | 8 | 12 | 8 | 16 | 18 | 16 | 17 | 18 | 18 | 21 | 18 | 19$*$ | 16 | 18 | 19 | — | — | 14 | 14 | 17 | 19$*$ | 13 | 13 |

BOBYQA | 16 | 16 | 8 | 4 | 12 | 8 | 14 | 14 | 14 | 17 | 14 | 14 | 13 | 15 | 16 | 14 | 15 | 12 | 10 | 10 | — | — | 15 | 17 | 16 | 12 |

SMAC | 8 | 6 | 5 | 3 | 6 | 3 | 10 | 5 | 12 | 11 | 10 | 6 | 12 | 9 | 16 | 9 | 11 | 7 | 6 | 4 | 9 | 7 | — | — | 13 | 11 |

fmincon | 12 | 14 | 8 | 2 | 11 | 9 | 13 | 14 | 10 | 14 | 11 | 15 | 13 | 15 | 13 | 14 | 12 | 11 | 11 | 11 | 8 | 12 | 11 | 13 | — | — |

S-CMA-ES | — | — | 1 | 3 | 8 | 6 | 11 | 10 | 15 | 12 | 8 | 8 | 11 | 11 | 12 | 7 | 9 | 5 | 11 | 4 | 11 | 11 | 18 | 16 | 13 | 10 |

0.05 DTS | 23 | 21 | — | — | 13 | 14 | 18 | 17 | 18$*$ | 18 | 17 | 16 | 19$*$ | 18 | 17$*$ | 14 | 15 | 6 | 17 | 14 | 19 | 17 | 20$*$ | 19$*$ | 17 | 13 |

adp. DTS | 16 | 18$*$ | 9 | 10 | — | — | 20 | 21 | 20$*$ | 21$*$ | 17 | 18 | 22$*$ | 21 | 23$*$ | 21 | 15 | 9 | 17 | 12 | 17 | 13 | 20$*$ | 20$*$ | 16 | 11 |

MA-ES | 13 | 14 | 6 | 7 | 4 | 3 | — | — | 15 | 15 | 8 | 3 | 18 | 11 | 17 | 9 | 8 | 4 | 6 | 6 | 12 | 11 | 16 | 18 | 12 | 11 |

GPOP | 9 | 12 | 6 | 6 | 4 | 3 | 9 | 9 | — | — | 9 | 7 | 11 | 10 | 13 | 7 | 6 | 5 | 7 | 6 | 9 | 6 | 16 | 17 | 14 | 9 |

SAPEO | 16 | 16 | 7 | 8 | 7 | 6 | 16 | 21 | 15 | 17 | — | — | 21 | 21 | 20 | 16 | 11 | 11 | 9 | 10 | 15 | 12 | 20 | 20$*$ | 16 | 11 |

CMA-ES | 13 | 13 | 5 | 6 | 2 | 3 | 6 | 13 | 13 | 14 | 3 | 3 | — | — | 19 | 12 | 6 | 9 | 4 | 6 | 11 | 11 | 14 | 17 | 12 | 9 |

CMA-ES 2pop | 12 | 17 | 7 | 10 | 1 | 3 | 7 | 15 | 11 | 17 | 4 | 8 | 5 | 12 | — | — | 5 | 8 | 5 | 10 | 8 | 13 | 14 | 19 | 12 | 9 |

$s*$ACM-ES | 15 | 19$*$ | 9 | 18 | 9 | 15 | 16 | 20 | 18 | 19$*$ | 13 | 13 | 18 | 15 | 19 | 16 | — | — | 10 | 13 | 16 | 17 | 17 | 21$*$ | 16 | 15 |

lmm-CMA | 13 | 20 | 7 | 10 | 7 | 12 | 18 | 18 | 17 | 18 | 15 | 14 | 20 | 18 | 19 | 14 | 14 | 11 | — | — | 18 | 18 | 17$*$ | 20$*$ | 13 | 10 |

BOBYQA | 13 | 13 | 5 | 7 | 7 | 11 | 12 | 13 | 15 | 18 | 9 | 12 | 13 | 13 | 16 | 11 | 8 | 7 | 6 | 6 | — | — | 15 | 17 | 13 | 13 |

SMAC | 6 | 8 | 4 | 5 | 4 | 4 | 8 | 6 | 8 | 7 | 4 | 4 | 10 | 7 | 10 | 5 | 7 | 3 | 6 | 3 | 9 | 7 | — | — | 11 | 10 |

fmincon | 11 | 14 | 7 | 11 | 8 | 13 | 12 | 13 | 10 | 15 | 8 | 13 | 12 | 15 | 12 | 15 | 8 | 9 | 11 | 14 | 11 | 11 | 13 | 14 | — | — |

$5-D$ . | S-CMA-ES . | 0.05 DTS . | adp. DTS . | MA-ES . | GPOP . | SAPEO . | CMA-ES . | CMA-ES 2pop . | $s*$ACM-ES . | lmm-CMA . | BOBYQA . | SMAC . | fmincon . | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

eval. budget . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . | $13$ . | 1 . |

S-CMA-ES | — | — | 3 | 2 | 5 | 2 | 12 | 9 | 9 | 13 | 9 | 7 | 14 | 12 | 13 | 14 | 11 | 10 | 5 | 2 | 8 | 8 | 16 | 18 | 12 | 10 |

0.05 DTS | 21$*$ | 22$*$ | — | — | 16 | 15 | 18 | 19$*$ | 21 | 23$*$ | 19 | 21$*$ | 19$*$ | 19$*$ | 21$*$ | 18$*$ | 20$*$ | 19$*$ | 15 | 16 | 16 | 19 | 19$*$ | 20$*$ | 16 | 22$*$ |

adp. DTS | 19 | 22$*$ | 6 | 7 | — | — | 17 | 20 | 15 | 20$*$ | 18 | 19 | 22 | 23$*$ | 23$*$ | 23$*$ | 22 | 18 | 12 | 16 | 12 | 15 | 18 | 20$*$ | 13 | 15 |

MA-ES | 12 | 15 | 6 | 5 | 7 | 4 | — | — | 12 | 15 | 13 | 12 | 18 | 16 | 20 | 15 | 16 | 11 | 8 | 6 | 10 | 10 | 14 | 19 | 11 | 10 |

GPOP | 15 | 11 | 3 | 1 | 9 | 4 | 12 | 9 | — | — | 11 | 9 | 14 | 13 | 16 | 11 | 15 | 9 | 8 | 7 | 10 | 7 | 12 | 13 | 14 | 10 |

SAPEO | 15 | 17 | 5 | 3 | 6 | 5 | 11 | 12 | 13 | 15 | — | — | 15 | 16 | 21 | 16 | 14 | 11 | 6 | 6 | 10 | 10 | 14 | 18 | 13 | 9 |

CMA-ES | 10 | 12 | 5 | 5 | 2 | 1 | 6 | 8 | 10 | 11 | 9 | 8 | — | — | 18 | 10 | 11 | 11 | 3 | 6 | 11 | 9 | 12 | 15 | 11 | 9 |

CMA-ES 2pop | 11 | 10 | 3 | 6 | 1 | 1 | 4 | 9 | 8 | 13 | 3 | 8 | 6 | 14 | — | — | 9 | 10 | 5 | 8 | 8 | 10 | 8 | 15 | 11 | 10 |

$s*$ACM-ES | 13 | 14 | 4 | 5 | 2 | 6 | 8 | 13 | 9 | 15 | 10 | 13 | 13 | 13 | 15 | 14 | — | — | 6 | 5 | 9 | 12 | 13 | 17 | 12 | 13 |

lmm-CMA | 19 | 22 | 9 | 8 | 12 | 8 | 16 | 18 | 16 | 17 | 18 | 18 | 21 | 18 | 19$*$ | 16 | 18 | 19 | — | — | 14 | 14 | 17 | 19$*$ | 13 | 13 |

BOBYQA | 16 | 16 | 8 | 4 | 12 | 8 | 14 | 14 | 14 | 17 | 14 | 14 | 13 | 15 | 16 | 14 | 15 | 12 | 10 | 10 | — | — | 15 | 17 | 16 | 12 |

SMAC | 8 | 6 | 5 | 3 | 6 | 3 | 10 | 5 | 12 | 11 | 10 | 6 | 12 | 9 | 16 | 9 | 11 | 7 | 6 | 4 | 9 | 7 | — | — | 13 | 11 |

fmincon | 12 | 14 | 8 | 2 | 11 | 9 | 13 | 14 | 10 | 14 | 11 | 15 | 13 | 15 | 13 | 14 | 12 | 11 | 11 | 11 | 8 | 12 | 11 | 13 | — | — |

S-CMA-ES | — | — | 1 | 3 | 8 | 6 | 11 | 10 | 15 | 12 | 8 | 8 | 11 | 11 | 12 | 7 | 9 | 5 | 11 | 4 | 11 | 11 | 18 | 16 | 13 | 10 |

0.05 DTS | 23 | 21 | — | — | 13 | 14 | 18 | 17 | 18$*$ | 18 | 17 | 16 | 19$*$ | 18 | 17$*$ | 14 | 15 | 6 | 17 | 14 | 19 | 17 | 20$*$ | 19$*$ | 17 | 13 |

adp. DTS | 16 | 18$*$ | 9 | 10 | — | — | 20 | 21 | 20$*$ | 21$*$ | 17 | 18 | 22$*$ | 21 | 23$*$ | 21 | 15 | 9 | 17 | 12 | 17 | 13 | 20$*$ | 20$*$ | 16 | 11 |

MA-ES | 13 | 14 | 6 | 7 | 4 | 3 | — | — | 15 | 15 | 8 | 3 | 18 | 11 | 17 | 9 | 8 | 4 | 6 | 6 | 12 | 11 | 16 | 18 | 12 | 11 |

GPOP | 9 | 12 | 6 | 6 | 4 | 3 | 9 | 9 | — | — | 9 | 7 | 11 | 10 | 13 | 7 | 6 | 5 | 7 | 6 | 9 | 6 | 16 | 17 | 14 | 9 |

SAPEO | 16 | 16 | 7 | 8 | 7 | 6 | 16 | 21 | 15 | 17 | — | — | 21 | 21 | 20 | 16 | 11 | 11 | 9 | 10 | 15 | 12 | 20 | 20$*$ | 16 | 11 |

CMA-ES | 13 | 13 | 5 | 6 | 2 | 3 | 6 | 13 | 13 | 14 | 3 | 3 | — | — | 19 | 12 | 6 | 9 | 4 | 6 | 11 | 11 | 14 | 17 | 12 | 9 |

CMA-ES 2pop | 12 | 17 | 7 | 10 | 1 | 3 | 7 | 15 | 11 | 17 | 4 | 8 | 5 | 12 | — | — | 5 | 8 | 5 | 10 | 8 | 13 | 14 | 19 | 12 | 9 |

$s*$ACM-ES | 15 | 19$*$ | 9 | 18 | 9 | 15 | 16 | 20 | 18 | 19$*$ | 13 | 13 | 18 | 15 | 19 | 16 | — | — | 10 | 13 | 16 | 17 | 17 | 21$*$ | 16 | 15 |

lmm-CMA | 13 | 20 | 7 | 10 | 7 | 12 | 18 | 18 | 17 | 18 | 15 | 14 | 20 | 18 | 19 | 14 | 14 | 11 | — | — | 18 | 18 | 17$*$ | 20$*$ | 13 | 10 |

BOBYQA | 13 | 13 | 5 | 7 | 7 | 11 | 12 | 13 | 15 | 18 | 9 | 12 | 13 | 13 | 16 | 11 | 8 | 7 | 6 | 6 | — | — | 15 | 17 | 13 | 13 |

SMAC | 6 | 8 | 4 | 5 | 4 | 4 | 8 | 6 | 8 | 7 | 4 | 4 | 10 | 7 | 10 | 5 | 7 | 3 | 6 | 3 | 9 | 7 | — | — | 11 | 10 |

fmincon | 11 | 14 | 7 | 11 | 8 | 13 | 12 | 13 | 10 | 15 | 8 | 13 | 12 | 15 | 12 | 15 | 8 | 9 | 11 | 14 | 11 | 11 | 13 | 14 | — | — |

Whereas the detailed parameter setting of the DTS-CMA-ES can be found above, the list of the settings of the remaining algorithms follows.

**S-CMA-ES**population size $\lambda =4+\u23083lnD\u2309$, model life-length $gm=5$, Gaussian process surrogate model: (TSS3), $rmaxA=10$, $Nmax=15\xb7D$, $KMate\xb4rn\nu =5/2$**MA-ES**population size $\lambda =10$, parent number $\mu =2$, number of training points $2\lambda $, extended population size $\lambda Pre=3\lambda $, predictive mean as pre-selection criterion, our MATLAB implementation based on the implementation of the S-CMA-ES**GPOP**population size $\lambda =10$, parent number $\mu =2$, training set size parameters $NC=NR=5D$, input space termination tolerance $10-8$, minimum fitness change for 10 last iterations unless restart $10-9$, perturbation size $m=1$, our GPML-based MATLAB implementation using the Gaussian processes from the S-CMA-ES**SAPEO**the COCO/BBOB results have been kindly provided by the algorithm's authors; variant*ucp-2*: sample-size $2\lambda $, using three increasingly risky dominance relations**CMA-ES**the IPOP-CMA-ES version (MATLAB code 3.62$\beta $) with single ($4+\u230a3lnD\u230b$) and doubled population size, 50 IPOP restarts, $\sigma (0)=83$, other settings left default**lmm-CMA, $s*$ACM-ES, SMAC**results downloaded from the COCO/BBOB results archive, the BIPOP-$s*$ACM-ES-k (2013) used for the $s*$ACM-ES, SMAC results are only up to 100 FE/D; initializations of the lmm-CMA and SMAC are rather disputable on $f19$**BOBYQA, fmincon**both algorithms initialized from a new uniformly sampled point every fifth restart, otherwise used in their default settings, Dlib C$++$ implementation of BOBYQA, fmincon with the interior-point algorithm from MATLAB 2017a

The graphs in Figures 3 and 4 document the differences between the GP/CMA-ES algorithms. They depict the best-achieved logarithmic median distances to the benchmarks' optima $\Delta flog$ for the respective numbers of FE/D. These medians (and in the case of the GP/CMA-ES algorithms also the first and third quartiles) are from 15 independent instances for each respective algorithm, function and dimension. The results of the GP/CMA-ES algorithms are represented by thick lines and the reference algorithms use thinner lines to see whether some better algorithm than any of the GP/CMA-ES exists.

**S-CMA-ES**. A noticeable pattern among our algorithms is, with few exceptions in 20-D, that the S-CMA-ES almost always converges to the optimum at the same rate or slower than both versions of DTS-CMA-ES. It might be at least partly because the generation-based evolution control of the S-CMA-ES is not able to exploit the GP models uncertainty, and, therefore, we will point our attention to the DTS-CMA-ES.

**DTS-CMA-ES**. The fixed ($\alpha =0.05$) version of the DTS-CMA-ES converges fastest out of the six GP/CMA-ES algorithms particularly on the subset $f8\u202614$ of the unimodal functions (which are more or less after some transformation similar to the Rosenbrock's function $f8$) and also on the highly multimodal functions $f19\u202621,23,24$, often functions without any global structure. While some of the non-GP algorithms $s*$ACM-ES, lmm-CMA, BOBYQA or fmincon perform similarly or better on the first part of functions ($f8\u202614$), the performance on the multimodal functions ($f19\u202621,23,24$) is often the best out of all compared algorithms. These results are probably because the fixed DTS-CMA-ES is able to descend to the functions' optima regardless of the relatively high model error which forces the adaptive algorithms to spend more original evaluations than necessary. On the other hand, the limitations of GP models can be seen on the non-smooth functions $f6$ or $f13$ where the GP is not able to regress the fitness around the optimum and where other algorithms perform better.

The self-adaptive DTS-CMA-ES is mostly the second or the third GP/CMA-ES algorithm on the functions where the fixed DTS-CMA-ES succeeds (see the previous paragraph). However, the self-adaptation improves the fixed DTS-CMA-ES behavior on the multimodal functions with a global structure, especially on the Rastrigin functions $f3,4,15$ and Shaffers functions $f17,18$ where the self-adaptive DTS-CMA-ES is usually the best of all 13 algorithms. A closer look at the convergence graphs reveals that the self-adaptive DTS-CMA-ES indeed speeds up the 2-population IPOP-CMA-ES on these functions, approximately by the factor of 2.

**MA-ES**. Our implementation of the MA-ES most often occupies the worse half of the GP-based algorithm ranking, even on the unimodal functions for which the $(2,10)$ setting was, according to its authors, mainly aimed. The results are often very similar to or slightly better than the results of the IPOP-CMA-ES, which signifies that the proposed individual-based EC is not able to adequately speed up the CMA-ES compared to the EC of the other surrogate algorithms. However, the results on the *Attractive sector*$f6$ are rather surprising because the MA-ES' model does not mislead the search even though the function is not smooth in the optimum and thus very hard to regress by GPs.

**GPOP, SMAC**. Although the GPOP belongs to Bayesian optimization algorithms, it descends much closer to the global optima than SMAC. For very few FE/D, SMAC is slightly faster but very soon traps in some local optima. The GPOP is often the slowest or the second slowest GP/CMA-ES algorithm on the COCO set (in 20-D on $f3,4$, $f6,7$, $f10\u202616$, $f19\u202624$), but it is the fastest during initial generations on the *Sphere*$f1$ and *Ellipsoidal*$f2$, where the GP is able to regress the function well even with a very small number of training points. The GPOP is the second on the *Schaffers'* functions $f17,18$ (rather surprisingly only in 20-D, it is the slowest in 5-D).

**$s*$ACM-ES**. The BIPOP-$s*$ACM-ES' convergence speed is often similar or slightly worse than the self-adaptive DTS-CMA-ES in 5-D. In 20-D, however, the $s*$ACM-ES converges much faster compared to the other algorithms, and it is the best or the second among all 13 algorithms on the unimodal functions $f2$, $f7,8$, $f10\cdots 14$. It probably profits from its ranking surrogate model and, therefore, from the invariance to monotonous transformations of the fitness. A considerably faster convergence is clearly visible especially on the ill-conditioned or non-smooth unimodal functions.

**SAPEO**. The SAPEO algorithm usually converges at a similar rate as the MA-ES or the IPOP-CMA-ES. Nonetheless, its performance is considerably better than the MA-ES on functions $f9$, $f11$, $f14$, $f19$, $f22$, and it is the best algorithm on $f6$ and $f16$ in 20-D, where it obviously adapts well to the fitness. Hence, the concept of uncertainty propagation is rather promising, especially considering the GP model was trained only on $2\lambda $ points.

**lmm-CMA**. The quadratic-model-based lmm-CMA ranks on most benchmarks somewhere between or very close to the fixed and self-adaptive DTS-CMA-ES ($f2\cdots 4$, $f8\cdots 12$, $f14\cdots 18$, $f21,22$) which conforms with the average results in Figure 5 where it is very close to the self-adaptive DTS-CMA-ES. This similarity in terms of the results is in accordance with the fact that both algorithms use a metric (i.e., not a rank-based) surrogate model as well as a similar kind of adaptivity.

**BOBYQA and****fmincon**. The BOBYQA is the best optimizer (out of all 13 compared) in the very beginning of the optimization, i.e. up to roughly 15–20 FE/D. Later, the behavior of both the BOBYQA and fmincon dramatically changes with the benchmark at hand. They are rather successful on the unimodal functions—BOBYQA on $f1$, $f5$, $f8,9$, and fmincon on the ill-conditioned $f10\cdots 14$ and also on $f1,2$ and $f6$ in 5-D. Moreover, they are in few cases able to successfully solve the multimodal benchmarks—BOBYQA adequately $f19\cdots 24$, and fmincon is one of the best in 20-D on $f21,22$. But not surprisingly, these algorithms otherwise often stuck in local optima, and on several benchmarks rank worse than the standard IPOP-CMA-ES.

Although it is not the prime criterion for the comparison of black-box optimizers, we also provide the average CPU time per one original function evaluation for the algorithms that we have benchmarked in Table 6.

Algorithm . | 2-D . | 3-D . | 5-D . | 10-D . | 20-D . |
---|---|---|---|---|---|

S-CMA-ES | 2.5e−01 | 1.5e−01 | 1.7e−01 | 2.6e−01 | 1.1e+00 |

$0.05/2pop$ DTS-CMA-ES | 7.3e−01 | 6.0e−01 | 7.5e−01 | 1.7e+00 | 1.3e+01 |

adaptive DTS-CMA-ES | 5.0e−01 | 2.7e−01 | 2.9e−01 | 9.1e−01 | 5.6e+00 |

MA-ES | 2.1e−02 | 4.5e−02 | 4.8e−02 | 4.6e−02 | 7.1e−02 |

GPOP | 9.4e−01 | 1.2e+00 | 1.8e+00 | 3.4e+00 | 1.2e+01 |

CMA-ES $2pop$ | 3.2e−03 | 2.2e−03 | 1.4e−03 | 8.4e−04 | 6.1e−04 |

BOBYQA | 1.8e−03 | 1.3e−03 | 9.5e−04 | 6.2e−04 | 5.3e−04 |

fmincon | 2.8e−03 | 1.7e−03 | 1.1e−03 | 6.8e−04 | 4.5e−04 |

Algorithm . | 2-D . | 3-D . | 5-D . | 10-D . | 20-D . |
---|---|---|---|---|---|

S-CMA-ES | 2.5e−01 | 1.5e−01 | 1.7e−01 | 2.6e−01 | 1.1e+00 |

$0.05/2pop$ DTS-CMA-ES | 7.3e−01 | 6.0e−01 | 7.5e−01 | 1.7e+00 | 1.3e+01 |

adaptive DTS-CMA-ES | 5.0e−01 | 2.7e−01 | 2.9e−01 | 9.1e−01 | 5.6e+00 |

MA-ES | 2.1e−02 | 4.5e−02 | 4.8e−02 | 4.6e−02 | 7.1e−02 |

GPOP | 9.4e−01 | 1.2e+00 | 1.8e+00 | 3.4e+00 | 1.2e+01 |

CMA-ES $2pop$ | 3.2e−03 | 2.2e−03 | 1.4e−03 | 8.4e−04 | 6.1e−04 |

BOBYQA | 1.8e−03 | 1.3e−03 | 9.5e−04 | 6.2e−04 | 5.3e−04 |

fmincon | 2.8e−03 | 1.7e−03 | 1.1e−03 | 6.8e−04 | 4.5e−04 |

## 6 Conclusion and Further Research Directions

More than the last decade of research in the area of model-based CMA-ES algorithms has recognized the surrogate-assisted CMA-ES algorithms as state-of-the art black-box continuous optimizers, especially when only tens to hundreds function evaluations are available for the optimization. In contrast to the overall success of Gaussian processes, employing Gaussian processes as surrogate models, namely in the algorithms MA-ES, GPOP or Kriging metamodeling CMA-ES, has gained considerably less attention compared to the quadratic-based lmm-CMAor SVM-based $s*$ACM-ES.

This article re-investigates the employment of Gaussian processes in the CMA-ES. The presented algorithm DTS-CMA-ES differs from the lmm-CMA and $s*$ACM-ES not only in the regression model, but, more importantly, it uses Gaussian process uncertainty prediction for selecting points in its evolution control. The selection that does not use uncertainty performs on average worst. In addition, uncertainty has been shown to help more on the multimodal benchmarks.

The fifth section provides an evaluation of different parameter settings. The selection of the $k$ nearest neighbor points as training sets for surrogates performed best, a selection method rather rarely encountered in literature. Other parameter settings, used in the experiments due to their good performance, are rather expected (e.g., a large radius for the selection of training set, the Matèrn covariance function, etc.).

The self-adaptive version of the DTS-CMA-ES uses a similar adaptation mechanism of the number of points for the evaluation with the original fitness as the lmm-CMA. This adaptive version also performs similar to the lmm-CMA on many COCO benchmark functions, which suggests a hypothesis that the self-adaptation decides the algorithm performance to a great extent. Less importantly, the self-adaptive DTS-CMA-ES is the best out of 13 compared algorithms on at least two benchmarks.

The last part of the article empirically compares the available implementations of the algorithms combining the Gaussian processes and the CMA-ES with the CMA-ES itself and five other state-of-the-art black box optimizers. The fastest convergence rate in terms of the original fitness evaluations out of the six compared GP/CMA-ES algorithms was reached by the DTS-CMA-ES with the fixed parameter $\alpha =0.05$. This DTS-CMA-ES variant represents an algorithm of choice for multimodal functions with weak global structure and is very eligible for unimodal landscapes, too, especially in lower dimensions. The self-adaptive version, on the other hand, excels on the globally decreasing multimodal functions where it outperforms other compared algorithms.

Competitive results on a relatively broad spectrum of the COCO benchmarks were also provided by the two non-GP surrogate-assisted CMA-ES algorithms. Since the $s*$ACM-ES uses a ranking surrogate model, it is valuable especially for functions without continuous derivatives in the optima. It also provides impressive results in 20-D where it represents the best choice for many unimodal functions while still providing average results on the multimodal ones.

According to the experimental results, different surrogate CMA-ES algorithms or their variants perform best on different fitness landscapes, but the mechanism of selecting the best-performing setting or smarter adaptation to the fitness at hand is still an open issue. Such adaptivity (or meta-learning) should select the appropriate evolution control parameters as well as the most convenient surrogate model and its parameters, and, in particular, it should decide whether to use the metric-based (probably Gaussian processes) or the ranking-based (e.g., ranking SVM) model. Besides, the parameters of the CMA-ES itself are problem-dependent, too. In fact, CMA-ES variants more suitable for expensive problems have been already proposed (Loshchilov, 2017; Belkhir et al., 2017), and they can be obviously combined with surrogate models.

To conclude, the surrogate-model-based CMA-ES algorithms constitute state-of-the-art optimizers for black-box optimization problems. In particular, the Gaussian processes have been shown to be a very competitive (if not the best) surrogate models, especially for multimodal moderately-dimensional objective functions.

## Acknowledgments

This research was supported by the Czech Science Foundation grants No. 17-01251 and 18-18080S. Access to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042) is greatly appreciated.

## Notes

^{1}

The detailed results can be found on the competition's webpage at https://bbcomp.ini.rub.de/

^{2}

$RDE\mu $ could be replaced by arbitrary error measure between $f$-values vectors which only depends on ranks of the elements of these vectors.

^{3}

The source code is available on Github https://github.com/bajeluk/surrogate-cmaes

^{4}

Additional results are on the author's webpage http://bajeluk.matfyz.cz/gpcomparison2017

## References

*Catalytic science series*.