Abstract

This article deals with Gaussian process surrogate models for the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES)—several already existing and two by the authors recently proposed models are presented. The work discusses different variants of surrogate model exploitation and focuses on the benefits of employing the Gaussian process uncertainty prediction, especially during the selection of points for the evaluation with a surrogate model. The experimental part of the article thoroughly compares and evaluates the five presented Gaussian process surrogate and six other state-of-the-art optimizers on the COCO benchmarks. The algorithm presented in most detail, DTS-CMA-ES, which combines cheap surrogate-model predictions with the objective function evaluations in every iteration, is shown to approach the function optimum at least comparably fast and often faster than the state-of-the-art black-box optimizers for budgets of roughly 25–100 function evaluations per dimension, in 10- and less-dimensional spaces even for 25–250 evaluations per dimension.

1  Introduction

Computers have been used to solve continuous optimization problems in engineering and many other areas for decades. A broad set of these problems is specified by black-box objective functions—functions without a known mathematical expression and with values gathered only experimentally or via a time-demanding computer simulation. Such functions can be found, for example, in engineering design optimization (Bekasiewicz and Koziel, 2017), computational fluid dynamics (CFD) simulations (Lee et al., 2016), or in the search for optimal materials (Baerns and Holeňa, 2009; Zaefferer et al., 2016). Because of the time or cost demands, the allowable budget of such evaluations is often strictly limited, and reaching a reasonable value of the objective function (also known as fitness in the evolutionary context) in as few evaluations as possible is usually the prime criterion for assessing black-box optimizers.

The early successful black-box optimization algorithms, such as those proposed by Nelder and Mead (1965) or Brent (1973), have been heavily used for decades (both are, for example, a standard part of MATLAB). However, these algorithms are often unable to escape from the local optimum of the function's attractor where they were started. Therefore, stochastic algorithms such as evolutionary algorithms have become increasingly popular for global optimization.

An important group of global optimizers is formed by Bayesian optimization algorithms, introduced in the 1970s by Mockus and Žilinskas (1972) and Mockus et al. (1978). Nowadays, Efficient Global Optimization (EGO) (Jones, 2001) and its successors (Hutter et al., 2011, 2013) are used most often. These algorithms estimate the prediction uncertainty, usually using Gaussian processes, and they work perfectly globally for very few fitness function evaluations per dimension (FE/D). However, they often get stuck in a local optimum after several tens of FE/D.

For higher fitness evaluations budgets, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) by Hansen and Ostermeier (2001) stands for one of the most successful evolutionary continuous black-box optimizers of the last two decades. It iteratively samples λ points from a Gaussian distribution N(m,σ2C) and then recalculates the distribution parameters m, σ and C based on the μ best fitness-evaluated points. The CMA-ES profits from its invariant properties, particularly on ill-conditioned or highly multi-modal functions, or in high-dimensional spaces when the allowable number of FE/D is higher than roughly several hundreds.

The CMA-ES learns some core characteristics of the fitness via perturbations of its covariance matrix C and the step-size σ. Nevertheless, the exact information from the passed evaluations can be used more intensively via surrogate modeling—a technique in which a regression model of the objective function is used interchangeably with the function itself. While spending extra CPU time for model construction, surrogate models have been shown to save some of the original-fitness evaluations; see Jin (2011) or Rios and Sahinidis (2012). Surrogate models have been increasingly often used in many optimization algorithms, including the CMA-ES. The acceleration of the CMA-ES via surrogates was shown, for example, by Auger et al. (2004), Kern et al. (2006) and Loshchilov et al. (2013) and it also constitutes the research objective of this article.

A particular position among existing surrogate models is occupied by Gaussian processes (GPs). For an unseen point, they return not only the expected value but the whole probability distribution of the expected output. The variance of this distribution can serve as an uncertainty measure of the prediction.

This article reviews previous attempts to use Gaussian processes as surrogates for the CMA-ES and thoroughly compares most of them with each other as well as with a few other important black-box optimization methods. The empirical comparison is based on the COCO benchmark suite (Hansen et al., 2016)—a platform repeatedly used for assessing black-box optimizers for the past eight years.

As a new contribution to the development of GP-based surrogate models for the CMA-ES, this article introduces two novel algorithms: S-CMA-ES (Surrogate CMA-ES) and DTS-CMA-ES (Doubly Trained Surrogate CMA-ES). The former switches between the CMA-ES iterations evaluating the population with the original fitness and the iterations using a GP or random-forest surrogate model. The latter uses a GP surrogate model for the most promising points in every iteration, and the selection of these points relies on the GP ability to estimate the whole distribution of the predicted fitness value.

The principles of the S-CMA-ES and DTS-CMA-ES were recently presented in conference articles (Bajer et al., 2015; Pitra et al., 2016, 2017). This article more thoroughly describes the details of the employed Gaussian process surrogate models, evaluates their parameter settings, reinvestigates the criteria for selecting points for the evaluation with the original fitness in the DTS-CMA-ES evolution control, presents such a new criterion, and brings new experimental results of the proposed algorithms. The DTS-CMA-ES both with and without self-adaptation is shown to converge to optima on average faster or comparably to the fastest black-box optimizers in dimensions up to D10 starting from approximately 25 FE/D and in dimension D=20 between approximately 25 and 100 FE/D. It is worth mentioning that the versions without and with the self-adaptation were placed the first and second, respectively in the two single-objective tracks of the Black Box Optimization Competition at the conference GECCO 2017.1

Related surrogate-assisted CMA-ES algorithms are briefly reviewed in the next section, followed by the foundations of the CMA-ES and Gaussian process surrogate modeling in Section 3. Section 4 provides the description of our new algorithms which is supplemented by their implementation details and parameter settings in the first half of Section 5 and by the experimental comparison of 13 different algorithms in the second half. Section 6 concludes the article and suggests a few future research objectives.

2  Related Works

Surrogate modeling has been used for accelerating expensive optimization for several decades, but we will restrict our attention only to its combination with the CMA-ES while primarily focussing on Gaussian process models.

Surrogate models were probably first used with the CMA-ES by Jin et al. (2001, 2002). Their framework uses multilayer perceptron networks with the individual- and generation-based evolution control (EC) strategies. In each generation, the individual-based EC selects a smaller number λ of points for evaluation with the original fitness (original-evaluated points in the following) out of a larger set of λPre model-evaluated points: the authors suggest choosing them randomly or selecting the points with the best model fitness. The generation-based EC uses cycles of Λ generations. It switches between η generations in which all the points are model-evaluated and (Λ-η) generations evaluated with the original fitness. The proportions λ/λPre and η/Λ of the original-fitness evaluations in both strategies are proportional to the current model error divided by the highest measured error.

LS-CMA-ES. The LS-CMA-ES, introduced by Auger et al. (2004), uses a quadratic model approximation of the fitness trained on the recent O(n2) points. This approximation is shown to provide a faster update of the CMA-ES covariance matrix when applied to the fitness functions similar to ellipsoid. The algorithm also automatically controls whether to use the model for the CMA-ES update or not.

lmm-CMA. Two years later, Kern et al. (2006) presented their Local meta-model CMA-ES (lmm-CMA) which uses a set of local quadratic models to predict ranking of the CMA-ES' μ best points from each generation (provided that there are enough points to construct the model). Being inspired by the approximate ranking procedure of Runarsson (2004), the lmm-CMA first evaluates the current population with the models and then iteratively evaluates the best-predicted points with the original fitness until the difference in ranking of the best μ points between two consecutive iterations becomes low enough. This algorithm was later slightly modified by Auger et al. (2013).

s*ACM-ES. Recently, Loshchilov et al. (2013) incorporated the Ranking SVM model into the CMA-ES in their s*ACM-ES algorithm. The Ranking SVM is trained to predict only ranking of the input points, which perfectly suits the CMA-ES' invariance with respect to monotonous fitness scaling. In addition, the CMA-ES covariance matrix C is used in the SVM as a fixed kernel, which results in low computational complexity of the model fitting.

Although Gaussian processes have been studied in probability theory for many decades, they started to be widely used as a regression model only about twenty years ago: it was due to advances in computational power and because of clarifying contributions like those by Neal (1994, 1997), Williams and Rasmussen (1996) and MacKay (1998). Since that time, Gaussian processes have been established as a model with a decent number of parameters, and they are used in many optimization domains, as long as the number of continuous problem variables does not exceed few dozen and the function being regressed is smooth enough.

MA-ES. To the best of our knowledge, the first combination of the CMA-ES optimizer and a Gaussian process model was in the Metamodel-Assisted Evolution Strategy (MA-ES) proposed by Ulmer et al. (2003). According to Jin's terminology, the algorithm employs individual-based evolution control—it uses a GP surrogate model for the pre-selection of points for evaluation with the original fitness. The algorithm pre-selects points based on one of two criteria: the best GP mean prediction, or the probability of improvement (PoI)—the probability that the considered point's function value will drop below some value, usually the so-far achieved optimum fmin (see Section 4.2). The authors have shown that the latter criterion, which uses the GP's prediction of uncertainty, performs better especially on multi-modal functions like Ackley's or Griewank. The same algorithm name was used for the GP-surrogate evolutionary strategy (not necessarily the CMA-ES) by Emmerich et al. (2002, 2006) who also showed superior performance of the criteria exploiting the GP uncertainty on multi-modal functions.

GPOP. Büche et al. (2005) combined GPs and the CMA-ES in the opposite way. In fact, their Gaussian process optimization procedure (GPOP) is a Bayesian optimization algorithm (similar to the EGO mentioned above) which additionally restricts the search around the current best point x^opt. In each iteration, the GPOP constructs a GP model of the local neighborhood of x^opt, and the CMA-ES is then used to locate the optimum of this model for evaluation with the original fitness. The CMA-ES optimizes functions defined as (y^(x)-αs^(x)) where y^(x) and s^(x) are the mean and standard deviation of the GP prediction in x respectively. The authors suggest using four different α=0,1,2,4 in each generation: while α=0 exploits the current best-predicted point, α=4 explores mainly uncertain regions of the search space.

Kriging metamodeling CMA-ES. Kruisselbrink et al. (2010) used GPs to handle robustness in optimization of noisy functions. Their kriging metamodeling based CMA-ES evaluates the population with the noisy original function and then constructs a separate GP/kriging model for each such point in order to estimate its true value without noise. The algorithm uses the CMA-ES covariance matrix C for the transformation of the input space. Apart from a procedure for selecting training points for the GP from an archive, the authors also suggest which points to take for an additional expensive evaluation if there are not enough points for the GP model building. The kriging method is shown to approximate the noisy fitness much better than two other methods based on a simple (re-)evaluation of the points during CMA-ES optimization.

Ensemble of Local Gaussian Process Models. The article by Lu et al. (2013) presents another surrogate-assisted CMA-ES algorithm that employs an ensemble of local Gaussian process models. The algorithm uses the GP models to prescreen a larger population (λ=25, μ=5) and to select the best 6 points for expensive re-evaluation according to the same criterion as in Büche et al. (2005) with α=2. Although the results in the article seem rather promising, we were not able to reproduce them using the provided source code.

EGO-CMA. Mohammadi et al. (2015) combine GPs and the CMA-ES in their EGO-CMA. It starts as the EGO algorithm and switches into the CMA-ES after a small number of EGO iterations. Hence, it is not a surrogate model in the proper sense of repeatedly switching between the evaluation with the original fitness and with the model. The core contribution of the article lies in the sophisticated initialization of the CMA-ES covariance matrix and step-size based on the learned parameters of the EGO's Gaussian process. The authors show interesting preliminary results on the 5-dimensional Sphere and Ackley functions which we were also not able to reproduce due to the incompleteness of the provided source code.

SAPEO. The last single-objective GP/CMA-ES algorithm we are aware of is the Surrogate-Assisted Partial Order-Based Evolutionary Optimization Algorithm (SAPEO) by Volz et al. (2017). The algorithm uses several dominance relations that combine the GP predicted mean and variance to induce a partial order on the points in a CMA-ES population, usually using confidence intervals. If this partial order is able to identify the ranking of the μ best points with a given probability, the algorithm proceeds without any further fitness evaluations. Otherwise, the SAPEO tries to choose its another dominance relation or original-evaluates some of the points if necessary.

Being interested in the mentioned works, we present a thorough comparison of the following algorithms in Section 5: the Metamodel-Assisted Evolution Strategy (MA-ES) by Ulmer et al. (2003), the GPOP by Büche et al. (2005), the lmm-CMA in the modification by Auger et al. (2013), the s*ACM-ES by Loshchilov et al. (2013), the SAPEO by Volz et al. (2017), and our algorithms S-CMA-ES and DTS-CMA-ES.

3  Surrogate Modeling in the Context of the CMA-ES and Gaussian Processes

The considered optimization problem can be formalized as follows. Let f:XR be a function defined on a connected set XRD.

All the benchmarks in this article accept bounded-constrained input spaces; thus X forms a multi-dimensional interval X=[(xlow)1,(xup)1]××[(xlow)D,(xup)D] where xlow and xup are the vectors of the lower and upper bounds, and (x)i is the i-th element of a vector x. In case of minimization, the optimization consists in finding such xoptX for which
f(xopt)=yopt=minxXf(x).
(1)
Here, the function f is considered as black-box. The true optimum yopt is often unknown in real-world applications, and thus the stopping criteria have to be defined differently. On the contrary, the formulas and the optima yopt are known for the COCO benchmarks, so the framework uses a precision Δfopt=10-8—the optimization ends when a point x^opt for which f(x^opt)[yopt,yopt+Δfopt] is found.

Furthermore, the function f is supposed to be expensive. More precisely, we assume that evaluating the fitness of a point represents a considerably higher cost than training a regression model. This assumption, commonly valid in real-world problems, justifies the aim to use surrogate models, and it implies that the criteria for assessing different black-box optimizers are calculated based on the number of the expensive function evaluations, not on the CPU time spent during the optimization.

As a surrogate model, we understand any regression model f^:XR that is trained on the data from an archive A={(xi,yi)|yi=f(xi),i=1,,N}. If it can be defined only on a substantially restricted X'X, we then call it a local surrogate model.

3.1  The CMA-ES

All of the surrogate-assisted algorithms considered in this article are based on the Covariance Matrix Adaptation Evolution Strategy (CMA-ES; Hansen and Ostermeier, 2001; Hansen, 2006). The algorithm iteratively samples a new population of points from a Gaussian distribution N(m,σ2C) and uses the μ best-evaluated points to update its state variables: the current distribution mean m, which is used as an estimate of the function optimum, the covariance matrix C, and the step-size σ (cf. Algorithm 1).

graphic

The CMA-ES leaves only several parameters to be set by its user: the initial population size λ, step-size σ(0), and mean m(0). Increasing the recommended starting value of the population size λ=4+3lnD decreases the convergence speed on unimodal functions, but enables the algorithm to escape easily from local optima (see, e.g., f1 and f17 in Figure 3). The initial step-size on (all dimensions equally-) bounded problems has been recommended as one half of the search space width σ(0)=0.5((xup)1-(xlow)1), and the initial mean has been advised to be sampled uniformly m(0)U(X). The remaining parameters of the CMA-ES have been tuned by its authors (Hansen, 2006).

Without stating explicitly, the CMA-ES is used in our algorithms with the default setting of its IPOP-CMA-ES version which gradually doubles the population size on restarts. These restarts occur when the optimization converges to a local minimum or stalls for several generations without finding better fitness.

3.2  Training Sets

The quality of a surrogate model is largely determined by the selection of already evaluated points into its training set. We denote Nmax the given maximum number of points in a training set and Nmin the given minimum number of points required to train a model. All points from the archive A are used if |A|<Nmax. Let us list a few possible methods for training set selection (TSS):

  1. Taking up to Nmax most recently evaluated points (e.g., in Ulmer et al., 2003 or Loshchilov et al., 2013),

  2. selecting the union of the k nearest neighbors of every point for which the fitness should be predicted (e.g., in Kern et al., 2006), where k is maximal such that the total number of selected points does not exceed Nmax,

  3. clustering the points in the input space into Nmax clusters and taking the points nearest to clusters' centroids (e.g., in Bajer et al., 2015), and

  4. selecting Nmax points which are closest to any point in the current population.

All of these selection methods can be restricted to an area considered relevant for the particular generation. Localizing this area around the current algorithm position (around the mean m(g) in the CMA-ES, for example) is important for local optimizers, and especially for the 3.2 which is not biased towards the current population. The algorithms in Section 4 use the Mahalanobis distance given by the CMA-ES matrix σ2C for selecting points into training sets, similarly to, for example, Kruisselbrink et al. (2010).

3.3  Gaussian Processes

A Gaussian process on the considered set XRD is a collection of random variables (fGP(x))xX, indexed by the space X, such that any finite n-element subcollection has a joint n-dimensional normal distribution.

Each Gaussian process is specified by its mean μ:XR and covariance function K:X×XR0+ where μ(x)=E[fGP(x)] and K(x1,x2)=cov(fGP(x1),fGP(x2)). Both functions are parametrized by a relatively small number of parameters which are usually fitted by the maximum-likelihood (ML) or leave-one-out cross-validation (LOO-CV) method. Since both functions μ,K are themselves parameters of the GP, their parameters are usually called hyperparameters of the GP.

Function values y are often accessible only as noisy observations y=fGP(x)+ɛ where ɛ is a zero-mean Gaussian noise with the variance σn2. The noise variance σn2 determines how precisely the GP can fit to the training data. Consequently, the covariance of noisy observations becomes
cov(yp,yq)=K(xp,xq)+σn2δp,q,
(2)
where δp,q=1 for p=q and δp,q=0 otherwise.
Prediction. Using a Gaussian process for prediction always starts with a training set of N points XN={xi|xiX,i=1,,N} for which the function values yN={yi=f(xi),i=1,,N} are known. If the mean function μ of the GP is zero, then yN|XNN(0,CN). We consider the prediction in a new, (N+1)-st point (x*,y*). Adding this point to the training set, the conditional distribution of the extended vector yN+1=(y1,,yN,y*) is
yN+1|XN+1N(0,CN+1).
(3)
Let us look at the covariance matrices if the observations are considered noisy:
CN=KN+σn2IN,where
(4)
KNi,j=K(XN,XN)i,j=K(xi,xj),
(5)
and KN=K(XN,XN) is the matrix of the covariance-function values between all the training points. The extended covariance matrix can be written as
CN+1=K(XN,XN)+σn2INK(XN,x*)K(x*,XN)K(x*,x*),
(6)
where K(XN,x*)=K(x*,XN) is the vector of covariances between the new point x* and the training data XN.
Finally, conditioning the prior distribution of yN+1 (3) on the observations yN, the one-dimensional Gaussian distribution of y* is (Rasmussen and Williams, 2006)
y*|XN+1,yNN(y^*,(s^*)2),where
(7)
y^*=K(x*,XN)CN-1yN
(8)
(s^*)2=K(x*,x*)-K(x*,XN)CN-1K(x*,XN).
(9)
Abandoning the assumption μ(x)=0, equation (8) generalizes to
y^*=μ(x*)+K(x*,XN)CN-1(yN-μ(XN)),
(10)
where the mean function is most commonly set to a constant μ(x)=mμ.

From the computational perspective, the calculation of the covariance matrix CN takes O(DN2) time, and the complexity of the likelihood calculation for hyperparameter estimation is O(N3) due to inversion of CN. Once the CN-1 is calculated, the complexity of the prediction is only O(N2).

Covariance functions. The covariance function K describes prior assumptions on the shape of the modeled function: K(xp,xq) models the covariance cov(f(xp),f(xq)) between the function values at two data points xp and xq.

For any NN, the matrix KN has to be positive semidefinite. Most of the covariance functions encountered in applications are stationary; that is, the value of the function K(x1,x2) depends only on the distance r(x1,x2) between two input points for some metric r, usually the Euclidean distance.

We have used the two most common covariance functions: the squared-exponential
KSE(x1,x2)=σf2exp-r(x1,x2)2/22,
(11)
which is suitable especially when the modeled function is rather smooth, and the Matérn covariance function for two values of its parameter ν{3/2,5/2}
KMate´rnν=3/2(x1,x2)=σf21+3r(x1,x2)/exp-3r(x1,x2)/,
(12)
KMate´rnν=5/2(x1,x2)=σf21+5r(x1,x2)+5r(x1,x2)232exp-5r(x1,x2).
(13)

The closer the points x1 and x2 are, the closer the covariance function value is to σf2, and the stronger the correlation between the function values f(x1) and f(x2) is. The parameter , a characteristic length-scale, is a distance to which the distance of two considered data points is compared, and σf2 is the signal variance. For all three mentioned covariance functions, a version which has a separate length scale parameter per each dimension exists (often called automatic relevance determination, ARD). However, these variants, especially in higher dimensions, dramatically increase the number of hyperparameters. Our preliminary experiments showed that the Gaussian processes with ARD covariance functions often give worse results, probably because they are sensitive to over-fitting.

4  Designing New Gaussian Process Surrogates for the CMA-ES

The results in the experimental section reveal that the convergence speeds of the Gaussian process surrogate CMA-ES algorithms GPOP and MA-ES are considerably lower on the COCO testbed, compared to the lmm-CMA with quadratic surrogate models or s*ACM-ES with ranking Support vector machines. These results, which are in contradiction to a great success of Gaussian processes during the last decade, motivated us to design new GP surrogate CMA-ES algorithms S-CMA-ES and DTS-CMA-ES. In addition to the use of GPs as a regression model of the objective function, we aim to utilize the GP uncertainty estimation in the optimization, too.

4.1  Surrogate CMA-ES: Generation-Based Evolution Control

Our first GP surrogate algorithm S-CMA-ES (Bajer et al., 2015) uses the generation-based EC: once the model is trained, it is used instead of the original fitness for several consecutive generations of the CMA-ES. Then, the original fitness is used for one generation, the model is re-trained also using the new points, and this procedure repeats (Algorithm 2). This EC, used for example in the s*ACM-ES, has a great advantage of simplicity and, compared to the individual-based EC, there is no need for any special selection of points into the population of the respective generation.

graphic

Apart from the surrogate-model parameters described below, the algorithm is parametrized by the model life-length gm: the number of generations for which the model is used. Although this parameter is problem-dependent, it was shown that the 5-generations life-length works reasonably well on most COCO functions (Bajer et al., 2015). Two years later, Repický et al. (2017) brought a variant called A-S-CMA-ES that governs the model life-length adaptively and which slightly improves the convergence rate for a higher number of FE/D. Both these algorithms demonstrate an impressive speed-up at the beginning of optimization runs, but they get often stuck in local optima and are outperformed by other algorithms, particularly by the DTS-CMA-ES.

4.2  Doubly Trained Surrogate CMA-ES: Closer to the Individual-Based Evolution Control

In the generation-based EC, the whole population is always original-evaluated at once which makes the exploitation of the GP predictive uncertainty difficult. Additionally, the CMA-ES can drift outside the region where any archive points are stored. These two facts led us to employ another evolution control. However, the other common choice, individual-based EC, can modify the distribution N(m,σ2C) of the CMA-ES population due to non-uniform selection of the λ promising points from the larger set of model-evaluated points.

Although Loshchilov et al. (2010) suggested subsampling the points according to the predicted fitness mapped onto a Gaussian distribution and Hansen (2011) has later given restrictions on injecting external points into the CMA-ES population, we have proposed another solution called doubly trained EC. Each generation of this EC can be summarized in the following steps:

  1. sample a new population of size λ (standard CMA-ES offspring),

  2. train the first surrogate model on the original-evaluated points from the archive A,

  3. select αλ point(s) wrt. a criterion C, which is based on the first model's prediction,

  4. evaluate these point(s) with the original fitness,

  5. re-train the surrogate model also using these new point(s), and

  6. predict the fitness for the non-original evaluated points with this second model.

The algorithm is partially similar to the evolution control of the lmm-CMA, but the lmm-CMA always selects the points with the best-predicted fitness (in step (3)) and retrains the model generally more times than twice (it cycles steps (3)–(5)).

The key characteristics of the doubly trained EC are the following. First, due to step (1), the population is always a sample from the CMA-ES distribution Nm,σ2C, similarly to the generation-based EC. Second, in step (3), the Gaussian process estimation of uncertainty can be utilized for the selection of a typically low number of original-evaluated points; see below for possible criteria and the description of the parameter α. Third, the estimated fitness values in step (6), which are returned to the CMA-ES, are predicted using the model trained on as recent points as possible. Finally, evaluating at least a small number of points in each generation using the original fitness maintains at least a decent number of points in the training set near the current CMA-ES distribution mean.

graphic

Employing the doubly trained EC in the CMA-ES results in the Doubly trained surrogate CMA-ES (DTS-CMA-ES, (Pitra et al., 2016)) whose pseudocode is shown in Algorithm 3. The algorithm is parametrized by the parameter α(0)—the initial value of the ratio of original-evaluated points in a population, by the criterion C for the selection of these points, and by the surrogate model and its parameters. Furthermore, a user can let the ratio α adapt throughout the algorithm run, which requires an additional set of parameters—see Algorithm 4 and the corresponding paragraph below.

Ranking difference error. In accordance with most of the model-assessment methods for the CMA-ES surrogates, we have also used the difference in ranking of the fitness values in a population as a measure of model quality. We have proposed a normalized version of the error measure used by Kern et al. (2006), which we call the Ranking difference error (RDE). We denote as ranking a function ρ:Rλ{1,,λ}λ which maps each element in a vector yRλ to its rank, that is, (ρ(y))i(ρ(y))j whenever (y)i(y)j. The RDE is defined for two vectors of fitness values y1*, y2*. We assume the second vector y2* to be a more accurate prediction: y2* might contain more original-evaluated f-values, or the y2*'s model training set could be larger. The RDE is then calculated as
RDEμ(y1*,y2*)=i:(ρ(y2*))iμ|(ρ(y2*))i-(ρ(y1*))i|maxπPermutationsof(1,,λ)i:π(i)μ|i-π(i)|0,1.
(14)

The larger the difference in rankings of these two f-values vectors, the more the CMA-ES would be fooled if it got the less accurate f-values prediction y1* instead of y2*.

Criteria for the selection of original-evaluated points. While the non-Gaussian-process surrogate algorithms select points for the original evaluation mostly according to the predicted fitness, the Gaussian process surrogates, which for any point x predict the whole Gaussian distribution N(y^(x),(s^(x))2), offer more options when used with the individual-based or doubly trained EC. The following criteria are considered in the DTS-CMA-ES. Whereas the first four are defined for any point of the input space and have been used in Bayesian optimization for decades, the last one, Expected ranking difference error, is our new contribution directly exploiting the DTS-CMA-ES' Gaussian processes prediction and is defined only for the points from the considered population.

  • GP predictive mean. The GP mean prediction y^(x) is the maximum-likelihood estimate of the original fitness. This criterion is defined as its negative value
    CM(x)=-y^(x).
    (15)
  • GP predictive standard deviation. Choosing the points with the highest uncertainty leads to the criterion
    CSTD(x)=s^(x).
    (16)
  • Expected improvement (EI). If ymin stands for the minimum fitness in the considered training set y1,,yN, the EI criterion is
    CEI(x)=E((ymin-f(x))I(f(x)<ymin)|y1,,yN),whereI(f(x)<ymin)=1forf(x)<ymin0forf(x)ymin.
    (17)
  • Probability of improvement (PoI). The PoI express the probability of finding lower fitness than some threshold T
    CPoI(x,T)=P(f(x)T|y1,,yN)=φT-y^(x)s^(x),
    (18)
    where φ is the distribution function of N(0,1). As T, a value slightly lower than ymin is usually chosen, see, e.g., Jones (2001) for an evaluation of these thresholds.
  • Expected ranking difference error (ERDE). The goal of this criterion is to select the points from the current population for which the expected RDE would decrease most after adding them to the GP training set. It is a ranking counterpart of the GP predictive standard deviation criterion since it selects the points for which the model is least certain. In DTS-CMA-ES, the RDEμ has been used as an error measure.2

    For each point xk*{x1,,xλ} in the considered population, let us denote X-k the population with the k-th point removed and y^-k=fM1(X-k) the vector of the corresponding mean predictions of the first DTS-CMA-ES model. The ERDE criterion is the expected contribution of xk* to the ranking error
    CERDE(xk*,X-k)=E[RDEμ(y^-k,y^-k+)]=-RDEμ(y^-k,y^-k+)ϕ(yk)dyk,
    (19)
    where y^-k+=fM1+(X-k) is the vector of mean predictions of a GP fM1+. This process fM1+ is identical to the first model fM1 except that the k-th point (xk*,yk) is additionally incorporated into its training set, say to its end, which becomes (XN+,yN+)=([XNxk*],(yNyk)); particularly, fM1+ uses the same hyperparameters as fM1. The expectation is calculated over normally distributed ykN(y^k,(s^k)2) with density ϕ(yk), where y^k and (s^k)2 are defined using the first DTS-CMA-ES model fM1 by equations (8) and (9).
    The vector of predicted means y^-k+ is given by a multivariate variant of (8)
    y^-k+=K(X-k,XN+)CN+-1yN+.
    (20)
    For the matrix A=K(X-k,XN+)CN+-1 that is for a considered xk* constant, the value of each element of y^-k+ depends only on yk as
    (y^-k+)i=(A)i,1NyN+yk(A)i,N+1.
    (21)
    Further, the integral (19) over yk can be calculated as the sum of subintegrals Jp over the intervals Ip=[lp,up], p=1,,(12(λ-1)(λ-2)+1); the intervals are defined in such a way that the ranking of y^-k+ is constant on each of them. The intervals Ip's can be derived from the mutual comparisons
    (y^-k+)i(y^-k+)jforij,i,j=1,,(λ-1),
    (22)
    where the interval boundaries u1=l2u2=l3up-1=lp are the points for which the equalities arise. As the rankings are on each Ip constant, the subintegrals simplify to Jp=RDEμ(y^-k,y^-k+)(Φ(up)-Φ(lp)) where Φ is the distribution function of N(y^k,(s^k)2). The complexity of calculating ERDE for each point is thus O(N3+λ2N) which enables efficient computation of this criterion even for moderately large λ.

Self-adaptation of the original-evaluated points ratio α. Whereas the speed-up of all surrogate-assisted algorithms is based on the model's ability to estimate fitness, the models can mislead the optimization algorithm when their error is high. Similarly to the lmm-CMA and s*ACM-ES, such an adverse effect can be controlled by raising the number of original-evaluated points if necessary. The DTS-CMA-ES measures the ranking difference error of the model and raises the original-evaluated points ratio α if the model error gets higher (Pitra et al., 2017). The details are summarized in Algorithm 4: first, the RDEμ of the last model is exponentially smoothed with the update rate β, and the ratio α is then calculated as the result of a linear transfer function that is defined by the bounds on the smoothed error εmin, εmax and on the resulting ratio αmin, αmax.

5  Implementation Details and Experimental Validation

This section starts with the details of the proposed algorithms, particularly addressing the training of and predicting with the Gaussian processes as well as the evaluation of their parameter settings. The second part continues with the experimental evaluation of the DTS-CMA-ES as well as other GP/CMA-ES algorithms on the COCO benchmarks. All the algorithms proposed by the authors are implemented in MATLAB.3

5.1  Configuration of the Gaussian Process Models

The GP models employed in our algorithms make use of the GPML 4.0—a toolbox that accompanied the book of Rasmussen and Williams (2006). The only modifications are using the MATLAB fmincon optimizer or the CMA-ES for the ML estimation of the GP hyperparameters and slight modifications in error handling in the case of numerical instabilities.

graphic

graphic

The estimation of the GP model hyperparameters is preceded by the transformation of both the input and output values; see Algorithm 5 for the details. The likelihood optimization in step 4 is started at and bounded by values obtained from brief experiments; see Table 1 for the exact values. The CMA-ES is used instead of fmincon if fmincon fails due to an infeasible starting point or numerical problems.

Table 1:
Bounds and starting values for hyperparameters ML estimation; yN is a vector of the f-values from the GP training set and Δy=maxyN-minyN.
HyperparameterStarting ValueLower BoundUpper Bound
mμ median f-value medyN minyN-2Δy maxyN+2Δy 
σf2 0.5 exp(-2) exp(25) 
 exp(-2) exp(25) 
σn2 10-2 10-6 10 
HyperparameterStarting ValueLower BoundUpper Bound
mμ median f-value medyN minyN-2Δy maxyN+2Δy 
σf2 0.5 exp(-2) exp(25) 
 exp(-2) exp(25) 
σn2 10-2 10-6 10 

Analogically, the predictions with such trained GP models start with the inverse transformation of the input variables into the space that was used in the model training, and the predicted f-values and variances are transformed back to the original scale. Note that the GP surrogate models have to save not only the covariance function and the corresponding hyperparameters but also these transformations and the model's training set.

Evaluation of GP model parameters. Because we are not aware of any systematic comparison of different parameter settings of the GP surrogate models for the CMA-ES in literature, we provide such evaluation on the data from the COCO benchmark functions. Table 2 lists all the tested parameters with their considered values; the rmaxA denotes the maximum distance between m(g) and the points to be selected from A. The minimum training set size Nmin is 3·D in all our GP models mentioned in this article.

Table 2:
Parameters of the GP surrogate models. The maximum distance rmaxA is derived using the Mahalanobis distance given by the covariance matrix σ2C. Qχ2(0.99,D) is the 0.99-quantile of the χD2 distribution, and therefore Qχ2(0.99,D) is the 0.99-quantile of the norm of a D-dimensional normal distributed random vector.
ParameterConsidered Values
training set selection method TSS (TSS1), (TSS2), (TSS3), (TSS4) 
maximum distance rmaxA 2Qχ2(0.99,D), 4Qχ2(0.99,D) 
Nmax 10·D, 15·D, 20·D 
covariance function K KSE, KMate´rnν=3/2, KMate´rnν=5/2 
ParameterConsidered Values
training set selection method TSS (TSS1), (TSS2), (TSS3), (TSS4) 
maximum distance rmaxA 2Qχ2(0.99,D), 4Qχ2(0.99,D) 
Nmax 10·D, 15·D, 20·D 
covariance function K KSE, KMate´rnν=3/2, KMate´rnν=5/2 

As training and testing datasets for this comparison, three generations from the first half and three generations from the second half of optimization runs of the DTS-CMA-ES (α=0.05), limited to 250 FE/D, were recorded on the first five instances of 24 COCO functions in dimensions 2, 3, 5, 10, and 20. For each recorded generation, the CMA-ES state variables, archive A and populations (as testing sets) were saved.

Figure 1 depicts the third quartiles of RDEμ/rsucc where rsucc denotes the ratio of successful trainings of the second model fM2. Both the errors and ratios were taken for each model setting of the full-factorial design of the parameters from Table 2 across 15 repetitions (5 instances × 3 generations) per each half of the optimization run on each function in each dimension. The same results concerning the influence of the individual parameters are summarized over functions via the n-way ANOVA in Table 3.

Figure 1:

The third quartiles from sets of 15 values of RDEμ/rsucc. Results for f5Linear slope (often equal to zero) were omitted for better visibility of other results. Two columns per each COCO function belong to the first and second half of optimization runs respectively. Solid horizontal lines separate different TSS methods (TSS1)–(TSS2) (in that order), dashed lines separate sectors with smaller and larger maximum distance rmaxA. Three triples of settings within each sector represent raising values of Nmax and three covariance functions KSE, KMate´rnν=3/2, and KMate´rnν=5/2 within each triple.

Figure 1:

The third quartiles from sets of 15 values of RDEμ/rsucc. Results for f5Linear slope (often equal to zero) were omitted for better visibility of other results. Two columns per each COCO function belong to the first and second half of optimization runs respectively. Solid horizontal lines separate different TSS methods (TSS1)–(TSS2) (in that order), dashed lines separate sectors with smaller and larger maximum distance rmaxA. Three triples of settings within each sector represent raising values of Nmax and three covariance functions KSE, KMate´rnν=3/2, and KMate´rnν=5/2 within each triple.

Table 3:
GP surrogate model parameters. Results from the combination of n-way ANOVA analysis without interactions and the posthoc Tukey–Kramer's comparison tests of the effects of four parameters on the mean of RDEμ/rsucc. The table shows the values of the parameters sorted according to the corresponding regressed RDEμ/rsucc (shown left) for the respective combinations of dimension and half of the optimization run. Values with the mean response according to Tukey–Kramer significantly higher than the respective lowest mean response are marked with . Stars after the values sign the statistical significance of rejecting the ANOVA test hypotheses that the effects of the respective parameters on the mean responses of RDEμ/rsucc are negligible. Significant results of Tukey–Kramer's comparisons between the values of the respective parameters are marked in the gray lines of each block (1 = best value, 2 = second etc.). Stars // sign the p-value lower than 0.05/0.01/0.001 for the respective tests.
dimPart of RunTrainset SelectionrmaxANmaxCovariance Function
2-D 0.40 k-NN (TSS2) 0.41 0.40 20·D 0.40 KMate´rnν=5/2 
  0.41 population (TSS4) 0.41 0.41 15·D 0.41 KMate´rnν=3/2 
  0.41 clustering (TSS3)   0.41 10·D 0.41 KSE 
  0.41 recent (TSS1)       
2-D ii 0.42 k-NN (TSS2) 0.42 2 0.42 20·D 0.42 KSE 
  0.42 recent (TSS1) 0.42 4  0.42 15·D 0.42 KMate´rnν=5/2 
  0.42 clustering (TSS3)   0.42 10·D 0.42 KMate´rnν=3/2 
  0.42 population (TSS4)       
   1<  
3-D 0.38 recent (TSS1) 0.38 0.38 15·D 0.38 KMate´rnν=3/2 
  0.38 k-NN (TSS2) 0.39 0.38 10·D 0.38 KMate´rnν=5/2 
  0.38 clustering (TSS3)   0.38 20·D 0.39 KSE 
  0.39 population (TSS4)       
3-D ii 0.38 recent (TSS1) 0.39 0.39 15·D 0.39 KMate´rnν=3/2 
  0.39 k-NN (TSS2) 0.39 0.39 10·D 0.39 KMate´rnν=5/2 
  0.39 clustering (TSS3)    0.39 20·D 0.40 KSE 
  0.40 population (TSS4)        
  1<3, 1<4, 2<   
5-D 0.34 k-NN (TSS2) 0.34 2 0.34 10·D 0.34 KMate´rnν=3/2 
  0.34 recent (TSS1) 0.35 4  0.34 15·D 0.34 KMate´rnν=5/2 
  0.34 clustering (TSS3)   0.34 20·D 0.35 KSE 
  0.35 population (TSS4)        
  1<4, 2<4, 3<1< 1<3, 2<
5-D ii 0.35 k-NN (TSS2) 0.35 2 0.35 15·D 0.35 KMate´rnν=5/2 
  0.35 recent (TSS1) 0.36 4  0.35 10·D 0.35 KMate´rnν=3/2 
  0.35 clustering (TSS3)   0.35 20·D 0.36 KSE 
  0.37 population (TSS4)        
  1<4, 2<4, 3<1< 1<3, 2<
10-D 0.34 k-NN (TSS2) 0.33 4 0.34 20·D 0.33 KMate´rnν=5/2 
  0.34 recent (TSS1) 0.35 2  0.34 15·D 0.34 KSE 
  0.34 clustering (TSS3)    0.34 10·D 0.35 KMate´rnν=3/2 
  0.35 population (TSS4)        
  1<3, 1<4, 2<1< 1<2, 1<
10-D ii 0.32 k-NN (TSS2) 0.32 4 0.33 20·D 0.33 KMate´rnν=5/2 
  0.33 recent (TSS1)  0.34 2  0.33 15·D 0.34 KSE 
  0.34 clustering (TSS3)    0.34 10·D 0.34 KMate´rnν=3/2 
  0.34 population (TSS4)        
  1<2, 1<3, 1<1< 1<2, 1<
20-D 0.35 k-NN (TSS2) 0.34 4 0.35 20·D 0.34 KMate´rnν=5/2 
  0.36 population (TSS4)  0.37 2  0.35 15·D 0.34 KMate´rnν=3/2 
  0.36 clustering (TSS3)    0.36 10·D 0.39 KSE 
  0.36 recent (TSS1)        
  1<2, 1<3, 1<1<1<3, 2<1<2, 1<3, 
     2<
20-D ii 0.34 k-NN (TSS2) 0.33 4 0.34 20·D 0.32 KMate´rnν=5/2 
  0.35 recent (TSS1)  0.36 2  0.35 15·D 0.34 KMate´rnν=3/2 
  0.35 population (TSS4)    0.36 10·D 0.38 KSE 
  0.36 clustering (TSS3)        
  1<2, 1<3, 1<4, 2<1<1<3, 2<1<2, 1<3, 
     2<
dimPart of RunTrainset SelectionrmaxANmaxCovariance Function
2-D 0.40 k-NN (TSS2) 0.41 0.40 20·D 0.40 KMate´rnν=5/2 
  0.41 population (TSS4) 0.41 0.41 15·D 0.41 KMate´rnν=3/2 
  0.41 clustering (TSS3)   0.41 10·D 0.41 KSE 
  0.41 recent (TSS1)       
2-D ii 0.42 k-NN (TSS2) 0.42 2 0.42 20·D 0.42 KSE 
  0.42 recent (TSS1) 0.42 4  0.42 15·D 0.42 KMate´rnν=5/2 
  0.42 clustering (TSS3)   0.42 10·D 0.42 KMate´rnν=3/2 
  0.42 population (TSS4)       
   1<  
3-D 0.38 recent (TSS1) 0.38 0.38 15·D 0.38 KMate´rnν=3/2 
  0.38 k-NN (TSS2) 0.39 0.38 10·D 0.38 KMate´rnν=5/2 
  0.38 clustering (TSS3)   0.38 20·D 0.39 KSE 
  0.39 population (TSS4)       
3-D ii 0.38 recent (TSS1) 0.39 0.39 15·D 0.39 KMate´rnν=3/2 
  0.39 k-NN (TSS2) 0.39 0.39 10·D 0.39 KMate´rnν=5/2 
  0.39 clustering (TSS3)    0.39 20·D 0.40 KSE 
  0.40 population (TSS4)        
  1<3, 1<4, 2<   
5-D 0.34 k-NN (TSS2) 0.34 2 0.34 10·D 0.34 KMate´rnν=3/2 
  0.34 recent (TSS1) 0.35 4  0.34 15·D 0.34 KMate´rnν=5/2 
  0.34 clustering (TSS3)   0.34 20·D 0.35 KSE 
  0.35 population (TSS4)        
  1<4, 2<4, 3<1< 1<3, 2<
5-D ii 0.35 k-NN (TSS2) 0.35 2 0.35 15·D 0.35 KMate´rnν=5/2 
  0.35 recent (TSS1) 0.36 4  0.35 10·D 0.35 KMate´rnν=3/2 
  0.35 clustering (TSS3)   0.35 20·D 0.36 KSE 
  0.37 population (TSS4)        
  1<4, 2<4, 3<1< 1<3, 2<
10-D 0.34 k-NN (TSS2) 0.33 4 0.34 20·D 0.33 KMate´rnν=5/2 
  0.34 recent (TSS1) 0.35 2  0.34 15·D 0.34 KSE 
  0.34 clustering (TSS3)    0.34 10·D 0.35 KMate´rnν=3/2 
  0.35 population (TSS4)        
  1<3, 1<4, 2<1< 1<2, 1<
10-D ii 0.32 k-NN (TSS2) 0.32 4 0.33 20·D 0.33 KMate´rnν=5/2 
  0.33 recent (TSS1)  0.34 2  0.33 15·D 0.34 KSE 
  0.34 clustering (TSS3)    0.34 10·D 0.34 KMate´rnν=3/2 
  0.34 population (TSS4)        
  1<2, 1<3, 1<1< 1<2, 1<
20-D 0.35 k-NN (TSS2) 0.34 4 0.35 20·D 0.34 KMate´rnν=5/2 
  0.36 population (TSS4)  0.37 2  0.35 15·D 0.34 KMate´rnν=3/2 
  0.36 clustering (TSS3)    0.36 10·D 0.39 KSE 
  0.36 recent (TSS1)        
  1<2, 1<3, 1<1<1<3, 2<1<2, 1<3, 
     2<
20-D ii 0.34 k-NN (TSS2) 0.33 4 0.34 20·D 0.32 KMate´rnν=5/2 
  0.35 recent (TSS1)  0.36 2  0.35 15·D 0.34 KMate´rnν=3/2 
  0.35 population (TSS4)    0.36 10·D 0.38 KSE 
  0.36 clustering (TSS3)        
  1<2, 1<3, 1<4, 2<1<1<3, 2<1<2, 1<3, 
     2<

Not surprisingly, the highest error RDEμ/rsucc is visible on the Attractive sectorf6 function. This benchmark has its optimum in a point without continuous derivatives and with different slopes in different directions which is, therefore, very hard to regress by GPs. On the contrary, the lowest errors (in lower dimensions often equal to 0) were always achieved on the Linear slopef5 whose results were omitted from the heatmaps for better color scaling of the remaining functions. Low errors were measured in higher dimensions on the Sphere function f1; relatively high errors in lower dimensions on f1 are probably due to a fast decrease of the CMA-ES step-size σ, which results in the low number of points in the training sets.

Let us look more closely at the effects of individual surrogate model parameters.

  • The dependence of the model error on the maximum radius of the training set rmaxA seems to be problem-dependent in low dimensions (in 2-D and 5-D—lower errors can be observed for smaller radius, for example, on the Rastrigin's functions f3 and f15 or on the multimodal Gallagher's Gaussiansf21 and f22). Rather surprising are the significantly lower errors for smaller radius in the 5-D, but Table 3 shows that larger radius yields significantly lower error in the 10-D and 20-D spaces.

  • The model error appears not to depend on the number of training points Nmax (with the exception of 20-D where both 20·D and 15·D points are significantly better than 10·D points). This insignificance is, however, affected by the number of points that are available for the training sets: the number of these points is often lower than 10·D.

  • Further, the choice of the covariance function has shown to be insignificant for 2-D and 3-D, but starting from 5-D, the Matérn functions have shown to be significantly better and starting from 10-D even more specifically with the parameter ν=5/2. Results in Figure 1 reveal that even though the squared exponential covariance exhibits in few cases the best performance (e.g., f9 in 2-D), it is more often susceptible to training failure or high RDEμ error, as can be seen, for example, on f21 and f23 in 10-D or on several functions in 20-D.

  • Finally, the TSS method of choice according to Table 3 is (TSS2)—taking the k nearest neighbors to each of the points in the population, in lower dimensions closely followed by (TSS1)—taking up to Nmax most recent points.

For the experiments in Section 5.3, the surrogate model settings performing significantly best in 10-D and 20-D were chosen for all dimensions: TSS= (TSS2), rmaxA=4Qχ2(0.99,D), Nmax=20·D, K=KMate´rnν=5/2.

Fixed hyperparameter values. In addition to the evaluation of GP model parameters, we performed a brief investigation of inferring the values of two GP hyperparameters and mμ from the DTS-CMA-ES state variables instead of their ML estimation.

Because the DTS-CMA-ES surrogate model already transforms the input coordinates into the space where the Euclidean distance equals to the Mahalanobis distance r(x1,x2)=(x1-x2)σ-2C-1(x1-x2) in the original space, a natural value for the length-scale parameter would be for the squared-exponential covariance KSE (11) the constant =1 as it then corresponds to a simple exponential kernel
K(x1,x2)=σf2exp-r(x1,x2)/2,
(23)
in the transformed space, proposed for RBF kernels by Loshchilov et al. (2010). Likewise, a common recommendation for the GP mean function is to use constantly zero μ(x)=0 (Rasmussen and Williams, 2006). Note that the GP surrogate models standardize the f-values to zero mean regardless the mean function (step 3 in Algorithm 5).

Table 4 shows the third quartiles of RDEμ/rsucc for independently set =1 or μ(x)=0 for two covariance functions, measured on the same datasets as in the previous evaluation—on the 2×15 populations of points per each COCO function and considered dimension. As can be seen, using the fixed lengthscale =1 provides practically the same results as ML-estimated for D=2,3, but is harmful in higher dimensions 10-D and 20-D where fitting the lengthscale seems to be essential for a good GP prediction. On the other hand, using the constant mean function μ(x)=0 in connection with the Matérn covariance in higher dimensions does not dramatically deteriorate the prediction, but provides worse results in low-dimensional spaces.

Table 4:
Means and standard deviations of the third quartiles of RDEμ/rsucc for ML-estimated and fixed hyperparameter values of the DTS-CMA-ES GP models. The third quartiles are from 2×15 independent datasets per function and dimension, and the means and standard deviations are averaged across 2×23 data (24 COCO functions without Linear slopef5 in the first and second half of optimization run) for each respective dimension.
covar. fun.μ(x)2-D3-D5-D10-D20-D
KMate´rnν=5/2 ML estimate ML estimate 0.41 ±0.07 0.39 ±0.07 0.33 ±0.06 0.31 ±0.08 0.31 ±0.12 
KMate´rnν=5/2 ML estimate =1 0.41 ±0.06 0.39 ±0.07 0.38 ±0.07 0.42 ±0.06 0.45 ±0.08 
KMate´rnν=5/2 mμ=0 ML estimate 0.45 ±0.08 0.42 ±0.08 0.37 ±0.09 0.33 ±0.08 0.33 ±0.11 
KMate´rnν=5/2 mμ=0 =1 0.40 ±0.06 0.39 ±0.07 0.39 ±0.07 0.44 ±0.06 0.49 ±0.07 
KSE ML estimate ML estimate 0.41 ±0.07 0.39 ±0.07 0.35 ±0.05 0.32 ±0.08 0.37 ±0.13 
KSE ML estimate =1 0.41 ±0.06 0.40 ±0.07 0.39 ±0.07 0.42 ±0.06 0.47 ±0.09 
KSE mμ=0 ML estimate 0.45 ±0.09 0.44 ±0.09 0.39 ±0.08 0.36 ±0.10 0.58 ±0.34 
KSE mμ=0 =1 0.41 ±0.06 0.41 ±0.07 0.40 ±0.07 0.44 ±0.06 0.51 ±0.09 
covar. fun.μ(x)2-D3-D5-D10-D20-D
KMate´rnν=5/2 ML estimate ML estimate 0.41 ±0.07 0.39 ±0.07 0.33 ±0.06 0.31 ±0.08 0.31 ±0.12 
KMate´rnν=5/2 ML estimate =1 0.41 ±0.06 0.39 ±0.07 0.38 ±0.07 0.42 ±0.06 0.45 ±0.08 
KMate´rnν=5/2 mμ=0 ML estimate 0.45 ±0.08 0.42 ±0.08 0.37 ±0.09 0.33 ±0.08 0.33 ±0.11 
KMate´rnν=5/2 mμ=0 =1 0.40 ±0.06 0.39 ±0.07 0.39 ±0.07 0.44 ±0.06 0.49 ±0.07 
KSE ML estimate ML estimate 0.41 ±0.07 0.39 ±0.07 0.35 ±0.05 0.32 ±0.08 0.37 ±0.13 
KSE ML estimate =1 0.41 ±0.06 0.40 ±0.07 0.39 ±0.07 0.42 ±0.06 0.47 ±0.09 
KSE mμ=0 ML estimate 0.45 ±0.09 0.44 ±0.09 0.39 ±0.08 0.36 ±0.10 0.58 ±0.34 
KSE mμ=0 =1 0.41 ±0.06 0.41 ±0.07 0.40 ±0.07 0.44 ±0.06 0.51 ±0.09 

In summary, the ability of Gaussian processes to predict the right ranking varies to a large extent with a problem at hand. On the contrary, a particular type of covariance function or some of the training set parameters, such as the algorithm of selection points from the archive or the maximum radius for this selection, might be preferred, especially in higher dimensions.

5.2  DTS-CMA-ES Implementation and Parameter Settings

The implementation of the DTS-CMA-ES is based on the IPOP-CMA-ES in the MATLAB version 3.62β. For the COCO benchmarks, the initial mean is uniformly sampled from [-4,4]D, the initial step-size σ(0)=83 and the number of IPOP restarts is limited to 50. Changes in the CMA-ES code are rather subtle: the doubly trained EC is employed just after the CMA-ES new population sampling (step 3 in Alg. 3), and the EC returns to the CMA-ES a mix of the original and GP model fitness, scaled such that none of the predicted fitness is lower than the so-far minimum original f-value in A.

Both the GP model trainings in the DTS-CMA-ES (steps 4 and 9) can fail, usually either due to numerical errors in Cholesky matrix decomposition during GP hyperparameter estimation (used for the GP covariance matrix inversion), or the model is over-fitted in such a way that it returns a constant almost everywhere. If training the first model fM1 fails and there is no successfully trained model at most two generations old, the algorithm proceeds as the standard CMA-ES: the whole population is evaluated with the original fitness. If training the second model fM2 fails, the first model is used for the final prediction, too, and the self-adaptation step is skipped.

Ratio of original-evaluated points α. The ratio of original-evaluated points should inversely depend on the model error and can be either set fixed or can be adapted by the DTS-CMA-ES. For the evaluations budget of roughly 50–200 FE/D, the experimentally discovered fixed value α=0.05 has shown to achieve one of the best results on the COCO testbed (Pitra et al., 2016)---updated results are provided later in this section. The self-adaptation setting is, on the other hand, better for larger evaluation budgets and for harder-to-regress functions, and its setting is described below.

Population size. Replacing the exact fitness values with values returned by a surrogate model introduces an uncertainty of which the CMA-ES is not aware. The CMA-ES updates are then too rapid with respect to the number of original FE/D, and especially the step-size σ(g) is shrunk too fast. As a result, the CMA-ES is often trapped in a local optimum. As the goal of our work is not a precise fine-tuning of the CMA-ES internal constants, we use the population size λ=8+6lnD that is doubled compared to the default. This enlargement results in adequate CMA-ES updates and the convergence rate of the DTS-CMA-ES is considerably improved (Pitra et al., 2016).

Criteria for the selection of original-evaluated points. Figure 2 shows aggregated results of all five mentioned criteria, separately for selected unimodal and multimodal COCO benchmarks. These graphs confirm both theoretical expectations and the previous investigations, e.g., in Ulmer et al. (2003). Selecting the original-evaluated points based on the best GP predictive mean (CM) works most exploitative: it leads the algorithm towards the nearest local optimum. It thus works well on the unimodal functions, but it provides the worst results on the multimodal benchmarks.

Figure 2:

Criteria for the selection of original-evaluated points (CM, CSTD, CEI, CPoI, CERDE) in the α=0.05/DTS-CMA-ES. The log10 of the median best f-value distances to the optima were linearly scaled to [-8,0] and averaged (solid lines) over the selected unimodal (first row) and multimodal (second row) benchmarks separately. The first and third quartiles from the respective sets of medians are denoted with dotted lines.

Figure 2:

Criteria for the selection of original-evaluated points (CM, CSTD, CEI, CPoI, CERDE) in the α=0.05/DTS-CMA-ES. The log10 of the median best f-value distances to the optima were linearly scaled to [-8,0] and averaged (solid lines) over the selected unimodal (first row) and multimodal (second row) benchmarks separately. The first and third quartiles from the respective sets of medians are denoted with dotted lines.

Similarly, the CEI is in comparison with the CPoI less localized and thus more explorative. The CPoI performs considerably better on the unimodal functions while the CEI and also the CSTD have superior results on the multimodal functions where the CPoI still performs adequately. CERDE has been shown neither the best nor the worst; it performs similarly to or slightly better than the CSTD on the unimodal functions, but considerably worse in the case of the multimodal benchmarks.

Altogether, the best average results on the whole COCO benchmark set is provided by the CPoI, which is used with the threshold T=fmin-0.05(fmax-fmin) in the rest of this work. However, it can be replaced with the CEI or CSTD when some a priori knowledge about the multimodality of the fitness is available.

Self-adaptation parameter setting. The specific behavior of the proposed α-self-adaptation is determined by the following set of parameters whose values were recently investigated by Pitra et al. (2017).

  • β – exponential smoothing update rate (a.k.a. learning rate) moderates vivid α oscillations. Pitra et al. (2017) recommend β=0.3, which is used in our experiments, too, and which performs very similarly to the value β=0.2, the value used in the s*ACM-ES (Loshchilov et al., 2013). The DTS-CMA-ES has shown to be rather robust with respect to the exact setting of this parameter.

  • αmin, αmax – minimum and maximum bounds for the ratio α—should be set to enable the self-adaptation to adequately react on a broad set of problems. The used minimum bound αmin=0.04 corresponds to one evaluated point per generation in D=2,,20, and the maximum bound should be set to αmax=1.0 for the possibility of disabling the surrogate modeling completely on the fitness landscapes where the GPs are not able to satisfyingly predict the population ranking.

  • εmin, εmax – lower and upper saturation bounds on the exponentially smoothed RDEμ error for the linear transfer function. These parameters determine the self-adaptive behavior to a large extent and should change with both the dimension D of the problem and the actual used ratio α(g+1). We have used the proposed set of multiple linear regression models fminε(α,D), fmaxε(α,D) from Pitra et al. (2017) with the largest considered interval εmin,εmax, i.e. using the regression models εminQ2 and εmaxQ3 that are most capable of adapting to changing GP ranking error. We report the parameters of these linear models which were tuned on one half of the COCO functions
    εminQ2=fminε(α,D)=(1ln(D)ααln(D)α2)·bminεmaxQ3=fmaxε(α,D)=(1ln(D)ααln(D)α2)·bmax
    where
    bmin=(0.11-0.0092-0.130.0440.14)bmax=(0.35-0.0470.440.044-0.19).
    Because the calculation of the values εmin, εmax and the calculation of the resulting ratio α(g+1) (step 3 in Algorithm 4) are mutually dependent, they are alternately repeated in a cycle until convergence or 500 iterations.

5.3  Evaluation of Gaussian Process-Based CMA-ES Algorithms

This section assesses the performance of the six algorithms that are based on the CMA-ES and a GP model: the S-CMA-ES, the DTS-CMA-ES in both the fixed and self-adaptive version, the MA-ES, GPOP and SAPEO. For the sake of comparison with other state-of-the art black-box optimizers, the graphs in Figures 3 and 4 also present the results of the standard IPOP-CMA-ES with both the recommended and the doubled population size, two other surrogate-assisted CMA-ES algorithms lmm-CMA and s*ACM-ES, the SMAC algorithm as a representative of Bayesian optimization algorithms, and two local-search numerical optimization algorithms based on a trust region method: the BOBYQA algorithm by Powell (2009) and the interior-point method (Byrd et al., 2000) from the MATLAB fmincon function.4 The numbers of COCO functions for which one algorithm is better than each of the others are presented in Table 5.

Figure 3:

Medians (solid) and 1st/3rd quartiles (dotted) of the distances to the optima of 24 COCO benchmarks in 5-D for 13 black-box optimizers with the emphasis on the six Gaussian process/CMA-ES algorithms. Medians/quartiles were calculated across 15 independent function instances for each algorithm and are shown in the log10 scale.

Figure 3:

Medians (solid) and 1st/3rd quartiles (dotted) of the distances to the optima of 24 COCO benchmarks in 5-D for 13 black-box optimizers with the emphasis on the six Gaussian process/CMA-ES algorithms. Medians/quartiles were calculated across 15 independent function instances for each algorithm and are shown in the log10 scale.

Figure 4:

Medians (solid) and 1st/3rd quartiles (dotted) of the distances to the optima of 24 COCO benchmarks in 20-D for 13 black-box optimizers with the emphasis on the six Gaussian process/CMA-ES algorithms. Medians/quartiles were calculated across 15 independent function instances for each algorithm and are shown in the log10 scale.

Figure 4:

Medians (solid) and 1st/3rd quartiles (dotted) of the distances to the optima of 24 COCO benchmarks in 20-D for 13 black-box optimizers with the emphasis on the six Gaussian process/CMA-ES algorithms. Medians/quartiles were calculated across 15 independent function instances for each algorithm and are shown in the log10 scale.

Table 5:
A pairwise comparison of the algorithms in 5-D and 20-D in terms of the number of wins of the row algorithm against the column algorithm over all COCO functions for two evaluation budgets. The budget (a) 1 denotes the smallest number of FE/D at which any of the algorithms reached the target ft=10-8 on the respective function, or 250 FE/D if the target was not reached, and (b) 13 is one third of the budget 1. For all four combinations of budgets and dimensions, the Iman-Davenport test rejects the hypothesis of equal ranks of algorithms with p-value <10-7. The asterisk marks the row algorithm achieving a significantly lower value of the objective function than the column algorithm (on medians over 15 instances from 24 functions) according to the Friedman post-hoc test with the Shaffer correction at the family-wise level 0.05.
5-DS-CMA-ES0.05 DTSadp. DTSMA-ESGPOPSAPEOCMA-ESCMA-ES 2pops*ACM-ESlmm-CMABOBYQASMACfmincon
eval. budget131131131131131131131131131131131131131
S-CMA-ES — — 12 13 14 12 13 14 11 10 16 18 12 10 
0.05 DTS 21* 22* — — 16 15 18 19* 21 23* 19 21* 19* 19* 21* 18* 20* 19* 15 16 16 19 19* 20* 16 22* 
adp. DTS 19 22* — — 17 20 15 20* 18 19 22 23* 23* 23* 22 18 12 16 12 15 18 20* 13 15 
MA-ES 12 15 — — 12 15 13 12 18 16 20 15 16 11 10 10 14 19 11 10 
GPOP 15 11 12 — — 11 14 13 16 11 15 10 12 13 14 10 
SAPEO 15 17 11 12 13 15 — — 15 16 21 16 14 11 10 10 14 18 13 
CMA-ES 10 12 10 11 — — 18 10 11 11 11 12 15 11 
CMA-ES 2pop 11 10 13 14 — — 10 10 15 11 10 
s*ACM-ES 13 14 13 15 10 13 13 13 15 14 — — 12 13 17 12 13 
lmm-CMA 19 22 12 16 18 16 17 18 18 21 18 19* 16 18 19 — — 14 14 17 19* 13 13 
BOBYQA 16 16 12 14 14 14 17 14 14 13 15 16 14 15 12 10 10 — — 15 17 16 12 
SMAC 10 12 11 10 12 16 11 — — 13 11 
fmincon 12 14 11 13 14 10 14 11 15 13 15 13 14 12 11 11 11 12 11 13 — — 
S-CMA-ES — — 11 10 15 12 11 11 12 11 11 11 18 16 13 10 
0.05 DTS 23 21 — — 13 14 18 17 18* 18 17 16 19* 18 17* 14 15 17 14 19 17 20* 19* 17 13 
adp. DTS 16 18* 10 — — 20 21 20* 21* 17 18 22* 21 23* 21 15 17 12 17 13 20* 20* 16 11 
MA-ES 13 14 — — 15 15 18 11 17 12 11 16 18 12 11 
GPOP 12 — — 11 10 13 16 17 14 
SAPEO 16 16 16 21 15 17 — — 21 21 20 16 11 11 10 15 12 20 20* 16 11 
CMA-ES 13 13 13 13 14 — — 19 12 11 11 14 17 12 
CMA-ES 2pop 12 17 10 15 11 17 12 — — 10 13 14 19 12 
s*ACM-ES 15 19* 18 15 16 20 18 19* 13 13 18 15 19 16 — — 10 13 16 17 17 21* 16 15 
lmm-CMA 13 20 10 12 18 18 17 18 15 14 20 18 19 14 14 11 — — 18 18 17* 20* 13 10 
BOBYQA 13 13 11 12 13 15 18 12 13 13 16 11 — — 15 17 13 13 
SMAC 10 10 — — 11 10 
fmincon 11 14 11 13 12 13 10 15 13 12 15 12 15 11 14 11 11 13 14 — — 
5-DS-CMA-ES0.05 DTSadp. DTSMA-ESGPOPSAPEOCMA-ESCMA-ES 2pops*ACM-ESlmm-CMABOBYQASMACfmincon
eval. budget131131131131131131131131131131131131131
S-CMA-ES — — 12 13 14 12 13 14 11 10 16 18 12 10 
0.05 DTS 21* 22* — — 16 15 18 19* 21 23* 19 21* 19* 19* 21* 18* 20* 19* 15 16 16 19 19* 20* 16 22* 
adp. DTS 19 22* — — 17 20 15 20* 18 19 22 23* 23* 23* 22 18 12 16 12 15 18 20* 13 15 
MA-ES 12 15 — — 12 15 13 12 18 16 20 15 16 11 10 10 14 19 11 10 
GPOP 15 11 12 — — 11 14 13 16 11 15 10 12 13 14 10 
SAPEO 15 17 11 12 13 15 — — 15 16 21 16 14 11 10 10 14 18 13 
CMA-ES 10 12 10 11 — — 18 10 11 11 11 12 15 11 
CMA-ES 2pop 11 10 13 14 — — 10 10 15 11 10 
s*ACM-ES 13 14 13 15 10 13 13 13 15 14 — — 12 13 17 12 13 
lmm-CMA 19 22 12 16 18 16 17 18 18 21 18 19* 16 18 19 — — 14 14 17 19* 13 13 
BOBYQA 16 16 12 14 14 14 17 14 14 13 15 16 14 15 12 10 10 — — 15 17 16 12 
SMAC 10 12 11 10 12 16 11 — — 13 11 
fmincon 12 14 11 13 14 10 14 11 15 13 15 13 14 12 11 11 11 12 11 13 — — 
S-CMA-ES — — 11 10 15 12 11 11 12 11 11 11 18 16 13 10 
0.05 DTS 23 21 — — 13 14 18 17 18* 18 17 16 19* 18 17* 14 15 17 14 19 17 20* 19* 17 13 
adp. DTS 16 18* 10 — — 20 21 20* 21* 17 18 22* 21 23* 21 15 17 12 17 13 20* 20* 16 11 
MA-ES 13 14 — — 15 15 18 11 17 12 11 16 18 12 11 
GPOP 12 — — 11 10 13 16 17 14 
SAPEO 16 16 16 21 15 17 — — 21 21 20 16 11 11 10 15 12 20 20* 16 11 
CMA-ES 13 13 13 13 14 — — 19 12 11 11 14 17 12 
CMA-ES 2pop 12 17 10 15 11 17 12 — — 10 13 14 19 12 
s*ACM-ES 15 19* 18 15 16 20 18 19* 13 13 18 15 19 16 — — 10 13 16 17 17 21* 16 15 
lmm-CMA 13 20 10 12 18 18 17 18 15 14 20 18 19 14 14 11 — — 18 18 17* 20* 13 10 
BOBYQA 13 13 11 12 13 15 18 12 13 13 16 11 — — 15 17 13 13 
SMAC 10 10 — — 11 10 
fmincon 11 14 11 13 12 13 10 15 13 12 15 12 15 11 14 11 11 13 14 — — 

Whereas the detailed parameter setting of the DTS-CMA-ES can be found above, the list of the settings of the remaining algorithms follows.

  1. S-CMA-ES population size λ=4+3lnD, model life-length gm=5, Gaussian process surrogate model: (TSS3), rmaxA=10, Nmax=15·D, KMate´rnν=5/2

  2. MA-ES population size λ=10, parent number μ=2, number of training points 2λ, extended population size λPre=3λ, predictive mean as pre-selection criterion, our MATLAB implementation based on the implementation of the S-CMA-ES

  3. GPOP population size λ=10, parent number μ=2, training set size parameters NC=NR=5D, input space termination tolerance 10-8, minimum fitness change for 10 last iterations unless restart 10-9, perturbation size m=1, our GPML-based MATLAB implementation using the Gaussian processes from the S-CMA-ES

  4. SAPEO the COCO/BBOB results have been kindly provided by the algorithm's authors; variant ucp-2: sample-size 2λ, using three increasingly risky dominance relations

  5. CMA-ES the IPOP-CMA-ES version (MATLAB code 3.62β) with single (4+3lnD) and doubled population size, 50 IPOP restarts, σ(0)=83, other settings left default

  6. lmm-CMA, s*ACM-ES, SMAC results downloaded from the COCO/BBOB results archive, the BIPOP-s*ACM-ES-k (2013) used for the s*ACM-ES, SMAC results are only up to 100 FE/D; initializations of the lmm-CMA and SMAC are rather disputable on f19

  7. BOBYQA, fmincon both algorithms initialized from a new uniformly sampled point every fifth restart, otherwise used in their default settings, Dlib C++ implementation of BOBYQA, fmincon with the interior-point algorithm from MATLAB 2017a

The graphs in Figures 3 and 4 document the differences between the GP/CMA-ES algorithms. They depict the best-achieved logarithmic median distances to the benchmarks' optima Δflog for the respective numbers of FE/D. These medians (and in the case of the GP/CMA-ES algorithms also the first and third quartiles) are from 15 independent instances for each respective algorithm, function and dimension. The results of the GP/CMA-ES algorithms are represented by thick lines and the reference algorithms use thinner lines to see whether some better algorithm than any of the GP/CMA-ES exists.

S-CMA-ES. A noticeable pattern among our algorithms is, with few exceptions in 20-D, that the S-CMA-ES almost always converges to the optimum at the same rate or slower than both versions of DTS-CMA-ES. It might be at least partly because the generation-based evolution control of the S-CMA-ES is not able to exploit the GP models uncertainty, and, therefore, we will point our attention to the DTS-CMA-ES.

DTS-CMA-ES. The fixed (α=0.05) version of the DTS-CMA-ES converges fastest out of the six GP/CMA-ES algorithms particularly on the subset f814 of the unimodal functions (which are more or less after some transformation similar to the Rosenbrock's function f8) and also on the highly multimodal functions f1921,23,24, often functions without any global structure. While some of the non-GP algorithms s*ACM-ES, lmm-CMA, BOBYQA or fmincon perform similarly or better on the first part of functions (f814), the performance on the multimodal functions (f1921,23,24) is often the best out of all compared algorithms. These results are probably because the fixed DTS-CMA-ES is able to descend to the functions' optima regardless of the relatively high model error which forces the adaptive algorithms to spend more original evaluations than necessary. On the other hand, the limitations of GP models can be seen on the non-smooth functions f6 or f13 where the GP is not able to regress the fitness around the optimum and where other algorithms perform better.

The self-adaptive DTS-CMA-ES is mostly the second or the third GP/CMA-ES algorithm on the functions where the fixed DTS-CMA-ES succeeds (see the previous paragraph). However, the self-adaptation improves the fixed DTS-CMA-ES behavior on the multimodal functions with a global structure, especially on the Rastrigin functions f3,4,15 and Shaffers functions f17,18 where the self-adaptive DTS-CMA-ES is usually the best of all 13 algorithms. A closer look at the convergence graphs reveals that the self-adaptive DTS-CMA-ES indeed speeds up the 2-population IPOP-CMA-ES on these functions, approximately by the factor of 2.

MA-ES. Our implementation of the MA-ES most often occupies the worse half of the GP-based algorithm ranking, even on the unimodal functions for which the (2,10) setting was, according to its authors, mainly aimed. The results are often very similar to or slightly better than the results of the IPOP-CMA-ES, which signifies that the proposed individual-based EC is not able to adequately speed up the CMA-ES compared to the EC of the other surrogate algorithms. However, the results on the Attractive sectorf6 are rather surprising because the MA-ES' model does not mislead the search even though the function is not smooth in the optimum and thus very hard to regress by GPs.

GPOP, SMAC. Although the GPOP belongs to Bayesian optimization algorithms, it descends much closer to the global optima than SMAC. For very few FE/D, SMAC is slightly faster but very soon traps in some local optima. The GPOP is often the slowest or the second slowest GP/CMA-ES algorithm on the COCO set (in 20-D on f3,4, f6,7, f1016, f1924), but it is the fastest during initial generations on the Spheref1 and Ellipsoidalf2, where the GP is able to regress the function well even with a very small number of training points. The GPOP is the second on the Schaffers' functions f17,18 (rather surprisingly only in 20-D, it is the slowest in 5-D).

s*ACM-ES. The BIPOP-s*ACM-ES' convergence speed is often similar or slightly worse than the self-adaptive DTS-CMA-ES in 5-D. In 20-D, however, the s*ACM-ES converges much faster compared to the other algorithms, and it is the best or the second among all 13 algorithms on the unimodal functions f2, f7,8, f1014. It probably profits from its ranking surrogate model and, therefore, from the invariance to monotonous transformations of the fitness. A considerably faster convergence is clearly visible especially on the ill-conditioned or non-smooth unimodal functions.

SAPEO. The SAPEO algorithm usually converges at a similar rate as the MA-ES or the IPOP-CMA-ES. Nonetheless, its performance is considerably better than the MA-ES on functions f9, f11, f14, f19, f22, and it is the best algorithm on f6 and f16 in 20-D, where it obviously adapts well to the fitness. Hence, the concept of uncertainty propagation is rather promising, especially considering the GP model was trained only on 2λ points.

lmm-CMA. The quadratic-model-based lmm-CMA ranks on most benchmarks somewhere between or very close to the fixed and self-adaptive DTS-CMA-ES (f24, f812, f1418, f21,22) which conforms with the average results in Figure 5 where it is very close to the self-adaptive DTS-CMA-ES. This similarity in terms of the results is in accordance with the fact that both algorithms use a metric (i.e., not a rank-based) surrogate model as well as a similar kind of adaptivity.

Figure 5:

Median distances to the benchmarks' optima averaged (solid lines) over all 24 COCO/BBOB functions in dimensions 2, 5, 10, 20; dotted lines denote the first and third quartiles. The log10 of the median distances were linearly scaled to [-8,0] for each function before calculating the averages/quartiles.

Figure 5:

Median distances to the benchmarks' optima averaged (solid lines) over all 24 COCO/BBOB functions in dimensions 2, 5, 10, 20; dotted lines denote the first and third quartiles. The log10 of the median distances were linearly scaled to [-8,0] for each function before calculating the averages/quartiles.

BOBYQA andfmincon. The BOBYQA is the best optimizer (out of all 13 compared) in the very beginning of the optimization, i.e. up to roughly 15–20 FE/D. Later, the behavior of both the BOBYQA and fmincon dramatically changes with the benchmark at hand. They are rather successful on the unimodal functions—BOBYQA on f1, f5, f8,9, and fmincon on the ill-conditioned f1014 and also on f1,2 and f6 in 5-D. Moreover, they are in few cases able to successfully solve the multimodal benchmarks—BOBYQA adequately f1924, and fmincon is one of the best in 20-D on f21,22. But not surprisingly, these algorithms otherwise often stuck in local optima, and on several benchmarks rank worse than the standard IPOP-CMA-ES.

Although it is not the prime criterion for the comparison of black-box optimizers, we also provide the average CPU time per one original function evaluation for the algorithms that we have benchmarked in Table 6.

Table 6:
The CPU time in seconds per one original function evaluation: averaged results over all 24 BBOB functions measured on the computers of the Czech national computational grid MetaCentrum (single-threaded jobs on mostly Intel Xeon-based computers).
Algorithm2-D3-D5-D10-D20-D
S-CMA-ES 2.5e−01 1.5e−01 1.7e−01 2.6e−01 1.1e+00 
0.05/2pop DTS-CMA-ES 7.3e−01 6.0e−01 7.5e−01 1.7e+00 1.3e+01 
adaptive DTS-CMA-ES 5.0e−01 2.7e−01 2.9e−01 9.1e−01 5.6e+00 
MA-ES 2.1e−02 4.5e−02 4.8e−02 4.6e−02 7.1e−02 
GPOP 9.4e−01 1.2e+00 1.8e+00 3.4e+00 1.2e+01 
CMA-ES 2pop 3.2e−03 2.2e−03 1.4e−03 8.4e−04 6.1e−04 
BOBYQA 1.8e−03 1.3e−03 9.5e−04 6.2e−04 5.3e−04 
fmincon 2.8e−03 1.7e−03 1.1e−03 6.8e−04 4.5e−04 
Algorithm2-D3-D5-D10-D20-D
S-CMA-ES 2.5e−01 1.5e−01 1.7e−01 2.6e−01 1.1e+00 
0.05/2pop DTS-CMA-ES 7.3e−01 6.0e−01 7.5e−01 1.7e+00 1.3e+01 
adaptive DTS-CMA-ES 5.0e−01 2.7e−01 2.9e−01 9.1e−01 5.6e+00 
MA-ES 2.1e−02 4.5e−02 4.8e−02 4.6e−02 7.1e−02 
GPOP 9.4e−01 1.2e+00 1.8e+00 3.4e+00 1.2e+01 
CMA-ES 2pop 3.2e−03 2.2e−03 1.4e−03 8.4e−04 6.1e−04 
BOBYQA 1.8e−03 1.3e−03 9.5e−04 6.2e−04 5.3e−04 
fmincon 2.8e−03 1.7e−03 1.1e−03 6.8e−04 4.5e−04 

6  Conclusion and Further Research Directions

More than the last decade of research in the area of model-based CMA-ES algorithms has recognized the surrogate-assisted CMA-ES algorithms as state-of-the art black-box continuous optimizers, especially when only tens to hundreds function evaluations are available for the optimization. In contrast to the overall success of Gaussian processes, employing Gaussian processes as surrogate models, namely in the algorithms MA-ES, GPOP or Kriging metamodeling CMA-ES, has gained considerably less attention compared to the quadratic-based lmm-CMAor SVM-based s*ACM-ES.

This article re-investigates the employment of Gaussian processes in the CMA-ES. The presented algorithm DTS-CMA-ES differs from the lmm-CMA and s*ACM-ES not only in the regression model, but, more importantly, it uses Gaussian process uncertainty prediction for selecting points in its evolution control. The selection that does not use uncertainty performs on average worst. In addition, uncertainty has been shown to help more on the multimodal benchmarks.

The fifth section provides an evaluation of different parameter settings. The selection of the k nearest neighbor points as training sets for surrogates performed best, a selection method rather rarely encountered in literature. Other parameter settings, used in the experiments due to their good performance, are rather expected (e.g., a large radius for the selection of training set, the Matèrn covariance function, etc.).

The self-adaptive version of the DTS-CMA-ES uses a similar adaptation mechanism of the number of points for the evaluation with the original fitness as the lmm-CMA. This adaptive version also performs similar to the lmm-CMA on many COCO benchmark functions, which suggests a hypothesis that the self-adaptation decides the algorithm performance to a great extent. Less importantly, the self-adaptive DTS-CMA-ES is the best out of 13 compared algorithms on at least two benchmarks.

The last part of the article empirically compares the available implementations of the algorithms combining the Gaussian processes and the CMA-ES with the CMA-ES itself and five other state-of-the-art black box optimizers. The fastest convergence rate in terms of the original fitness evaluations out of the six compared GP/CMA-ES algorithms was reached by the DTS-CMA-ES with the fixed parameter α=0.05. This DTS-CMA-ES variant represents an algorithm of choice for multimodal functions with weak global structure and is very eligible for unimodal landscapes, too, especially in lower dimensions. The self-adaptive version, on the other hand, excels on the globally decreasing multimodal functions where it outperforms other compared algorithms.

Competitive results on a relatively broad spectrum of the COCO benchmarks were also provided by the two non-GP surrogate-assisted CMA-ES algorithms. Since the s*ACM-ES uses a ranking surrogate model, it is valuable especially for functions without continuous derivatives in the optima. It also provides impressive results in 20-D where it represents the best choice for many unimodal functions while still providing average results on the multimodal ones.

According to the experimental results, different surrogate CMA-ES algorithms or their variants perform best on different fitness landscapes, but the mechanism of selecting the best-performing setting or smarter adaptation to the fitness at hand is still an open issue. Such adaptivity (or meta-learning) should select the appropriate evolution control parameters as well as the most convenient surrogate model and its parameters, and, in particular, it should decide whether to use the metric-based (probably Gaussian processes) or the ranking-based (e.g., ranking SVM) model. Besides, the parameters of the CMA-ES itself are problem-dependent, too. In fact, CMA-ES variants more suitable for expensive problems have been already proposed (Loshchilov, 2017; Belkhir et al., 2017), and they can be obviously combined with surrogate models.

To conclude, the surrogate-model-based CMA-ES algorithms constitute state-of-the-art optimizers for black-box optimization problems. In particular, the Gaussian processes have been shown to be a very competitive (if not the best) surrogate models, especially for multimodal moderately-dimensional objective functions.

Acknowledgments

This research was supported by the Czech Science Foundation grants No. 17-01251 and 18-18080S. Access to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042) is greatly appreciated.

Notes

1

The detailed results can be found on the competition's webpage at https://bbcomp.ini.rub.de/

2

RDEμ could be replaced by arbitrary error measure between f-values vectors which only depends on ranks of the elements of these vectors.

3

The source code is available on Github https://github.com/bajeluk/surrogate-cmaes

4

Additional results are on the author's webpage http://bajeluk.matfyz.cz/gpcomparison2017

References

Auger
,
A.
,
Brockhoff
,
D.
, and
Hansen
,
N.
(
2013
).
Benchmarking the local metamodel CMA-ES on the noiseless BBOB'2013 test bed
. In
Proceedings of the 15th Annual Conference Companion
(
GECCO
), pp.
1225
1232
.
Auger
,
A.
,
Schoenauer
,
M.
, and
Vanhaecke
,
N
. (
2004
).
LS-CMA-ES: A second-order algorithm for covariance matrix adaptation
. In
Parallel Problem Solving from Nature
, pp.
182
191
.
Lecture Notes in Computer Science, Vol. 3242
.
Baerns
,
M.
, and
Holeňa
,
M
. (
2009
).
Combinatorial development of solid catalytic materials: Design of high-throughput experiments, data analysis, data mining
, Vol. 7 of Catalytic science series.
London
:
World Scientific
.
Bajer
,
L.
,
Pitra
,
Z.
, and
Holeňa
,
M
. (
2015
). Benchmarking Gaussian processes and random forests surrogate models on the BBOB noiseless testbed. In
Proceedings of the 15th Annual Conference Companion
(GECCO), pp.
1143
1150
.
Bekasiewicz
,
A.
, and
Koziel,
S.
(
2017
).
Surrogate-assisted design optimization of photonic directional couplers
.
International Journal of Numerical Modelling: Electronic Networks, Devices and Fields
,
30
(
3–4
).
Belkhir
,
N.
,
Dréo
,
J.
,
Savéant
,
P.
, and
Schoenauer
,
M
. (
2017
). Per instance algorithm configuration of CMA-ES with limited budget. In
Proceedings of the 19th Annual Conference
(
GECCO
), pp.
681
688
.
Brent
,
R. P
. (
1973
).
Algorithms for minimization without derivatives
.
Prentice-Hall, NJ
:
Prentice-Hall Series in Automatic Computation
.
Büche
,
D.
,
Schraudolph
,
N. N.
, and
Koumoutsakos
,
P
. (
2005
).
Accelerating evolutionary algorithms with Gaussian process fitness function models
.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
,
35
(
2
):
183
194
.
Byrd
,
R. H.
,
Gilbert
,
J. C.
, and
Nocedal
,
J
. (
2000
).
A trust region method based on interior point techniques for nonlinear programming
.
Mathematical Programming
,
89
(
1
):
149
185
.
Emmerich
,
M.
,
Giannakoglou
,
K.
, and
Naujoks
,
B
. (
2006
).
Single- and multiobjective evolutionary optimization assisted by Gaussian random field metamodels
.
IEEE Transactions on Evolutionary Computation
,
10
(
4
):
421
439
.
Emmerich
,
M.
,
Giotis
,
A.
,
Özdemir
,
M.
,
Bäck
,
T.
, and
Giannakoglou
,
K.
(
2002
).
Metamodel—Assisted evolution strategies
. In
Parallel Problem Solving from Nature
, pp.
361
370
.
Lecture Notes in Computer Science, Vol. 2439
.
Hansen
,
N.
(
2006
). The CMA evolution strategy: A comparing review. In
J. A.
Lozuno
,
P.
Larrañaga
,
I.
Inza
, and
F.
Bengoetxea
(Eds.),
Towards a new evolutionary computation
, Number 192 in Studies in fuzziness and soft computing (pp.
75
102
).
Berlin Heidelberg
:
Springer
.
Hansen
,
N.
(
2011
).
Injecting external solutions into CMA-ES
.
Technical Report 7748. INRIA Saclay. Retrieved from arXiv:1110.4181
.
Hansen
,
N.
,
Auger
,
A.
,
Mersmann
,
O.
,
Tušar
,
T.
, and
Brockhoff
,
D.
(
2016
).
COCO: A platform for comparing continuous optimizers in a black-box setting
.
Retrieved from arXiv:1603.08785
.
Hansen
,
N.
, and
Ostermeier
,
A
. (
2001
).
Completely derandomized self-adaptation in evolution strategies
.
Evolutionary Computation
,
9
(
2
):
159
195
.
Hutter
,
F.
,
Hoos
,
H.
, and
Leyton-Brown
,
K.
(
2013
).
An evaluation of sequential model-based optimization for expensive blackbox functions
. In
Proceedings of the 15th Annual Conference Companion
(
GECCO
), pp.
1209
1216
.
Hutter
,
F.
,
Hoos
,
H. H.
, and
Leyton-Brown
,
K.
(
2011
).
Sequential model-based optimization for general algorithm configuration
. In
Learning and Intelligent Optimization
, pp.
507
523
.
Lecture Notes in Computer Science, Vol. 6683
.
Jin
,
Y
. (
2011
).
Surrogate-assisted evolutionary computation: Recent advances and future challenges
.
Swarm and Evolutionary Computation
,
1
(
2
):
61
70
.
Jin
,
Y.
,
Olhofer
,
M.
, and
Sendhoff
,
B
. (
2001
).
Managing approximate models in evolutionary aerodynamic design optimization
. In
Proceedings of the 2001 Congress on Evolutionary Computation
, Vol.
1
, pp.
592
599
.
Jin
,
Y.
,
Olhofer
,
M.
, and
Sendhoff
,
B
. (
2002
).
A framework for evolutionary optimization with approximate fitness functions
.
IEEE Transactions on Evolutionary Computation
,
6
(
5
):
481
494
.
Jones
,
D. R
. (
2001
).
A taxonomy of global optimization methods based on response surfaces
.
Journal of Global Optimization
,
21
(
4
):
345
383
.
Kern
,
S.
,
Hansen
,
N.
, and
Koumoutsakos
,
P.
(
2006
).
Local meta-models for optimization using evolution strategies
. In
Parallel Problem Solving from Nature
, pp.
939
948
.
Lecture Notes in Computer Science, Vol. 4193
.
Kruisselbrink
,
J.
,
Emmerich
,
M. T. M.
,
Deutz
,
A.
, and
Back
,
T
. (
2010
). A robust optimization approach using Kriging metamodels for robustness approximation in the CMA-ES. In
2010 IEEE Congress on Evolutionary Computation
, pp.
1
8
.
Lee
,
H.
,
Jo
,
Y.
,
Lee
,
D.-J.
, and
Choi
,
S.
(
2016
).
Surrogate model based design optimization of multiple wing sails considering flow interaction effect
.
Ocean Engineering
,
121:422
436
.
Loshchilov
,
I
. (
2017
).
LM-CMA: An alternative to L-BFGS for large scale black-box optimization
.
Evolutionary Computation
,
25
(
1
):
143
171
.
Loshchilov
,
I.
,
Schoenauer
,
M.
, and
Sebag
,
M.
(
2010
).
Comparison-based optimizers need comparison-based surrogates
. In
Parallel Problem Solving from Nature
, pp.
364
373
.
Lecture Notes in Computer Science, Vol. 6238
.
Loshchilov
,
I.
,
Schoenauer
,
M.
, and
Sebag
,
M
. (
2013
). Intensive surrogate model exploitation in self-adaptive surrogate-assisted CMA-ES (saACM-ES). In
Proceedings of the 15th Annual Conference
(
GECCO
), pp.
439
446
.
Lu
,
J.
,
Li
,
B.
, and
Jin
,
Y
. (
2013
). An evolution strategy assisted by an ensemble of local Gaussian process models. In
Proceedings of the 15th Annual Conference
(
GECCO
), pp.
447
454
.
MacKay
,
D. J. C.
(
1998
).
Introduction to Gaussian processes
.
Neural Networks and Machine Learning
,
168:133
165
.
Mockus
,
J. B.
,
Tiešis
,
V.
, and
Žilinskas
,
A.
(
1978
). The application of Bayesian method for seeking the extremum. In
L. C. W.
Dixon
, and
G. P.
Szegö
(Eds.),
Towards global optimization 2
(pp.
117
129
).
Amsterdam
:
North-Holland
.
Mockus
,
J. B.
, and
Žilinskas
,
A.
(
1972
).
On Bayesian method for seeking the minimum
.
Avtomatika i vychislitelnaja technika
(
in Russian
).
Mohammadi
,
H.
,
Riche
,
R. L.
, and
Touboul
,
E.
(
2015
).
Making EGO and CMA-ES complementary for global optimization
. In
Learning and Intelligent Optimization
, pp.
287
292
.
Lecture Notes in Computer Science, Vol. 8994
.
Neal
,
R. M.
(
1994
).
Bayesian learning for neural networks
.
Doctoral thesis, University of Toronto, Toronto
.
Neal
,
R. M.
(
1997
).
Monte Carlo implementation of Gaussian process models for Bayesian regression and classification
.
Technical Report 9702, Department of Statistics, University of Toronto
.
Nelder
,
J. A.
, and
Mead
,
R
. (
1965
).
A simplex method for function minimization
.
The Computer Journal
,
7
(
4
):
308
313
.
Pitra
,
Z.
,
Bajer
,
L.
, and
Holeňa
,
M.
(
2016
).
Doubly trained evolution control for the surrogate CMA-ES
. In
Parallel Problem Solving from Nature
.
Lecture Notes in Computer Science, Vol. 9921
.
Pitra
,
Z.
,
Bajer
,
L.
,
Repický
,
J.
, and
Holeňa
,
M.
(
2017
).
Adaptive doubly trained evolution control for the covariance matrix adaptation evolution strategy
. In
Information Technologies—Applications and Theory
,
Martin. CreateSpace Independent Public Platform
.
Powell
,
M. J.
(
2009
).
The BOBYQA algorithm for bound constrained optimization without derivatives
.
Technical Report NA2009/06. University of Cambridge, Cambridge, UK
.
Rasmussen
,
C. E.
, and
Williams
,
C. K. I
. (
2006
).
Gaussian processes for machine learning
.
Adaptative computation and machine learning series. Cambridge, MA
:
MIT Press
.
Repický
,
J.
,
Bajer
,
L.
,
Pitra
,
Z.
, and
Holeňa
,
M.
(
2017
).
Adaptive generation-based evolution control for Gaussian process surrogate models
. In
Information Technologies—Applications and Theory
,
Martin. CreateSpace Independent Public Platform
.
Rios
,
L. M.
, and
Sahinidis
,
N. V
. (
2012
).
Derivative-free optimization: A review of algorithms and comparison of software implementations
.
Journal of Global Optimization
,
56
(
3
):
1247
1293
.
Runarsson
,
T. P.
(
2004
).
Constrained evolutionary optimization by approximate ranking and surrogate models
. In
Parallel Problem Solving from Nature
,
Lecture Notes in Computer Science
, pp.
401
410
.
Ulmer
,
H.
,
Streichert
,
F.
, and
Zell
,
A.
(
2003
).
Evolution strategies assisted by Gaussian processes with improved preselection criterion
. In
The 2003 Congress on Evolutionary Computation
,
Vol. 1
, pp.
692
699
.
Volz
,
V.
,
Rudolph
,
G.
, and
Naujoks
,
B
. (
2017
). Investigating uncertainty propagation in surrogate-assisted evolutionary algorithms. In
Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO), pp.
881
888
.
Williams
,
C. K. I.
, and
Rasmussen
,
C. E.
(
1996
). Gaussian processes for regression. In
D. S.
Touretzky
and
M. E.
Hasselmo
(Eds.),
NIPS 1995
(pp.
514
520
).
Cambridge, MA
:
MIT Press
.
Zaefferer
,
M.
,
Gaida
,
D.
, and
Bartz-Beielstein
,
T.
(
2016
).
Multi-fidelity modeling and optimization of biogas plants
.
Applied Soft Computing
,
48:13
28
.