## Abstract

We introduce an acceleration for covariance matrix adaptation evolution strategies (CMA-ES) by means of *adaptive diagonal decoding* (dd-CMA). This diagonal acceleration endows the default CMA-ES with the advantages of separable CMA-ES without inheriting its drawbacks. Technically, we introduce a diagonal matrix $D$ that expresses coordinate-wise variances of the sampling distribution in *DCD* form. The diagonal matrix can learn a rescaling of the problem in the coordinates within a linear number of function evaluations. Diagonal decoding can also exploit separability of the problem, but, crucially, does not compromise the performance on nonseparable problems. The latter is accomplished by modulating the learning rate for the diagonal matrix based on the condition number of the underlying correlation matrix. dd-CMA-ES not only combines the advantages of default and separable CMA-ES, but may achieve overadditive speedup: it improves the performance, and even the scaling, of the better of default and separable CMA-ES on classes of nonseparable test functions that reflect, arguably, a landscape feature commonly observed in practice.

The article makes two further secondary contributions: we introduce two different approaches to guarantee positive definiteness of the covariance matrix with active CMA, which is valuable in particular with large population size; we revise the default parameter setting in CMA-ES, proposing accelerated settings in particular for large dimension.

All our contributions can be viewed as independent improvements of CMA-ES, yet they are also complementary and can be seamlessly combined. In numerical experiments with dd-CMA-ES up to dimension 5120, we observe remarkable improvements over the original covariance matrix adaptation on functions with coordinate-wise ill-conditioning. The improvement is observed also for large population sizes up to about dimension squared.

## 1 Introduction

In real-world applications of continuous optimization involving simulations such as physics or chemical simulations, the input-output relation between a candidate solution and its objective function value is barely expressible in explicit mathematical formula. The objective function value is computed through a complex simulation with a candidate solution as an input. In such scenarios, we gain the information of the problem only through the evaluation of the objective function value of a given candidate solution. A continuous optimization of $f:Rn\u2192R$ is referred to as black-box continuous optimization if we gain the information of the problem only through the evaluation $x\u21a6f(x)$ of a given candidate solution $x\u2208Rn$.

Black-box continuous optimization arises widely in real world applications such as model parameter calibration and design of robot controller. It often involves computationally expensive simulation to evaluate the quality of candidate solutions. The search cost of black-box continuous optimization is therefore the number of simulations, that is, the number of objective function calls; and a search algorithm is desired to locate good quality solutions with as few $f$-calls as possible. Practitioners need to choose one or a few search algorithms to solve their problems and tune their hyperparameters based on the prior knowledge into their problems. However, prior knowledge is often limited in the black-box situation due to the black-box relation between $x$ and $f(x)$. Hence, algorithm selection, as well as hyperparameter tuning, is a tedious task for practitioners who are typically not experts in search algorithms.

Covariance matrix adaptation evolution strategy (CMA-ES), developed by Hansen and Ostermeier (2001), Hansen et al. (2003), Hansen and Kern (2004), and Jastrebski and Arnold (2006), is recognized as a state-of-the-art derivative-free search algorithm for *difficult* continuous optimization problems (Rios and Sahinidis, 2013). CMA-ES is a stochastic and comparison-based search algorithm that maintains the multivariate normal distribution as a sampling distribution of candidate solutions. The distribution parameters such as the mean vector and the covariance matrix are updated at each iteration based on the candidate solutions and their objective value ranking, so that the sampling distribution will produce promising candidate solutions more likely in the next algorithmic iteration. The update of the distribution parameters is partially found as the natural gradient ascent on the manifold of the distribution parameter space equipped with the Fisher metric (Akimoto et al., 2010; Glasmachers et al., 2010; Ollivier et al., 2017), thereby revealing the connection to natural evolution strategies (Wierstra et al., 2008; Sun et al., 2009; Glasmachers et al., 2010; Wierstra et al., 2014), whose parameter update is derived explicitly from the natural gradient principle.

Invariance (Hansen, 2000; Hansen et al., 2011) is one of the governing principles of the design of CMA-ES and the essence of its success on *difficult* continuous optimization problems consisting of ruggedness, ill-conditioning, and nonseparability. CMA-ES exhibits several invariance properties such as invariance to order preserving transformation of the objective function, invariance to translation, rotation, and coordinate-wise scaling of the search space (Hansen and Auger, 2014). Invariance guarantees the algorithm to work identically on an original problem and its transformed version. Thanks to its invariance properties, CMA-ES works, after an adaptation phase, equally well on separable and well-conditioned functions, which are easy for most search algorithms, and on nonseparable and ill-conditioned functions produced by an affine coordinate transformation of the former, which are considered difficult for many other search algorithms. This also contributes to allow default hyperparameter values to depend solely on the search space dimension and the population size, whereas many other search algorithms need tuning depending on problem difficulties to make the algorithms efficient (Karafotias et al., 2015).

On the other hand, exploiting problem structure, if possible, is beneficial for optimization speed. Different variants of CMA-ES have been proposed to exploit problem structure such as separability and limited variable dependency. They aim to achieve a better scaling with the dimension (Knight and Lunacek, 2007; Ros and Hansen, 2008; Akimoto et al., 2014; Akimoto and Hansen, 2016; Loshchilov, 2017). However, they lose some of the invariance properties that CMA-ES has and compromise the performance on problems where their specific, more or less restrictive assumptions on the problem structure do not hold. For instance, the separable CMA-ES (Ros and Hansen, 2008) reduces the degrees of freedom of the covariance matrix from $(n2+n)/2$ to $n$ by adapting only the diagonal elements of the covariance matrix. It scales better in terms of internal time and space complexity and in terms of number of function evaluations to adapt the coordinate-wise variance of the search distribution. Good results are hence observed on functions with weak variable dependencies. However, unsurprisingly, the convergence speed of the separable CMA is significantly deteriorated on nonseparable and ill-conditioned functions, where the shape of the level sets of the objective function can not be reasonably well approximated by the equiprobability ellipsoid defined by the normal distribution with diagonal (co)variance matrix.

In this article, we aim to improve the performance of CMA-ES on a relatively wide class of problems by exploiting problem structure, however crucially, without compromising the performance on more difficult problems without this structure.^{1}

The first mechanism we are concerned with is the so-called active covariance matrix update, which was originally proposed for the $(\mu ,\lambda )$-CMA-ES with intermediate recombination (Jastrebski and Arnold, 2006), and later incorporated into the $(1+1)$-CMA-ES (Arnold and Hansen, 2010). It utilizes unpromising solutions to actively decrease the eigenvalues of the covariance matrix. The active update consistently improves the adaptation speed of the covariance matrix in particular on functions where a low dimensional subspace dominates the function value. The positive definiteness of the covariance matrix is, however, not guaranteed when the active update is utilized. Practically, a small enough learning rate of the covariance matrix is sufficient to keep the covariance matrix positive definite with overwhelming probability; however, we would like to increase the learning rate when the population size is large. We propose two novel schemes that guarantee the positive definiteness of the covariance matrix, so that we take advantage of the active update even when a large population size is desired, e.g., when the objective function evaluations are distributed on many CPUs or when the objective function is rugged.

The main contribution of this article is the diagonal acceleration of CMA by means of *adaptive diagonal decoding*, referred to as dd-CMA. We introduce a coordinate-wise variance matrix, $D2$, of the sampling distribution alongside the positive definite symmetric matrix $C$, such that the resulting covariance matrix of the sampling distribution is represented by $DCD$. We call $D$ the diagonal decoding matrix. We update $C$ with the original CMA, whereas $D$ is updated similarly to separable CMA. An important point is that we want to update $D$ faster than $C$, by setting higher learning rates for the $D$ update. However, when $C$ contains nonzero covariances, the update of $D$ can result in a drastic change of the sampling distribution and disturb the adaptation of $C$. To resolve this issue, we introduce an adaptive damping mechanism for the $D$ update, so that the difference (e.g., Kullback-Leibler divergence) between the distributions before and after the update remains sufficiently small. With this damping, $D$ is updated as fast as by separable CMA on a separable function if the correlation matrix of the sampling distribution is close to the identity, and it suppresses the $D$ update when the correlation matrix is away from the identity; that is, variable dependencies have been learned.

The update of $D$ breaks the rotation invariance of the original CMA; hence, we lose a mathematical guarantee of the original CMA that it performs identical on functions in an equivalence group defined by this invariance property. The ultimate aim of this work is to gain significant speed-up in some situations and to preserve the performance of the original CMA in the worst case. Functions where we expect that the diagonally accelerated CMA outperforms the original one have variables with different sensitivities, that is, coordinate-wise ill-conditioning. Such functions may often appear in practice, since variables in a black-box continuous optimization problem can have distinct meanings. Diagonal acceleration however can even be superior to the optimal additive portfolio of the original CMA and the separable CMA. We demonstrate in numerical experiments that dd-CMA outperforms the original CMA not only on separable functions but also on nonseparable ill-conditioned functions with coordinate-wise ill-conditioning that separable CMA cannot efficiently solve.

The last contribution is a set of improved and simplified default parameter settings for the covariance matrix adaptation and for the cumulation factor for the so-called evolution path. These learning rates, whose default values have been previously expressed with independent formulas, are reformulated so that their dependencies are clearer. The new default learning rates also improve the adaptation speed of the covariance matrix on high dimensional problems without compromising stability.

The rest of this article is organized as follows. We introduce the original and separable CMA-ES in Section 2. The active update of the covariance matrix with positive definiteness guarantee is proposed in Section 3. The adaptive diagonal decoding mechanism is introduced in Section 4. Section 5 is devoted to explain the renewed default hyperparameter values for the covariance matrix adaptation and provides an algorithm summary of dd-CMA-ES and a link to publicly available Python code. Numerical experiments are conducted in Section 6 to see how effective each component of CMA with diagonal acceleration works in different situations. We conclude the article in Section 7.

## 2 Evolution Strategy with Covariance Matrix Adaptation

We summon up the ($\mu w$, $\lambda $)-CMA-ES consisting of weighted recombination, cumulative step-size adaptation, and rank-one and rank-$\mu $ covariance matrix adaptation.

The CMA-ES maintains the multivariate normal distribution, $N(m,\sigma 2DCD)$, where $m\u2208Rn$ is the mean vector that represents the center of the search distribution, $\sigma \u2208R+$ is the so-called step-size that represents the scaling factor of the distribution spread, and $C\u2208Rn\xd7n$ is a positive definite symmetric matrix that represents the shape of the distribution ellipsoid. Though the covariance matrix of the sampling distribution is $\sigma 2DCD$, we often call $C$ the covariance matrix in the context of CMA-ES. In this article, we apply this terminology, and we will call $\sigma 2DCD$ the covariance matrix of the sampling distribution to distinguish them when necessary. The positive definite diagonal matrix $D\u2208Rn\xd7n$ is regarded as a diagonal decoding matrix, which represents the scaling factor of each design variable. It is fixed during the optimization, usually $D=I$, and does not appear in the standard terminology. However, it plays an important role in this article, and we define the CMA-ES with $D$ for the later modification.

### 2.1 Original CMA-ES

The CMA-ES repeats the following steps until it meets one or more termination criteria:

Sample $\lambda $ candidate solutions, $xi$, independently from $N(m,\sigma 2DCD)$;

Evaluate candidate solutions on the objective, $f(xi)$, and sort them in ascending order,

^{2}$f(x1:\lambda )\u2264\cdots \u2264f(x\lambda :\lambda )$ is the index of the $i$-th best candidate solution among $x1,\u2026,x\lambda $;Update the distribution parameters, $m$, $\sigma $, and $C$.

#### 2.1.1 Sampling and Recombination

#### 2.1.2 Step-Size Adaptation

^{3}

^{4}The log step-size is updated as

#### 2.1.3 Covariance Matrix Adaptation

### 2.2 Separable Covariance Matrix Adaptation

One advantage of the separable CMA is that all operations can be done in linear time per $f$-call. Therefore, it is promising if $f$-calls are cheap. The other advantage is that one can set the learning rate greater than those used for the $C$ update, since the degrees of freedom of the covariance matrix of the sampling distribution is $n$, rather than $n(n+1)/2$. The adaptation speed of the covariance matrix is faster than for the original CMA. However, if the problem is nonseparable and has strong variable dependencies, adapting the coordinate-wise scaling is not enough to make the search efficient. More concisely, if the inverse Hessian of the objective function, $Hess(f)-1$, is not well approximated by a diagonal matrix, the convergence speed will be $O(1/Cond(D2Hess(f)))$, which is empirically observed on convex quadratic functions as well as theoretically deduced in Akimoto et al. (2018). In practice, it is rarely known in advance whether the separable CMA is appropriate or not. Ros and Hansen (2008) propose to use the separable CMA for hundred times dimension function evaluations and then switch to the original CMA afterwards. Such an algorithm has been benchmarked in Ros (2009), where the first $1+100n/\lambda $ iterations are used for the separable CMA.

## 3 Active Covariance Matrix Update with Guarantee of Positive Definiteness

^{5}

Each component of the covariance matrix adaptation, rank-one update, rank-$\mu $ update, and active update, produces complementary effects. The rank-one update of the covariance matrix (the second term on the RHS of Eq. (14); Hansen and Ostermeier, 2001) accumulates the successive steps of the distribution mean and increases the eigenvalues of the covariance matrix in the direction of the successive movements. It excels at learning one long axis of the covariance matrix, and is effective on functions with a subspace of a relatively small dimension where the function value is less sensitive than in its orthogonal subspace. Figure 1a visualizes an example of such function. See also Figure 7 in Hansen and Auger (2014) for numerical results. On the other hand, since the update is always of rank one, the learning rate $c1$ needs to be sufficiently small to keep the covariance matrix regular and stable.

The rank-$\mu $ update (the third term on the RHS of Eq. (14) with positive $wi$; Hansen et al., 2003) utilizes the information of $\mu $ successful candidate solutions in a different way than the rank-one update. It computes the empirical covariance matrix of successful mutation steps $yi$. The update matrix is with probability one of rank $min(n,\mu )$, allowing to set a relatively large learning rate $c\mu $. It reduces the number of iterations to adapt the covariance matrix when $\lambda \u226b1$.

Both rank-one and rank-$\mu $ update try to increase the variances in the subspace of successful mutation steps. The eigenvalues corresponding to unsuccessful mutation steps only passively fade out. The active update actively decreases such eigenvalues. It consistently accelerates covariance matrix adaptation, and the improvement is particularly pronounced on functions with a small number of dominating eigenvalues of the Hessian of the objective function. Figure 1c is an example of such function.

A disadvantage of the active update with negative recombination weights such as Eq. (13) is to have no guarantee of the positive definiteness of the covariance matrix anymore. It is easy to see that the rank-one and rank-$\mu $ update of CMA in Eq. (8) guarantee that the minimal eigenvalue of $C(t+1)$ is no smaller than the minimal eigenvalue of $C(t)$ times $1-c1-c\mu \u2211i=1\mu wi$, since the second and third terms only increase the eigenvalues. However, the introduction of negative recombination weights can violate the positive definiteness because the third term may decrease the minimum eigenvalue arbitrarily. Jastrebski and Arnold (2006) set a sufficiently small learning rate for the active update; that is, the absolute values of the negative recombination weights sum up to a smaller value than one. It will prevent the covariance matrix from being non-positive with high probability, but it does not *guarantee* positive definiteness. Moreover, it becomes ineffective when the population size is relatively large and a greater learning rate is desired.

^{6}

### 3.1 Method 1: Scaling Down the Update Factor

^{7}for any two square matrices $A$ and $B$. Moreover, $dn(I+\alpha (t)Z(t))=1+\alpha (t)dn(Z(t))$. Therefore, if $\alpha (t)<1/\u2225dn(Z(t))\u2225$, the resulting covariance matrix is guaranteed to be positive definite.

### 3.2 Method 2: Scaling Down Negative Weights

An alternative way to guarantee the positive definiteness of the covariance matrix is to scale down the sum of the absolute values of the negative weights, combined with the projection of unpromising solutions introduced above. We use the same update formula (21), but $\alpha (t+teig)=1$.

The positive definiteness of the covariance matrix is guaranteed under the condition $1/teig>c1+c\mu \u2211i=1\lambda wi+nc\mu \u2211i:wi<0|wi|$. More precisely, we have the following claim.

If $1/teig>c1+c\mu \u2211i=1\lambda wi+nc\mu \u2211i:wi<0|wi|$, then $C(t+teig)\u227d1-teigc1+c\mu \u2211i=1\lambda wi+nc\mu \u2211i:wi<0|wi|C(t)$.

This method is simpler than the method described in Section 3.1 since it does not require an additional eigenvalue computation. However, to guarantee the positive definiteness in this way, we bound the unrealistic worst case where all unpromising candidate solutions are sampled on a line. Therefore, the scaling down factor is set much smaller than the value that is actually needed in practice. Figure 2 visualizes the correction factor $\alpha $ under our choice of the default pre-weights and learning rates described in Section 5. The sum of the negative recombination weights needs to decrease as the population size increases. For $n\u2a7e320$, the factor is less than 0.1 for $\lambda \u2a7en1.5$. Unless the internal computational time becomes a critical bottleneck, we prefer the method described in Section 3.1 over the method in Section 3.2.

### 3.3 Choice of Recombination Weights

^{8}

The positive recombination weights are unchanged from the default settings without active CMA. They are approximately proportional to the positive half of the optimal recombination weights. However, this is not the case for the negative recombination weights. The optimal recombination weights are symmetric about zero; that is, $wi=-w\lambda -i+1$, whereas our negative weights tend to level out for the following reasons. The above-mentioned quality gain results consider only the mean update. The obtained optimal values are not necessarily optimal for the covariance matrix adaptation. Since we use only positive recombination weights for the mean vector update, the negative weights do not need to correspond to these optimal recombination weights. Furthermore, the optimal negative weights—greater absolute values for worse steps—counteract our motivation for rescaling of unpromising steps discussed in Section 3.1. Our choice of the negative recombination weights is a natural extension of the default positive recombination weights. The shape of our recombination weights is somewhat similar to the one in xCMA-ES (Krause and Glasmachers, 2015), where the first half is $wi-1/\lambda $ and the last half is $-1/\lambda $. A minor difference is that xCMA-ES assigns negative values even for some of the better half of the candidate solutions, whereas our setting assigns positive values only for the better half and negative values only for the worse half.

Positive and negative recombination weights are scaled differently. Positive weights are scaled to sum to one, whereas negative weights are scaled so that the sum of their absolute values is the minimum of $1+c1c\mu $ and $1+2\mu w-\mu w+2$. The latter corrects for unbalanced positive and negative weights, but is usually greater than the former with our default values. The former makes the decay factor $1-c1-c\mu \u2211i=1\lambda wi$ of the covariance matrix update (see Eq. (14)) to 1. That is, the covariance matrix does not passively shrink, and only the negative update can shrink the covariance matrix. Krause and Glasmachers (2015) mention to have no passive shrinkage by setting the sum of the weights to zero, but the effect of the rank-one update was not taken into account and thus in xCMA the passive decay factor of $1-c1$ remains.

## 4 Adaptive Diagonal Decoding (dd-CMA)

The separable CMA-ES (Ros and Hansen, 2008) enables an adaptation to a diagonal covariance matrix faster than the standard CMA-ES because the learning rates are $O(n)$ times greater than in standard CMA-ES. This works well on separable objective functions, where separable CMA-ES adapts the (co)variance matrix by a factor of $O(n)$ faster than CMA-ES. To accelerate standard CMA, we combine separable CMA and standard CMA. We adapt both $D$ and $C$ at the same time,^{9} where $D$ is updated with *adaptive* learning rates which can be much greater than those used to update $C$.

### 4.1 Algorithm

Importantly, we set the learning rates for the $D$ update, $c1,D$ and $c\mu ,D$, to be about $n$ times greater than the learning rates for the $C$ update, $c1$ and $c\mu $. Moreover, we maintain an additional evolution path, $pc,D$, for the rank-one update of $D$ since an adequate cumulation factor, $cc,D$, may be different from the one for $pc$.

### 4.2 Damping Factor

The *dynamic* damping factor $\beta (t)$ is the crucial novelty that prevents diagonal decoding, to disrupt the adaptation of the covariance matrix $C$ in CMA-ES. The damping factor is introduced to prevent the sampling distributions from drastically changing due to the diagonal matrix update. Figure 3 visualizes three example cases with different $C$. When the diagonal decoding matrix changes from the identity matrix to $D$, the change of the distribution from $N(0,C)$ to $N(0,DCD)$ is minimal when $C$ is diagonal, and is comparatively large if $C$ is nondiagonal. It will be greater as the condition number of the correlation matrix $corr(C)$ increases. In the worst case in Figure 3, we see that two distributions are overlapping only at the center of distribution.

^{10}

In a nutshell, we have quantified the distribution change from changing $D$ by measuring its KL-divergence. We derive that $\beta $ in Eq. (31) upper bounds the KL-divergence due to changes of $D$ approximately by the KL-divergence from the same change of $D$ when $C=I$. The latter is directly determined by the learning rates $c1,D$ and $c\mu ,D$.

### 4.3 Implementation Remark

*unique*parametrization for the covariance matrix.

The eigen decomposition of $C(t)$ is performed every $teig$ iterations. We keep $C$ and $C-1$ until the next matrix decomposition is performed. Suppose that the eigen decomposition $C(t)=E\Lambda ET$ is performed at iteration $t$. Then, the above matrices are $C=E\Lambda 12ET$ and $C-1=E\Lambda -12ET$. Then, we can compute $\beta $ in Eq. (31) as $\beta =max(1,(maxi([\Lambda ]i,i)/mini([\Lambda ]i,i))12-\beta thresh+1)$. This $\beta $ is kept until the next eigen decomposition is performed. Then, the additional computational cost for adaptive diagonal decoding is $O(n2/teig+\lambda n)$, which is smaller than the computational cost for the other parts of the CMA-ES, $O(n3/teig+\lambda n2)$ per iteration.

One might think that maintaining $D$ is not necessary and $C$ could be updated by pre- and post-multiplying by a diagonal matrix absorbing the effect of the $D$ update. However, because diagonal decoding may lead to a fast change of the distribution shape, the decomposition of $C$ then needs to be done every iteration. The chosen parametrization circumvents frequent matrix decompositions despite changing the sampling distribution rapidly.

## 5 Algorithm Summary and Strategy Parameters

The dd-CMA-ES combines weighted recombination, active covariance matrix update with positive definiteness guarantee (described in Section 3.1), adaptive diagonal decoding (described in Section 4), and CSA, and is summarized in Algorithm 1.^{11}

Providing good default values for the strategy parameters (aka hyperparameters) is, needless to say, essential for the success of a novel algorithm. Especially in a black-box optimization scenario, parameter tuning for an application relies on trial-and-error and is usually prohibitively time consuming. The computation of the default parameter values is summarized in Algorithm 2. Since we have $c1/c\mu =c1,D/c\mu ,D$ as long as $c1+c\mu <1$ (see below), the recombination weights for the update of $C$ and of $D$ are the same, $wi=wi,D$.

The first important change is the scaling of the learning rate with the dimension $n$. In previous works (Ros and Hansen, 2008; Hansen and Auger, 2014; Akimoto and Hansen, 2016), the default learning rates were set inversely proportional to $m$; that is, $\Theta (n-2)$ for the original CMA and $\Theta (n-1)$ for the separable CMA. Our choice is based on empirical observations that $\Theta (n-2)$ is exceedingly conservative for higher dimensional problems, say $n>100$: in experiments to identify $c\mu $ for the original CMA-ES, we vary $c\mu $ and investigate the number of evaluations to reach a given target. We find that the location of the minimum and the shape of the dependency remains for a wide range of different dimensions roughly the same when the evaluations are taken against $c\mu m/n0.35$ (see also Figure 5). The observation suggests in particular that $c\mu =\Theta (n/m)$ is likely to fail with increasing dimension, whereas $c\mu =\Theta (n0.35/m)$ is a stable setting such that the new setting of $\Theta (n0.25/m)$ (replacing $\Theta (n0/m)$) is still sufficiently conservative. Figure 4 depicts the difference between the default learning rate values in Hansen and Auger (2014) and the above formulas.^{12}

Another important change in the default parameter values from previous studies (Ros and Hansen, 2008; Hansen and Auger, 2014) is in the cumulation factor $cc$. Previously, the typical choices were $cc=4n+4$ or $cc=4+\mu w/nn+4+2\mu w/n$. The new and simpler default value for the cumulation factor^{13} is motivated by an experiment on the Cigar function, investigating the dependency of $cc$ on $c1$. The effect of the rank-one update of the covariance matrix $C$ is most pronounced when the covariance matrix needs to increase a single eigenvalue. To skip over the usual initial phase where sigma decreases without changing the covariance matrix and emphasize on the effect of $cc$, the mean vector is initialized at $m(0)=(1,0,\u2026,0)$ and $\sigma (0)=10-6$. The rank-$\mu $ update is off; that is, $c\mu =c\mu ,D=0$. We compared the number of $f$-calls until the target function value of $10-2$ is reached. Note that $\sigma (0)$ is chosen to avoid $\sigma $ adaptation at the beginning and the target value is chosen so that we stop the optimization once the covariance matrix learned the right scaling. Figure 5 compares the number of $f$-calls spent by the original CMA and the separable CMA on the 20 dimensional Cigar function. Because the $x$-axis is normalized with the new default value for $cc$, the shape of the graph and the location of the minimum remains roughly the same for different values of $c1$ and $\lambda $. The adaptation speed of the covariance matrix tends to be the best around $cc=\mu wc1/2$ or $\mu wc1,D/2$. The minimum is more pronounced for smaller $c1$ (or $c1,D$). Nevertheless, we observe little improvement with the new setting in practice since more $f$-calls are usually spent to adapt the overall covariance matrix and even to adapt the step-size to converge. We have done the same experiments in dimension 80 and observe the same trend.

None of the strategy parameters are meant to be tuned by users, except the population size $\lambda $. A larger population size typically results in finding a better local minimum on rugged functions such as multimodal or noisy functions (Hansen and Kern, 2004). It is also advantageous when the objective function values are evaluated in parallel (Hansen et al., 2003). Restart strategies that perform independent restarts with increasing population size such as IPOP strategy (Harik and Lobo, 1999; Auger and Hansen, 2005b) or BIPOP strategy (Hansen, 2009) or NIPOP strategy (Loshchilov et al., 2017) are useful policies that automate the parameter tuning. They can be applied to the CMA-ES with diagonal acceleration in a straightforward way. We leave research in this line as future work.

## 6 Experiments

We conduct numerical simulations to see the effect of the algorithmic components of CMA-ES with diagonal acceleration, dd-CMA-ES—namely the active update with positive definiteness guarantee and the adaptive diagonal decoding. Since the effect of the active covariance matrix update with default population size has been investigated by Jastrebski and Arnold (2006) and our contribution is a positive definiteness guarantee for the covariance matrix especially for large population size, we investigate how the effect of the active update scales as the population size increases. Our main focus is, however, on the effect of adaptive diagonal decoding. Particularly, we investigate (i) how much better dd-CMA scales on separable functions than the CMA without diagonal decoding (plain CMA), (ii) how closely dd-CMA matches the performance of separable CMA on separable functions, and (iii) how the scaling of dd-CMA compares to the plain CMA on various nonseparable functions. Moreover, we investigated how effective dd-CMA is when the population size is increased.

### 6.1 Common Setting

Table 1 summarizes the test functions used in the following experiments together with the initial $m(0)$ and $\sigma (0)$. The initial covariance matrix $C(0)$ and the diagonal decoding matrix $D(0)$ are always set to the identity matrix. The random vectors and the random matrix appearing in Table 1 are initialized randomly for each problem instance, but the same values are used for different algorithms for fair comparison. A trial is considered as success if the target function value of $10-8$ is reached before the algorithm spends $5\xd7104n$ function evaluations, otherwise regarded as failure. For $n\u2a7d40$ we conducted 20 independent trials for each setting, 10 for $80\u2a7dn\u2a7d320$, and 3 for $n\u2a7e640$. When the computational effort of evaluating the test function scales worse than linear with the dimension (i.e., on rotated functions) we may omit dimensions larger than 320.

. | $f(x$ . | $m(0)$ . | $\sigma (0)$ . |
---|---|---|---|

Sphere | $\u2225x\u22252$ | $3\xb71$ | 1 |

Cigar | $\u2329u,x\u232a2+106(\u2225x\u22252-\u2329u,x\u232a2)$ | $3\xb71$ | 1 |

Discus | $106\u2329u,x\u232a2+(\u2225x\u22252-\u2329u,x\u232a2)$ | $3\xb71$ | 1 |

Ellipsoid | $\u2225Dell3z\u22252$ | $3\xb71$ | 1 |

TwoAxes | $106\u2211i=1n/2[z]i2+\u2211i=n/2+1n[z]i2$ | $3\xb71$ | 1 |

Ell-Cig | $10-4\u2329mu,y\u232a2+(\u2225y\u22252-\u2329u,y\u232a2)$ | $3\xb71$ | 1 |

Ell-Dis | $104\u2329u,y\u232a2+(\u2225y\u22252-\u2329u,y\u232a2)$ | $3\xb71$ | 1 |

Rosenbrock | $\u2211i=1n-1100([z]i2-[z]i+1)2+([z]i-1)2$ | $0$ | 0.1 |

Bohachevsky | $\u2211i=1n-1([z]i2+2[z]i+12-0.3cos(3\pi [z]i)\cdots $ | $N(0,82I)$ | 7 |

$-0.4cos(4\pi [z]i+1)+0.7)$ | |||

Rastrigin | $\u2211i=1n[z]i2+10(1-cos(2\pi [z]i))$ | $N(0,32I)$ | 2 |

. | $f(x$ . | $m(0)$ . | $\sigma (0)$ . |
---|---|---|---|

Sphere | $\u2225x\u22252$ | $3\xb71$ | 1 |

Cigar | $\u2329u,x\u232a2+106(\u2225x\u22252-\u2329u,x\u232a2)$ | $3\xb71$ | 1 |

Discus | $106\u2329u,x\u232a2+(\u2225x\u22252-\u2329u,x\u232a2)$ | $3\xb71$ | 1 |

Ellipsoid | $\u2225Dell3z\u22252$ | $3\xb71$ | 1 |

TwoAxes | $106\u2211i=1n/2[z]i2+\u2211i=n/2+1n[z]i2$ | $3\xb71$ | 1 |

Ell-Cig | $10-4\u2329mu,y\u232a2+(\u2225y\u22252-\u2329u,y\u232a2)$ | $3\xb71$ | 1 |

Ell-Dis | $104\u2329u,y\u232a2+(\u2225y\u22252-\u2329u,y\u232a2)$ | $3\xb71$ | 1 |

Rosenbrock | $\u2211i=1n-1100([z]i2-[z]i+1)2+([z]i-1)2$ | $0$ | 0.1 |

Bohachevsky | $\u2211i=1n-1([z]i2+2[z]i+12-0.3cos(3\pi [z]i)\cdots $ | $N(0,82I)$ | 7 |

$-0.4cos(4\pi [z]i+1)+0.7)$ | |||

Rastrigin | $\u2211i=1n[z]i2+10(1-cos(2\pi [z]i))$ | $N(0,32I)$ | 2 |

We compare the following CMA variants:

**Plain CMA**updates $C$ while $D$ is kept the identity matrix, with or without active update described in Section 3 (Algorithm 1 without $D$-update);**Separable CMA**updates $D$ as described in Section 4 while $C$ is kept the identity matrix, with active update or without active update, where negative recombination weights are set to zero (Algorithm 1 without $C$-update);**dd-CMA**as summarized in Algorithm 1, updates $C$ as described in Section 2.1 with active update described in Section 4, and updates $D$ as described in Section 4.

### 6.2 Active CMA

First, the effect of the active update is investigated. The plain and separable CMA with and without active update are compared.

Figure 6 shows the number of iterations spent by each algorithm plotted against population size $\lambda $. The number of iterations to reach the target function value decreases as $\lambda $ increases and tends to level out. The plain CMA is consistently faster with active update than without. As expected from the results of Jastrebski and Arnold (2006), the effect of active covariance matrix update is most pronounced on functions with a small number of sensitive variables such as the Discus function. The advantage of the active update diminishes as $\lambda $ increases in plain CMA, whereas the speed-up in separable CMA becomes even slightly more pronounced.

### 6.3 Adaptive Diagonal Decoding

Next, we compare dd-CMA with the plain and the separable CMA with active update.

Figure 7 compares dd-CMA with the plain CMA on the rotated Cigar, Ellipsoid, and Discus functions, and with the separable CMA on the separable Cigar, Ellipsoid, and Discus functions. Note that the plain CMA scales equally well on separable functions and rotated functions, while the separable CMA will not find the target function value on rotated functions before the maximum budget is exhausted (Ros and Hansen, 2008). Displayed are the number of function evaluations to reach the target function value on each problem instance. The results of dd-, plain and separable CMA on the Sphere function are displayed for reference.

On the Sphere function, no covariance matrix adaptation is required, and all algorithms perform quite similarly. On the Cigar function, the rank-one covariance matrix update is known to be rather effective, and even the plain CMA scales linearly on the rotated Cigar function. On both, separable and rotated Cigar functions, dd-CMA is competitive with separable and plain CMA thereby combining the better performance of the two.

The discrepancy between plain and separable CMA is much more pronounced on Ellipsoid and Discus functions, where the plain CMA scales superlinearly, whereas the separable CMA exhibits linear or slightly worse than linear scaling on separable instances and fails to find the optimum within the given budget on rotated instances. On the rotated functions, dd-CMA is competitive with the plain CMA, whereas it significantly reduces the number of function evaluations on the separable functions compared to plain CMA; for example, ten times fewer $f$-calls are required on the 160 dimensional separable Ellipsoid function.

The dd-CMA is competitive with separable CMA on the separable Discus function up to 5120 dimension, which is the ideal situation since we cannot expect faster adaptation of the distribution shape on separable functions. On the separable Ellipsoid function, the performance curve of dd-CMA starts deviating from that of the separable CMA around $n=320$, and scales more or less the same as the plain CMA afterwards, yet it requires ten times fewer $f$-calls. To improve the scaling of dd-CMA on the separable Ellipsoid function, we might need to set $\beta thresh$ depending on $n$, or use another monotone transformation of the condition number of the correlation matrix in Eq. (31). Performing the eigen decomposition of $C$ less frequently (i.e., setting $\beta eig$ smaller) is another possible way to improve the performance of dd-CMA on the separable Ellipsoid function, while compromising the performance on the rotated Ellipsoid function.^{14}

Figures 8a and 8b show the results on Ell-Cig and Ell-Dis functions, which are nonseparable ill-conditioned functions with additional coordinate-wise scaling, Figure 8c shows the results on nonrotated and rotated Rosenbrock functions. The separable CMA locates the target $f$-value within the given budget only on the nonrotated Rosenbrock function in smaller dimension. The experiment on the Ell-Cig function reveals that diagonal decoding can improve the scaling of the plain CMA *even on nonseparable functions*. The improvement on the Ell-Dis function is much less pronounced. The reason why, compared to the plain CMA, dd-CMA is more suitable for Ell-Cig than for Ell-Dis is discussed in relation with Figure 9. On the nonrotated Rosenbrock function, dd-CMA becomes advantageous as the dimension increases, just as the separable CMA is faster than the plain CMA when $n>80$.

Figure 9 visualizes insights in the typical behavior of dd-CMA on different functions. On the separable Ellipsoid, the correlation matrix deviates from the identity matrix due to stochastics and the initial parameter setting, and the inverse damping decreases marginally. This effect is more pronounced for $n>320$ and is the reason for the impaired scaling of dd-CMA on the separable Ellipsoid in Figure 7d. The diagonal decoding matrix, that is, variance matrix, adapts the coordinate-wise scaling efficiently as long as the condition number of the correlation matrix $C$ is not too high. On the rotated Ellipsoid function, where the diagonal elements of the inverse Hessian of the objective function, $12RTDell-6R$, likely have similar values, the diagonal decoding matrix remains close to the identity and the correlation matrix learns the ill-conditioning. The inverse damping parameter decreases so that the adaptation of the diagonal decoding matrix does not disturb the adaptation of the distribution shape. Figure 9b shows that adaptive diagonal decoding without the dynamic damping factor Eq. (31) severely disturbs the adaptation of $C$ on the rotated Ellipsoid.

Ideally, the diagonal decoding matrix $D$ adapts the coordinate-wise standard deviation and $C$ adapts the correlation matrix of the sampling distribution. If the correlation matrix $C$ is easier to adapt than the full covariance matrix $DCD$, we expect a speed-up of the adaptation of the distribution shape. The functions Ell-Cig and Ell-Dis in Figures 9c and 9d are such examples. Once $D$ becomes inversely proportional to $Dell2$, the problems become rotated Cigar and rotated Discus functions, respectively. The run on the Ell-Cig function in 9c depicts the ideal case. The coordinate-wise scaling is first adapted by $D$ and then the correlation matrix learns a long Cigar axis. The inverse damping factor is kept at a high value similar to the one observed on the separable Ellipsoid and it starts decreasing only after $D$ is adapted. On the Ell-Dis function (9d) however, the short noncoordinate axis is learned (too) quickly by the correlation matrix. Therefore, the inverse damping factor decreases before $D$ has adapted the full coordinate-wise scaling. Since the function value is more sensitive in the subspace corresponding to great eigenvalues of the Hessian of the objective, and the ranking of candidate solutions is mostly determined by this subspace, the CMA-ES first shrinks the distribution in this subspace. This is the main reason why dd-CMA is more efficient on Ell-Cig than on Ell-Dis.

Figure 10 shows, similar to Figure 6, the number of iterations versus the population size $\lambda $ on three functions in dimension 40. For all larger populations sizes up to 13312, dd-CMA performs on par with the better of plain and separable CMA, as ideally to be expected.

We remark that dd-CMA with default $\lambda =13$ has a relatively large variance in performance on the separable TwoAxes function. Increasing $\lambda $ from its default value reduces this variance and even reduces the number of $f$-calls to reach the target. This defect of dd-CMA is even more pronounced for higher dimensions. It may suggest to increase the default $\lambda $ for dd-CMA. We leave this to be addressed in future work.

Figure 11 shows the number of iterations in successful runs, the success rate and the average number of evaluations in successful runs divided by the success rate (Price, 1997; Auger and Hansen, 2005a) on two multimodal 40-dimensional functions. The maximum numbers of $f$-calls are $5n\xd7104$ on Bohachevsky and $2n\xd7105$ on Rastrigin. Comparing separable to dd-CMA on separable functions and plain to dd-CMA on rotated functions, dd-CMA tends to spend less iterations but to require greater $\lambda $ to reach the same success rate. The latter is attributed to doubly shrinking the overall variance by updating $C$ and $D$. For $\lambda \u226bn$, we have $c\mu \u22481$ and $c\mu ,D\u22481$ and the effect of shrinking the overall variance due to $C$ and $D$ updates is not negligible. Then, the overall variance is smaller than in plain and separable CMA, which also leads to faster convergence. This effect disappears if we force the $D$ update to keep its determinant unchanged, which can be realized by replacing $\Delta D$ in Eq. (29) with $\Delta D-(Tr(\Delta D)/n)I$.

## 7 Summary and Conclusion

In this article, we have put forward several techniques to improve multirecombinative nonelitistic covariance matrix adaptation evolution strategies without changing their internal computation effort.

The first component is concerned with the active covariance matrix update, which utilizes unsuccessful candidate solutions to actively shrink the sampling distribution in unpromising directions. We propose two ways to guarantee the positive definiteness of the covariance matrix by rescaling unsuccessful candidate solutions.

The second and main component is a diagonal acceleration of CMA by adaptive diagonal decoding, dd-CMA. The covariance matrix adaptation is accelerated by adapting a coordinate-wise scaling separately from the positive definite symmetric covariance matrix. This drastically accelerates the adaptation speed on high-dimensional functions with coordinate-wise ill-conditioning, whereas it does not significantly influence the adaptation of the covariance matrix on highly ill-conditioned non-separable functions without coordinate-wise scaling component.

The last component is a set of improved default parameters for CMA. The scaling of the learning rates are relaxed from $\Theta (1/n2)$ to $\Theta (1/n1.75)$ for default CMA and from $\Theta (1/n)$ to $\Theta (1/n0.75)$ for separable CMA. This contributes to accelerate the learning in all investigated CMA variants on high dimensional problems, say $n\u2a7e100$.

Algorithm selection as well as hyperparameter tuning of an algorithm is a troublesome issue in black-box optimization. For CMA-ES, we needed to make a decision whether we use the plain CMA or the separable CMA, based on the limited knowledge on a problem of interest. If we select the separable CMA but the objective function is non-separable and highly ill-conditioned, we will not obtain a reasonable solution within most given function evaluation budgets. Therefore, plain CMA is a safer choice, though it may be slower on nearly separable functions. The dd-CMA automizes this decision and achieves faster adaptation speed on functions with ill-conditioned variables usually without compromising the performance on nonseparable and highly ill-conditioned functions. Moreover, the advantage of dd-CMA is not limited to separable problems. We found a class of nonseparable problems with variables of different sensitivity on which dd-CMA-ES decidedly outperforms CMA-ES *and* separable CMA-ES (see Figure 8a). As dd-CMA-ES improves the performance on a relatively wide range of problems with mis-scaling of variables, it is a prime candidate to become the default CMA-ES variant.

## Acknowledgments

The authors thank the associate editor and the anonymous reviewers for their valuable comments and suggestions. This work is partly supported by JSPS KAKENHI grant number 19H04179.

## Notes

^{1}

Any covariance matrix, $\Sigma $, can be uniquely decomposed into $\Sigma =DCD$, where $D$ is a diagonal matrix and $C$ is a correlation matrix. The addressed problem class can be characterized in that for the best problem approximation $\Sigma =DCD$*both matrices, $C$ and $D$, have non-negligible condition number, say no less than* 100.

^{2}

Ties, $f(xi:\lambda )=\cdots =f(xi+k-1:\lambda )$, are treated by redistributing the averaged recombination weights $\u2211l=0k-1wi+l/k$ to tied solutions $xi:\lambda ,\u2026,xi+k-1:\lambda $.

^{3}

When $D$ is not the identity, (3) is not exactly equivalent to the original CSA (Hansen and Ostermeier, 2001): $zi:\lambda =(DC)-1(xi:\lambda -m/\sigma $ in this article whereas originally $zi:\lambda =DCD-1(xi:\lambda -m)/\sigma $. This difference results in rotating the second term of the RHS of (3) at each iteration with a different orthogonal matrix, and ends up in a different $\u2225p\sigma \u2225$. Krause et al. (2016) have theoretically studied the effect of this difference and argued that this systematic difference becomes small if the parameterization of the covariance matrix of the sampling distribution is unique and it converges up to scale. If $D$ is fixed, the parameterization is unique. Later in this article, we update both $D$ and $C$ but we force the parameterization to be unique by Eqs. (34) and (35). Hence, the systematic difference is expected to be small. See Krause et al. (2016) for details.

^{4}

An elegant alternative to introducing $\gamma \sigma $ is to use $c\sigma (t)=max(c\sigma ,1/t)$ in place of $c\sigma $ in Eq. (3), assuming the first $t=1$. This resembles a simple average while $t\u22641/c\sigma $ and only afterwards discounts older information by the original decay of $1-c\sigma $.

^{5}

As of 2018, many implementations of CMA-ES feature the active update of $C$ and it should be considered as default.

^{6}

There are variants of CMA-ES that update a factored matrix $A$ satisfying $C=AAT$ (e.g., eNES by Sun et al., 2009). No matter how $A$ is updated, the positive semidefiniteness of $C$ is guaranteed since $AAT$ is always positive semidefinite. However this approach has a potential drawback that a negative update may end up increasing a variance. To see this, consider the case $A\u2190A(I-\eta eeT)$, where $e$ is some vector and $\eta >0$ represents the learning rate times the recombination weight. Then, the covariance matrix follows $C\u2190A(I-\eta eeT)2AT$. This update shrinks a variance if $\eta $ is sufficiently small ($\eta <1/\u2225e\u22252$); however, it increases the variance if $\eta $ is large ($\eta >1/\u2225e\u22252$). Hence, a negative update with a long vector $e$ and/or a large $\eta $ will cause an opposite effect. Therefore, the factored matrix update is not a conclusive solution to the positive definiteness issue.

^{7}

Some references (e.g., Harville (2008)) use the term “positive semidefinite matrix” for a matrix with positive and zero eigenvalues and “non-negative definite matrix” is used for matrices with non-negative eigenvalues, whereas in other references these terms are used for matrices with non-negative eigenvalues. In this article, we apply the latter terminology.

^{8}

Interesting results are derived from the quality gain (and progress rate) analysis; see, for example, Section 4.2 of Hansen et al. (2015). Comparing the ($1+1$)-ES and ($\mu $, $\lambda $)-ES with intermediate recombination, we realize that they reach the same quality gain in the limit for $n$ to infinity when $\mu $ is the optimal value, $\mu \u22480.27\lambda $ (Beyer, 1995). That is, a nonelitist strategy with optimal $\mu /\lambda $ for intermediate recombination can reach the same speed as the elitist strategy. With the optimal recombination weights above, the weighted recombination ES is 2.475 times faster than those algorithms. If we employ only the positive half of the optimal recombination weights, the speed will be halved, yet it is faster by the factor of 1.237.

^{9}

Considering the eigen decomposition of the covariance matrix, $E\Lambda ET$, instead of the decomposition $DCD$, our attempts to additionally adapt $\Lambda $ were unsuccessful. We attribute the failure to the strong interdependency between $\Lambda $ and $E$. Compared to the eigen decomposition, the diagonal decoding model $DCD$ also has the advantage to be interpretable as a rescaling of variables. We never considered the decomposition $CD2C$ as proposed by one of the reviewers of this article; however, at first sight, we do not see any particular advantage in favor of this decomposition.

^{10}

Let $A$ and $B$ be an arbitrary positive definite symmetric matrix and an arbitrary symmetric matrix, respectively. Let $\u2297$ and $vec$ denote the Kronecker product of two matrices and the matrix-vector rearrangement operator that successively stacks the columns of a matrix. From Theorem 16.2.2 of Harville (2008), we have $Tr(ABA-1B)=vec(B)T(A\u2297A-1)vec(B)$. The eigenvalues of the Kronecker product $A\u2297A-1$ are the products of the eigenvalues of $A$ and $A-1$ (Theorem 21.11.1 of Harville, 2008). They are all positive and upper bounded by the condition number of $A$. Therefore, we have $vec(B)T(A\u2297A-1)vec(B)\u2a7dCond(A)\u2225vec(B)\u22252=Cond(A)Tr(B2)$. Letting $A=corr(C)$ and $B=\Delta \xafD$, we have $Tr\Delta \xafDC\Delta \xafDC-1=Tr(ABA-1B)\u2a7d(\beta +\beta thresh-1)2Tr(\Delta \xafD2)$.

^{11}

Its python code is available at https://gist.github.com/youheiakimoto/1180b67b5a0b1265c204cba991fa8518

^{12}

The learning rate for the rank-$\mu $ update is usually $\mu '$ times greater than the learning rate for the rank-one update, where $\mu '\u2208(\mu w-2,\mu w-1/2)$ is monotonous in $\mu w$ and approaches $\mu w-3/2$ for $\lambda \u2192\u221e$. Using $\mu '$ instead of $\mu w$ is a correction for small $\mu w$. When $\mu w=1$, we have $\mu '<1/2$ (the first three terms cancel out). In this case, because $\mu =1$, the rank-one update contains more information than the rank-$\mu $ update as the evolution path accumulates information over time. Using $\mu w$ instead of $\mu '$ would result in the same learning rates for both updates, whereas the learning rate for the rank-$\mu $ update should be smaller in this case.

^{13}

The appearance of $\mu w$ in the cumulation factor is motivated as follows. Hansen and Ostermeier (2001) analyzed the situation where the selection vector $\mu w\u2211i=1\mu wizi:\lambda $ alternates between two vectors $v$ and $w$ over iterations and showed the squared Euclidean norm of the evolution path; that is the eigenvalue of $pcpcT$, remains in $O(\u2225v+w\u22252/cc)$ if $\u2225v+w\u2225>0$. On the other hand, Akimoto et al. (2008) showed that the above selection vector can potentially be proportional to $\mu w$ when $\sigma $ is (too) small. Altogether, we have $c1\u2225pc\u22252\u2208O(c1\mu w/cc)$, where $c1\mu w\u21922$ and $cc\u21922/2$ as $\mu w\u2192\u221e$. Therefore, the eigenvalue added by the rank-one update is bounded by a constant with overwhelming probability.

^{14}

If we set $\beta eig=10$ instead of $\beta eig=10n$, dd-CMA spends about 1.5 times more FEs on the 5120-dimensional separable Ellipsoid function than separable CMA as displayed in Figure 7d, whereas it spends about $15%$ more FEs on the 320-dimensional rotated Ellipsoid function than plain CMA as displayed in Figure 7d. This might be a more practical choice, but further investigation is required.