## Abstract

Generating more evenly distributed samples in high dimensional search spaces is the major purpose of the recently proposed *mirrored sampling* technique for evolution strategies. The diversity of the mutation samples is enlarged and the convergence rate is therefore improved by the mirrored sampling. Motivated by the mirrored sampling technique, this article introduces a new derandomized sampling technique called *mirrored orthogonal sampling*. The performance of this new technique is both theoretically analyzed and empirically studied on the sphere function. In particular, the mirrored orthogonal sampling technique is applied to the well-known Covariance Matrix Adaptation Evolution Strategy (CMA-ES). The resulting algorithm is experimentally tested on the well-known Black-Box Optimization Benchmark (BBOB). By comparing the results from the benchmark, mirrored orthogonal sampling is found to outperform both the standard CMA-ES and its variant using mirrored sampling.

## 1 Introduction

In evolution strategies (ESs), derandomized sampling aims at improving random sampling by generating “good” random mutation points from a certain distribution (usually Gaussian). Loosely speaking, the “goodness” of the mutation refers to its diversity, which measures how evenly the mutation points are arranged in the search space. Intuitively, the mutation sample having a high diversity could explore the search space more thoroughly. We will introduce the definition of diversity in Section 2.2.

Much effort has been devoted to the derandomized sampling and several methods are proposed. Niederreiter (1992) proposes to adopt *quasi-random* variables, which has already been applied to genetic algorithms (Kimura and Matsumura, 2005) and evolution strategies (Teytaud and Gelly, 2007). Despite their successes, these approaches are found to be more complicated to implement than the simple random sampling and may introduce undesired biases as will be shown later. A recent systematic overview of modern variants of evolution strategies can be found in Bäck et al. (2013).

The recently proposed *mirrored sampling* technique (Brockhoff et al., 2010) is a simple and effective derandomized sampling method for ES. Instead of generating i.i.d. Gaussian samples, the sample points are paired and symmetric to the current parent, where only one sample point of each pair is generated by the simple random sampling. In mirrored sampling, half of the sample points are independent and the other half are dependent. There is no additional computational cost needed for the mirrored sampling technique. It has been theoretically proven that the performance of $(1,+\lambda )$-ES can be improved by applying the mirrored sampling technique (Auger et al., 2011a).

The purpose of this article is to elaborate and analyze an improvement of the mirrored sampling technique, which is called *mirrored orthogonal sampling*. Its basic idea has been briefly introduced in Wang et al. (2014) and it is based on the following intuition. In mirrored sampling, half of the samples are still obtained using the simple random sampling, which would suffer from the same problem as before, namely the sampling errors (see Section 2.2 for an in-depth discussion). Thus, the diversity of random samples could be further enhanced by improving the mirrored sampling technique. This is achieved by completely discarding the simple random sampling component in mirrored sampling and replacing it by the *orthogonal sampling*. The resulting technique is called *mirrored orthogonal sampling* here. This article extends our previous work (Wang et al., 2014) in the following aspects:

The motivation, intuition, and formalization of the mirrored orthogonal sampling are incorporated in detail.

The procedure for tuning the strategy parameters for mirrored orthogonal sampling is discussed.

The progress rate approach is applied to analyze and compare the simple random, mirrored, and mirrored orthogonal sampling techniques.

The benchmark results are presented and explained in detail.

Based on the convergence rate analysis approach in Brockhoff et al. (2010), we analyze the convergence rate of the isotropic $(1,\lambda )$-ES with mirrored orthogonal sampling on the sphere function. For the $(\mu ,\lambda )$-ES algorithm, a bias would occur during recombination, probably leading to premature convergence behavior. This bias is avoided by applying the *pairwise selection* technique, in which only the better point of a mirrored pair is allowed to participate in the weighted recombination. Mirrored orthogonal sampling is applied within a CMA-ES for the experimental validation.

This article is organized as follows. Section 2 introduces the background of derandomized sampling as well as the motivation of our work. In Section 3.1, the new derandomized sampling approach is proposed and explained in detail. Section 3.2 concentrates on the implementation issues of our approach. Both the theoretical and empirical study of the convergence rates are presented in Section 4. The progress rate analysis is used to analyze the algorithm performance. In Section 5, the experimental results of mirrored orthogonal sampling are shown and compared to the other sampling methods. Finally, conclusions and possible directions for further research are given in Section 6. In this article, we shall use $n$ to denote the dimensionality of the search space, and $\lambda $ to denote the population size of evolution strategy.

## 2 Background and Related Work

### 2.1 Evolution Strategy

**mutation operator**, where random variables are used to perturb candidate solutions locally. Mostly commonly, the mutation operator is realized by taking a

**simple random sample**(of size $\lambda >1$) from the

*multivariate Gaussian distribution*:

*weighted recombination*(Hansen and Ostermeier, 2001):

*Covariance Matrix Adaptation Evolution Strategy*(CMA-ES) (Hansen, 2006), this is achieved by the maximum likelihood estimation. In addition, in CMA-ES the step-size $\sigma $ is controlled based on the

*Cumulative Step-size Adaptation*(CSA) mechanism (Hansen and Ostermeier, 2001), where the steps (displacement of $m$ between iterations) are cumulated with exponential decay and the length of the cumulant is used to adjust the step-size. The changing rate of the step-size is regulated by the so-called

*damping factor*$d\sigma $. Please refer to Hansen (2006) for the detail. Note that the original damping factor is modified when applying the proposed sampling approach to CMA-ES (Section 3.4).

### 2.2 Sampling Error and Space Exploration

*simple random sampling*(Equation (1)), samples pseudo-random numbers directly from a certain distribution. However, it also suffers from the so-called

*sampling error*, which describes the situation that the estimated properties (from a sample) differ largely from the property of the population. The sampling error is caused by unrepresentative or biased samples when the sample size is small.

An example of biased samples is illustrated in Figure 1, in which four i.i.d. mutation vectors are sampled from a multivariate Gaussian distribution $N(m,C$; the step-size is ignored in this case). The black solid ellipsoid represents the covariance matrix $C$. The diversity of the mutation vectors is not satisfactory because the minimal distance between samples is relatively small. A strong sampling error incurs in this case because if the mean and covariance of the distribution are estimated from these four vectors, the results would deviate largely from $m$ and $C$.

Consequently, a large portion of the search space is not reached (at least half the space in this case); moreover, if the objective function is locally convex near the optimum (as illustrated by the dashed ellipsoids). The probability that a new search point leads to an improvement can be very small, as shown by the area marked by vertical lines. Therefore, if the population size is small, a biased sample can take place such that none of the mutations leads to an improvement, hindering the progress of this generation. The sampling error has an even bigger side effect in modern evolution strategies (e.g., CMA-ES) because those algorithms tend to exploit small populations to speed up their convergence rate. To overcome this problem, it is proposed to apply derandomized sampling methods for a small population.

### 2.3 Quasi-Random Sampling

There are some techniques proposed to reduce the sampling error as much as possible. The first method is called quasi-random sampling, which produces low-discrepancy sequences of samples (Dick and Pillichshammer, 2010). The discrepancy of a sequence is low if the proportion of points in the sequence falling into an arbitrary set is close to proportional to the size of this set. Low-discrepancy sequences are commonly used as a replacement of simple random samples from the uniform distribution. Intuitively, such sequences span the search space more “evenly” than the pseudo-random numbers. It is widely used in numerical approaches like the quasi-Monte-Carlo method (Niederreiter, 1992) to achieve a faster rate of convergence. Due to the advantages of quasi-random sampling, it is also applied in genetic algorithms (Kimura and Matsumura, 2005) and evolution strategies (Teytaud and Gelly, 2007). Specifically, it has already been applied to the well-known CMA-ES (Hansen and Ostermeier, 2001; Hansen et al., 2003). Teytaud and Gelly (2007) propose to replace the simple random Gaussian samples by a low-discrepancy sequence in the mutation operator. The method for generating quasi-random samples according to the Gaussian distribution is also developed because the quasi-random samples are usually generated for a uniform distribution. It is also argued that the efficiency of CMA-ES is improved due to a better diversity of quasi-random samples. However, such an approach would cause a systematic bias on the step size adaptation (see Section 3.3).

### 2.4 Mirrored Sampling

The mirrored sampling technique (Brockhoff et al., 2010) is another method for obtaining “good” samples and it successfully accelerates the convergence of ESs (Auger et al., 2010). It is a simple and elegant idea in which a single random mutation is used to create two sample points: Instead of generating $\lambda $ i.i.d. search points, only half of the mutation vectors are generated using simple random sampling, namely $z2i-1i=1\lambda /2$, $zi\u223cN(0,\sigma 2C)$. Each mutation vector $z2i-1$ is used to generate a pair of offspring, $x2i-1=m+z2i-1$ and $x2i=m-z2i-1$, which are symmetric about the center of mass $m$ (parent point).

To make the following discussion clear, the mutation obtained directly from simple random sampling is called the original mutation. The mirrored sampling method is described in Algorithm 1, acting as an alternative to the standard mutation operator (simple random sampling) in evolution strategies. For an odd $\lambda $, it begins with generating $\u2308\lambda /2\u2309$ mutations in the first generation, corresponding to $\u2308\lambda /2\u2309$ mirrored ones. To keep the population size alway as $\lambda $, all $\u2308\lambda /2\u2309$ original mutations and $\u2308\lambda /2\u2309-1$ mirrored ones undergo the evaluation and selection procedure while the last mirrored mutation is held out for the next iteration (Lines 18–21). In the next iteration, the hold-out mirrored mutation is used (Lines 3–9) and we only need to draw $\u230a\lambda /2\u230b$ mutations. The following generations repeat this procedure. The static variable $zlast$ in Algorithm 1 stores the hold-out mutation. Here, the notation proposed in Brockhoff et al. (2010) is used such that any ES algorithm using mirrored sampling is denoted as $(\mu ,+\lambda m)$-ES. Consequently, in the $(1+1m)$-ES, a mirrored mutation is used if and only if the iteration index is even. By using mirrored sampling, mutations in each mirrored pair are dependent and explore two anti-parallel directions such that the mirrored counterpart of an unsuccessful mutation has a certain chance to yield an improvement.

Note that the mirrored sampling method is very similar to the so-called *opposition-based learning* method (Rahnamayan et al., 2006), in which the candidate solution is mirrored with respect to the center of the smallest hyper-box covering the current population. This approach is implemented in the differential evolution (DE) algorithm to generate an opposite population occasionally, which improves the performance of DE (Rahnamayan et al., 2006).

### 2.5 Deterministic Orthogonal Sampling

Orthogonal sampling, which denotes a sampling approach utilizing orthogonal search directions, is another solution to enhance the mutation diversity. This sampling scheme can be found in Coordinate Descent (Schwefel, 1993), Adaptive Coordinate Descent (ACiD) (Loshchilov et al., 2011), and Rosenbrock's Local Search (Rosenbrock, 1960). Intuitively, by taking samples on the orthogonal directions, the search space is covered more evenly.

Normally, in this approach, an orthogonal basis $\Xi ={\xi 1,\xi 2,\u2026,\xi n}$ are maintained in each optimization iteration, which represents the possible search directions. In each iteration, a line search is conducted along a basis vector, which is achieved by sampling two trial points: one point is created by adding the basis vector to the current search point $m$ while the other one is generated through mirroring. In the next iteration, a different basis vector in $\Xi $ is picked for the exploration. The general framework of this method is summarized below:

Initialize the search point $m$, a set of orthonormal basis vectors $\Xi ={\xi 1,\xi 2,\u2026,\xi n}$ as the search directions and the step sizes ${\sigma 1,\sigma 2,\u2026,\sigma n}$ for each search direction.

If the termination condition is not satisfied, perform the following steps until (e) for each iteration. Let $g$ be the iteration counter:

Choose base $\xi i$ as the exploration direction where $i=gmodn$ and generate one trial point: $x1=m+\sigma i\xi i.$

For Rosenbrock's local search, go to (c). For the other methods, use base $\xi i$ to generate the other trial point: $x2=m-\sigma i\xi i.$

Evaluate the trial points $x1,x2$ (if $x2$ exists). Set the search point $m$ to the one with the best fitness value.

Update the step size $\sigma i$ according to a deterministic or stochastic rule and increase the iteration counter $g$ by one.

If $gmodn=0$, then update the basis $\Xi $ according to the search points of the most recent $n$ iterations.

When all the basis directions are tried, the orthogonal basis $\Xi $ is either unchanged or updated based on the successful trials in the history. Note that the rules of the update may vary from algorithm to algorithm: In Coordinate Descent, the basis is fixed to the canonical basis of $Rn$ during the process. In ACiD, the basis is updated by Adaptive Encoding (Hansen, 2008), which is the generalization of the covariance matrix adaptation in CMA-ES. We deliberately term this sampling method as *deterministic* orthogonal sampling due to the fact that the update of the orthonormal basis is completely deterministic and it is easier to distinguish this sampling method from the *random* orthogonal sampling proposed here.

## 3 Mirrored Orthogonal Sampling

In this section, we elaborate the mirrored orthogonal sampling technique.

### 3.1 The Proposed Method

This new method is motivated by the following observation: In mirrored sampling, half of the mutation vectors (the mirrored ones) completely depend on the other half (the original ones). Mirrored sampling ensures a significant difference between these two halves of mutations. In addition, the mirrored mutation is anti-parallel to the original one and thus a mirrored pair would never miss one half of the search space, no matter how the search space is partitioned.^{1} However, within each half of mutations, the diversity is still not regulated such that many mutations might be “squeezed” in a narrow direction. Thus, the mirrored sampling technique can still generate unrepresentative samples as described in Section 2.2.

In order to alleviate this issue, we consider the deterministic orthogonal sampling method (Section 2.5), where the mutations are selected from a precomputed orthogonal basis and thus the minimal distance between samples is enlarged. The disadvantage is that deterministic search directions are used and only one of the orthogonal vectors can be used in one evolution cycle, which limits it usability for the general $(\mu ,\lambda )$-ES. Instead of just picking vectors in an orthogonal basis, it is proposed here to create uniformly random orthogonal vectors, in the sense that each vector is stochastic (instead of being deterministic) and uniformly random (meaning that each search direction is sampled with the same probability). The definition of such samples is given as:

The *uniform random orthogonal vectors* are defined as a set of random vectors ${O1,O2,\u2026,Ok}\u2282Rn$ ($k\u2264n$), satisfying the following three properties:

Orthogonality: $\u2200i\u2260j\u2208{1,2,\u2026,k},\u2329Oi,Oj\u232a=0.$

$\chi $-distributed norm: $\u2200i\u2208{1,2,\u2026,k},Oi=\u2329Oi,Oi\u232a\u223c\chi (n)$.

Uniformity: for each vector $Oi$, its normalization $Oi/Oi$ distributes

*uniformly*on the unit sphere.

Remarks: (1) The norm of the sample vector is restricted to $\chi $ distribution for mimicking the behavior of the standard Gaussian vector. (2) The uniform distribution on the unit sphere is equivalent to the *rotation-invariant* property with respect to an arbitrary rotation matrix^{2}$R$: the random vector $x$ and the rotated one $x'=Rx$ are identically distributed. (3) Throughout this article, the dot product is taken for the inner product, namely $\u2329x,y\u232a=x\u22a4y$.

*random orthogonal sampling*. For clarity, we shall refer the default mutation operator (Equation (1)) in CMA-ES as

*simple random sampling*. In addition, the random orthogonal samples are rescaled and rotated according to the covariance matrix $C$ before they are added to the parental point $m$, as with the default mutation operator:

*mirrored orthogonal sampling*. In addition, any ES algorithm equipped with it is denoted as $(\mu ,+\lambda mo)$-ES here. The detailed algorithm of the mirrored orthogonal sampling method is given in Algorithm 2. Note that an algorithm for generating random orthogonal Gaussian vectors (which is explained in the following) is invoked in line 10 and replaces the direct sampling of the Gaussian distribution. The remainder of this algorithm is basically the same as mirrored sampling (Algorithm 1).

Compared to mirrored sampling, which ensures the difference within any mirrored pair, the orthogonalization method is exploited to guarantee the significant differences among mutations. Therefore, it is straightforward to compare the performance of mirrored orthogonal sampling to that of mirrored sampling/simple random sampling. Such a comparison is presented in the experimental results (Section 5).

### 3.2 Implementation of Random Orthogonal Sampling

In order to implement the random orthogonal sampling technique, the well-known Gram-Schmidt process (Björck, 1994) is exploited to generate the orthogonal sample points. The Gram-Schmidt process is a method for orthonormalizing a set of vectors in an inner product space, most commonly the Euclidean space $Rn$. It takes a finite, linearly independent set $S={v1,\u2026,vk}$ ($k\u2264n$) and generates an orthogonal set $S'={u1,\u2026,uk}$ that spans $k$-dimensional subspace of $Rn$. The pseudocode of the Gram-Schmidt process is listed in Algorithm 3.

^{1}. The orthogonality and restriction on the vectors length are immediately satisfied. The rotation-invariance of the vectors can be shown as follows. Firstly, the standard normal vectors are rotation-invariant, meaning that for every $si\u223cN(0,I)$, it has the same distribution as $Rsi$, where $R$ is rotation matrix taken from $SO(n)$. Second, the orthogonalization formula of the Gram-Schmidt process, which is encoded in Algorithm 3, reads as follows:

### 3.3 Recombination and Pairwise Selection

To fix this undesirable effect, the *pairwise selection* heuristic introduced in Auger et al. (2011b) is adopted here. Pairwise selection prevents the pairwise cancellation by allowing only the better mutation among the mirrored pair to contribute to the weighted recombination. The effect of combining pairwise selection and mirroring is presented by solid curves in Figure 2a, in which no bias in step-size adaptation can be observed. In the following sections, pairwise selection is used in the ES whenever mirrored sampling or mirrored orthogonal sampling is used.

### 3.4 Application to the CMA-ES Algorithm

We apply the *mirrored orthogonal sampling* technique to the CMA-ES (Hansen and Ostermeier, 2001). In addition to the recombination problem discussed in the last section, some tuning is required to find default settings of control parameters for the new sampling technique. The step size control mechanism, *cumulative step-size adaptation* (CSA), is exploited in CMA-ES. In the CSA technique, the damping factor $d\sigma $ controls the adaptation speed of the step size $\sigma $ and is originally developed for i.i.d. Gaussian mutations. However, the mutations generated by mirrored orthogonal sampling are no longer independently distributed. Therefore, the damping factor $d\sigma $ needs to be optimized for the newly proposed technique. The default setting of the damping factor in Hansen (2006) is $d\sigma =1+2max{0,(\mu eff-1)/(n+1)-1}+c\sigma $. Note that $\mu eff$ is defined as the variance effective selection mass (Hansen and Ostermeier, 2001) of the recombination weights ${wi}i=1\mu $ and computed according to $\mu eff=\u2211i=1\mu wi2-1$. $c\sigma $ is the cumulation constant used for the evolution path and usually $c\sigma \u226a1$. For other default parameters in CMA-ES and their explanation, please refer to Hansen (2006).

We tune the damping factor under the default $\lambda $ setting, which is the rounded logarithm of dimensionality. The tuning approach follows the approach proposed in Brockhoff et al. (2010) to choose the new $d\sigma $ setting. First, every strategy parameter except $d\sigma $ is initialized by its default value. Second, multiple $d\sigma $ values are evaluated according to an experiment performed on the sphere function $f(x)=\u2211i=1nxi2$, where the performance can be assumed to be a unimodal function of $d\sigma $, such that a unique optimum value for $d\sigma $ can be determined. An example of this second step for $(\mu /\mu w,\lambda mo)$-CMA-ES is shown in Figure 2a. Finally, all the tuning curves from step 2 are collected and the feasible ranges of $d\sigma $ are chosen according to three criteria (Brockhoff et al., 2010):

Decreasing the selected $d\sigma $ from the feasible value by a factor of two leads to a better performance than increasing it by a factor of two.

Decreasing the selected $d\sigma $ by a factor of three never leads to an observed failure.

The selected $d\sigma $ should never lead to a performance that is two times slower than the optimal performance in the tuning graph.

## 4 Performance Analysis

In this section, we analyze the possible performance improvement introduced by mirrored orthogonal sampling. We first give the theoretical analysis for the single-parent evolution strategy and then investigate the multiparent strategies empirically on the sphere function.

### 4.1 Theoretical Aspects

**even**in the following analysis. In practice, when $\lambda $ is odd, the corresponding progress rate can be bounded from below by using $\lambda -1$ in the analysis and also be bounded from above by using $\lambda +1$.

Note that although some results (e.g., Figure 4b) can be equivalently obtained, using the theoretical framework of *convergence rate analysis* (Brockhoff et al., 2010), we did not adopt such an analysis approach because the progress rate analysis gives more insight into why the proposed sampling method outperforms its counterparts. The link between progress rate and convergence rate is elaborated in Auger and Hansen (2011). For the convergence rate analysis on the mirrored sampling, please see Auger et al. (2011a,b).

#### 4.1.1 Mirrored Sampling

*progress coefficient*from Beyer (1993). We denote it by $c1,\lambda m$ here. It can be compared to the progress coefficient of random sampling, which reads:

Numerically, we plot the progress coefficients of random sampling and mirrored sampling against population size in Figure 4b. The mirrored sampling (the curve marked by triangles) shows a small yet obvious advantage compared to the random sampling for small population sizes. In larger populations, these two converging curves imply that mirrored sampling provides no speed-up to the ES algorithm. Thus, the application of mirrored sampling should be limited to the small population setting.

For mirrored orthogonal sampling, we would like to use the same approach as for the mirrored sampling analysis above. However, it is hard to analytically obtain the CDF and the density function of the largest projection onto $PO$ of the mirrored orthogonal sampling. Therefore, we compute its CDF and density function empirically by Monte-Carlo simulation. For the simulation, the population size $\lambda $ is set to $2N$. The mirrored orthogonal samples are projected onto $PO$ and the largest projections are stored, from which the CDF is estimated. The results are also summarized in Figure 4. In Figure 4a, the CDF of mirrored orthogonal sampling (the solid curve marked by stars) is more likely to distribute samples towards bigger values compared to the CDF of mirrored sampling. As a consequence, in Figure 4b, the progress coefficients of mirrored orthogonal sampling are significantly bigger than those of mirrored sampling, even in a large population.

#### 4.1.2 Mirrored Orthogonal Sampling: The Worst Case Analysis

The worst case analysis of mirrored orthogonal sampling is conducted when the population is set to $2n$. We will call such population setting as “full mutations.” Under this condition, the progress rate is maximized (as will be explained later) and it is possible to provide analytical results. The progress under the condition $\lambda <2n$ will be also discussed later.

In $2D$ with $\lambda =4$, the worst case (together with best case) of progress for $(1,\lambda mo)$ is shown in Figure 3b. Suppose there is no step size $\sigma $ ($\sigma $ = 1) involved here for simplification. In the mutations centered at $P1$, there is one mutation pointing to the optimum $O$ and therefore this mutation performs optimally. We call this mutation scenario the best case of progress. The progress coefficient in this case is the expectation of the standard norm mutation length. It serves as the upper bound of the progress coefficient and is the same for random, mirrored, and mirrored orthogonal sampling.

The worst case of progress is indicated by the mutations centered at $P2$ in which the angle formed by the line segment $P2O$ and mutation $si$ is the same as the one ($\pi /4$ as shown in the figure) formed by $P2O$ and $sj$. In this scenario, the expected projections of $si$ and $sj$ are the same. It is not possible to make the expected projection of one mutation smaller without rendering the expected projection of the other one larger. For example, if we rotate $sj$ a little bit clockwise, then its projection becomes smaller. However, in the meanwhile $si$ is also rotated and its projection gets larger. Consequently, the largest projection of all the mutations becomes larger. Therefore, among all the possible mutation scenarios, $P2$ gives the lower bound of the largest projection of mutations onto $P2O$. Recall from Equation (10b) that the progress made by $(1,\lambda )$-ES is determined by the largest projection. Thus, the scenario $P2$ is the worst case of progress.

In the case that mirrored orthogonal sampling does not use “full mutations,” namely $\lambda <2n$, the progress rate would be reduced in contrast to the “full mutations” case. This is because it can now happen that some subspace could not be covered when $\lambda <2n$. Therefore, it is possible that the subspace in which the progress can be made is simply unexplored.

### 4.2 Empirical Aspects

For the multiparental variants of ES, we only consider their empirical convergence rates here. Similar to the convergence rate estimation in Loshchilov et al. (2011), the effect of the mirrored orthogonal sampling technique on the sphere function is investigated empirically by incorporating it into the well-known CMA-ES algorithm.

On the 20-D sphere function, the convergence rates of the $(\mu /\mu w,\lambda mo)$-CMA-ES and other comparable ES variants are illustrated in Figure 5a. The empirical convergence rate is estimated as the average slope of convergence curve over 200 runs. For all the CMA-ES variants tested here, the default settings of population size are applied (Hansen, 2006): $\lambda =4+\u230a3lnn\u230b,\mu =\u230a\lambda /2\u230b$. The legend “$(1+1)$-ES” represents the $(1+1)$-ES with $1/5$ success rule step size control while the “$(1+1)$-ES optimal” is for the $(1+1)$-ES with scale-invariant step size setting $\sigma =1.2n\u2225x(k)\u2225$, which proves to be the optimal step size setting on the sphere function (Loshchilov et al., 2011). The pairwise selection is always used if the mirroring operation is present in the sampling procedure. The mirrored sampling CMA-ES with modified $d\sigma $ (Equation (9)) is denoted as “$(\mu /\mu w,\lambda m)$-CMA-ES”. The curve labeled by “$(\mu /\mu w,\lambda mo)$-CMA-ES” stands for the mirrored orthogonal CMA-ES with modified $d\sigma $ (Equation (8)). In addition, “optimal $d\sigma $” represents the mirrored orthogonal CMA-ES using the optimal $d\sigma $ tuning on the sphere function, corresponding to the minimal value of the tuning curve in Figure 2a. Due to the empirical results, the convergence of $(\mu /\mu w,\lambda mo)$-CMA-ES (marked by diamond) is slower but close to that of the $(1+1)$-ES (marked by upside-down triangle) while the $(\mu /\mu w,\lambda mo)$-CMA-ES using the optimal parameter settings gradually catches the convergence rates of the optimal $(1+1)$-ES in high dimensions.

The relation between the empirical convergence rate and the dimensionality is shown in Figure 5b. The algorithms tested here are the same as Figure 5a. It is obvious that there is a leap of convergence rates between the CMA-ES and its mirrored orthogonal competitor. The advantages of the mirrored orthogonal CMA-ES over the mirrored CMA-ES are significant and preserved even for large dimensions. The upper limit of the $(\mu /\mu w,\lambda mo)$-CMA-ES on the sphere function is shown by the convergence rates achieved under the optimal $d\sigma $ tuning, which is even better than $(1+1)$-ES for almost all the dimensions. However, the optimal $d\sigma $ setting on the sphere function turned out to be not robust when considering other fitness functions and therefore is not used.

## 5 Experimental Validation

The mirrored orthogonal version of CMA-ES with pairwise selection has been tested on the noiseless BBOB (Hansen et al., 2010). By using the automatic comparison procedures provided in this benchmark, the BBOB results of $(\mu /\mu w,\lambda mo)$-CMA-ES are compared to those of $(\mu /\mu w,\lambda m)$-CMA-ES and $(\mu /\mu w,\lambda )$-CMA-ES.

### 5.1 Experimental Settings

The three algorithms, $(\mu /\mu w,\lambda mo)$-CMA-ES, $(\mu /\mu w,\lambda m)$-CMA-ES, and $(\mu /\mu w,\lambda )$-CMA-ES are benchmarked on BBOB-2012^{3} and their results are compared and processed by the postprocessing procedure of BBOB.

The BBOB parameter settings of the experiment are the same for all the tested ES variants. The initial global step size $\sigma $ is set to 1. The maximum number of function evaluations is set to $104\xd7n$. The initial solution (initial parent) is a uniformly sampled in the hyper-box $[-4,4]n$. The dimensions tested in the experiment are $n\u2208{2,3,5,10,20,40}$.

In addition, two independent but similar experiments are conducted. In the first experiment, the default population size setting, rounded logarithm of dimensionality, is used to configure all three algorithms. The result of this experiment is denoted as **small population** in the following. In this experiment, the strategy parameters are the same except for the $d\sigma $ setting. The default setting $d\sigma =1+2max{0,(\mu w-1)/(n+1)}+c\sigma $ is used in the standard CMA-ES while the modified $d\sigma $, as stated in Equations (8) and (9), are used for mirrored and mirrored orthogonal sampling, respectively. Another experiment exploits a relatively large population size, namely $2N$, the result of which is denoted as **large population**. In this experiment, the strategy parameters used are exactly the same for the three ES variants. The modified $d\sigma $ is not used because it is tuned under the default population setting instead of the large population setting.

### 5.2 Results and Discussion

The BBOB noiseless testbed (Hansen et al., 2009) contains 24 test functions which are classified into several groups as separable, ill-conditioned, or multimodal functions. Due to space limitations, only the comparisons of the aggregated empirical cumulative distributions (ECDFs) of run length over all the test functions are presented here. The ECDFs of run length estimates the cumulative distribution of the function evaluations consumed in ESs, with respect to a given precision target.

**Small population.** The results under the default small population setting are shown in Figure 6. The comparison between the mirrored orthogonal sampling and the mirrored sampling is shown in Figures 6a and 6b. Four different target precision values ($10k$ with $k\u2208{1,-1,-4,-8}$) are presented. On the left side, the comparisons under 5-D indicate a big performance improvement by the mirrored orthogonal sampling, which holds for all the target precisions. On the right side, the situation in 20-D still shows small advantages of the mirrored orthogonal sampling technique over the other algorithms. As for comparison between the mirrored orthogonal sampling and the standard CMA-ES, Figures 6c and 6d give the results. The comparison here shows approximately the same results as in Figures 6a and 6b. The improvement introduced by mirrored orthogonal sampling is decreasing when the dimensionality increases.

**Large population.** For the cases where the population size is linearly related to the dimensionality, we are mainly interested in validating the theoretical performance advantage of the mirrored orthogonal sampling (Sections 4.1.1 and 4.1.2). Thus, the results of the original $(\mu /\mu w,\lambda )$-CMA-ES is not shown here. The results are illustrated in Figure 7. From the comparisons between the ECDFs of 5-D (left half) to that of 20-D (right half), it is obvious that the amount of the improvement is still significant when the dimensionality goes large. The more detailed results in 5-D, which are shown in Figure 8, indicate that the mirrored orthogonal sampling technique outperforms its mirrored counterpart on almost all the test functions: highly-conditioned functions $f10$-$f14$, multimodal functions with adequate global structure $f15$-$f19$, separable functions $f1$-$f5$ and multimodal functions with weak global structure $f20$-$f24$. The detailed results in 10-D, as summarized in Figure 9, shows roughly the same comparisons as that in 5-D, except that it is hard to judge which algorithm is better from the ECDFs of the multimodal functions with adequate global structure $f15$–$f19$ (Figure 9c).

The better experimental results for a large population suggest that the newly proposed mirrored orthogonal sampling technique would be most suitable in the case where the population size is about two times the dimensionality.

## 6 Discussion and Conclusion

In this article, we propose a new mutation operator, the mirrored orthogonal sampling to generate evenly distributed samples for evolution strategies. Several approaches, including the mirrored sampling, to achieve derandomized sampling are briefly introduced. By the theoretical analysis, we have shown that the performance improvement given by the mirrored sampling vanishes in a large population setting by theoretical analysis. As a remedy to this limitation, random orthogonal samples are introduced as a possible improvement of mirrored sampling. Pairwise selection is also used to avoid the undesired bias caused by the mirroring operation. The resulting algorithm, called the mirrored orthogonal sampling, is applied to the CMA-ES after some parameter tuning. The performance of random, mirrored, and mirrored orthogonal sampling are compared both analytically and empirically. On the sphere function, the $(1,\lambda )$-ES with the mirrored orthogonal sampling is just a little bit slower than the $(1+1)$-ES with $1/5$ rule, as shown in the empirical analysis. Finally, we tested the $(\mu /\mu w,\lambda mo)$-CMA-ES on the BBOB benchmark regarding its performance for small population size and for large population size. The results reveal the advantages of the mirrored orthogonal sampling over mirrored sampling and over the standard $(\mu /\mu w,\lambda )$-CMA-ES. In particular on highly conditioned and multimodal functions the competitiveness of the new mirrored orthogonal sampling becomes significant. As discussed in the theoretical analysis (Section 4.1), the proposed method is well suited for the problem where the dimensionality is larger than or similar to half of the population size. However, in very high dimensions, the advantage of the new method gradually diminishes.

Some interesting future directions can be identified, based on the suggested new method of generating mutations. First, the pairwise selection method is chosen here for avoiding the undesired bias. A more advanced idea introduced in Auger et al. (2011b), *selective mirroring*, is also suitable option for being using in mirrored orthogonal sampling. More work is needed to identify the best possible selection method for mirrored orthogonal sampling.

Second, some more parameter tuning should be done. The learning rates $c1,c\mu $ for rank-one and rank-$\mu $ update of the covariance matrix remain unchanged from their suggested settings. It is important to adapt those parameters to the new sampling technique to obtain the best possible speed-up of the algorithm.

Third, concerning the progress rate analysis (Section 4.1), deriving the distribution function of the uniform random orthogonal vectors still remains an open problem. The exact progress rate formula for mirrored orthogonal sampling is unknown. This is planned as another part of the further work. Finally, it would be interesting to apply the mirrored orthogonal sampling to more recent CMA-ES variants such as the active CMA-ES.

## Acknowledgments

The authors are grateful for the financial support by the Dutch Research Project (NWO) PROMIMOOC (project number 650.002.001).

## Notes

^{1}

Note that the mirrored pair can stay on the partition boundary. However, this situation has only zero measure in $Rn$.

^{2}

A $n$ dimensional rotation matrix $R$ satisfies conditions $R-1=R\u22a4$ and $detR=1$. All such matrices form a so-called special orthogonal group $SO(n)$.

^{3}

The exact version is v11.06.