## Abstract

This paper analyzes a -Evolution Strategy, a randomized comparison-based adaptive search algorithm optimizing a linear function with a linear constraint. The algorithm uses resampling to handle the constraint. Two cases are investigated: first, the case where the step-size is constant, and second, the case where the step-size is adapted using cumulative step-size adaptation. We exhibit for each case a Markov chain describing the behavior of the algorithm. Stability of the chain implies, by applying a law of large numbers, either convergence or divergence of the algorithm. Divergence is the desired behavior. In the constant step-size case, we show stability of the Markov chain and prove the divergence of the algorithm. In the cumulative step-size adaptation case, we prove stability of the Markov chain in the simplified case where the cumulation parameter equals 1, and discuss steps to obtain similar results for the full (default) algorithm where the cumulation parameter is smaller than 1. The stability of the Markov chain allows us to deduce geometric divergence or convergence, depending on the dimension, constraint angle, population size, and damping parameter, at a rate that we estimate. Our results complement previous studies where stability was assumed.

## 1 Introduction

Derivative-free optimization (DFO) methods are tailored for the optimization of numerical problems in a black box context, where the objective function is pictured as a black box that *solely* returns *f* values (in particular, no gradients are available).

Evolution strategies (ES) are *comparison-based* randomized DFO algorithms. At iteration *t*, solutions are sampled from a multivariate normal distribution centered in a vector . The candidate solutions are ranked according to *f*, and the updates of and other parameters of the distribution (usually a step-size and a covariance matrix) are performed using solely the ranking information given by the candidate solutions. Since the ES do not directly use the function values of the new points but only how the objective function *f* ranks the different samples, they are invariant to the composition (to the left) of the objective function by a strictly increasing function .

This property and the black box scenario make evolution strategies suited for a wide class of real-world problems, where constraints on the variables are often imposed. Different techniques for handling constraints in randomized algorithms have been proposed; see Mezura-Montes and Coello (2011) for a survey. For ES, common techniques are resampling (resample a solution until it lies in the feasible domain), repair of solutions that project infeasible points onto the feasible domain (Arnold, 2011b, 2013), penalty methods where infeasible solutions are penalized either by a quantity that depends on the distance to the constraint if this latter one can be computed with adaptive penalty weights (e.g., Hansen et al., 2009; Arnold and Porter, 2015) or by the constraint value itself (e.g., stochastic ranking; Runarsson and Yao, 2000), or methods inspired by multiobjective optimization (e.g., Mezura-Montes and Coello, 2008).

In this paper we focus on the resampling method and study it on a simple constrained problem. More precisely, we study a -ES optimizing *a linear function with a linear constraint* and resampling any infeasible solution until a feasible solution is sampled. The linear function models the situation where the current point is far from the optimum relative to the step-size, and “solving” this function means diverging. The linear constraint models being close to the constraint relative to the step-size and far from other constraints. Because of the invariance of the algorithm to the composition of the objective function by a strictly increasing map, the linear function can be composed by a function without derivative and with many discontinuities without any impact on our analysis.

The problem we address was studied previously for different step-size adaptation mechanisms and different constraint-handling methods: with constant step-size, self-adaptation, and cumulative step-size adaptation, and the constraint being handled through resampling or repairing infeasible solutions (Arnold, 2011a; 2012; 2013). The conclusion is that with step-size adaptation, the -ES fails to diverge unless some requirements on internal parameters of the algorithm are met. However, the approach followed in the aforementioned studies relies on finding simplified theoretical models to explain the behavior of the algorithm: typically these models arise from approximations (considering some random variables equal to their expected value, etc.) and assume mathematical properties like the existence of stationary distributions of underlying Markov chains without accompanying proof.

In contrast, our motivation is to study the algorithm without simplifications and to prove rigorously different mathematical properties of the algorithm that allow deducing the exact behavior of the algorithm, as well as to provide tools and methodology for such studies. Our theoretical studies need to be complemented by simulations of the convergence/divergence rates. The mathematical properties that we derive show that these numerical simulations converge fast. Our results are largely in agreement with the aforementioned studies of simplified models, thereby backing up their validity.

As for the step-size adaptation mechanism, our aim is to study the cumulative step-size adaptation (CSA), also called path length control, the default step-size mechanism for the CMA-ES algorithm (Hansen and Ostermeier, 2001). The mathematical object to study for this purpose is a discrete-time, continuous state space Markov chain that is defined as the pair: evolution path and distance to the constraint normalized by the step-size. More precisely, stability properties like irreducibility and existence of a stationary distribution of this Markov chain need to be studied to deduce the geometric divergence of the CSA and have a rigorous mathematical framework to perform Monte Carlo simulations that allow studying the influence of different parameters of the algorithm. We start by illustrating in detail the methodology on the simpler case where the step-size is constant. We show in this case that the distance to the constraint reaches a stationary distribution. This latter property was assumed in a previous study (Arnold, 2011a). We then prove that the algorithm diverges at a constant speed. We then apply this approach to the case where the step-size is adapted using path length control. We show that in the special case where the cumulation parameter *c* equals 1, the expected logarithmic step-size change, , converges to a constant *r*, and the average logarithmic step-size change, , converges in probability to the same constant, which depends on parameters of the problem and of the algorithm. This implies geometric divergence (if ) or convergence (if ) at the rate *r* for which estimates are provided.

This paper is organized as follows. In Section 2 we define the -ES using resampling and the problem. In Section 3 we provide some preliminary derivations on the distributions that come into play for the analysis. In Section 4 we analyze the constant step-size case. In Section 5 we analyze the cumulative step-size adaptation case. Finally, in Section 6, we discuss our results and our methodology.

A preliminary version of this paper appeared in conference proceedings (Chotard et al., 2014). The analysis of path length control with cumulation parameter equal to 1 is however fully new, as is the discussion on how to analyze the case with cumulation parameter smaller than 1. Also, Figures 4–11 as well as the convergence of the progress rate in Theorem ^{10} are new.

### 1.1 Notation

Throughout this article, we denote by the density function of the standard multivariate normal distribution (the dimension being clarified within the context), and by the cumulative distribution function of a standard univariate normal distribution. The standard (unidimensional) normal distribution is denoted by , the (*n*-dimensional) multivariate normal distribution with covariance matrix identity by , and the *i*th-order statistic of i.i.d. standard normal random variables by . The uniform distribution on an interval *I* is denoted by . The set of natural numbers (including 0) is denoted by , and the set of real numbers by . We denote by the set , and for , the set denotes , and denotes the indicator function of *A*. For a topological space , denotes the Borel algebra of . We denote by the Lebesgue measure on , and for , denotes the trace measure . For two vectors and , we denote by the *i*th-coordinate of and by the scalar product of and . Take with , we denote by the interval of integers between *a* and *b*. The gamma function is denoted by . For and two random vectors, we write if and are equal in distribution. For a sequence of random variables and *X* a random variable, we write if *X _{t}* converges almost surely to

*X*, and if

*X*converges in probability to

_{t}*X*. For

*X*a random variable and a probability measure, we denote by the expected value of

*X*, and by the expected value of

*X*when

*X*has distribution .

## 2 Problem Statement and Algorithm Definition

### 2.1 -ES with Resampling

In this paper we study the behavior of a -Evolution Strategy *maximizing* a function *f*: , , , with a constraint defined by a function restricting the feasible space to . To handle the constraint, the algorithm resamples any infeasible solution until a feasible solution is found.

*i*th candidate solution is denoted by , and the realization of the multivariate normal distribution giving is , that is, The vector is called a feasible step. Note that is not distributed as a multivariate normal distribution.

### 2.2 Linear Fitness Function with Linear Constraint

*f*, the function that we optimize, and

*g*, the constraint, are linear functions. W.l.o.g., we assume that . We denote by a normal vector to the constraint hyperplane. We choose an orthonormal Euclidean coordinate system with basis with its origin located on the constraint hyperplane, where is equal to the gradient ; hence and the vector lives in the plane generated by and and is such that the angle between and is positive. We define as the angle between and , and restrict our study to . The function

*g*can be seen as a signed distance to the linear constraint as A point is feasible if and only if (see Figure 1). Overall the problem reads

Although and are in , because of the choice of the coordinate system and the independence of the sequence only the two first coordinates of these vectors are affected by the resampling implied by *g* and the selection according to *f*. Therefore for . With an abuse of notation, the vector will denote the two-dimensional vector , likewise will denote the two-dimensional vector , and will denote the two-dimensional vector . The coordinate system will also be used as only.

We initialize the algorithm by choosing and , which implies that .

## 3 Preliminary Results and Definitions

Throughout this section we derive the probability density functions of the random vectors and and give a definition of and of as a function of and of an i.i.d. sequence of random vectors.

### 3.1 Feasible Steps

The random vector , the *i*th feasible step, is distributed as the standard multivariate normal distribution truncated by the constraint, as stated in the following lemma.

*f*under a constraint function

*g*. If

*g*is a linear form determined by a vector as in Eq. (6), then the distribution of the feasible step only depends on the normalized distance to the constraint and its density given that equals reads

It will be important in the sequel to be able to express the vector as a function of and of a *finite* number of random samples. Hence we give an alternative way to sample rather than the resampling technique that involves an unbounded number of samples.

*f*under a constraint function

*g*, where

*g*is a linear form determined by a vector as in Eq. (6). Let the feasible step be the random vector described in Lemma

^{1}and be the two-dimensional rotation matrix of angle . Then where denotes the generalized inverse

^{1}of the cumulative distribution of , , with i.i.d. and i.i.d. random variables.

We define a new coordinate system (see Figure 1). It is the image of by . In the new basis , only the coordinate along is affected by the resampling. Hence the random variable follows a truncated normal distribution with cumulative distribution function equal to , while the random variable follows an independent standard normal distribution, hence . Using the fact that if a random variable has a cumulative distribution *F*, then for the generalized inverse of *F*, with has the same distribution as this random variable, we get that , so we obtain Eq. (11).

We now extend our study to the selected step .

### 3.2 Selected Step

The selected step is chosen among the different feasible steps to maximize the function *f*, and it has the density described in the following lemma.

*f*being linear, the rankings on correspond to the order statistic on . If we look at the joint cumulative distribution of by summing disjoints events. The vectors being independent and identically distributed, Deriving on

*x*and

*y*yields the density of of Eq. (12).

We may now obtain the marginal of and .

*finite*number of i.i.d. random variables. Using the notation of Lemma

^{2}, we define the function as According to Lemma

^{2}, given that and , (resp. ) is distributed as the resampled step in the coordinate system (resp. ). Finally, let and let be the function defined as As shown in the following proposition, given that and , the function is distributed as the selected step .

## 4 Constant Step-Size Case

We illustrate in this section our methodology on the simple case where the step-size is constantly equal to , and prove that diverges in probability at constant speed and that the progress rate (see Arnold, 2011a, Eq. 2) converges to a strictly positive constant (Theorem 11). The analysis of the CSA is then a generalization of the results presented here, with more technical results to derive. Note that the progress rate definition coincides with the fitness gain, that is, .

As suggested by Arnold (2011a), the sequence plays a central role for the analysis, and we show that it admits a stationary measure. We first prove that this sequence is a homogeneous Markov chain.

It follows from the definition of that , and Proposition ^{5} states that . Since has the same distribution as a time-independent function of and of , where are i.i.d., it is a homogeneous Markov chain.

The invariant measure also underlies the study carried out by Arnold (2011a, Section 4), where it is stated, “Assuming for now that the mutation strength is held constant, when the algorithm is iterated, the distribution of -values tends to a stationary limit distribution.” We provide a formal proof that indeed admits a stationary limit distribution , and prove some other useful properties that allow us to conclude the divergence of .

### 4.1 Study of the Stability of

*t*-steps transition kernel

*P*is defined by .

^{t}From the transition kernel, we now derive the first properties on the Markov chain . First we investigate the so-called -irreducible property.

A Markov chain on a state space is *-irreducible* if there exists a nontrivial measure such that for all sets with and for all , there exists such that . We denote by the set of Borel sets of with strictly positive -measure.

*small sets*and

*petite sets*. A set is called a small set if there exists and a nontrivial measure such that for all sets and all , A set is called a petite set if there exists a probability measure on and a nontrivial measure such that for all sets and all , A small set is therefore also a petite set. The existence of a small set combined with a control of the Markov chain outside the small set allows deducing powerful stability properties of the Markov chain. If there exists a -small set

*C*such that , then the Markov chain is called strongly aperiodic.

*A*with , . Hence is irreducible with respect to the Lebesgue measure.

*x*coming from the dominated convergence theorem). Given a compact

*C*of , we know that there exists such that for all , . Hence for all , The measure being nontrivial, the previous equation shows that compact sets of are small and that for

*C*a compact such that , we have ; hence the chain is strongly aperiodic. Note also that since , the same reasoning holds for instead of

*C*(where ). Hence the set is also a small set.

*Harris-recurrent*if for all sets and for all , where is the occupation time of A, that is, . We show that the Markov chain is positive and Harris-recurrent by using so-called Foster-Lyapunov drift conditions: define the

*drift*operator for a positive function

*V*as Drift conditions translate that outside a small set, the drift operator is negative. We show a drift condition for V-geometric ergodicity where given a function , a positive and Harris-recurrent chain with invariant measure is called

*if and there exists such that where for a signed measure denotes .*

*f*-geometrically ergodic*V*-geometric ergodicity, we prove that there exists a small set

*C*, constants , and a function finite for at least some such that for all , If the Markov chain is -irreducible and aperiodic, this drift condition implies that the chain is

*V*-geometrically ergodic (Meyn and Tweedie, 1993, Theorem 15.0.1)

^{2}as well as positive and Harris-recurrent.

^{3}

Because sets of the form with are small sets, and drift conditions investigate the negativity outside a small set, we need to study the chain for large. The following lemma is a technical lemma studying the limit of for to infinity.

For the proof see the appendix. We are now ready to prove a drift condition for geometric ergodicity.

Consider a -ES with resampling and with constant step-size optimizing the constrained problem (7), and let be the Markov chain exhibited in Eq. (18). The Markov chain is *V*-geometrically ergodic with for small enough, and is Harris-recurrent and positive with invariant probability measure .

^{4}we obtain that As the right-hand side of the previous equation is finite, we can invert integral with series with Fubini’s theorem, so with Taylor series which in turns yields Since for , , for and small enough, we get . Hence there exists , and such that

According to Proposition ^{7}, is a small set, hence it is petite (Meyn and Tweedie, 1993, Prop. 5.5.3). Furthermore is a -irreducible aperiodic Markov chain, so satisfies the conditions of Theorem 15.0.1 from Meyn and Tweedie (1993), which with Lemma 15.2.2, Theorem 9.1.8, and Theorem 14.0.1 from Meyn and Tweedie (1993) proves the proposition.

We have proved rigorously the existence (and unicity) of an invariant measure for the Markov chain , which provides the steady state behavior described by Arnold (2011a, Section 4). As the Markov chain is positive and Harris-recurrent, we may now apply a Law of Large Numbers (Meyn and Tweedie, 1993, Theorem 17.1.7) in Eq. (19) to obtain the divergence of and an exact expression of the divergence rate.

^{9}, and is the probability measure of .

From Proposition 4 the Markov chain is Harris-recurrent and positive, and since is i.i.d., the chain is also Harris-recurrent and positive with invariant probability measure , so to apply the Law of Large Numbers (Meyn and Tweedie, 1993, Theorem 17.0.1) to we only need to be -integrable.

^{4}with Jensen’s inequality shows that is finite. Therefore the function is bounded by a constant . As is a probability measure, , meaning is -integrable. Hence we may apply the LLN on Eq. (19) The equality in distribution in Eq. (19) allows us to deduce the convergence in probability of the left-hand side of Eq. (19) to the right-hand side of the previous equation.

From Eq. (19), , so . As is integrable with Fubini’s theorem, , so . According to Proposition ^{9}, is *V*-geometrically ergodic with , so there exists and such that . We showed that the function is bounded, so since for all and , there exists *k* such that for all . Hence . And therefore , which converges to 0 when *t* goes to infinity.

We showed rigorously the divergence of and gave an exact expression of the divergence rate, and that the progress rate converges to the same rate. The fact that the chain is *V*-geometrically ergodic gives that there exists a constant such that . This implies that the distribution can be simulated efficiently by a Monte Carlo simulation, giving precise estimations of the divergence rate of .

A Monte Carlo simulation of the divergence rate in the right-hand side of expressions (30) and (31) and for time steps gives the progress rate of (Arnold, 2011a), which once normalized by and yields Figure 2. We normalize per , as in evolution strategies the cost of the algorithm is assumed to be the number of *f*-calls. We see that for small values of , the normalized serial progress rate assumes roughly . Only for larger constraint angles the serial progress rate depends on , where smaller are preferable.

Figure 3 is obtained through simulations of the Markov chain defined in Eq. (18) for time steps where the values of are averaged over time. We see that when , then , since the selection does not attract toward the constraint anymore. With a larger population size the algorithm is closer to the constraint, as better samples are more likely to be found close to the constraint.

## 5 Cumulative Step-Size Adaptation

In this section we apply the techniques introduced in the previous section to the case where the step-size is adapted using cumulative step-size adaptation. This technique was studied on sphere functions by Arnold and Beyer (2004) and on ridge functions by Arnold and MacLeod (2008).

*f*ranks uniformly randomly the different samples and these samples are normally distributed, then will also follow a standard normal distribution independently of the value of

*c*.

*n*, since it would be the distribution of the evolution path under random selection. If is greater (resp. lower) than

*n*, then the step-size is increased (resp. decreased) following where the damping parameter determines how much the step-size can change and can be set here to .

Note that for , is a function only of , , and *K _{t}*, which we denote by .

We prove in the next proposition that for , the sequence is a homogeneous Markov chain, and make explicit its update function. In the case where , the chain reduces to .

^{5}.

With Eq. (32) and Eq. (17) we get Eq. (37).

^{5}it follows that So is a function of only and i.i.d. random variables; hence is a time-homogeneous Markov chain.

Fixing in Eqs. (36) and (37) immediately yields Eq. (38), and then is a function of only and i.i.d. random variables, so in this case is a time-homogeneous Markov chain.

To apply the LLN we need the Markov chain to be Harris-positive and the properties mentioned in the following lemma.

We believe that the latter result can be generalized to the case if for any there exists such that for all there exists a path of events of length *t* from to the set for and small enough.

^{5}that for the compact is a small set, it is sufficient to show that the limits of in 0 and are negative. These limits result from the limits studied in the following lemma corresponding to the decomposition in Eq. (41).

The proof of this lemma consists in applications of Lebesgue’s dominated convergence theorem (see appendix).

We now prove the Harris positivity of by proving a stronger property, namely, the geometric ergodicity, which we show using the drift inequality (29).

Consider a -ES with resampling and cumulative step-size adaptation maximizing the constrained problem (7). For the Markov chain from Proposition ^{11} is *V*-geometrically ergodic with for small enough, and positive Harris with invariant measure .

*V*the positive function (the parameter is strictly positive and is specified later), a random vector, and

*K*a random variable following a chi-squared distribution with degrees of freedom. We first study when . From Eq. (41) we have the following drift quotient with defined in Eq. (35) and in Eq. (16). From Lemma

^{6}, following the same notation as in the lemma, when and if is small enough, the right-hand side of the previous equation converges to . With Taylor series Furthermore, as the density of at

*x*equals and , which for small enough is integrable, we get Therefore we can use Fubini’s theorem to invert series (which are integrals for the counting measure) and integral. The same reasoning holding for

*E*

_{2}and

*E*

_{3}(for

*E*

_{3}with the chi-squared distribution, we need for all ), we have and as and , From Chotard et al. (2012a), if , then . Therefore, for small enough, we have , so there exists and such that whenever .

^{6}, and . Hence, using Eq. (46), . So there exists and such that for all