This paper analyzes a -Evolution Strategy, a randomized comparison-based adaptive search algorithm optimizing a linear function with a linear constraint. The algorithm uses resampling to handle the constraint. Two cases are investigated: first, the case where the step-size is constant, and second, the case where the step-size is adapted using cumulative step-size adaptation. We exhibit for each case a Markov chain describing the behavior of the algorithm. Stability of the chain implies, by applying a law of large numbers, either convergence or divergence of the algorithm. Divergence is the desired behavior. In the constant step-size case, we show stability of the Markov chain and prove the divergence of the algorithm. In the cumulative step-size adaptation case, we prove stability of the Markov chain in the simplified case where the cumulation parameter equals 1, and discuss steps to obtain similar results for the full (default) algorithm where the cumulation parameter is smaller than 1. The stability of the Markov chain allows us to deduce geometric divergence or convergence, depending on the dimension, constraint angle, population size, and damping parameter, at a rate that we estimate. Our results complement previous studies where stability was assumed.
Derivative-free optimization (DFO) methods are tailored for the optimization of numerical problems in a black box context, where the objective function is pictured as a black box that solely returns f values (in particular, no gradients are available).
Evolution strategies (ES) are comparison-based randomized DFO algorithms. At iteration t, solutions are sampled from a multivariate normal distribution centered in a vector . The candidate solutions are ranked according to f, and the updates of and other parameters of the distribution (usually a step-size and a covariance matrix) are performed using solely the ranking information given by the candidate solutions. Since the ES do not directly use the function values of the new points but only how the objective function f ranks the different samples, they are invariant to the composition (to the left) of the objective function by a strictly increasing function .
This property and the black box scenario make evolution strategies suited for a wide class of real-world problems, where constraints on the variables are often imposed. Different techniques for handling constraints in randomized algorithms have been proposed; see Mezura-Montes and Coello (2011) for a survey. For ES, common techniques are resampling (resample a solution until it lies in the feasible domain), repair of solutions that project infeasible points onto the feasible domain (Arnold, 2011b, 2013), penalty methods where infeasible solutions are penalized either by a quantity that depends on the distance to the constraint if this latter one can be computed with adaptive penalty weights (e.g., Hansen et al., 2009; Arnold and Porter, 2015) or by the constraint value itself (e.g., stochastic ranking; Runarsson and Yao, 2000), or methods inspired by multiobjective optimization (e.g., Mezura-Montes and Coello, 2008).
In this paper we focus on the resampling method and study it on a simple constrained problem. More precisely, we study a -ES optimizing a linear function with a linear constraint and resampling any infeasible solution until a feasible solution is sampled. The linear function models the situation where the current point is far from the optimum relative to the step-size, and “solving” this function means diverging. The linear constraint models being close to the constraint relative to the step-size and far from other constraints. Because of the invariance of the algorithm to the composition of the objective function by a strictly increasing map, the linear function can be composed by a function without derivative and with many discontinuities without any impact on our analysis.
The problem we address was studied previously for different step-size adaptation mechanisms and different constraint-handling methods: with constant step-size, self-adaptation, and cumulative step-size adaptation, and the constraint being handled through resampling or repairing infeasible solutions (Arnold, 2011a; 2012; 2013). The conclusion is that with step-size adaptation, the -ES fails to diverge unless some requirements on internal parameters of the algorithm are met. However, the approach followed in the aforementioned studies relies on finding simplified theoretical models to explain the behavior of the algorithm: typically these models arise from approximations (considering some random variables equal to their expected value, etc.) and assume mathematical properties like the existence of stationary distributions of underlying Markov chains without accompanying proof.
In contrast, our motivation is to study the algorithm without simplifications and to prove rigorously different mathematical properties of the algorithm that allow deducing the exact behavior of the algorithm, as well as to provide tools and methodology for such studies. Our theoretical studies need to be complemented by simulations of the convergence/divergence rates. The mathematical properties that we derive show that these numerical simulations converge fast. Our results are largely in agreement with the aforementioned studies of simplified models, thereby backing up their validity.
As for the step-size adaptation mechanism, our aim is to study the cumulative step-size adaptation (CSA), also called path length control, the default step-size mechanism for the CMA-ES algorithm (Hansen and Ostermeier, 2001). The mathematical object to study for this purpose is a discrete-time, continuous state space Markov chain that is defined as the pair: evolution path and distance to the constraint normalized by the step-size. More precisely, stability properties like irreducibility and existence of a stationary distribution of this Markov chain need to be studied to deduce the geometric divergence of the CSA and have a rigorous mathematical framework to perform Monte Carlo simulations that allow studying the influence of different parameters of the algorithm. We start by illustrating in detail the methodology on the simpler case where the step-size is constant. We show in this case that the distance to the constraint reaches a stationary distribution. This latter property was assumed in a previous study (Arnold, 2011a). We then prove that the algorithm diverges at a constant speed. We then apply this approach to the case where the step-size is adapted using path length control. We show that in the special case where the cumulation parameter c equals 1, the expected logarithmic step-size change, , converges to a constant r, and the average logarithmic step-size change, , converges in probability to the same constant, which depends on parameters of the problem and of the algorithm. This implies geometric divergence (if ) or convergence (if ) at the rate r for which estimates are provided.
This paper is organized as follows. In Section 2 we define the -ES using resampling and the problem. In Section 3 we provide some preliminary derivations on the distributions that come into play for the analysis. In Section 4 we analyze the constant step-size case. In Section 5 we analyze the cumulative step-size adaptation case. Finally, in Section 6, we discuss our results and our methodology.
A preliminary version of this paper appeared in conference proceedings (Chotard et al., 2014). The analysis of path length control with cumulation parameter equal to 1 is however fully new, as is the discussion on how to analyze the case with cumulation parameter smaller than 1. Also, Figures 4–11 as well as the convergence of the progress rate in Theorem 10 are new.
Throughout this article, we denote by the density function of the standard multivariate normal distribution (the dimension being clarified within the context), and by the cumulative distribution function of a standard univariate normal distribution. The standard (unidimensional) normal distribution is denoted by , the (n-dimensional) multivariate normal distribution with covariance matrix identity by , and the ith-order statistic of i.i.d. standard normal random variables by . The uniform distribution on an interval I is denoted by . The set of natural numbers (including 0) is denoted by , and the set of real numbers by . We denote by the set , and for , the set denotes , and denotes the indicator function of A. For a topological space , denotes the Borel algebra of . We denote by the Lebesgue measure on , and for , denotes the trace measure . For two vectors and , we denote by the ith-coordinate of and by the scalar product of and . Take with , we denote by the interval of integers between a and b. The gamma function is denoted by . For and two random vectors, we write if and are equal in distribution. For a sequence of random variables and X a random variable, we write if Xt converges almost surely to X, and if Xt converges in probability to X. For X a random variable and a probability measure, we denote by the expected value of X, and by the expected value of X when X has distribution .
2 Problem Statement and Algorithm Definition
2.1 -ES with Resampling
In this paper we study the behavior of a -Evolution Strategy maximizing a function f: , , , with a constraint defined by a function restricting the feasible space to . To handle the constraint, the algorithm resamples any infeasible solution until a feasible solution is found.
2.2 Linear Fitness Function with Linear Constraint
Although and are in , because of the choice of the coordinate system and the independence of the sequence only the two first coordinates of these vectors are affected by the resampling implied by g and the selection according to f. Therefore for . With an abuse of notation, the vector will denote the two-dimensional vector , likewise will denote the two-dimensional vector , and will denote the two-dimensional vector . The coordinate system will also be used as only.
We initialize the algorithm by choosing and , which implies that .
3 Preliminary Results and Definitions
Throughout this section we derive the probability density functions of the random vectors and and give a definition of and of as a function of and of an i.i.d. sequence of random vectors.
3.1 Feasible Steps
The random vector , the ith feasible step, is distributed as the standard multivariate normal distribution truncated by the constraint, as stated in the following lemma.
It will be important in the sequel to be able to express the vector as a function of and of a finite number of random samples. Hence we give an alternative way to sample rather than the resampling technique that involves an unbounded number of samples.
We define a new coordinate system (see Figure 1). It is the image of by . In the new basis , only the coordinate along is affected by the resampling. Hence the random variable follows a truncated normal distribution with cumulative distribution function equal to , while the random variable follows an independent standard normal distribution, hence . Using the fact that if a random variable has a cumulative distribution F, then for the generalized inverse of F, with has the same distribution as this random variable, we get that , so we obtain Eq. (11).
We now extend our study to the selected step .
3.2 Selected Step
The selected step is chosen among the different feasible steps to maximize the function f, and it has the density described in the following lemma.
We may now obtain the marginal of and .
4 Constant Step-Size Case
We illustrate in this section our methodology on the simple case where the step-size is constantly equal to , and prove that diverges in probability at constant speed and that the progress rate (see Arnold, 2011a, Eq. 2) converges to a strictly positive constant (Theorem 11). The analysis of the CSA is then a generalization of the results presented here, with more technical results to derive. Note that the progress rate definition coincides with the fitness gain, that is, .
As suggested by Arnold (2011a), the sequence plays a central role for the analysis, and we show that it admits a stationary measure. We first prove that this sequence is a homogeneous Markov chain.
It follows from the definition of that , and Proposition 5 states that . Since has the same distribution as a time-independent function of and of , where are i.i.d., it is a homogeneous Markov chain.
The invariant measure also underlies the study carried out by Arnold (2011a, Section 4), where it is stated, “Assuming for now that the mutation strength is held constant, when the algorithm is iterated, the distribution of -values tends to a stationary limit distribution.” We provide a formal proof that indeed admits a stationary limit distribution , and prove some other useful properties that allow us to conclude the divergence of .
4.1 Study of the Stability of
From the transition kernel, we now derive the first properties on the Markov chain . First we investigate the so-called -irreducible property.
A Markov chain on a state space is -irreducible if there exists a nontrivial measure such that for all sets with and for all , there exists such that . We denote by the set of Borel sets of with strictly positive -measure.
Because sets of the form with are small sets, and drift conditions investigate the negativity outside a small set, we need to study the chain for large. The following lemma is a technical lemma studying the limit of for to infinity.
For the proof see the appendix. We are now ready to prove a drift condition for geometric ergodicity.
Consider a -ES with resampling and with constant step-size optimizing the constrained problem (7), and let be the Markov chain exhibited in Eq. (18). The Markov chain is V-geometrically ergodic with for small enough, and is Harris-recurrent and positive with invariant probability measure .
According to Proposition 7, is a small set, hence it is petite (Meyn and Tweedie, 1993, Prop. 5.5.3). Furthermore is a -irreducible aperiodic Markov chain, so satisfies the conditions of Theorem 15.0.1 from Meyn and Tweedie (1993), which with Lemma 15.2.2, Theorem 9.1.8, and Theorem 14.0.1 from Meyn and Tweedie (1993) proves the proposition.
We have proved rigorously the existence (and unicity) of an invariant measure for the Markov chain , which provides the steady state behavior described by Arnold (2011a, Section 4). As the Markov chain is positive and Harris-recurrent, we may now apply a Law of Large Numbers (Meyn and Tweedie, 1993, Theorem 17.1.7) in Eq. (19) to obtain the divergence of and an exact expression of the divergence rate.
From Proposition 4 the Markov chain is Harris-recurrent and positive, and since is i.i.d., the chain is also Harris-recurrent and positive with invariant probability measure , so to apply the Law of Large Numbers (Meyn and Tweedie, 1993, Theorem 17.0.1) to we only need to be -integrable.
From Eq. (19), , so . As is integrable with Fubini’s theorem, , so . According to Proposition 9, is V-geometrically ergodic with , so there exists and such that . We showed that the function is bounded, so since for all and , there exists k such that for all . Hence . And therefore , which converges to 0 when t goes to infinity.
We showed rigorously the divergence of and gave an exact expression of the divergence rate, and that the progress rate converges to the same rate. The fact that the chain is V-geometrically ergodic gives that there exists a constant such that . This implies that the distribution can be simulated efficiently by a Monte Carlo simulation, giving precise estimations of the divergence rate of .
A Monte Carlo simulation of the divergence rate in the right-hand side of expressions (30) and (31) and for time steps gives the progress rate of (Arnold, 2011a), which once normalized by and yields Figure 2. We normalize per , as in evolution strategies the cost of the algorithm is assumed to be the number of f-calls. We see that for small values of , the normalized serial progress rate assumes roughly . Only for larger constraint angles the serial progress rate depends on , where smaller are preferable.
Figure 3 is obtained through simulations of the Markov chain defined in Eq. (18) for time steps where the values of are averaged over time. We see that when , then , since the selection does not attract toward the constraint anymore. With a larger population size the algorithm is closer to the constraint, as better samples are more likely to be found close to the constraint.
5 Cumulative Step-Size Adaptation
In this section we apply the techniques introduced in the previous section to the case where the step-size is adapted using cumulative step-size adaptation. This technique was studied on sphere functions by Arnold and Beyer (2004) and on ridge functions by Arnold and MacLeod (2008).
Note that for , is a function only of , , and Kt, which we denote by .
We prove in the next proposition that for , the sequence is a homogeneous Markov chain, and make explicit its update function. In the case where , the chain reduces to .
To apply the LLN we need the Markov chain to be Harris-positive and the properties mentioned in the following lemma.
We believe that the latter result can be generalized to the case if for any there exists such that for all there exists a path of events of length t from to the set for and small enough.
The proof of this lemma consists in applications of Lebesgue’s dominated convergence theorem (see appendix).
We now prove the Harris positivity of by proving a stronger property, namely, the geometric ergodicity, which we show using the drift inequality (29).