Abstract

This paper analyzes a -Evolution Strategy, a randomized comparison-based adaptive search algorithm optimizing a linear function with a linear constraint. The algorithm uses resampling to handle the constraint. Two cases are investigated: first, the case where the step-size is constant, and second, the case where the step-size is adapted using cumulative step-size adaptation. We exhibit for each case a Markov chain describing the behavior of the algorithm. Stability of the chain implies, by applying a law of large numbers, either convergence or divergence of the algorithm. Divergence is the desired behavior. In the constant step-size case, we show stability of the Markov chain and prove the divergence of the algorithm. In the cumulative step-size adaptation case, we prove stability of the Markov chain in the simplified case where the cumulation parameter equals 1, and discuss steps to obtain similar results for the full (default) algorithm where the cumulation parameter is smaller than 1. The stability of the Markov chain allows us to deduce geometric divergence or convergence, depending on the dimension, constraint angle, population size, and damping parameter, at a rate that we estimate. Our results complement previous studies where stability was assumed.

1  Introduction

Derivative-free optimization (DFO) methods are tailored for the optimization of numerical problems in a black box context, where the objective function is pictured as a black box that solely returns f values (in particular, no gradients are available).

Evolution strategies (ES) are comparison-based randomized DFO algorithms. At iteration t, solutions are sampled from a multivariate normal distribution centered in a vector . The candidate solutions are ranked according to f, and the updates of and other parameters of the distribution (usually a step-size and a covariance matrix) are performed using solely the ranking information given by the candidate solutions. Since the ES do not directly use the function values of the new points but only how the objective function f ranks the different samples, they are invariant to the composition (to the left) of the objective function by a strictly increasing function .

This property and the black box scenario make evolution strategies suited for a wide class of real-world problems, where constraints on the variables are often imposed. Different techniques for handling constraints in randomized algorithms have been proposed; see Mezura-Montes and Coello (2011) for a survey. For ES, common techniques are resampling (resample a solution until it lies in the feasible domain), repair of solutions that project infeasible points onto the feasible domain (Arnold, 2011b, 2013), penalty methods where infeasible solutions are penalized either by a quantity that depends on the distance to the constraint if this latter one can be computed with adaptive penalty weights (e.g., Hansen et al., 2009; Arnold and Porter, 2015) or by the constraint value itself (e.g., stochastic ranking; Runarsson and Yao, 2000), or methods inspired by multiobjective optimization (e.g., Mezura-Montes and Coello, 2008).

In this paper we focus on the resampling method and study it on a simple constrained problem. More precisely, we study a -ES optimizing a linear function with a linear constraint and resampling any infeasible solution until a feasible solution is sampled. The linear function models the situation where the current point is far from the optimum relative to the step-size, and “solving” this function means diverging. The linear constraint models being close to the constraint relative to the step-size and far from other constraints. Because of the invariance of the algorithm to the composition of the objective function by a strictly increasing map, the linear function can be composed by a function without derivative and with many discontinuities without any impact on our analysis.

The problem we address was studied previously for different step-size adaptation mechanisms and different constraint-handling methods: with constant step-size, self-adaptation, and cumulative step-size adaptation, and the constraint being handled through resampling or repairing infeasible solutions (Arnold, 2011a; 2012; 2013). The conclusion is that with step-size adaptation, the -ES fails to diverge unless some requirements on internal parameters of the algorithm are met. However, the approach followed in the aforementioned studies relies on finding simplified theoretical models to explain the behavior of the algorithm: typically these models arise from approximations (considering some random variables equal to their expected value, etc.) and assume mathematical properties like the existence of stationary distributions of underlying Markov chains without accompanying proof.

In contrast, our motivation is to study the algorithm without simplifications and to prove rigorously different mathematical properties of the algorithm that allow deducing the exact behavior of the algorithm, as well as to provide tools and methodology for such studies. Our theoretical studies need to be complemented by simulations of the convergence/divergence rates. The mathematical properties that we derive show that these numerical simulations converge fast. Our results are largely in agreement with the aforementioned studies of simplified models, thereby backing up their validity.

As for the step-size adaptation mechanism, our aim is to study the cumulative step-size adaptation (CSA), also called path length control, the default step-size mechanism for the CMA-ES algorithm (Hansen and Ostermeier, 2001). The mathematical object to study for this purpose is a discrete-time, continuous state space Markov chain that is defined as the pair: evolution path and distance to the constraint normalized by the step-size. More precisely, stability properties like irreducibility and existence of a stationary distribution of this Markov chain need to be studied to deduce the geometric divergence of the CSA and have a rigorous mathematical framework to perform Monte Carlo simulations that allow studying the influence of different parameters of the algorithm. We start by illustrating in detail the methodology on the simpler case where the step-size is constant. We show in this case that the distance to the constraint reaches a stationary distribution. This latter property was assumed in a previous study (Arnold, 2011a). We then prove that the algorithm diverges at a constant speed. We then apply this approach to the case where the step-size is adapted using path length control. We show that in the special case where the cumulation parameter c equals 1, the expected logarithmic step-size change, , converges to a constant r, and the average logarithmic step-size change, , converges in probability to the same constant, which depends on parameters of the problem and of the algorithm. This implies geometric divergence (if ) or convergence (if ) at the rate r for which estimates are provided.

This paper is organized as follows. In Section 2 we define the -ES using resampling and the problem. In Section 3 we provide some preliminary derivations on the distributions that come into play for the analysis. In Section 4 we analyze the constant step-size case. In Section 5 we analyze the cumulative step-size adaptation case. Finally, in Section 6, we discuss our results and our methodology.

A preliminary version of this paper appeared in conference proceedings (Chotard et al., 2014). The analysis of path length control with cumulation parameter equal to 1 is however fully new, as is the discussion on how to analyze the case with cumulation parameter smaller than 1. Also, Figures 4–11 as well as the convergence of the progress rate in Theorem 10 are new.

1.1  Notation

Throughout this article, we denote by the density function of the standard multivariate normal distribution (the dimension being clarified within the context), and by the cumulative distribution function of a standard univariate normal distribution. The standard (unidimensional) normal distribution is denoted by , the (n-dimensional) multivariate normal distribution with covariance matrix identity by , and the ith-order statistic of i.i.d. standard normal random variables by . The uniform distribution on an interval I is denoted by . The set of natural numbers (including 0) is denoted by , and the set of real numbers by . We denote by the set , and for , the set denotes , and denotes the indicator function of A. For a topological space , denotes the Borel algebra of . We denote by the Lebesgue measure on , and for , denotes the trace measure . For two vectors and , we denote by the ith-coordinate of and by the scalar product of and . Take with , we denote by the interval of integers between a and b. The gamma function is denoted by . For and two random vectors, we write if and are equal in distribution. For a sequence of random variables and X a random variable, we write if Xt converges almost surely to X, and if Xt converges in probability to X. For X a random variable and a probability measure, we denote by the expected value of X, and by the expected value of X when X has distribution .

2  Problem Statement and Algorithm Definition

2.1  -ES with Resampling

In this paper we study the behavior of a -Evolution Strategy maximizing a function f: , , , with a constraint defined by a function restricting the feasible space to . To handle the constraint, the algorithm resamples any infeasible solution until a feasible solution is found.

From iteration , given the vector and step-size , the algorithm generates new candidates:
formula
1
with , and i.i.d. standard multivariate normal random vectors. If a new sample lies outside the feasible domain, that is, , then it is resampled until it lies within the feasible domain. The first feasible ith candidate solution is denoted by , and the realization of the multivariate normal distribution giving is , that is,
formula
2
The vector is called a feasible step. Note that is not distributed as a multivariate normal distribution.
We define as the index realizing the maximum objective function value, and call the selected step. The vector is then updated as the solution realizing the maximum value of the objective function, that is,
formula
3
The step-size and other internal parameters are then adapted. We denote for the moment in a nonspecific manner the adaptation as
formula
4
where is a random variable whose distribution is a function of the selected steps , , and of internal parameters of the algorithm. We define later specific rules for this adaptation.

2.2  Linear Fitness Function with Linear Constraint

In this article we consider the case where f, the function that we optimize, and g, the constraint, are linear functions. W.l.o.g., we assume that . We denote by a normal vector to the constraint hyperplane. We choose an orthonormal Euclidean coordinate system with basis with its origin located on the constraint hyperplane, where is equal to the gradient ; hence
formula
5
and the vector lives in the plane generated by and and is such that the angle between and is positive. We define as the angle between and , and restrict our study to . The function g can be seen as a signed distance to the linear constraint as
formula
6
A point is feasible if and only if (see Figure 1). Overall the problem reads
formula
7
Figure 1:

Linear function with a linear constraint in the plane generated by and , a normal vector to the constraint hyperplane with angle with . The point is at distance from the constraint.

Figure 1:

Linear function with a linear constraint in the plane generated by and , a normal vector to the constraint hyperplane with angle with . The point is at distance from the constraint.

Although and are in , because of the choice of the coordinate system and the independence of the sequence only the two first coordinates of these vectors are affected by the resampling implied by g and the selection according to f. Therefore for . With an abuse of notation, the vector will denote the two-dimensional vector , likewise will denote the two-dimensional vector , and will denote the two-dimensional vector . The coordinate system will also be used as only.

Following Arnold (2011a; 2012) and Arnold and Brauer (2008), we denote the normalized signed distance to the constraint as , that is,
formula
8

We initialize the algorithm by choosing and , which implies that .

3  Preliminary Results and Definitions

Throughout this section we derive the probability density functions of the random vectors and and give a definition of and of as a function of and of an i.i.d. sequence of random vectors.

3.1  Feasible Steps

The random vector , the ith feasible step, is distributed as the standard multivariate normal distribution truncated by the constraint, as stated in the following lemma.

Lemma 1:
Let a -ES with resampling optimize a function f under a constraint function g. If g is a linear form determined by a vector as in Eq. (6), then the distribution of the feasible step only depends on the normalized distance to the constraint and its density given that equals reads
formula
9
Proof:
A solution is feasible if and only if , which is equivalent to . Hence dividing by , a solution is feasible if and only if . Since a standard multivariate normal distribution is rotational invariant, follows a standard (unidimensional) normal distribution. Hence the probability that a solution or a step is feasible is given by
formula
Therefore the probability density function of the random variable for is . For any vector orthogonal to , the random variable was not affected by the resampling and is therefore still distributed as a standard (unidimensional) normal distribution. With a change of variables, using the fact that the standard multivariate normal distribution is rotational invariant, we obtain the joint distribution of Eq. (9).
Then the marginal density function of can be computed by integrating Eq. (9) over and reads
formula
10
(see Arnold, 2011a, Eq. 4), and we denote by its cumulative distribution function.

It will be important in the sequel to be able to express the vector as a function of and of a finite number of random samples. Hence we give an alternative way to sample rather than the resampling technique that involves an unbounded number of samples.

Lemma 2:
Let a -ES with resampling optimize a function f under a constraint function g, where g is a linear form determined by a vector as in Eq. (6). Let the feasible step be the random vector described in Lemma 1 and be the two-dimensional rotation matrix of angle . Then
formula
11
where denotes the generalized inverse1 of the cumulative distribution of , , with i.i.d. and i.i.d. random variables.
Proof:

We define a new coordinate system (see Figure 1). It is the image of by . In the new basis , only the coordinate along is affected by the resampling. Hence the random variable follows a truncated normal distribution with cumulative distribution function equal to , while the random variable follows an independent standard normal distribution, hence . Using the fact that if a random variable has a cumulative distribution F, then for the generalized inverse of F, with has the same distribution as this random variable, we get that , so we obtain Eq. (11).

We now extend our study to the selected step .

3.2  Selected Step

The selected step is chosen among the different feasible steps to maximize the function f, and it has the density described in the following lemma.

Lemma 3:
Let a -ES with resampling optimize problem (7). Then the distribution of the selected step only depends on the normalized distance to the constraint and its density, given that equals , reads
formula
12
where is the density of given that given in Eq. (9) and the cumulative distribution function of whose density is given in Eq. (10) and the vector .
Proof:
The function f being linear, the rankings on correspond to the order statistic on . If we look at the joint cumulative distribution of
formula
by summing disjoints events. The vectors being independent and identically distributed,
formula
Deriving on x and y yields the density of of Eq. (12).

We may now obtain the marginal of and .

Corollary 1:
Let a -ES with resampling optimize problem (7). Then the marginal distribution of only depends on , and its density given that equals reads
formula
13
and the same holds for , whose marginal density reads
formula
14
Proof:

Integrating Eq. (12) directly yields Eq. (13).

The conditional density function of is
formula
As , using the previous equation with Eq. (12) gives that , which with Eq. (9) gives
formula
The condition is equivalent to ; hence Eq. (14) holds.
We will need an expression of the random vector as a function of and a random vector composed of a finite number of i.i.d. random variables. Using the notation of Lemma 2, we define the function as
formula
15
According to Lemma 2, given that and , (resp. ) is distributed as the resampled step in the coordinate system (resp. ). Finally, let and let be the function defined as
formula
16
As shown in the following proposition, given that and , the function is distributed as the selected step .
Proposition 1:
Let a -ES with resampling optimize the problem defined in Eq. (7), and let be an i.i.d. sequence of random vectors with , and . Then
formula
17
where the function is defined in Eq. (16).
Proof:

Since f is a linear function , so is equivalent to . Hence and therefore . From Lemma 2 and Eq. (15), , so , which from Eq. (16) is .

4  Constant Step-Size Case

We illustrate in this section our methodology on the simple case where the step-size is constantly equal to , and prove that diverges in probability at constant speed and that the progress rate (see Arnold, 2011a, Eq. 2) converges to a strictly positive constant (Theorem 11). The analysis of the CSA is then a generalization of the results presented here, with more technical results to derive. Note that the progress rate definition coincides with the fitness gain, that is, .

As suggested by Arnold (2011a), the sequence plays a central role for the analysis, and we show that it admits a stationary measure. We first prove that this sequence is a homogeneous Markov chain.

Proposition 2:
Consider the -ES with resampling and with constant step-size optimizing the constrained problem (7). Then the sequence is a homogeneous Markov chain on and
formula
18
where is the function defined in Eq. (16) and is an i.i.d. sequence with for all .
Proof:

It follows from the definition of that , and Proposition 5 states that . Since has the same distribution as a time-independent function of and of , where are i.i.d., it is a homogeneous Markov chain.

The Markov chain comes into play for investigating the divergence of . Indeed, we can express in the following manner:
formula
19
The latter term suggests the use of a Law of Large Numbers (LLN) to prove the convergence of , which in turn implies—if the limit is positive—the divergence of at a constant rate. Sufficient conditions on a Markov chain to be able to apply the LLN include the existence of an invariant probability measure . The limit term is then expressed as an expectation over the stationary distribution. More precisely, assuming the LLN can be applied, the following limit will hold
formula
20
If the Markov chain is also V-ergodic with , then the progress rate converges to the same limit:
formula
21
We prove formally these two equations in Theorem 10.

The invariant measure also underlies the study carried out by Arnold (2011a, Section 4), where it is stated, “Assuming for now that the mutation strength is held constant, when the algorithm is iterated, the distribution of -values tends to a stationary limit distribution.” We provide a formal proof that indeed admits a stationary limit distribution , and prove some other useful properties that allow us to conclude the divergence of .

4.1  Study of the Stability of

We study in this section the stability of . We first derive its transition kernel for all and . Since
formula
22
where is the density of given in Eq. (12). For , the t-steps transition kernel Pt is defined by .

From the transition kernel, we now derive the first properties on the Markov chain . First we investigate the so-called -irreducible property.

A Markov chain on a state space is -irreducible if there exists a nontrivial measure such that for all sets with and for all , there exists such that . We denote by the set of Borel sets of with strictly positive -measure.

We also need the notion of small sets and petite sets. A set is called a small set if there exists and a nontrivial measure such that for all sets and all ,
formula
23
A set is called a petite set if there exists a probability measure on and a nontrivial measure such that for all sets and all ,
formula
24
A small set is therefore also a petite set. The existence of a small set combined with a control of the Markov chain outside the small set allows deducing powerful stability properties of the Markov chain. If there exists a -small set C such that , then the Markov chain is called strongly aperiodic.
Proposition 3:

Consider a -ES with resampling and with constant step-size optimizing the constrained problem (7), and let be the Markov chain exhibited in Eq. (18). Then is -irreducible, strongly aperiodic, and compact sets of and sets of the form with are small sets.

Proof:
Take and . Using Eq. (22) and Eq. (12), the transition kernel can be written
formula
We remove from the indicator function by a substitution of variables , and . As this substitution is the composition of a rotation and a translation, the determinant of its Jacobian matrix is 1. We denote , and . Then , and
formula
25
For all the function is strictly positive; hence for all A with , . Hence is irreducible with respect to the Lebesgue measure.
In addition, the function is continuous as the composition of continuous functions (the continuity of for all x coming from the dominated convergence theorem). Given a compact C of , we know that there exists such that for all , . Hence for all ,
formula
The measure being nontrivial, the previous equation shows that compact sets of are small and that for C a compact such that , we have ; hence the chain is strongly aperiodic. Note also that since , the same reasoning holds for instead of C (where ). Hence the set is also a small set.
The application of the LLN for a -irreducible Markov chain on a state space requires the existence of an invariant measure that is satisfying for all
formula
26
If a Markov chain admits an invariant probability measure, then the Markov chain is called positive.
A typical assumption to apply the LLN is positivity and Harris recurrence. A -irreducible chain on a state space is Harris-recurrent if for all sets and for all , where is the occupation time of A, that is, . We show that the Markov chain is positive and Harris-recurrent by using so-called Foster-Lyapunov drift conditions: define the drift operator for a positive function V as
formula
27
Drift conditions translate that outside a small set, the drift operator is negative. We show a drift condition for V-geometric ergodicity where given a function , a positive and Harris-recurrent chain with invariant measure is called f-geometrically ergodic if and there exists such that
formula
28
where for a signed measure denotes .
To prove the V-geometric ergodicity, we prove that there exists a small set C, constants , and a function finite for at least some such that for all ,
formula
29
If the Markov chain is -irreducible and aperiodic, this drift condition implies that the chain is V-geometrically ergodic (Meyn and Tweedie, 1993, Theorem 15.0.1)2 as well as positive and Harris-recurrent.3

Because sets of the form with are small sets, and drift conditions investigate the negativity outside a small set, we need to study the chain for large. The following lemma is a technical lemma studying the limit of for to infinity.

Lemma 4:

Consider the -ES with resampling optimizing the constrained problem (7), and let be the function defined in Eq. (16). We denote by K and the random variables and . For and any , and .

For the proof see the appendix. We are now ready to prove a drift condition for geometric ergodicity.

Proposition 4:

Consider a -ES with resampling and with constant step-size optimizing the constrained problem (7), and let be the Markov chain exhibited in Eq. (18). The Markov chain is V-geometrically ergodic with for small enough, and is Harris-recurrent and positive with invariant probability measure .

Proof:
Take the function ; then
formula
With Lemma 4 we obtain that
formula
As the right-hand side of the previous equation is finite, we can invert integral with series with Fubini’s theorem, so with Taylor series
formula
which in turns yields
formula
Since for , , for and small enough, we get . Hence there exists , and such that
formula

According to Proposition 7, is a small set, hence it is petite (Meyn and Tweedie, 1993, Prop. 5.5.3). Furthermore is a -irreducible aperiodic Markov chain, so satisfies the conditions of Theorem 15.0.1 from Meyn and Tweedie (1993), which with Lemma 15.2.2, Theorem 9.1.8, and Theorem 14.0.1 from Meyn and Tweedie (1993) proves the proposition.

We have proved rigorously the existence (and unicity) of an invariant measure for the Markov chain , which provides the steady state behavior described by Arnold (2011a, Section 4). As the Markov chain is positive and Harris-recurrent, we may now apply a Law of Large Numbers (Meyn and Tweedie, 1993, Theorem 17.1.7) in Eq. (19) to obtain the divergence of and an exact expression of the divergence rate.

Theorem 1:
Consider a -ES with resampling and with constant step-size optimizing the constrained problem (7) and let be the Markov chain exhibited in Eq. (18). The sequence diverges in probability to at constant speed, that is,
formula
30
and the expected progress satisfies
formula
31
where is the progress rate defined by Arnold (2011a, Eq. (2)), is defined in Eq. (16), with an i.i.d. sequence such that , is the stationary measure of whose existence was proved in Proposition 9, and is the probability measure of .
Proof:

From Proposition 4 the Markov chain is Harris-recurrent and positive, and since is i.i.d., the chain is also Harris-recurrent and positive with invariant probability measure , so to apply the Law of Large Numbers (Meyn and Tweedie, 1993, Theorem 17.0.1) to we only need to be -integrable.

With Fubini-Tonelli’s theorem, equals . As , we have , and for all as , , and , with Eq. (13) we obtain that , so the function is integrable. Hence for all , is finite. Using the dominated convergence theorem, the function is continuous, hence so is . From Eq. (13), , which is integrable, so the dominated convergence theorem implies that the function is continuous. Finally, using Lemma 4 with Jensen’s inequality shows that is finite. Therefore the function is bounded by a constant . As is a probability measure, , meaning is -integrable. Hence we may apply the LLN on Eq. (19)
formula
The equality in distribution in Eq. (19) allows us to deduce the convergence in probability of the left-hand side of Eq. (19) to the right-hand side of the previous equation.

From Eq. (19), , so . As is integrable with Fubini’s theorem, , so . According to Proposition 9, is V-geometrically ergodic with , so there exists and such that . We showed that the function is bounded, so since for all and , there exists k such that for all . Hence . And therefore , which converges to 0 when t goes to infinity.

As the measure is an invariant measure for the Markov chain , using Eq. (18), ; hence , and thus
formula
We see from Eq. (14) that for , ; hence the expected value is strictly negative. With the previous equation this implies that is strictly positive.

We showed rigorously the divergence of and gave an exact expression of the divergence rate, and that the progress rate converges to the same rate. The fact that the chain is V-geometrically ergodic gives that there exists a constant such that . This implies that the distribution can be simulated efficiently by a Monte Carlo simulation, giving precise estimations of the divergence rate of .

A Monte Carlo simulation of the divergence rate in the right-hand side of expressions (30) and (31) and for time steps gives the progress rate of (Arnold, 2011a), which once normalized by and yields Figure 2. We normalize per , as in evolution strategies the cost of the algorithm is assumed to be the number of f-calls. We see that for small values of , the normalized serial progress rate assumes roughly . Only for larger constraint angles the serial progress rate depends on , where smaller are preferable.

Figure 2:

Normalized progress rate divided by for the -ES with constant step-size and resampling, plotted against the constraint angle , for .

Figure 2:

Normalized progress rate divided by for the -ES with constant step-size and resampling, plotted against the constraint angle , for .

Figure 3 is obtained through simulations of the Markov chain defined in Eq. (18) for time steps where the values of are averaged over time. We see that when , then , since the selection does not attract toward the constraint anymore. With a larger population size the algorithm is closer to the constraint, as better samples are more likely to be found close to the constraint.

Figure 3:

Average normalized distance from the constraint for the -ES with constant step-size and resampling, plotted against the constraint angle , for .

Figure 3:

Average normalized distance from the constraint for the -ES with constant step-size and resampling, plotted against the constraint angle , for .

5  Cumulative Step-Size Adaptation

In this section we apply the techniques introduced in the previous section to the case where the step-size is adapted using cumulative step-size adaptation. This technique was studied on sphere functions by Arnold and Beyer (2004) and on ridge functions by Arnold and MacLeod (2008).

In CSA the step-size is adapted using a path , vector of , that sums up the different selected steps with a discount factor. More precisely the evolution path is defined by and
formula
32
The variable is called the cumulation parameter and determines the ”memory” of the evolution path, with the importance of a step decreasing in . The backward time horizon is consequently about . The coefficients in Eq. (32) are chosen such that if follows a standard normal distribution, and if f ranks uniformly randomly the different samples and these samples are normally distributed, then will also follow a standard normal distribution independently of the value of c.
The length of the evolution path is compared to the expected length of a Gaussian vector (that corresponds to the expected length under random selection) (see Hansen and Ostermeier, 2001). To simplify the analysis we study here a modified version of CSA introduced by Arnold (2002), where the squared length of the evolution path is compared with the expected squared length of a Gaussian vector, that is, n, since it would be the distribution of the evolution path under random selection. If is greater (resp. lower) than n, then the step-size is increased (resp. decreased) following
formula
33
where the damping parameter determines how much the step-size can change and can be set here to .
As for , we also have . It is convenient in the sequel to also denote by the two-dimensional vector . With this (small) abuse of notation, Eq. (33) is rewritten as
formula
34
with an i.i.d. sequence of random variables following a chi-squared distribution with degrees of freedom. We denote by the multiplicative step-size change , which is the function
formula
35

Note that for , is a function only of , , and Kt, which we denote by .

We prove in the next proposition that for , the sequence is a homogeneous Markov chain, and make explicit its update function. In the case where , the chain reduces to .

Proposition 5:
Consider a -ES with resampling and cumulative step-size adaptation maximizing the constrained problem (7). Take . The sequence is a time-homogeneous Markov chain and
formula
36
formula
37
with an i.i.d. sequence of random variables following a chi-squared distribution with degrees of freedom, defined in Eq. (16) and defined in Proposition 5.
If c = 1, then the sequence is a time-homogeneous Markov chain and
formula
38
Proof:

With Eq. (32) and Eq. (17) we get Eq. (37).

From Eq. (8) and Proposition 5 it follows that
formula
So is a function of only and i.i.d. random variables; hence is a time-homogeneous Markov chain.

Fixing in Eqs. (36) and (37) immediately yields Eq. (38), and then is a function of only and i.i.d. random variables, so in this case is a time-homogeneous Markov chain.

As for the constant step-size case, the Markov chain is important when investigating the convergence or divergence of the step-size of the algorithm. Indeed, from Eq. (34) we can express as
formula
39
The right-hand side suggests we use the LLN. The convergence of to a strictly positive limit (resp. negative) will imply the divergence (resp. convergence) of at a geometrical rate.
It turns out that the dynamic of the chain looks complex to analyze. Establishing drift conditions looks particularly challenging. We therefore restrict the rest of the study to the more simple case where ; hence the Markov chain of interest is . Then Eq. (39) becomes
formula
40

To apply the LLN we need the Markov chain to be Harris-positive and the properties mentioned in the following lemma.

Lemma 5 (Chotard and Auger 2015, Prop. 7):

Consider a -ES with resampling and cumulative step-size adaptation maximizing the constrained problem (7). For the Markov chain from Proposition 11 is -irreducible, strongly aperiodic, and compact sets of are small sets for this chain.

We believe that the latter result can be generalized to the case if for any there exists such that for all there exists a path of events of length t from to the set for and small enough.

To show the Harris positivity of we use the drift function . From the definition of the drift operator in Eq. (27) and the update of in Eq. (38), we have
formula
41
To verify the drift condition (29), using the fact from Lemma 5 that for the compact is a small set, it is sufficient to show that the limits of in 0 and are negative. These limits result from the limits studied in the following lemma corresponding to the decomposition in Eq. (41).
Lemma 6:
For small enough
formula
42
formula
43
formula
44
formula
45
where , , and ; where is the function defined in Eq. (16) and is defined in Eq. (35) (for ), K is a random variable following a chi-squared distribution with degrees of freedom, and is a random vector.

The proof of this lemma consists in applications of Lebesgue’s dominated convergence theorem (see appendix).

We now prove the Harris positivity of by proving a stronger property, namely, the geometric ergodicity, which we show using the drift inequality (29).

Proposition 6:

Consider a -ES with resampling and cumulative step-size adaptation maximizing the constrained problem (7). For the Markov chain from Proposition 11 is V-geometrically ergodic with for small enough, and positive Harris with invariant measure .

Proof:
Take V the positive function (the parameter is strictly positive and is specified later), a random vector, and K a random variable following a chi-squared distribution with degrees of freedom. We first study when . From Eq. (41) we have the following drift quotient
formula
46
with defined in Eq. (35) and in Eq. (16). From Lemma 6, following the same notation as in the lemma, when and if is small enough, the right-hand side of the previous equation converges to . With Taylor series
formula
Furthermore, as the density of at x equals and , which for small enough is integrable, we get
formula
Therefore we can use Fubini’s theorem to invert series (which are integrals for the counting measure) and integral. The same reasoning holding for E2 and E3 (for E3 with the chi-squared distribution, we need for all ), we have
formula
and as and ,
formula
From Chotard et al. (2012a), if , then . Therefore, for small enough, we have , so there exists and such that whenever .
Similarly, when is small enough, using Lemma 6, and . Hence, using Eq. (46), . So there exists and