## Abstract

We study the behaviour of a -ES that handles constraints by resampling infeasible candidate solutions for linear optimization problems with a conically constrained feasible region. The analysis generalizes prior work in that no particular orientation of the cone relative to the gradient of the objective function is assumed. Expressions that describe the strategy's single-step behaviour are derived. Assuming that the mutation strength is adapted in a scale-invariant manner, a simple zeroth-order model is used to determine the speed of convergence of the strategy. We then derive expressions that approximately characterize the average step size and convergence rate attained when using cumulative step size adaptation and compare the values with optimal ones.

## 1 Introduction

Analytical studies of evolution strategies in select test environments contribute to the understanding of the dependence of algorithm behaviour on strategy as well as problem parameters. The ranges of strategy variants and test function classes that have been considered in analytical studies of evolution strategies for constrained optimization are more limited than in the unconstrained case. An improved analytically based understanding of the behaviour of evolution strategies for a wider range of constrained test function classes will be useful for interpreting experimental results, selecting and designing useful test functions for empirical comparisons of algorithms, and developing and improving constraint handling techniques.

Early work that analyzes the performance of evolution strategies for constrained optimization includes that of Rechenberg (1973), who studies the behaviour of the (1+1)-ES^{1} for the axis-aligned corridor model. Schwefel (1981) considers the performance of the -ES in the same environment. Beyer (1989) analyzes the behaviour of the (1+1)-ES for a constrained discus-like function. All of these studies have in common that the normal vectors of the constraint planes are orthogonal to the gradient vector of the objective function.

More recently, Arnold and Brauer (2008) consider a linear problem with a single linear constraint of general orientation. They provide a quantitative confirmation of the observation made by Schwefel (1995, p. 116f) that when using the one-fifth rule, the presence of constraints may lead to the step size being systematically reduced in situations where the angle between the gradient direction and the normal vector of the active constraint, which we refer to as the constraint angle, is small. Arnold (2011a, 2011b) studies the behaviour of the -ES applied to the same problem and compares constraint handling through resampling of infeasible candidate solutions with constraint handling through a simple repair mechanism that projects infeasible candidate solutions onto the feasible region. He finds that cumulative step size adaptation may fail in the face of small constraint angles, and that using the repair mechanism makes it possible to choose the algorithm's parameters such that stagnation can be avoided for any constraint angle. Arnold (2013b) extends the analysis of the linearly constrained problem to consider the -ES with cumulative step size adaptation. Arnold (2012) considers the -ES with mutative self-adaptation in the same environment and finds that constraint handling through resampling of infeasible candidate solutions results in stagnation in the face of small constraint angles.

Common benchmark problems, such as those collected by Michalewicz and Schoenauer (1996), have optima located on the boundary of the feasible region. A linear problem with a single linear constraint does not possess a finite optimum and does not model that situation well. Rather than increasing the step size indefinitely, an evolution strategy applied to a constrained problem with the optimal solution located on the boundary of the feasible region needs to reduce its step size as it approaches the optimal solution while at the same time avoiding premature stagnation. Arnold (2013a) studies the behaviour of a -ES that resamples infeasible candidate solutions for a linear problem with a feasible region delimited by a right circular cone, where the optimal solution lies at the cone's apex and the gradient direction coincides with the cone's central axis. He finds that cumulative step size adaptation enables the strategy to converge linearly for the entire range of parameter settings considered. Assuming that computational costs are dominated by objective function evaluations, the step sizes generated a yield close to optimal rates of convergence. Using smaller values for the number of feasible offspring generated per time step results in the strategy operating at a greater distance from the boundary of the feasible region and thus in fewer infeasible offspring that need to be resampled.

This paper extends the analysis presented by Arnold (2013a) to linear optimization problems with a conically constrained feasible region of general orientation. That is, the assumption that the gradient direction coincides with the cone's central axis is dropped, removing some of the artificial symmetry of the original problem. The remainder of the paper is organized as follows. Section 2 briefly describes the optimization problem and algorithm under consideration. Section 3 analyzes the single-step behaviour of the strategy and derives expressions for the expected step that it takes. Section 4 investigates the multi-step limit behaviour of the strategy assuming that the mutation strength is adapted such that the step size is proportional to the distance from the axis of the cone that delimits the feasible region. A simple zeroth-order model is employed, and its accuracy is evaluated using computer experiments. Section 5 employs the same approach for the analysis of the behaviour of cumulative step size adaptation. Expressions characterizing the step size as well as the convergence rate achieved with cumulative step size adaptation are derived and compared with optimal ones. Section 6 concludes with a brief discussion and goals for future research.

## 2 Problem and Algorithm

*x*

_{1}axis coincides with the cone's central axis, and the

*x*

_{2}axis is oriented such that the span of the

*x*

_{1}and

*x*

_{2}axes includes the gradient direction, the problem can without loss of generality be stated as that of minimizing subject to constraints where , , and . The problem considered by Arnold (2013a) is recovered for . Figure 1 illustrates the setup. Larger values of result in a narrower cone of feasible solutions. If , then the optimal solution to the constrained optimization problem lies at the cone's apex and has objective function value zero. If , then all points with are optimal. If , then no optimal solution exists and the objective function value can be decreased indefinitely.

The optimization strategy considered throughout this paper is the same as that employed by Arnold (2013a). Assuming a feasible initial candidate solution , the -ES iteratively generates a sequence of further candidate solutions by performing the following four steps per iteration.

Generate feasible offspring candidate solutions , . Vectors

**z**^{(i)}are generated by repeatedly sampling mutation vectors**w**from an*N*-dimensional normal distribution with mean zero and unit covariance matrix. The first vector**w**for which is feasible is labelled**z**^{(i)}. Step size parameter is referred to as the mutation strength.Evaluate the

**x**^{(i)}for and let denote the mutation vector corresponding to the offspring candidate solution with the smallest objective function value. Any ties are broken arbitrarily.Update the parental candidate solution according to .

Modify the mutation strength.

## 3 Single-Step Behaviour

The analysis of the single-step behaviour of the strategy is simplified by virtue of the symmetries inherent in the constraint functions. Specifically, the location of a candidate solution is adequately described by its *x*_{1} and *x*_{2} coordinates, and the distance from the origin in the space spanned by the *x*_{3} through *x _{N}* axes. When considering single iterations of the -ES with parental candidate solution

**x**we can choose a coordinate system such that , where

*R*>0

^{2}. Moreover, in our calculations we make simplifications assuming that the dimension

*N*of the search space tends to .

### 3.1 Normalization and Constraints

**x**is feasible, it follows from Equations (4) and (5) that The purpose of describing the state of the algorithm in terms of , , and

*R*rather than in terms of

*x*

_{1},

*x*

_{2}, and

*R*will become clear in Section 4, where we consider scale-invariant behaviour.

*w*are independent standard normal random variables, is feasible if and only if and Using Equations (4), (5), (6), and (7), these conditions can equivalently be written as and For given , , and and , the left-hand side of Equation (8) converges in distribution to and the condition will be satisfied with overwhelming probability. The left-hand side of Equation (9) converges to and the condition will be satisfied if .

_{i}### 3.2 Distribution of Feasible Offspring

*z*

_{1},

*z*

_{2}, and

*z*

_{3}of mutation vector

**z**resulting in feasible offspring is The marginal probability density of the

*z*

_{1}and

*z*

_{2}components is where Equation (10) has been used. As the

*z*with have no influence on whether an offspring candidate solution is feasible, they are independently distributed with standard normal density.

_{i}### 3.3 Normalized Fitness Difference

### 3.4 Expected Step

*z*

_{1},

*z*

_{2}, and

*z*

_{3}components of the mutation vector corresponding to the offspring candidate solution selected to replace the parent can be computed using elementary results from the field of order statistics (Balakrishnan and Rao, 1998). Specifically, as there are possible outcomes of selection and the selected candidate solution has a smaller objective function value than the remaining candidate solutions, the expected values can be obtained from integration over the densities computed above as Letting calculations detailed in Appendix B derive expressions allowing the straightforward numerical computation of the expected values.

### 3.5 Change of Normalized Slack

### 3.6 Change of Objective Bias

## 4 Steady State Behaviour

In this section we consider the behaviour of the -ES with scale-invariant mutation strength. That is, we assume that is adapted such that the normalized mutation strength is constant. While this does not constitute a viable algorithm, the results will form a stepping stone toward the analysis of the behaviour of cumulative step size adaptation in Section 5. Moreover, they will allow deriving optimal parameter settings that values generated by any step size adaptation mechanism can be compared against.

Figure 2 plots the average normalized slack and average objective bias in runs of the -ES with and scale-invariant step size against the normalized mutation strength . The curves have been obtained by numerical root finding using Equations (27) and (28) with Equations (20), (21), (22), (25), and (26). The points mark average measurements made in runs of the strategy for search space dimensions . Each run is initialized with the parental candidate solution on the boundary of the feasible region. Mixing times are very long for small values of the normalized mutation strength. In order to avoid initialization effects from impacting the measurements, the strategy is run with decreasing normalized mutation strength for 400*N* iterations in which no measurements are made. The data points shown in the figure are the result of averaging the normalized slack and objective bias over the following 20,000 time steps. A run is terminated prematurely and no data point is included in the figure if the generation of an offspring results in at least 1,000 infeasible candidate solutions before sampling a feasible one.

It can be seen from the figure that the average normalized slack increases with increasing normalized mutation strength as well as with increasing . In contrast to the case where (shown at the top and considered previously by Arnold, 2013a) where the normalized amount of slack appears to tend to zero as normalized mutation strengths decrease, for the curves are nearly flat for small values of . At the same time, the objective bias, which is zero for , takes on negative values for . For small values of the normalized mutation strength, average values of are close to , indicating that the strategy tracks the boundary of the feasible region near the bottom of the cone shown in Figure 1. The accuracy of the predictions made on the basis of the zeroth-order model and the assumption of infinite search space dimensionality is quite good throughout. It visibly decreases for larger values of or , where a model that includes fluctuations of would be required for more accurate results. Deviations of the measured values from the analytically obtained curves can also be observed for small values of the normalized mutation strength, where the long mixing times are responsible and higher accuracy would be observed if more iterations were discarded as well as averaged over in the experimental runs.

*P*

_{feasible}of a random candidate solution being feasible and the normalized per iteration quality gain

^{3}The former determines the number of candidate solutions that need to be sampled on average (and thereby the number of constraint function evaluations) in order to generate a feasible offspring candidate solution. The latter coincides with the rate of convergence (or divergence, if negative) of the strategy, scaled with the search space dimension

*N*. From Equation (1) with Equations (4), (6), and (7), the normalized quality gain of the strategy equals Performing a Taylor expansion of the logarithm and dropping terms that disappear in the limit yields as an approximation for the normalized quality gain of the strategy.

In the case that (i.e., an optimal solution does not exist), the strategy can be observed to diverge linearly. In contrast to the former case, once negative objective function values of candidate solutions have been achieved, solution quality is improved by increasing absolute objective function values. We thus use the negative of Equation (30) to estimate the normalized rate of divergence. Accordingly, we measure the normalized quality gain in runs of the evolution strategy as the average observed value of . After initialization effects have faded, we do not observe changes in sign of the parental objective function value, ensuring that the logarithm is real valued.

Figure 3 plots the average probability *P*_{feasible} and the normalized quality gain from runs of the -ES with and scale-invariant step size against the normalized mutation strength . The lines have been obtained from Equations (11) and (30) with Equations (20) and (21), with the average normalized slack and objective bias obtained from the zeroth-order model based on Equations (27) and (28). The points mark measurements made in the same runs of the evolution strategy used to generate Figure 2. The higher accuracy of the quality gain predictions compared to those of *P*_{feasible} is a result of the skewness of the distribution of normalized slack values, which is not considered in the simple zeroth-order model, and the differing slopes of the functions in Equations (11) and (30) in the vicinity of the normalized slack and objective bias values typically encountered.

It can be seen from the figure that the average probability *P*_{feasible} decreases with increasing normalized mutation strength and thus that larger mutation strengths result in more constraint function evaluations per iteration. Moreover, in contrast to the case of a linear function with a single linear constraint (as considered in Arnold and Brauer, 2008; Arnold, 2011a, 2011b, 2012, 2013b), *P*_{feasible} can drop significantly below one-half for the conically constrained problem. In five of the six cases considered, the normalized quality gain shows the usual unimodal behaviour, where the quality gain first increases with increasing normalized mutation strength , but then starts decreasing and eventually turns negative (signifying divergence) as increases further. Normalized quality gain values are larger for than for . The sixth case (, ) is the only one where and thus where there is no optimal solution at the cone's apex. In that case, larger normalized mutation strengths result in more rapid divergence and increasingly larger quality gain values.

Figure 4 plots the optimal normalized mutation strength (i.e., the normalized mutation strength that maximizes the normalized quality gain), the corresponding average probability *P*_{feasible}, and the resulting normalized quality gain against both and . The curves have been obtained by numerically optimizing the normalized quality gain computed from Equation (30) with the average normalized slack and objective bias obtained from the zeroth-order model based on Equations (27) and (28). It can be seen that where the problem parameters are such that an optimal solution exists, optimal normalized mutation strengths increase with increasing values of . The larger normalized mutation strengths for larger values of result in faster convergence of the -ES. Figure 4 also shows that the faster convergence in the face of larger values of comes at the cost of small values of *P*_{feasible} and thus larger numbers of constraint function evaluations.

## 5 Mutation Strength Adaptation

Achieving scale-invariant step size of course requires information generally not available to the algorithm. Instead, evolution strategies employ one of several mechanisms for adapting their mutation strength. Among the techniques used, mutative self-adaptation (Schwefel, 1981) and cumulative step size adaptation (Ostermeier et al., 1994) are the two most common, with the latter forming the global step size adaptation component of the CMA-ES (Hansen and Ostermeier, 2001).

**s**

^{(0)}=

**0**and where superscripts indicate iteration number, that implements an exponentially fading record of past steps taken by the strategy. Constant , which goes to zero as , is referred to as the cumulation parameter and determines the effective length of the memory implemented in the search path. The mutation strength is updated according to

^{4}where

*D*>0 is a damping constant that scales the magnitude of the updates. Clearly, the sign of determines whether the mutation strength is increased or decreased. The underlying idea is that long search paths indicate that the steps made by the strategy point predominantly in one direction and could beneficially be replaced with fewer but longer steps. Short search paths indicate that the strategy steps back and forth and that it should operate with a smaller step size. The search path has an expected squared length of

*N*if consecutive steps are perpendicular on average, in which case no change in mutation strength is effected.

The approach to the analysis of cumulative step size adaptation is the same as that used in Section 4 to study the behaviour of the scale-invariant strategy. We identify a set of variables that describes the state of the algorithm such that the update equations for those variables that are induced by the strategy behaviour are time-invariant. We derive approximate values for the mean values of the joint limit distribution of the variables using a simple zeroth-order model. Imposing conditions akin to Equations (27) and (28) yields a system of as many equations as there are variables. The same approach has been used to analyze the behaviour of cumulative step size adaptation in other environments (Arnold, 2002; Arnold and Beyer, 2010; Arnold, 2013b). A better characterization of the joint limit distribution could be obtained by expanding it into a Gram-Charlier series with unknown cumulants as proposed by Beyer and Schwefel (2002). Considering cumulants beyond the means would allow deriving approximations for variances and covariances, and so forth.

*s*

_{1}and

*s*

_{2}of the search path in the directions of the

*x*

_{1}and

*x*

_{2}axes, the length of the projection of the search path onto the vector pointing from the current candidate solution to the origin of the space spanned by the

*x*

_{3}through

*x*axes, and the deviation of the search path's squared length from the value where no change in mutation strength is effected.

_{N}*x*

_{1}axis is governed by Imposing condition E[

*s*

^{(t+1)}

_{1}]=

*s*

^{(t)}

_{1}and solving for

*s*

_{1}=

*s*

^{(t)}

_{1}yields as an approximation to the average value of

*s*

_{1}. Proceeding equivalently in the subspace spanned by the

*x*

_{2}axis results in as an approximation to the average value of

*s*

_{2}.

*x*

_{3}through

*x*axes is described by Using Equation (6), cancelling terms, expanding the fraction into a Taylor series, taking expected values, and omitting terms that disappear in the limit yields Imposing condition and solving for yields as an approximation to the average value of .

_{N}*N*.

Figure 5 plots the average normalized mutation strength and the normalized quality gain of the -ES with cumulative step size adaptation against constraint function parameter . The curves have been obtained by numerical root finding from Equations (38), (27), and (28) with Equations (20), (21), and (22) to compute the expected values. The thin dotted lines show the values for optimal scale-invariant mutation strength derived in Section 4 for comparison. The points mark averages measured in runs of the -ES with cumulative step size adaptation with , *D*=1, and . Each run is initialized with the parental candidate solution on the boundary of the feasible region and an initial mutation strength such that . The first 40*N* iterations of each run are discarded in order to avoid initialization conditions from impacting the measurements. The data points shown in the figure are the result of averaging over the following 20,000 time steps, where again data points from runs in which at least 1,000 infeasible candidate solutions were sampled in the generation of a single offspring are excluded.

It can be seen that cumulative step size adaptation successfully controls the mutation strength of the strategy in that linear convergence (as signified by positive quality gain values) is achieved for all combinations of parameter values of and considered that admit an optimal solution. Step sizes generated usually differ from optimal ones. However, quality gain values observed are not far from optimal ones, and the lower mutation strengths employed for large values of have the advantage of resulting in fewer infeasible candidate solutions being generated.

Figure 6 illustrates the magnitude of the error made in the derivation of the normalized quality gain. The lines have been obtained from Equation (30) using the simple zeroth-order model that neglects fluctuations and assumes . The points mark measurements made in runs of the (1, 10)-ES with cumulative step size adaptation as described above, but without terminating a run once 1,000 infeasible candidate solutions have been sampled in the generation of a single offspring candidate solution. It can be seen that while for and the quality of the approximation is good for *N* as small as 2, larger values of as well as may result in significant amounts of error in low-dimensional search spaces.

## 6 Discussion and Conclusions

To conclude, we have presented an analysis of the behaviour of a -ES that handles constraints by resampling infeasible candidate solutions applied to conically constrained linear problems. Conically constrained linear problems are among the easiest to analyze problems with optimal solutions located on the boundary of the feasible region. Assuming scale-invariant mutation strength, we have found that the probability of generating infeasible offspring and thus needing to resample repeatedly increases with increasing step size. Larger values of the parameter of the constraint function correspond to a more strongly constrained problem (i.e., one with a narrower feasible region), but at the same time make it possible to achieve faster rates of convergence. However, the faster rates of convergence, which are achieved with larger step sizes, come at the cost of generating more infeasible candidate solutions that need to be resampled.

We then investigated the performance of cumulative step size adaptation and found that it generates mutation strengths that enable the evolution strategy to converge linearly for the entire range of parameter settings considered that result in problems with the optimal solution located at the cone's apex. Quality gain values achieved using cumulative step size adaptation are close to optimal, and while the gap to optimal values widens for larger values of , the mutation strengths generated by cumulative step size adaptation have the advantage of resulting in fewer infeasible candidate solutions than optimal settings do.

Future work to be conducted includes the extension of the analysis presented here to consider evolution strategies that employ non-singleton populations of parental candidate solutions, such as the -ES. For the special case that equals zero and the gradient direction is aligned with the cone's axis, this has been accomplished by Porter and Arnold (2014), who find that employing non-singleton populations with multi-recombination leads to the evolution strategy operating at a greater distance from the boundary of the infeasible region, resulting in fewer infeasible offspring being generated. This is advantageous in the case that the cost of constraint function evaluations is non-negligible, and it remains to be seen whether the same can be observed for non-zero values of . It is furthermore desirable to conduct similar analyses for other step size adaptation and constraint handling techniques, as well as for further constrained problems with different characteristics.

## Acknowledgments

This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). The suggestion to consider a conically constrained linear problem of general orientation is due to one of the reviewers of Arnold (2013a). The author is grateful to one of the reviewers of the present paper for pointing out a major flaw in the original manuscript.

## References

## Appendix A Distribution of Normalized Fitness Difference Values

## Appendix B Computation of the Expected Values

### B.1 Computation of

### B.2 Computation of

### B.3 Computation of

## Notes

^{1}

See Beyer and Schwefel (2002) for an introduction to evolution strategy terminology.

^{2}

While *R*=0 is possible, after initialization the probability of generating such a candidate solution is vanishingly small.

^{3}

We employ a logarithmic definition of the quality gain in order to avoid the pitfalls of the definition based on expected relative improvement pointed out by Auger and Hansen (2006).

^{4}

Notice that this update differs from the original prescription in that Ostermeier et al. (1994) adapt the mutation strength based on the length of the search path rather than on its squared length. With appropriately chosen parameters both variants often behave similarly. As in previous work analyzing cumulative step size adaptation, basing the update of the mutation strength on the squared length of the search path simplifies the analysis.