## Abstract

Theoretical analyses of evolution strategies are indispensable for gaining a deep understanding of their inner workings. For constrained problems, rather simple problems are of interest in the current research. This work presents a theoretical analysis of a multi-recombinative evolution strategy with cumulative step size adaptation applied to a conically constrained linear optimization problem. The state of the strategy is modeled by random variables and a stochastic iterative mapping is introduced. For the analytical treatment, fluctuations are neglected and the mean value iterative system is considered. Nonlinear difference equations are derived based on one-generation progress rates. Based on that, expressions for the steady state of the mean value iterative system are derived. By comparison with real algorithm runs, it is shown that for the considered assumptions, the theoretical derivations are able to predict the dynamics and the steady state values of the real runs.

## 1 Introduction

Thorough theoretical investigations of evolution strategies (ESs) are necessary for gaining a deep understanding of how they work. Much research has been done for analyzing ESs applied to unconstrained problems. For the constrained setting, there are still many aspects for which a deep theoretical understanding is missing. Theoretical comparisons of different infeasibility handling techniques are of interest. In addition, theoretical investigations of parameter choices have not yet been deeply investigated for constrained problems. Furthermore, theoretical comparisons of mutation strength control mechanisms are not fully explored. Moreover, as all of these heavily depend on the problem under consideration, more problems that model specific parts of practical constrained optimization problems need to be considered. As a step in that direction, this work theoretically analyzes a $(\mu /\mu I,\lambda )$-ES with cumulative step size adaptation (CSA) applied to a conically constrained linear problem.

Regarding related work, a $(1,\lambda )$-ES with constraint handling by discarding infeasible offspring has been analyzed by Arnold (2011b) for a single linear constraint. Repair by projection has been considered (Arnold, 2011a) and a comparison with repair by reflection and repair by truncation has been performed by Hellwig and Arnold (2016). In all of that work, analysis of the respective algorithms' one-generation and steady state behavior considering CSA is performed. Expressions with integrals are numerically integrated. Based on Lagrangian constraint handling, Arnold and Porter (2015) presented a $(1+1)$-ES applied to a single linear inequality constraint with the sphere model. The one-generation behavior has been analyzed in that work. The insights gained from that analysis have been used to develop an algorithm that is experimentally evaluated.

A theoretical investigation based on Markov chains for a multirecombinative variant with Lagrangian constraint handling has been presented by Atamna et al. (2016). Investigation of a single linear constraint in that work has been extended to multiple linear constraints (Atamna et al., 2017).

The simple linearly constrained optimization problem with a single linear constraint used in Arnold (2011b), Arnold (2011a), and Hellwig and Arnold (2016) has been used as one step toward a better understanding of ESs for constrained optimization. However, as it does not have an optimal solution, it does model only parts of the difficulties in handling constrained optimization problems. In particular, if the linear objective function gradient points toward the constraint, the ES has to avoid premature convergence. As a step further, Arnold (2013) has considered a conically constrained linear problem in which the linear objective function gradient coincides with the cone axis. As an extension, Arnold (2014) has proposed the more general problem that does not assume that the objective function gradient points in direction of the cone axis. However, in the work presented in our article, the choice of the objective function gradient in direction of the cone axis is used because it simplifies the theoretical analysis. This problem models the situation of the optimum lying on the boundary of the feasibility region in addition to the premature convergence difficulty. Hence, the ES has to avoid premature convergence but at the same time the steps need to get smaller and smaller during the process of approaching the optimum. Arnold (2013) has considered a $(1,\lambda )$-ES applied to that problem by discarding infeasible offspring. Integral expressions for the one-generation behavior have been derived. Numerical integration has been used to analyze the steady state behavior. For the $\sigma $ control strategy, CSA has been used. Spettel and Beyer (2018a) have considered the same problem and have analyzed a $(1,\lambda )$-$\sigma $-Self-Adaptation ES ($\sigma $SA-ES). It has been extended to the multirecombinative $(\mu /\mu I,\lambda )$ variant (Spettel and Beyer, 2018b). In contrast to the work by Arnold, closed-form expressions have been derived under asymptotic assumptions for the one-generation behavior, the mean value dynamics, and the steady state behavior. This enables one to directly analyze the influence of the different algorithmic parameters for the algorithm behavior.

The contribution of this article is the analysis considering CSA instead of $\sigma $SA for the mutation strength control mechanism in the $(\mu /\mu I,\lambda )$-ES with repair by projection applied to the conically constrained linear problem. Based on the derivations for the microscopic behavior in Spettel and Beyer (2018b), approximate closed-form expressions are derived for predicting the mean value dynamics and the steady state behavior of the multirecombinative CSA-ES with projection. Multirecombinative ESs with projection have already shown their suitability for constrained problems in real-world applications. Beyer and Finck (2012) have considered repair by projection in the context of a portfolio optimization problem. This design gave rise to a more principled design using projection for optimization problems with linear constraints (Spettel et al., 2018). In line with those empirical investigations, the theoretical analysis in this work concerns multirecombinative ESs. In order to simplify the analysis, isotropic mutations are considered. This removes the need to analyze the covariance matrix adaptation. Again, the aim is to derive closed-form expressions under asymptotic considerations for predicting the mean value dynamics and the steady state behavior. Because the CSA is a completely different approach for adapting the mutation strength $\sigma $ in comparison with $\sigma $SA, one main interest is the comparison of those two $\sigma $ control mechanisms. CSA uses information gained from the observed steps taken in the search space, whereas $\sigma $SA uses the information from the objective function space. Furthermore, it is not immediately apparent how the cumulation parameter in the CSA should be chosen. The theoretical analysis gives insights in this regard. A particular focus of the analysis lies on the difference between the choice of $c=1N$ proposed in early publications (Hansen and Ostermeier, 1997) on the covariance matrix adaptation evolution strategy (CMA-ES) and $c=\Theta 1N$ that is proposed in newer publications (Hansen and Ostermeier, 2001; Hansen, 2016).

The remainder of the article is organized as follows. Section 2 introduces the optimization problem under consideration and describes the algorithm that is analyzed. Section 3 concerns the theoretical analysis. First, a mean value iterative system that models the dynamics of the ES is derived in Section 3.1. Second, steady state considerations are shown in Section 3.2. For the theoretical considerations, plots comparing them with results of real ES runs are presented to show the approximation quality. Section 4 discusses the results and Section 5 concludes the article.

## 2 Problem and Algorithm

The state of an ES individual can be uniquely described in the $(x,r)T$-space. It consists of $x$, the distance from 0 in $x1$-direction (cone axis), and $r$, the distance from the cone axis. Because isotropic mutations are considered in the ES, the coordinate system can be rotated (w.l.o.g.) such that $(x\u02dc,r\u02dc)T$ corresponds to $(x\u02dc,r\u02dc,0,\u2026,0)T$ in the parameter space. Figure 1 visualizes the problem. The equation for the cone boundary is $r=x\xi $, which follows from Equation (2). The projection line can be derived using the cone direction vector $1,1\xi T$ and its counterclockwise rotation by 90 degrees $-1\xi ,1T$ yielding $r=-\xi x+q\xi +1\xi $. The values $x$ and $x\u02dc$ denote a parental individual and an offspring individual, respectively. The corresponding mutation is indicated as $\sigma \u02dcz$. The values of $x$ and $r$ after projection are denoted by $q$ and $qr$, respectively.

^{1}For the mutation strength update, first the cumulative $s$-vector is updated. The cumulation parameter $c$ determines the fading strength. The mutation strength is then updated using this $s$-vector. The parameter $D$ acts as a damping factor. If the squared length of the $s$-vector is smaller than $N$, the step size is decreased. Otherwise, the step size is increased. Intuitively, this means that multiple correlated steps allow a larger step size and vice versa. The update of the generation counter ends one iteration of the generation loop. The values $x(g)$, $r(g)$, $ql$, $ql'$, $\u2329q\u232a$, and $\u2329qr\u232a$ are only needed in the theoretical analysis and can be removed in practical applications of the ES. They are indicated in the algorithm in Lines 4, 5, 14, 15, 21, and 22, respectively.

Figure 2 shows an example of the $x$- and $r$-dynamics of Algorithm 1 (solid line) in comparison with results of the closed-form approximate iterative system (dotted line) that is derived in the sections that follow. As one can see, the theoretical prediction comes close to the real dynamics for the case shown.

## 3 Theoretical Analysis

### 3.1 Derivation of a Mean Value Iterative System for Modeling the Dynamics of the ES

#### 3.1.1 Derivation of Mean Value Difference Equations for $x$ and $r$

^{2}The expression for $\phi x(g)*$ has been derived as

#### 3.1.2 Derivation of a Mean Value Difference Equation for $s1$

#### 3.1.3 Derivation of a Mean Value Difference Equation for $s\u2299$

#### 3.1.4 Derivation of a Mean Value Difference Equation for $||s||2$

#### 3.1.5 Derivation of a Mean Value Difference Equation for $\sigma $

#### 3.1.6 Summary of the Mean Value Difference Equations

For the figures, results of 100 real runs of the ES have been averaged for generating the solid lines. The lines for the iteration by approximation have been computed by iterating the mean value iterative system (Equations (38) to (44)) with Equations (19), (20), and (A.21) for $\phi x(g)$ (and $\phi x(g)*$), $\phi r(g)$ (and $\phi r(g)*$), and $\phi r2(g)$, respectively. The lines for the iteration with one-generation experiments have been generated by iterating the system (Equations (38) to (44)) and simulating $\phi x(g)$ (and $\phi x(g)*$), $\phi r(g)$ (and $\phi r(g)*$), and $\phi r2(g)$. It can happen that in a generation of iterating the system (Equations (38) to (44)), infeasible $(x(g),r(g))T$ are created. In such circumstances, the corresponding $(x(g),r(g))T$ have been projected back.

### 3.2 Behavior of the ES in the Steady State

The goal of this section is to analyze the steady state behavior of the mean value iterative system that is summarized in Section 3.1.6. A working ES should steadily decrease $x$ and $r$ (Equations (38) and (39), respectively) in order to move toward the optimizer. For determining the steady state normalized mutation strength value, the fixed point of the system of nonlinear equations (Equations (40) to (44)) is to be computed.

Section 3.2.1 comprises a first step toward closed-form approximations for the steady state values of the system summarized in Section 3.1.6. Expressions are derived that finally lead to a steady state equation for the normalized mutation strength. A closed-form solution of that derived equation (Equation (62)) is not apparent. Hence, further assumptions for different cases are considered in Sections 3.2.2 and 3.2.3.

Section 3.2.2 considers $c=1/N$ that has been proposed in early CMA-ES papers such as Hansen and Ostermeier (1997). Section 3.2.3 uses $c=\Theta 1N$ proposed in newer publications (see Hansen and Ostermeier, 2001 and Hansen, 2016). Again, to arrive at closed-form expressions, asymptotic considerations ($N\u2192\u221e$) are taken into account.

#### 3.2.1 Derivations Toward Closed-Form Steady State Expressions

Interestingly, the choice of $c=\Theta (1N)$ results in steady state normalized $x$ progress values that are independent of $\xi $. They stay almost constant with increasing $\xi $ for the dimensions considered whereas for $c=\Theta (1N)$ the steady state normalized $x$ progress values decrease with increasing $\xi $ for larger dimensions. For the considered dimensions, only for $N=40$ the choice $c=1N$ achieves constant progress with increasing $\xi $.

In addition, the choice of $c=\Theta (1N)$ results in better steady state normalized $x$ progress compared to $c=\Theta (1N)$ the higher $N$ and $\xi $ get. For example, for the configurations considered in the plots, the choice $c=\mu +2N+\mu +5$ results in the highest steady state $x$ progress for $N=400$ and $\xi \u226525$. For $N=40$, the choices $c=1N$ and $c=2N$ for $\xi \u226525$ and $\xi <25$, respectively, result in the highest steady state $x$ progress.

In order to analyze the influence of the parameters further theoretically, the aim of the following two sections is to derive closed-form expressions using asymptotic assumptions.

#### 3.2.2 Derivation of Closed-Form Approximations for the Steady State with the Assumptions $c=\Theta (1N)$ and $N\u2192\u221e$

The goal of this section is to simplify the expressions derived in Section 3.2.1 further using additional asymptotic assumptions in order to arrive at closed-form steady state approximations.

Figure 5 shows plots of the steady state computations. Results computed by Equation (74) have been compared to real $(\mu /\mu I,\lambda )$-CSA-ES runs for $\lambda =10$ and $\mu \u2208{1,2,3}$. Further experimental results (including results for $\mu \u2208{4,6,8}$) are provided in the supplementary material (Figures 15 to 18 in Appendix B). The values for the points denoting the approximations have been determined by computing the normalized steady state mutation strength $\sigma ss*$ using Equation (74) for different values of $\xi $. The results for $\phi x*$ and $\phi r*$ have been determined by using the computed steady state $\sigma ss*$ values with Equation (48). The approximations for $x\xi rss$ have been determined by evaluating Equation (47). The values for the points denoting the experiments have been determined by computing the averages of the particular values in real ES runs. The figures show that the derived expressions get better for larger values of $\xi $ and $N$. Again, the deviations for small $\xi $ are due to approximations in the derivation of the local progress rates. The deviations for small $N$ stem from the use of asymptotic assumptions $N\u2192\u221e$.

#### 3.2.3 Derivation of Closed-Form Approximations for the Steady State with the Assumptions $c=\Theta 1N$ and $N\u2192\u221e$

Figure 6 shows plots of the steady state computations comparing the $(\mu /\mu I,\lambda )$-CSA-ES for $\lambda =10$ and $\mu \u2208{1,2,3}$. Further experimental results (including results for $\mu \u2208{4,6,8}$) are provided in the supplementary material (Figures 19 to 22 in Appendix B). Results computed by Equation (87) have been compared to real ES runs. The values for the points denoting the approximations have been determined by computing the normalized steady state mutation strength $\sigma ss*$ using Equation (87) for different values of $\xi $. The results for $\phi x*$ and $\phi r*$ have been determined by using the computed steady state $\sigma ss*$ values with Equation (48). The approximations for $x\xi rss$ have been determined by evaluating Equation (47). The values for the points denoting the experiments have been determined by computing the averages of the particular values in real ES runs. The rather large deviations for $\phi x*ss$ and $\phi r*ss$ for values of $\xi $ ranging from around 3 to 30 stem from the $\sigma ss*$ approximation (Equation (87)). A further investigation resulted in the plots shown in Figure 13 and Figure 14 in the supplementary material. They compare the numerical solution of Equation (62) with real ES runs and show a better agreement.

## 4 Discussion

The comparison of the mean value iterative system summarized in Section 3.1.6 with real ES runs shows that the theoretical prediction is not completely perfect because of the asymptotic assumptions used in the derivations (see Figure 3). For large $\xi $ and large $N$ the prediction comes near to the real behavior. The deviations for small $N$ are due to the asymptotic assumptions $N\u2192\u221e$ that are used in the derivations of the microscopic and macroscopic aspects of the ES. They are used to simplify the expressions and thus make a theoretical analysis tractable. The deviations for small $\xi $ stem from the derivation of the offspring cumulative distribution function after the projection step in $x1$-direction $PQ(q)$ (for the details, refer to Section 3.1.2.1.2.3 in Spettel and Beyer (2018c), in particular to the step from Equations (3.73) to (3.74)). The same observations regarding the deviations can be made for the derived steady state expressions (see Figures 5 and 6).

For the steady state derivations, it is of particular interest to compare the results obtained in this work for the CSA-ES with the results obtained for the $\sigma $SA-ES. The $(\mu /\mu I,\lambda )$-$\sigma $SA-ES has been theoretically analyzed by Spettel and Beyer (2018b) applied to the same conically constrained problem. In that work, the microscopic and macroscopic aspects of the $(\mu /\mu I,\lambda )$-$\sigma $SA-ES have been investigated. For the microscopic aspects, expressions for the local progress for $x$ and $r$ and the self-adaptation response (SAR) function have been derived using asymptotic assumptions. These results have then been used for the macroscopic analysis. The mean value dynamics generated by iteration using those local measures have been compared to real runs. In addition, steady state expressions have been derived and discussed. They show that the $\sigma $SA-ES is able to achieve sufficiently high mutation strengths to keep the progress almost constant for increasing $\xi $. Surprisingly, for the CSA-ES, the choice of the cumulation parameter $c$ has a qualitative influence on the behavior. As detailed in the following paragraphs, of the two cases $c=1/N$ and $c=\Theta 1N$ considered, only the choice of $c=\Theta 1N$ results in a qualitative behavior of the CSA-ES that is similar to the $\sigma $SA-ES.

A further aspect for discussion is the special case of the non-recombinative $(1,\lambda )$-CSA-ES that is contained in the derivations for the $(\mu /\mu I,\lambda )$-CSA-ES. The iterative system for the nonrecombinative ($\mu =1$) case can be derived analogously to the multirecombinative ($\mu >1$) case. The resulting equations differ in the expressions for $\phi x(g)$, $\phi r(g)$, and $\phi r2(g)$. It has been investigated further (not shown here) and for $N2(1+1/\xi )\u226b\sigma (g)*2$ the mean value iterative systems of the $(1,\lambda )$-CSA-ES and the $(\mu /\mu I,\lambda )$-CSA-ES agree. An interesting observation between the case $\mu =1$ and the case $\mu >1$ is the evolution near the cone boundary. Whereas the CSA-ES with $\mu =1$ evolves on the boundary, the CSA-ES with $\mu >1$ attains a certain steady state distance from the boundary (refer to the subplots of Figures 5 and 6 showing $(x\xi r)ss$). This can be explained: Considering a parental individual on the cone boundary, the offspring are infeasible with overwhelming probability for sufficiently large $N$. Hence, they are repaired by projection and are on the boundary after projection. In particular, the best of them is on the boundary. Therefore, for $\mu =1$, the ES evolves on the boundary. For $\mu >1$, the centroid computation after projection results in offspring that are inside the feasible region due to its convexity.

## 5 Conclusion

In this work, the $(\mu /\mu I,\lambda )$-CSA-ES has been theoretically analyzed. For this, a mean value iterative system has been introduced and compared to real ES runs. Based on this derived system, steady state expressions have been derived and compared to ES simulations. It has been shown that for $c=\Theta 1N$ a behavior that is similar to the $\sigma $SA can be achieved. For $c=1N$, the mutation strength achieved is not high enough to reach a similar steady state progress as for $\sigma $SA. Similar to the $\sigma $SA-ES, the CSA-ES evolves on the boundary for the non-recombinative case and it evolves at a certain distance from the boundary for the multirecombinative case.

To conclude the article, topics for future work are outlined. In addition to the $\sigma $SA and the CSA for the $\sigma $ control mechanism, it is of interest to investigate the behavior of Meta-ESs applied to the conically constrained problem. Meta-ESs can be used as a further approach to control the mutation strength. The main idea is to evolve the mutation strength by competition of two ESs running with different mutation strengths for an isolation period. A Meta-ES controls the mutation strength evolution by picking and mutating the mutation strength of the better performing ES.

Comparison of the repair by projection approach with other repair methods is another topic for further research. Analysis of ESs applied to other constrained problems is another research direction for the future. A possibility for a direct next step is the analysis of the conically constrained linear optimization problem for which the objective function gradient does not coincide with the cone axis (Arnold, 2014).

## Acknowledgments

This work was supported by the Austrian Science Fund FWF under grant P29651-N32.

## Notes

^{1}

Note that the order statistic notation $m;\lambda $ is used to denote the $m$-th best (w.r.t. fitness) out of $\lambda $ values. The notation $(x)k$ is used to denote the $k$-th element of a vector $x$. It is equivalent to writing $xk$.

^{2}

In the further considerations, the symbols “$\u2243$” and “$\u2248$” are used. Expressions in the form of $lhs\u2243rhs$ denote that $lhs$ is asymptotically equal to $rhs$ for given asymptotical assumptions (e.g. $N\u2192\u221e$). The particular assumptions are stated explicitly for every use of “$\u2243$”. That is, in the limit case of the given assumptions, $lhs$ is equal to $rhs$. The form $lhs\u2248rhs$ is used for cases where $rhs$ is an approximation for $lhs$ with given assumptions that are not of asymptotical nature. In this sense, “$\u2248$” is weaker than “$\u2243$”.