## Abstract

The convergence behaviors of so-called natural evolution strategies (NES) and of the information-geometric optimization (IGO) approach are considered. After a review of the NES/IGO ideas, which are based on information geometry, the implications of this philosophy w.r.t. optimization dynamics are investigated considering the optimization performance on the class of positive quadratic objective functions (the ellipsoid model). Exact differential equations describing the approach to the optimizer are derived and solved. It is rigorously shown that the original NES philosophy optimizing the expected value of the objective functions leads to very slow (i.e., sublinear) convergence toward the optimizer. This is the real reason why state of the art implementations of IGO algorithms optimize the expected value of transformed objective functions, for example, by utility functions based on ranking. It is shown that these utility functions are localized fitness functions that change during the IGO flow. The governing differential equations describing this flow are derived. In the case of convergence, the solutions to these equations exhibit an exponentially fast approach to the optimizer (i.e., linear convergence order). Furthermore, it is proven that the IGO philosophy leads to an adaptation of the covariance matrix that equals in the asymptotic limit—up to a scalar factor—the inverse of the Hessian of the objective function considered.

## 1 Introduction

For decades, engineering of evolution strategies (ES) as well as of other evolutionary algorithm (EA) related designs (including genetic algorithms, particle swarm optimization and differential evolution to name a few) was mainly based on biomimicry (also referred to as bionics). That is, principles gleaned from biology were translated into optimization algorithms. While there were attempts to put the design philosophy of ESs on some kind of “axiomatic” base by formulating design principles (see Beyer and Deb, 2001; Beyer, 2001; Hansen, 2006; Beyer, 2007), a new development by Wierstra et al. (2008) and Sun et al. (2009) called natural evolution strategies (NES) promised a more principled approach to the design of EAs. This new approach yielded algorithm implementations that were similar to the well-known covariance matrix adaptation (CMA) ES of Hansen et al. (2003) with rank- update (see Glasmachers et al., 2010). This striking similarity gave rise to attempts grounding CMA-ES on information geometry (Amari and Nagaoka, 2000) resulting in the work of Akimoto, Nagata, et al. (2012).

An alternative view on NES is to consider it as an algorithmic implementation of an information-geometric optimization (IGO) flow in the parameter space of the respective probability distribution family. That is, an IGO differential equation describing the time evolution of the distribution parameters is set up and solved numerically by discretizing the time (thus getting a generational algorithm, i.e., an EA) and estimating the expected values by Monte-Carlo sampling [the mutation and recombination process in ES; see Akimoto, Auger, et al. (2012)].

From a bird’s-eye view, most of the research done in this field follows directly or indirectly this line of approach with the aim to derive real EA implementations. On the other hand, the IGO flow differential equation can be used to derive theoretical assertions w.r.t. the convergence behavior and generally the time evolution of the IGO flow system. This line of research has been opened by the work of Glasmachers (2012) and Akimoto, Auger, et al. (2012). Considering isotropic mutations they proved convergence to the optimizer of convex quadratic functions using different techniques avoiding the direct solution of the IGO equation system. The most advanced results have been provided by Akimoto (2012) who proved linear convergence order on functions with ellipsoidal level sets.

It should be clear that proving convergence of the IGO flow without calculating the real time evolution is but a first step. However, the direct solution of the IGO flow system yields the maximum information which allows for a deeper understanding of the evolutionary dynamics of IGO. This is the primary goal of this paper. It is devoted to the calculation of the exact time evolution of NES algorithms optimizing convex quadratic objective functions (the so-called ellipsoid model). The NES to be analyzed uses normally distributed mutations with a fully developed covariance matrix . Both cases are considered, the ordinary expected fitness value optimization and the case of rank-based utility optimization. It will be shown that the original NES philosophy, that is, the natural gradient ascent in the expected fitness landscape, leads to sublinear convergence. That is, the original NES idea results in convergent but slowly performing algorithms. Introducing a rank-based evaluation instead of the direct fitness changes the behavior drastically. This rank-based evaluation localizes the search in the sense that the globally defined objective function is replaced by a locally (in time) acting utility function. Such a utility can be obtained in different ways. For example, considering the optimal ES weighting of Arnold (2006), one gets a locally standardized fitness. Alternatively, -ES truncation selection has a similar effect. In both cases asymptotical exponentially fast convergence to the optimizer will be proven. These results provide the theoretical explanation of why recent NES implementations always rely on rank-based utility functions. Additionally, the analysis technique presented also allows for the calculation of the dynamics of the covariance matrix evolution. As will be shown, approaches the inverse of the Hessian of the objective function up to a scalar factor.

The rest of the paper is organized as follows. The philosophy of NES and IGO will be introduced in a self-contained manner. However, this introduction will not consider implementation aspects since these are not needed in this paper. In Section 2 it will be shown—not solely for didactical reasons—that ordinary gradient ascent in expected fitness landscapes may cause problems in that different parameterizations may yield qualitatively different (and sometimes undesirable) behaviors. Thereafter, the idea of information distance constrained gradient ascent is introduced, building the philosophical basis of NES. The resulting IGO flow will be analyzed in Section 3 by solving the differential equation system yielding a slowly converging evolution dynamics. In Section 4 the idea of localized fitness evaluation is added to the NES framework and analyzed for optimal weighting and truncation selection assuming normally distributed fitness values. Finally, in Section 5 conclusions are drawn.

## 2 NES/IGO in a Nutshell

Given a function to be optimized, there are numerous options to perform a gradual approach to the optimizer . One way—as pursued in evolutionary algorithms—consists of randomly sampling values from a family of probability distributions the density of which may be given by .^{1} Here, the set of distribution parameters evolves over time *t* (the generation counter) with the goal to change the distribution in such a manner that with increasing *t* the distribution is more and more concentrated about the optimizer . The manner in which the parameter set is changed over the generations *t* characterizes the different EA class such as estimation of distribution algorithms (EDAs) and ESs. Alternatively, can even be implicitly presented by the population of candidate solutions as is usually done in genetic algorithms (GAs). However, in this work, we will focus on EAs where the distribution parameters are explicitly represented. In the next section, we will start with a naive approach by minimizing the expected value of *f*. It will be shown that this approach does not always lead to a stable performing EA. Therefore, in an attempt to get a theoretically principled approach based on expected value maximization, the class of so-called natural evolution strategies (NESs) will be considered (Wierstra et al., 2008). The underlying ideas will be presented and investigated in subsequent sections.

### 2.1 Maximizing the Expected Value of *f*

*f*-maximization, this leads to a transformed optimization problem This approach is the starting point of the NESs by Wierstra et al. (2008). In order to have a meaningful optimization problem, as in Equation (2), one has to ensure that the parameter set allows for the expression of the estimate of the optimizer of with arbitrary precision. This can be achieved if the expected value vector is itself part of the parameter set . For example, this is attained by using multivariate Gaussian normal distributions with mean and the symmetric covariance matrix , that is, The idea behind such a parameterization is that evolves toward the optimizer and the covariance matrix shrinks in such a manner that the distribution of the samples gets more and more concentrated about the optimizer .

*t*-discrete values become a continuous time

*t*function and one obtains This is an ordinary differential equation (ODE) system that can serve as a model description of ESs. A modified version of Equation (6) has already been used in the so-called information-geometric optimization (IGO) framework in order to investigate the convergence behavior of infinite population size models of ESs; see Akimoto, Auger, et al. (2012), Glasmachers (2012), and the remainder of this paper.

*f*is given in simple form, one can calculate this time evolution. To this end, the maximization of the general ellipsoid model is considered with the symmetric positive definite matrix and an arbitrary (constant) vector . Note that the fitness model in Equation (7) can be regarded as a local approximation of more complex objective functions . In that case, is simply the Hessian of

*f*, that is, .

*t*

_{0}above which the covariance matrix loses positive definiteness. That is, implementing an ES on the basis of the parameterization might cause convergence problems due to inappropriate covariance matrix evolution.

^{2}This changes in Equation (8) to and the derivatives w.r.t.

**A**becomes . As a result, Equation (10b) changes to This equation can be solved using the

*Ansatz*yielding as can be easily checked by inserting Equation (18) into Equation (17). As one can see, and therefore remain positive definite for because is positive definite. The result in Equation (18) is in contrast to Equation (16). Using the parameterization avoids the problem of evolving a nonpositive definite covariance matrix. Let us have a closer look at the -dynamics imposed by Equation (18). To this end, is projected into the eigen system of . Consider the eigenvalue problem Note, a nondecreasing ordering of the eigenvalues has been chosen in Equation (19), that is,

*q*

_{1}is the smallest eigenvalue. Using

*q*() as the basis, Equation (18) becomes That is, and therefore shrink exponentially quickly. However, this shrinking appears at different time constants . In the asymptotic limit , the dynamics are dominated by the slowest decaying mode, that is, the mode. Therefore Transforming back, the covariance matrix becomes This means that samples are

_{k}*predominantly*produced in the direction of the largest principal axis while the other directions related to the with are overproportionally dampened. In other words, given an ellipsoid success domain (defined by Equation (7)), search is predominantly conducted in the major axis direction. That is, for large

*t*, the search degenerates in a one-dimensional subspace of the . Similarly, if the ground state is degenerated by a multiplicity of

*m*, , then there are

*m*orthogonal directions resulting in a degenerated search in the corresponding

*m*-dimensional subspace of .

### 2.2 How to Get Unbiased

*t*to only minimally while being maximally unbiased. That is, the information gained in this step should be rather small. A quantity measuring this change is reminiscent of a relative entropy loss or information gain

*I*considering the ratio (see Rényi, 1961): The logarithm with base has been used instead of the usual base 2, resulting in an additional factor of which is, however, without any relevance for the considerations here. Equation (24) evaluates to where the resulting integral can be also interpreted as the Kullback-Leibler divergence . While it is well-known that the information gain defined in Equation (25) is not a distance measure with the meaning of a metric (see, e.g., Eguchi and Copas, 2006), its infinitesimal version is. Interpreting as small quantities, the Taylor expansion of

*I*up to the second order in the components yields after a simple calculation where are the components of the Fisher information matrix (see Lehmann and Casella, 1998).

^{3}

While for noninfinitesimal deviations it generally holds that , one can easily show that holds for infinitesimal deviations. That is, the rhs of Equation (26) can be considered as a differential length element on the statistical manifold induced by . This is the starting point of information geometry by noting that *I _{ij}* is just the metric tensor of the statistical manifold (see Amari and Nagaoka, 2000). However, for the considerations in this paper, the apparatus of Riemannian geometry is not really needed. All that is needed is the distance formula in Equation (26).

*R*. Such a collapse might be avoided by constraining the search step in the parameter space such that the information gain

^{N}*I*is kept at a (given) small level , that is, , while searching for the that provides the largest increase of . Using Taylor expansion for this yields the constrained maximization problem that can be solved by using Lagrange’s method. Using Equations (26) and (28), neglecting higher order terms, one obtains the Lagrange function Taking the derivatives w.r.t. and , one gets Setting Equation (30a) to zero in order to find the stationary point , one obtains in matrix notation . Solving for yields where is the inverse of the Fisher matrix in Equation (27). Setting Equation (30b) to zero and inserting Equation (31), one gets in matrix notation (note, ) Solving for and reinserting the result into Equation (31) yields Noting that the square root in Equation (33) is an infinitesimal quantity as , one can interpret this as an infinitesimal time change from

*t*to that causes a change from to , leading to . Therefore, dividing by , the rhs of Equation (33) can be interpreted as the time derivative of . Thus, one gets the ordinary differential equation (ODE) system, This ODE system is different from Equation (6) in that the gradient direction given by is transformed by the inverse of the Fisher information matrix. According to Amari (1998), the rhs of Equation (34) is called “natural gradient”

^{4}in contrast to the ordinary gradient .

The idea of natural gradient descent in the expected value landscape is the key idea of the NESs^{5} (see, e.g., Wierstra et al., 2008). The differential Equation (34) is also referred to as an IGO differential equation in Akimoto, Nagata, et al. (2012). The information flow generated by this differential equation will be subject of investigation in the next section. A very astonishing result will be derived for the ellipsoid model in Equation (7) showing that the IGO flow in Equation (34) results in a sublinear convergence order.

## 3 On the Dynamics of IGO without Utility Functions

According to Amari (1998, p. 251),

“the ordinary gradient does not give the steepest direction of a target function; rather, the steepest direction is given by the natural (or contravariant) gradient.”

This assertion was made under the premise of a Riemannian metric structure (Amari, 1998, p. 251). From the analysis presented so far in this paper, this assertion seems hard to justify under the premise of expected value maximization of Equation (1). While the family of Gaussian distributions gives rise to a nonflat metric, it remains an open question how quickly Equation (34) approaches the steady state and the optimizer . To shed some light on this matter, we will investigate the dynamics of IGO flow on the general quadratic function *f* given by Equation (7) using Gaussian samples parameterized by , that is, . To this end, the inverse Fisher information matrix of the Gaussian distribution must be calculated in order to get the concrete form of Equation (34).

### 3.1 Derivation of the IGO Differential Equation

^{6}Since does not depend on and vice versa, the resulting Fisher information in Equation (35) subdivides into a part acting on and another on . That is, the cross terms and vanish. As for the components w.r.t. , one immediately gets from Equation (35), Treating the -related part in Equation (35) yields Thus, one gets for the -related part of , Due to the block structure of , calculating the inverse of reduces to the calculation of the inverse of Equations (36) and (38) separately. Considering Equation (36), one immediately concludes that The correctness of is proven directly by calculating Now, the IGO differential equation can be directly obtained from Equation (34) using Equations (39) and (40). It reads While the -dynamics were directly obtained as a matrix vector product, the -dynamics needed an intermediate step: Using the gradients already calculated in Equation (9), one finally obtains the nonlinear IGO differential equation system In the next section, a solution to the system in Equation (42) will be derived.

### 3.2 On the Time Evolution of the IGO System

A closer look at Equation (42) reveals that Equation (42b) is independent of Equation (42a). That is, in a first step Equation (42b) must be solved.

*Ansatz*Calculating the time derivative of Equation (51), one gets Here it was implicitly assumed that and commute.

^{7}Comparing Equation (52) with Equation (50), one obtains the ODE This ODE can be formally integrated using the matrix logarithm yielding where is a constant matrix to be determined below. Inserting this result in Equation (51), one gets Taking the initial condition into account, one gets [using Equation (54)]: Plugging this into Equation (54) leads to the next theorem.

### 3.3 Discussion

## 4 NES/IGO Localized—On the Use of Utility Transforms

### 4.1 Assessing Utility by Individual Ranking

*utility*values to the individuals sampled, that is, weights, reflecting the

*ranking*of those individuals. Thus, NES became increasingly similar to classical ESs, as in Beyer and Schwefel (2002), Hansen et al. (2003), and Beyer (2007). It is very important to stress once more: The use of fitness shaping techniques cannot be deduced from the natural gradient ascent philosophy. It must be introduced in an ad hoc manner. In the literature, its use is mainly justified by rendering “the algorithm invariant under monotonically growing (i.e., rank preserving) transformations of the fitness function” (Glasmachers et al., 2010, p. 394). The same argument is used to explain the application of weighted -selection in CMA-ES (see Hansen, 2006). Therefore, it does not come as a surprise that in the fully developed IGO framework (see Akimoto, Auger, et al., 2012), a rank-based utility function replaces the in Equation (1). Thus, Equation (1) changes to Here, it is important to note that itself depends on the distribution parameters . That is, the globally defined fitness function is replaced by a locally acting function . The local character of this weighting function becomes immediately clear when looking at standard ES with -truncation selection. There, offspring () are generated according to the offspring distribution . The best individuals (i.e., those with the largest

*f*values in the case of maximization) are taken with weights and the rest get weights equal to zero

^{8}

^{9}: This ranking clearly depends on the choice of the distribution parameters . That is, changes with every iteration step. Thus, in the continuous limit, it becomes a function of time. As a replacement for , it can be regarded as some kind of

*local*fitness. Climbing up the expected value landscape in Equation (1) is therefore governed by the gradient acting on the density function exclusively. That is, instead of the IGO differential equation in Equation (34), one now has to consider (see Ollivier et al., 2011), and the gradient must be evaluated according to That is, the gradient operator must not act on .

*f*values). Moreover, ES theory even allows for the determination of optimal weights based on asymptotic sphere model assumptions. Arnold (2006) has shown for ESs with isotropic mutations that choosing guarantees maximal progress toward the optimizer, provided that the mutation strength is controlled correctly. Here the lhs in Equation (69) is the expected value of the

*m*th order statistics of the standard normal variate given by the integral where and are the pdf and cdf of the standard normal variate, respectively. For large (i.e., large sample sizes), Equation (70) can be expressed asymptotically (Arnold et al., 1992, p. 128, Equation (5.5.2)) where is the quantile function of the standard normal variate. These optimal weights are different from the heuristic weights in Equation (65) in that each of the individuals gets a weight and not only the best individuals. Unlike Equations (68) and (65), the Equation (69) also yields negative weights for individuals whose

*f*values are below the median of

*f*. The weights are antisymmetric about the median, that is, it holds .

In the remainder of this section, the effect of the local weight models on the dynamics of ES will be investigated considering the linear and quadratic fitness models in Equation (7). It turns out that using optimal weights results in surprisingly simple expressions that allow for closed solutions of the IGO differential equation.

### 4.2 Asymptotic Fitness Distributions and Search Gradients

*N*-dimensional integral. Taking into account, however, that the local weight function depends only on the

*f*values, Equation (67) can be transformed to a one-dimensional integral by the mapping . Thus, Equation (67) changes to This integral is tractable provided that the pdf can be expressed by simple expressions, for example, in terms of a Gaussian distribution: This holds exactly for linear fitness functions . Provided that the eigenvalue spectrum of of the quadratic fitness function in Equation (7) does not contain singularly dominating eigenvalues, Equation (73) also holds for the quadratic case if , due to the central limit theorem of statistics. In order to proceed under these assumptions, the expected value and the variance must be determined. To this end, the local fitness function is considered that follows from Equation (7) by substituting and : Since , one only has to consider that yields (Beyer, 2001, p. 122). Thus, one gets As for the variance of Equation (74), it is noted that constant terms do not contribute to the variance of a random variate. Therefore Making use of the standard deviation formula in Beyer (2001, p. 123, Equation 4.59), one finds, mutatis mutandis: Note that Equation (77) also contains the case of linear fitness as a special case by setting , yielding .

*W*, it will be performed in separate sections focusing on optimal weights in Equation (71) and truncation selection weights in Equation (65), respectively.

_{f}### 4.3 Dynamics of IGO with Arnold’s Optimal Weights

*m*of an individual can be directly inferred from its

*f*value, that is, , where is the (local) cumulative distribution function of the fitness samples. Assuming normality given by Equation (73), this transfers to Plugging this into Arnold’s optimal (sphere) weight formula in Equation (69) using the asymptotically exact version in Equation (71), one ends up with the surprisingly simple expression (for ), That is, Arnold’s optimal weights transform to a utility function being simply the local standardization of in the infinite population size model. This result shares also a similarity with the so-called fitness baseline method found in older versions of NES (see Wierstra et al., 2008). However, the standardization by is missing there. As we will see below, this is a difference that causes considerable differences in the convergence behavior of the ES.

*t*powers vanish, one obtains Using Equation (81), one gets Now, the IGO differential equations in Equation (66) can be set up using Equation (41), and replacing

*E*by

_{f}*E*: where is given by Equation (77). Finding a closed solution to the nonlinear ODE system in Equation (92) seems generally excluded. However, an asymptotic solution for large can be derived.

_{W}#### 4.3.1 Solving the Quadratic Fitness Case

See the derivation given above.

*same*scalar function on the rhs in Equation (105). As a result, , where is a function to be determined. To summarize, the general solution to Equation (105) reads Inserting this into Equation (105), one obtains Since the denominator in Equation (105) is and , it holds . By virtue of Equation (106) it follows that and . Thus, is a monotonously increasing unbounded function and for . Therefore, an asymptotically exact solution to Equation (107) can be constructed by letting in the denominator of Equation (107), leading to the simple differential equation, The solution of which is with the inverse time constant Equating Equation (106) with Equation (104) and resolving for yields Now inserting the asymptotic solution from Equation (109) into Equation (111), one obtains This result gives rise to the following theorem.

As for Equation (113), the proof is obtained by following the derivation presented above yielding Equation (112), and finally inserting Equation (110). In order to prove Equation (114), one has to make use of Lemma 1, Equation (102). Inserting Equation (113) into Equation (102) yields Equation (114).

*N*only with an unexpected square root law and no dependency on . While the latter might come as a surprise, one should recall that we have derived the asymptotical solution to Equation (93). According to Equation (113), the covariance matrix gets proportional to the inverse of . That is, the IGO ES “sees” effectively a sphere model. Therefore, asymptotically, the IGO ES has transformed the ellipsoidal level sets of

*f*into spherical ones and the dependencies on vanish. Note, according to Equation (111), the influence of the initially chosen covariance matrix vanishes exponentially fast.

The law in Equation (115) is somewhat peculiar. Considering the performance of isotropic -ES with mutation strength control using self-adaptation or cumulative stepsize adaptation on the sphere model, Arnold and Beyer (2004) and Meyer-Nieberg and Beyer (2005), respectively, found time constants . That is, the expected runtime for a fixed relative improvement in *f* values is proportional to *N*, whereas for the IGO ODE it grows only with the square root of the search space dimensionality. It is currently an open question whether this square root law indicates a general lower *N* bound for algorithms derived from the IGO ODE, taking into account that the IGO ODE is an infinite population size model.

Considering Equation (113), one sees that the covariance matrix adapts to the desired behavior in Equation (23) exponentially fast. That is, we have *proven* that this type of IGO approximates the inverse of up to a scalar factor. Such a behavior has been observed in real ES using covariance matrix adaptation, as, for example, the evolutionary gradient search ES (Arnold and Salomon, 2007, p. 492, Footnote 3). A similar result has been derived in Akimoto (2012) using difference equations, where it was shown that evolves up to a scalar factor to the inverse of the Hessian of . In that work, a specially tailored IGO NES has been considered that deviates from the model given in Equation (93). It relies on an ad hoc constructed time-dependent stepsize scaling factor that uses eigenvalue information taken from the actual matrix.

Comparing the results of Theorem 3 with those of Theorems 1 and 2 regarding NES without local utility transformation, one sees that the rank-based weighting yields qualitatively better performance on quadratic fitness models. Having a closer look at the governing ODEs (42) and (92), one notices the terms in the denominator of the latter. Getting closer to the optimizer, gets smaller, thus, counteracting the decrease of the numerators in the rhs of Equations (92a) and (92b) with the result of larger derivatives. Tracing back the additional terms, one finds that these are due to the term from Equation (86) in the integral in Equation (88). That is, the linear convergence order of this IGO version is clearly a result of that (sphere) optimal rank-based weighting in Equation (69) leading to the locally standardized utility function in Equation (86).

#### 4.3.2 The Linear Fitness Case

### 4.4 Dynamics of IGO with -Truncation Selection

*f*quantile are used in the calculation of the gradient in Equation (67). Let be the truncation ratio then the local weighting function in Equation (65) returns for

*f*values and zero otherwise (for ): Assuming normally distributed fitness values

*f*, the pdf is given by Equation (73). Therefore, the quantile obeys the equation That is, the first line in Equation (120) is fulfilled for values that fulfill where is the inverse of the cdf (i.e., the quantile function) of the standard normal variate. Now, plugging Equation (120) into Equation (87) yields Changing the integration variable

*f*to and noting that Equation (122) transforms to yields The

*t*integration can be performed using integral formula (A.16) and (A.17) as shown in Beyer (2001, p. 331 (A.17)). One gets and where Thus, one obtains for Equation (124): Inserting the gradients w.r.t. and given by Equations (81), (82), and (84) into Equation (128) yields and The IGO ODE system in Equation (41) becomes (replacing

*E*by

_{f}*E*) and Its asymptotic solution will be investigated in the next section.

_{W}#### 4.4.1 Dynamics of the Quadratic Fitness Case

*t*undergoes a linear transformation . This is equivalent to a change of the inverse time constant , Equation (110), to . Thus, the asymptotic dynamics can be taken from Theorem 3, Equations (113) and (114), yielding where was calculated using Equation (125). As for this special case, we have shown that the IGO ODE exhibits linear convergence order. Considering is much more involved and not completely solved up until now. Therefore, we will first discuss the qualitative behavior of the ODE system when changing ; and afterward, we will derive a solution that holds in the vicinity of .

Considering the different matrix terms in Equation (135b), one notices that all these single terms are positive definite; that is, it holds for that , , and . As for the latter case, this becomes clear by substitution and noting that is positive definite.

First, consider the case in Equation (135b). The term with the *h*_{2} factor vanishes because . The remaining expression in the bracket is positive definite, that is, . Due to the negative sign in front of the bracket, the remaining ODE describes the *contraction* of the covariance matrix , the dynamics of which is given by Equation (113).

Increasing the truncation ratio, that is, , it holds ; and and increase the positiveness of the square bracket, resulting in a faster contraction of the covariance matrix . If, however, the contraction rate is too fast, the contraction of the vector cannot keep pace with that evolution. In such a case, the ODE system describes premature convergence of the IGO algorithm.

Conversely, decreasing the truncation ratio, that is, , it holds . In that case, the positiveness of the square bracket is decreased compared to . Actually, the expression in the square bracket can become negative definite. As a result, the covariance matrix expands. This behavior is desirable to a certain extent, especially in cases where the initial covariance matrix was chosen too small. However, it can also result in uncontrolled growth.

*t*with Now, the linear time transformation can be applied to Equations (140a) and (140b), yielding the ODE system This ODE system is similar to the system in Equation (93) except for the time parameter. Therefore, Theorem 3 can be applied using , from Equation (142), instead of

*t*in Equations (113) and (114). Thus, the exponent in Equations (113) and (114) becomes . That is, writing , one gets for the inverse time constant using Equations (141), (125), and (127): and the asymptotic dynamics become The special case in Equation (136) is contained in Equation (145), as can be checked by inserting in Equations (144) and (145) and comparing with Equation (136). Concerning values, it should be pointed out that the validity range of in Equation (145) cannot be determined by the calculations presented. Further research is needed to determine the range for Equation (145) that guarantees linear convergence order.

An alternative convergence proof for IGO using the special case of isotropic mutations has been presented by Glasmachers (2012). However, in that work, the existence of a linear convergence rate was only claimed without proof. Another approach proving convergence to the optimizer using Lyapunov’s stability analysis was proposed by Akimoto, Nagata, et al. (2012). However, that analysis did not provide any assertions concerning the convergence order. Therefore, it does not contribute to the question of why and how a local weighting function *W _{f}* is needed for efficiently working ESs.

Considering Equation (145a), one sees that the covariance matrix gets asymptotically similar to the inverse of the Hessian of , that is, the -matrix of the quadratic model in Equation (7). This is a proof of the long-conjectured property of CMA-ES-like algorithms using truncation selection (for the case ) that is empirically observed when running such algorithms on goal functions that can be locally approximated by quadratic functions (see Arnold and Salomon, 2007).

#### 4.4.2 The Linear Fitness Case

## 5 Conclusions

Deriving EAs from first principles is a tempting approach to put EA engineering on a sound scientific base. In an attempt to provide such an approach, the NESs were proposed at the end of the last decade. A closer look at the premises of that approach reveals a paradigm that might be condensed into “an information gain constrained gradient ascent in the expected value fitness landscape.” According to Amari (1998), the resulting ascent direction is referred to as the natural gradient being the steepest direction. However, as has shown in this paper, considering quadratic objective functions, the natural gradient ascent when applied to expected fitness landscapes results in a slowly (but) converging information geometric flow toward the optimizer. The approach to the optimizer obeys an law, Equation (48). While Amari’s claim concerning the steepest descent direction is not wrong, because this is the best direction one can get given the constraints imposed on the information gain, natural gradient ascent alone does not necessarily yield efficient EAs.

Satisfactorily, however, is the desired result concerning the covariance matrix evolution. In all cases investigated that rely on natural gradient ascent, it has been proven that in the case of convergence gets asymptotically proportional to the inverse of the Hessian of *f*. This behavior is desirable since it counteracts the degeneration tendencies of the search distribution in subspaces.

The slow convergence behavior of the original NES is due to the combination of two properties of the original NES approach: On the one hand, the optimization of the objective function *f* is transformed into a globally defined expected value landscape. On the other hand, fast changes of the search distribution are suppressed by the natural gradient. Since returning to ordinary gradient descent causes problems for some distribution parameterizations (as has been shown in Section 2.1), the remaining option is to localize the evaluation. That is, the utility of candidate solutions must be evaluated in a time-local manner.

As might have become clear in the above given discussion, the NES/IGO ODE design contains two decisions, which are rather independent:

The choice of an appropriate statistical manifold and an ascent or descent principle, respectively.

The choice of a utility function that transforms the original objective function.

The IGO ODE design choice (1) seems to be obviously fixed by the steepest ascent/descent in accordance with the metric determined by the distribution family chosen. Apart from the fact that the choice of the distribution family cannot be deduced from the IGO principle, real ES implementations—for example, xNES; see Glasmachers et al. (2010)—also depart from the natural gradient direction by introducing different learning rates for and . This is clearly for the sake of real algorithm performance and marks the limitations of the infinite population size assumption inherent in the IGO ODE approach. Such implementational tweaks are hard to deduce from the IGO philosophy.

The same holds for the design choice (2) regarding the utility used. In order to get linear convergence order in the IGO ODE framework, one has to localize the evaluation of the *f* values generated. There are different options to localize the evaluation. Arnold’s optimal weight function derived for the ES on the sphere model yields in the asymptotic population limit a utility function that is just the locally standardized fitness. That is, *f* values below get a negative evaluation. While a similar fitness baseline can already be found in early NES versions, the local standard deviation in the denominator of the standardization formula in Equation (86) makes the decisive difference. It ensures large utility signals when getting closer to the optimizer. It would be interesting to implement this theoretical finding into a new NES version and evaluate its performance on standard testbeds. This new NES version would work without ranking, similar to the evolutionary gradient search of Arnold and Salomon (2007).

*f*maximization, it accepts

*f*values above the

*local*quantile . That is, instead of using the expected value of utility in Equation (86), one has to consider the expected value This quantity can be given the simple interpretation of being proportional to the probability of

*f*realizations that are greater than or equal to the local quantile. Thus, gradient ascent is aiming here at the increase of the probability of generating above values. may be regarded as a threshold that changes with time. It gradually increases (in the case of

*f*maximization) during the IGO flow. This again ensures—similar to the in the denominator of Equation (86)—that there is a large utility signal.

As we have seen, the choice of the utility function has a strong impact on the performance of the IGO system. This performance also depends on the class of fitness functions considered. For example, Arnold’s weight scheme yields an exponentially fast approach to the optimizer of the ellipsoid model. However, as Equation (118) shows, it performs rather slowly on linear functions. Truncation selection as an alternative option yields an exponential growth in Equation (151), provided that the truncation ratio is less than . Therefore, it is better suited for these linear functions. It should be clear that it is impossible to draw general conclusions regarding the usefulness of specific utility functions without fixing the objective function class to be optimized.

While all utility functions considered are *f* compliant, that is, they emphasize the selection of locally better *f* values, one can call this into question. As a matter of fact, *f* compliance can imply an emphasis on local search. Even seemingly well-posed arguments regarding the advantages of truncation selection ensuring the invariance under monotonic *f* transformations can be challenged when considering demands of robust optimization, as shown in Beyer and Sendhoff (2007).

The theoretical investigations in this paper concerned the performance of IGO on linear and quadratic models. From this analysis one cannot draw reliable conclusions regarding optimization on multimodal objective functions. Considering global optimization, the question arises whether utility functions that allow for non-*f*-compliance to a certain degree could be a means to improve global search. To this end, IGO ODEs of simple multimodal test functions should be considered in a future research program.

As has been shown in this paper, the use of localized function evaluations yields strong enough utility signals to counteract the information conserving property of the natural gradient. As a result, one can obtain exponentially fast convergence to the optimizer in the case of quadratic objective functions. Considering Equations (114) and (145), one sees that the exponential approach takes place with a time constant proportional to . The time constant does not depend on . This is in contrast to the naive expected value maximization of the dynamics given in Equation (14). Obviously, using the natural gradient approach makes the dynamics asymptotically independent of the shape of the ellipsoid model defined by . That is, using the Fisher metric, the IGO framework asymptotically transforms (for ) the ellipsoidal problem into a spherical one. While we have considered the choice of the metric and the utility function as independent design choices, it seems remarkable that this transformation result is independent of the three different utilities chosen. It raises the question of how sensitive this result is w.r.t. other utility functions (e.g., non-*f*-compliant versions).

Let us consider the principal limitations of the IGO ODE theory. Due to the infinite population size it is hard—if not impossible—to get meaningful assertions w.r.t. the real behavior of real NES implementations derived from IGO. Obviously, this is a concern, especially for properties that are related to the population size, as, for example, learning rates. However, it also is a concern considering the parameterization used. While the IGO ODE theory is invariant w.r.t. parameter transformations due to the differential geometry imposed by the Fisher metric, in real NES implementations, the choice of an appropriate parameterization of the distribution family seems to have considerable influence on the performance of the NES (see e.g., Glasmachers et al., 2010). Furthermore, in advanced NES versions, the direction of the IGO flow of the parameters departs from the natural gradient in that different step-size factors (learning parameters) are assigned to the and the gradients. That is, the flow vector no longer points into the natural gradient direction. However, such deviations could be incorporated into the IGO ODE framework. In the simplest case, this would lead to different factors in Equations (93) and (135). Finding closed form solutions to those ODEs might be a challenge for future research.

While the IGO ODE theory presented provides useful insight into the dynamic behavior of such systems, one should be cautious when transferring these infinite population size results to real NES implementations. As already pointed out, real NES implementation usually deviates considerably from the IGO ODE such that the original ODE does not correctly describe the real ES. This concerns Monte Carlo gradient estimations, explicit methods of step-size control and evolution path cumulation (as used in CMA-ES; see Hansen et al., 2003), and different learning parameters. The current development of IGO theory cannot account for sampling aspects, path cumulations, and so on. Analyzing the convergence behavior of real NES poses the same problems as in the case of classical ES theory and needs similar approaches and techniques as have been developed for the ES in Beyer (2001). With regard to NES, a first treatise has been provided by Schaul (2012).

Probably the most important benefit of the IGO philosophy lies in its inspiring power for deriving new EA variants on a scientifically grounded base. Even though final implementations do (most often) considerably deviate from pure theory, the IGO philosophy can provide new starting points for research. One such starting point (proposed by one of the reviewers of this paper) could concern the combination of evolutionary methods, system dynamics, and time-series analysis that tracks the evolution of to make predictions for . Future work will show whether this idea leads to improved ES algorithms.

## Acknowledgments

This work was partially supported by the Austrian Science Fund FWF under grant P22649-N23. The author would like to thank the anonymous reviewers and Michael Hellwig for helpful discussions.

## References

*C*

^{2}-composite functions

## Notes

^{1}

In this work, only real-valued optimization is considered, that is, , and therefore, probability distributions will be described by their probability density functions (pdfs) parameterized by sets of distribution parameters.

^{2}

This has the additional advantage that generating the samples can be done directly by the transformation of isotropic standard normally distributed random vector components . That is, no matrix square root operations are needed.

^{3}

Here, calligraphic has been used in order to distinguish this matrix from the unity matrix .

^{4}

The term “natural gradient” is put into quotes throughout this paper because in the author’s opinion there is nothing “natural” here. As has been shown, the “natural gradient” yields a special ascent direction.

^{5}

NES inherits its “natural” from the “natural gradient” ascent, thus, being a rather special, that is, synthetic choice. A better characterizing term would be “synthetic ES” (SES) instead of NES.

^{6}

The part of that corresponds to is indexed by a single index, whereas the -matrix related part needs two indices, for example, .