Abstract

The convergence behaviors of so-called natural evolution strategies (NES) and of the information-geometric optimization (IGO) approach are considered. After a review of the NES/IGO ideas, which are based on information geometry, the implications of this philosophy w.r.t. optimization dynamics are investigated considering the optimization performance on the class of positive quadratic objective functions (the ellipsoid model). Exact differential equations describing the approach to the optimizer are derived and solved. It is rigorously shown that the original NES philosophy optimizing the expected value of the objective functions leads to very slow (i.e., sublinear) convergence toward the optimizer. This is the real reason why state of the art implementations of IGO algorithms optimize the expected value of transformed objective functions, for example, by utility functions based on ranking. It is shown that these utility functions are localized fitness functions that change during the IGO flow. The governing differential equations describing this flow are derived. In the case of convergence, the solutions to these equations exhibit an exponentially fast approach to the optimizer (i.e., linear convergence order). Furthermore, it is proven that the IGO philosophy leads to an adaptation of the covariance matrix that equals in the asymptotic limit—up to a scalar factor—the inverse of the Hessian of the objective function considered.

1  Introduction

For decades, engineering of evolution strategies (ES) as well as of other evolutionary algorithm (EA) related designs (including genetic algorithms, particle swarm optimization and differential evolution to name a few) was mainly based on biomimicry (also referred to as bionics). That is, principles gleaned from biology were translated into optimization algorithms. While there were attempts to put the design philosophy of ESs on some kind of “axiomatic” base by formulating design principles (see Beyer and Deb, 2001; Beyer, 2001; Hansen, 2006; Beyer, 2007), a new development by Wierstra et al. (2008) and Sun et al. (2009) called natural evolution strategies (NES) promised a more principled approach to the design of EAs. This new approach yielded algorithm implementations that were similar to the well-known covariance matrix adaptation (CMA) ES of Hansen et al. (2003) with rank- update (see Glasmachers et al., 2010). This striking similarity gave rise to attempts grounding CMA-ES on information geometry (Amari and Nagaoka, 2000) resulting in the work of Akimoto, Nagata, et al. (2012).

An alternative view on NES is to consider it as an algorithmic implementation of an information-geometric optimization (IGO) flow in the parameter space of the respective probability distribution family. That is, an IGO differential equation describing the time evolution of the distribution parameters is set up and solved numerically by discretizing the time (thus getting a generational algorithm, i.e., an EA) and estimating the expected values by Monte-Carlo sampling [the mutation and recombination process in ES; see Akimoto, Auger, et al. (2012)].

From a bird’s-eye view, most of the research done in this field follows directly or indirectly this line of approach with the aim to derive real EA implementations. On the other hand, the IGO flow differential equation can be used to derive theoretical assertions w.r.t. the convergence behavior and generally the time evolution of the IGO flow system. This line of research has been opened by the work of Glasmachers (2012) and Akimoto, Auger, et al. (2012). Considering isotropic mutations they proved convergence to the optimizer of convex quadratic functions using different techniques avoiding the direct solution of the IGO equation system. The most advanced results have been provided by Akimoto (2012) who proved linear convergence order on functions with ellipsoidal level sets.

It should be clear that proving convergence of the IGO flow without calculating the real time evolution is but a first step. However, the direct solution of the IGO flow system yields the maximum information which allows for a deeper understanding of the evolutionary dynamics of IGO. This is the primary goal of this paper. It is devoted to the calculation of the exact time evolution of NES algorithms optimizing convex quadratic objective functions (the so-called ellipsoid model). The NES to be analyzed uses normally distributed mutations with a fully developed covariance matrix . Both cases are considered, the ordinary expected fitness value optimization and the case of rank-based utility optimization. It will be shown that the original NES philosophy, that is, the natural gradient ascent in the expected fitness landscape, leads to sublinear convergence. That is, the original NES idea results in convergent but slowly performing algorithms. Introducing a rank-based evaluation instead of the direct fitness changes the behavior drastically. This rank-based evaluation localizes the search in the sense that the globally defined objective function is replaced by a locally (in time) acting utility function. Such a utility can be obtained in different ways. For example, considering the optimal ES weighting of Arnold (2006), one gets a locally standardized fitness. Alternatively, -ES truncation selection has a similar effect. In both cases asymptotical exponentially fast convergence to the optimizer will be proven. These results provide the theoretical explanation of why recent NES implementations always rely on rank-based utility functions. Additionally, the analysis technique presented also allows for the calculation of the dynamics of the covariance matrix evolution. As will be shown, approaches the inverse of the Hessian of the objective function up to a scalar factor.

The rest of the paper is organized as follows. The philosophy of NES and IGO will be introduced in a self-contained manner. However, this introduction will not consider implementation aspects since these are not needed in this paper. In Section 2 it will be shown—not solely for didactical reasons—that ordinary gradient ascent in expected fitness landscapes may cause problems in that different parameterizations may yield qualitatively different (and sometimes undesirable) behaviors. Thereafter, the idea of information distance constrained gradient ascent is introduced, building the philosophical basis of NES. The resulting IGO flow will be analyzed in Section 3 by solving the differential equation system yielding a slowly converging evolution dynamics. In Section 4 the idea of localized fitness evaluation is added to the NES framework and analyzed for optimal weighting and truncation selection assuming normally distributed fitness values. Finally, in Section 5 conclusions are drawn.

2  NES/IGO in a Nutshell

Given a function to be optimized, there are numerous options to perform a gradual approach to the optimizer . One way—as pursued in evolutionary algorithms—consists of randomly sampling values from a family of probability distributions the density of which may be given by .1 Here, the set of distribution parameters evolves over time t (the generation counter) with the goal to change the distribution in such a manner that with increasing t the distribution is more and more concentrated about the optimizer . The manner in which the parameter set is changed over the generations t characterizes the different EA class such as estimation of distribution algorithms (EDAs) and ESs. Alternatively, can even be implicitly presented by the population of candidate solutions as is usually done in genetic algorithms (GAs). However, in this work, we will focus on EAs where the distribution parameters are explicitly represented. In the next section, we will start with a naive approach by minimizing the expected value of f. It will be shown that this approach does not always lead to a stable performing EA. Therefore, in an attempt to get a theoretically principled approach based on expected value maximization, the class of so-called natural evolution strategies (NESs) will be considered (Wierstra et al., 2008). The underlying ideas will be presented and investigated in subsequent sections.

2.1  Maximizing the Expected Value of f

While each real implementation of EAs has to produce a population of random samples and has to evaluate these samples w.r.t. fitness, that is, , the information obtained from these samples can be used in a different manner. From the EA modeling perspective, one possible objective is to consider the expected value
formula
1
as the target quantity to be optimized. That is, in the case of f-maximization, this leads to a transformed optimization problem
formula
2
This approach is the starting point of the NESs by Wierstra et al. (2008). In order to have a meaningful optimization problem, as in Equation (2), one has to ensure that the parameter set allows for the expression of the estimate of the optimizer of with arbitrary precision. This can be achieved if the expected value vector is itself part of the parameter set . For example, this is attained by using multivariate Gaussian normal distributions with mean and the symmetric covariance matrix , that is,
formula
3
The idea behind such a parameterization is that evolves toward the optimizer and the covariance matrix shrinks in such a manner that the distribution of the samples gets more and more concentrated about the optimizer .
An implementation of the idea of maximizing the expected value function according to Equation (2) starts by realizing a gradient ascent in the parameter space of . To this end, the gradient is needed, symbolized by the nabla operator . In real-world cases, however, the expected value in Equation (1) cannot be calculated analytically. Therefore, a Monte Carlo sampling technique is applied in order to estimate the gradient of Equation (1); see Sun et al. (2009):
formula
4
and is the number of samples taken. is also referred to as the offspring population size in ESs. The gradient estimate in Equation (4) is then used to perform a hill climbing step in the -space, thus forming an iterative update formula
formula
5
where is a step-size factor. While in practice, one is interested in an that provides a fast approach to the optimizer, one can also consider the limit case. Subtracting on both sides of Equation (5), dividing by , and taking the limit, the t-discrete values become a continuous time t function and one obtains
formula
6
This is an ordinary differential equation (ODE) system that can serve as a model description of ESs. A modified version of Equation (6) has already been used in the so-called information-geometric optimization (IGO) framework in order to investigate the convergence behavior of infinite population size models of ESs; see Akimoto, Auger, et al. (2012), Glasmachers (2012), and the remainder of this paper.
Before considering more advanced versions of Equations (1) and (6), being the basis for mature version of NES and IGO algorithms, a motivation for those will be given here. Equation (6) describes the continuous time evolution of the trajectories. Provided that f is given in simple form, one can calculate this time evolution. To this end, the maximization of the general ellipsoid model
formula
7
is considered with the symmetric positive definite matrix and an arbitrary (constant) vector . Note that the fitness model in Equation (7) can be regarded as a local approximation of more complex objective functions . In that case, is simply the Hessian of f, that is, .
Using Gaussian random vectors , the expected value of Equation (1) can be easily calculated using results from Beyer (2001, p. 122):
formula
8
Here, represents the trace of the matrix product . Calculating the gradient of Equation (8) yields
formula
9
Inserting this into Equation (6), one obtains
formula
10a
formula
10b
Equation (10a) does not depend on . A particular solution to the inhomogeneous equation is given by the stationary state condition leading to . That is, the particular solution is just the optimizer of Equation (7):
formula
11
Consider the deviation measuring the distance of to the optimizer
formula
12
Equation (10a) can be rewritten using Equation (11),
formula
13
As one can easily check by insertion, its solution is given by
formula
14
Thus, one obtains
formula
15
that is, the model equation approaches the optimizer exponentially quickly (linear convergence order).
The solution of Equation (10b) is very simple, as one can check by insertion:
formula
16
However, this result indicates a problem, since there is a t0 above which the covariance matrix loses positive definiteness. That is, implementing an ES on the basis of the parameterization might cause convergence problems due to inappropriate covariance matrix evolution.
This covariance matrix adaptation problem can be avoided by using another ad hoc parameterization. Using a decomposition with a symmetric matrix resolves the problem. That is, one uses as parameterization of the Gaussian distribution family.2 This changes in Equation (8) to and the derivatives w.r.t. A becomes . As a result, Equation (10b) changes to
formula
17
This equation can be solved using the Ansatz yielding
formula
18
as can be easily checked by inserting Equation (18) into Equation (17). As one can see, and therefore remain positive definite for because is positive definite. The result in Equation (18) is in contrast to Equation (16). Using the parameterization avoids the problem of evolving a nonpositive definite covariance matrix. Let us have a closer look at the -dynamics imposed by Equation (18). To this end, is projected into the eigen system of . Consider the eigenvalue problem
formula
19
Note, a nondecreasing ordering of the eigenvalues has been chosen in Equation (19), that is, q1 is the smallest eigenvalue. Using qk () as the basis, Equation (18) becomes
formula
20
That is, and therefore shrink exponentially quickly. However, this shrinking appears at different time constants . In the asymptotic limit , the dynamics are dominated by the slowest decaying mode, that is, the mode. Therefore
formula
21
Transforming back, the covariance matrix becomes
formula
22
This means that samples are predominantly produced in the direction of the largest principal axis while the other directions related to the with are overproportionally dampened. In other words, given an ellipsoid success domain (defined by Equation (7)), search is predominantly conducted in the major axis direction. That is, for large t, the search degenerates in a one-dimensional subspace of the . Similarly, if the ground state is degenerated by a multiplicity of m, , then there are m orthogonal directions resulting in a degenerated search in the corresponding m-dimensional subspace of .

2.2  How to Get Unbiased

Degeneration of the search distribution is undesirable. Instead of Equation (22), it would be desirable to get an asymptotic behavior like this
formula
23
This would ensure that the sample points are distributed according to the shape of the ellipsoid described by and the covariance matrix could shrink in an even manner. How can such a behavior be ensured by first principles?
It is quite clear that such a behavior can only be obtained by constraining the evolution of the search distribution . It is the goal to change from t to only minimally while being maximally unbiased. That is, the information gained in this step should be rather small. A quantity measuring this change is reminiscent of a relative entropy loss or information gain I considering the ratio (see Rényi, 1961):
formula
24
The logarithm with base has been used instead of the usual base 2, resulting in an additional factor of which is, however, without any relevance for the considerations here. Equation (24) evaluates to
formula
25
where the resulting integral can be also interpreted as the Kullback-Leibler divergence . While it is well-known that the information gain defined in Equation (25) is not a distance measure with the meaning of a metric (see, e.g., Eguchi and Copas, 2006), its infinitesimal version is. Interpreting as small quantities, the Taylor expansion of I up to the second order in the components yields after a simple calculation
formula
26
where
formula
27
are the components of the Fisher information matrix (see Lehmann and Casella, 1998).3

While for noninfinitesimal deviations it generally holds that , one can easily show that holds for infinitesimal deviations. That is, the rhs of Equation (26) can be considered as a differential length element on the statistical manifold induced by . This is the starting point of information geometry by noting that Iij is just the metric tensor of the statistical manifold (see Amari and Nagaoka, 2000). However, for the considerations in this paper, the apparatus of Riemannian geometry is not really needed. All that is needed is the distance formula in Equation (26).

As we have seen at the end of Section 2.1, a gradient ascent in the -parameter space on the expected value landscape can result in a degeneration of the search distribution with the result that the search gets finally restricted in subspaces of the RN. Such a collapse might be avoided by constraining the search step in the parameter space such that the information gain I is kept at a (given) small level , that is, , while searching for the that provides the largest increase of . Using Taylor expansion for this yields the constrained maximization problem
formula
28a
formula
28b
that can be solved by using Lagrange’s method. Using Equations (26) and (28), neglecting higher order terms, one obtains the Lagrange function
formula
29
Taking the derivatives w.r.t. and , one gets
formula
30a
formula
30b
Setting Equation (30a) to zero in order to find the stationary point , one obtains in matrix notation . Solving for yields
formula
31
where is the inverse of the Fisher matrix in Equation (27). Setting Equation (30b) to zero and inserting Equation (31), one gets in matrix notation (note, )
formula
32
Solving for and reinserting the result into Equation (31) yields
formula
33
Noting that the square root in Equation (33) is an infinitesimal quantity as , one can interpret this as an infinitesimal time change from t to that causes a change from to , leading to . Therefore, dividing by , the rhs of Equation (33) can be interpreted as the time derivative of . Thus, one gets the ordinary differential equation (ODE) system,
formula
34
This ODE system is different from Equation (6) in that the gradient direction given by is transformed by the inverse of the Fisher information matrix. According to Amari (1998), the rhs of Equation (34) is called “natural gradient”4 in contrast to the ordinary gradient .

The idea of natural gradient descent in the expected value landscape is the key idea of the NESs5 (see, e.g., Wierstra et al., 2008). The differential Equation (34) is also referred to as an IGO differential equation in Akimoto, Nagata, et al. (2012). The information flow generated by this differential equation will be subject of investigation in the next section. A very astonishing result will be derived for the ellipsoid model in Equation (7) showing that the IGO flow in Equation (34) results in a sublinear convergence order.

3  On the Dynamics of IGO without Utility Functions

According to Amari (1998, p. 251),

“the ordinary gradient does not give the steepest direction of a target function; rather, the steepest direction is given by the natural (or contravariant) gradient.”

This assertion was made under the premise of a Riemannian metric structure (Amari, 1998, p. 251). From the analysis presented so far in this paper, this assertion seems hard to justify under the premise of expected value maximization of Equation (1). While the family of Gaussian distributions gives rise to a nonflat metric, it remains an open question how quickly Equation (34) approaches the steady state and the optimizer . To shed some light on this matter, we will investigate the dynamics of IGO flow on the general quadratic function f given by Equation (7) using Gaussian samples parameterized by , that is, . To this end, the inverse Fisher information matrix of the Gaussian distribution must be calculated in order to get the concrete form of Equation (34).

3.1  Derivation of the IGO Differential Equation

As a first step, must be calculated. This calculation starting from Equation (27) is simple, but somewhat lengthy. Therefore, we abstain from presenting it here and use the result of Kay (1993, p. 47):
formula
35
Here, and are (multi-) indices corresponding to the components of the vector and the covariance matrix , respectively.6 Since does not depend on and vice versa, the resulting Fisher information in Equation (35) subdivides into a part acting on and another on . That is, the cross terms and vanish. As for the components w.r.t. , one immediately gets from Equation (35),
formula
36
Treating the -related part in Equation (35) yields
formula
37
Thus, one gets for the -related part of ,
formula
38
Due to the block structure of , calculating the inverse of reduces to the calculation of the inverse of Equations (36) and (38) separately. Considering Equation (36), one immediately concludes that
formula
39
The correctness of
formula
40
is proven directly by calculating
formula
Now, the IGO differential equation can be directly obtained from Equation (34) using Equations (39) and (40). It reads
formula
41
While the -dynamics were directly obtained as a matrix vector product, the -dynamics needed an intermediate step:
formula
Using the gradients already calculated in Equation (9), one finally obtains the nonlinear IGO differential equation system
formula
42a
formula
42b
In the next section, a solution to the system in Equation (42) will be derived.

3.2  On the Time Evolution of the IGO System

A closer look at Equation (42) reveals that Equation (42b) is independent of Equation (42a). That is, in a first step Equation (42b) must be solved.

Theorem 1 (IGO C-Dynamics): 
The nonlinear ODE system with the symmetric matrices and , both invertible, and the initial condition has the solution
formula
43
Proof: 
Let be an invertible matrix. Let and substitute this expression in Equation (42b), then multiply from the left with ; this yields the ODE
formula
44
Now, differentiate the identity and multiply from the right with . This yields, after rearrangement,
formula
45
Plugging this result into Equation (44) and multiplication from the left and the right with yields
formula
46
This ODE system can be easily solved yielding , where is a constant (time independent) matrix. Thus, one gets . Recalling that was originally substituted, one gets
formula
47
The constant matrix is obtained by considering the initial condition . Using Equation (47), one gets and therefore . Introduced in the rhs of Equation (47) finally yields Equation (43).
Rewriting the dynamic in Equation (43) for , one gets for the asymptotic behavior,
formula
48
That is, for sufficiently large time, IGO forgets the initial covariance matrix and approaches a matrix that is proportional to the inverse of the matrix as postulated by Equation (24). However, this approach is rather slow since it obeys a law. The question arises whether this slow approach also transfers to the -dynamics.
In order to derive a solution for it should be first noted that Equation (42a) has a fixed point (attractor), characterized by , which is determined by the solution of , already calculated in Equation (11). That is, approaches the maximizer of for . Let us investigate the approach to the maximizer by considering the evolution of the deviation . Using Equation (11), the ODE (42a) becomes
formula
49
Inserting Equation (47) with into Equation (49) and applying matrix algebra, one gets step-by-step
formula
50
This is a linear ODE system with a time dependent coefficient matrix that can be solved using the Ansatz
formula
51
Calculating the time derivative of Equation (51), one gets
formula
52
Here it was implicitly assumed that and commute.7 Comparing Equation (52) with Equation (50), one obtains the ODE
formula
53
This ODE can be formally integrated using the matrix logarithm yielding where is a constant matrix to be determined below. Inserting this result in Equation (51), one gets
formula
54
Taking the initial condition into account, one gets [using Equation (54)]:
formula
55
Plugging this into Equation (54) leads to the next theorem.
Theorem 2 (IGO Residual Distance to Optimizer Dynamics): 
Consider the IGO system in Equation (42) where samples are generated using Gaussian normal vectors and the fitness is given by the ellipsoid model in Equation (7). Then the residual distance to the optimizer of the ellipsoid model in Equation (7) obeys the ODE and its solution is given by
formula
56
Proof: 
Since the ODE regarding has been already derived in Equation (50), it remains to insert the solution Equation (56) into both sides of Equation (50) and to show their equality. To this end, Equation (56) is expressed in terms of the matrix defined in Equation (50),
formula
57
Let us first calculate the lhs of Equation (50) by making use of Equation (45) (replacing by ):
formula
58
Since
formula
one gets for Equation (58),
formula
59
The rhs of Equation (50) yields
formula
60
Equality of Equations (59) and (60) is proven if is equal to . This transfers to their inverses, that is, and . Direct calculation yields
formula
61a
formula
61b
Since Equations (61a) and (61b) agree, Equations (59) and (60) agree as well and thus, , which completes the proof.

3.3  Discussion

Let us have a closer look at Equation (56) and investigate the asymptotic behavior:
formula
62
This is a very astonishing result for the natural gradient ascent: The optimizer is approached very slowly, obeying a law, in contrast to the exponential decay of the ordinary gradient ascent given by Equation (14). There is also another difference concerning the manner in which the optimizer is approached. According to Equation (62), the IGO flow follows a straight line, the direction of which is given by . That is, each component of the deviation vector changes in proportional manner. This can also be seen using Equation (49) in conjunction with the asymptotic Equation (48), yielding
formula
63
This is in contrast to the ordinary gradient ascent dynamics in Equation (13) that locally transform the ascent direction by the matrix. The latter yields trajectories in Equation (14) being bent in the space while the natural gradient case produces a straight line. In that sense one could argue that in the asymptotic limit the natural gradient ascent proceeds along a “geodesic” in a flat -space. Interestingly, the search behavior of -ES with -self-adaptation does also exhibit such a search behavior when considering the expected value dynamics in the steady state (see Beyer and Melkozerov, 2013). However, following a geodesic does not necessarily guarantee a fast approach to the optimizer . While the -ES approaches exponentially fast (thus, exhibiting linear convergence order in expectation), the IGO dynamics in Equation (42a) yields according to Equation (62) only a disappointingly slow behavior, that is, sublinear convergence order. While this result seems astonishing at first glance, it does not really come as a surprise when considering the premises under which the natural gradient has been derived: The natural gradient direction is a result of the constrained optimization problem in Equation (28b) that penalizes the steepest ascent of , such that the information gain in Equation (28b) is fixed at a small value . That is, constraining the information gain in Equation (24) results necessarily in a slowly changing search distribution as in Equation (48). The real surprise is, however, that using the natural gradient does change the convergence behavior in such a drastic manner. Since state of the art implementations of NESs do not exhibit such a sublinear convergence behavior, another ingredient of these strategies comes into focus: the application of utility functions.

4  NES/IGO Localized—On the Use of Utility Transforms

4.1  Assessing Utility by Individual Ranking

While the NES approach has a certain scientific appeal, taking it too literally, one ends up with slowly converging algorithms on quadratic functions. Actually, even one of the first publications on NES, Wierstra et al. (2008), did not literally implement the NES update in Equation (5) to perform the maximization of the expected value in Equation (1). Instead, transformed fitness functions have been introduced replacing in Equation (1). This technique was called fitness shaping in Wierstra et al. (2008). In later publications it was emphasized that the original NES algorithm “converges slowly or even prematurely” (Glasmachers et al., 2010, p. 393), thus supporting the new theoretical findings of Section 3 from the empirical perspective. In recent versions of NES, for example, exponential NES, fitness shaping is realized by assigning utility values to the individuals sampled, that is, weights, reflecting the ranking of those individuals. Thus, NES became increasingly similar to classical ESs, as in Beyer and Schwefel (2002), Hansen et al. (2003), and Beyer (2007). It is very important to stress once more: The use of fitness shaping techniques cannot be deduced from the natural gradient ascent philosophy. It must be introduced in an ad hoc manner. In the literature, its use is mainly justified by rendering “the algorithm invariant under monotonically growing (i.e., rank preserving) transformations of the fitness function” (Glasmachers et al., 2010, p. 394). The same argument is used to explain the application of weighted -selection in CMA-ES (see Hansen, 2006). Therefore, it does not come as a surprise that in the fully developed IGO framework (see Akimoto, Auger, et al., 2012), a rank-based utility function replaces the in Equation (1). Thus, Equation (1) changes to
formula
64
Here, it is important to note that itself depends on the distribution parameters . That is, the globally defined fitness function is replaced by a locally acting function . The local character of this weighting function becomes immediately clear when looking at standard ES with -truncation selection. There, offspring () are generated according to the offspring distribution . The best individuals (i.e., those with the largest f values in the case of maximization) are taken with weights and the rest get weights equal to zero89:
formula
65
This ranking clearly depends on the choice of the distribution parameters . That is, changes with every iteration step. Thus, in the continuous limit, it becomes a function of time. As a replacement for , it can be regarded as some kind of local fitness. Climbing up the expected value landscape in Equation (1) is therefore governed by the gradient acting on the density function exclusively. That is, instead of the IGO differential equation in Equation (34), one now has to consider
formula
66
(see Ollivier et al., 2011), and the gradient must be evaluated according to
formula
67
That is, the gradient operator must not act on .
Besides -selection in Equation (65), rank-based weights putting nonlinear emphasis on the best individuals are state of the art in CMA-ES. For example, Hansen (2006) proposed the heuristic formula
formula
68
This puts different weights on the best individuals (i.e., those with the largest f values). Moreover, ES theory even allows for the determination of optimal weights based on asymptotic sphere model assumptions. Arnold (2006) has shown for ESs with isotropic mutations that choosing
formula
69
guarantees maximal progress toward the optimizer, provided that the mutation strength is controlled correctly. Here the lhs in Equation (69) is the expected value of the mth order statistics of the standard normal variate given by the integral
formula
70
where and are the pdf and cdf of the standard normal variate, respectively. For large (i.e., large sample sizes), Equation (70) can be expressed asymptotically (Arnold et al., 1992, p. 128, Equation (5.5.2))
formula
71
where is the quantile function of the standard normal variate. These optimal weights are different from the heuristic weights in Equation (65) in that each of the individuals gets a weight and not only the best individuals. Unlike Equations (68) and (65), the Equation (69) also yields negative weights for individuals whose f values are below the median of f. The weights are antisymmetric about the median, that is, it holds .

In the remainder of this section, the effect of the local weight models on the dynamics of ES will be investigated considering the linear and quadratic fitness models in Equation (7). It turns out that using optimal weights results in surprisingly simple expressions that allow for closed solutions of the IGO differential equation.

4.2  Asymptotic Fitness Distributions and Search Gradients

In order to investigate the dynamics of IGO with local weighting, one first has to calculate the search gradient in Equation (67), a rather difficult task due to the N-dimensional integral. Taking into account, however, that the local weight function depends only on the f values, Equation (67) can be transformed to a one-dimensional integral by the mapping . Thus, Equation (67) changes to
formula
72
This integral is tractable provided that the pdf can be expressed by simple expressions, for example, in terms of a Gaussian distribution:
formula
73
This holds exactly for linear fitness functions . Provided that the eigenvalue spectrum of of the quadratic fitness function in Equation (7) does not contain singularly dominating eigenvalues, Equation (73) also holds for the quadratic case if , due to the central limit theorem of statistics. In order to proceed under these assumptions, the expected value and the variance must be determined. To this end, the local fitness function is considered that follows from Equation (7) by substituting and :
formula
74
Since , one only has to consider that yields (Beyer, 2001, p. 122). Thus, one gets
formula
75
As for the variance of Equation (74), it is noted that constant terms do not contribute to the variance of a random variate. Therefore
formula
76
Making use of the standard deviation formula in Beyer (2001, p. 123, Equation 4.59), one finds, mutatis mutandis:
formula
77
Note that Equation (77) also contains the case of linear fitness as a special case by setting , yielding .
The next step concerns the calculation of the gradient in Equation (72). To this end, it is noted that . Thus Equation (72) changes to
formula
78
Taking the logarithm of Equation (73), one gets
formula
79
and applying the gradient calculation yields
formula
80
In order to complete Equation (80), the gradients of and are to be calculated. Comparing Equation (75) with Equation (8), and taking Equation (9) into account, one immediately gets
formula
81
Straightforward calculation using Equation (77) yields
formula
82
Concerning the gradient, a detailed componentwise calculation is presented:
formula
83
Rewriting in matrix vector notation yields
formula
84
The results obtained are to be inserted into Equation (80) and then into Equation (72). Since the final substitution step depends on the weighting function Wf, it will be performed in separate sections focusing on optimal weights in Equation (71) and truncation selection weights in Equation (65), respectively.

4.3  Dynamics of IGO with Arnold’s Optimal Weights

In order to analyze the IGO flow in Equation (66) recall that this differential equation describes an infinite population model, that is, . Thus, in expectation, the rank m of an individual can be directly inferred from its f value, that is, , where is the (local) cumulative distribution function of the fitness samples. Assuming normality given by Equation (73), this transfers to
formula
85
Plugging this into Arnold’s optimal (sphere) weight formula in Equation (69) using the asymptotically exact version in Equation (71), one ends up with the surprisingly simple expression (for ),
formula
86
That is, Arnold’s optimal weights transform to a utility function being simply the local standardization of in the infinite population size model. This result shares also a similarity with the so-called fitness baseline method found in older versions of NES (see Wierstra et al., 2008). However, the standardization by is missing there. As we will see below, this is a difference that causes considerable differences in the convergence behavior of the ES.
The calculation of the gradient in Equation (78) can be completed now. First, using Equations (80) and (73), one gets
formula
87
Inserting the result from Equation (86), one obtains
formula
88
Performing a variable transformation , Equation (88) changes to
formula
89
Since the integrals over odd t powers vanish, one obtains
formula
90
Using Equation (81), one gets
formula
91
Now, the IGO differential equations in Equation (66) can be set up using Equation (41), and replacing Ef by EW:
formula
92a
formula
92b
where is given by Equation (77). Finding a closed solution to the nonlinear ODE system in Equation (92) seems generally excluded. However, an asymptotic solution for large can be derived.

4.3.1  Solving the Quadratic Fitness Case

In a first step, it is again noted that Equation (92a) has a fixed point where . This fixed point is the optimizer of the quadratic fitness function . This is fully analogous to Equations (10a) and (42a). Therefore, Equation (92a) can be changed to an ODE system describing the evolution of the distance vector to the optimizer defined by Equation (12). Using , it follows (see also Equation (49)):
formula
93a
formula
93b
where , from Equation (77), changes to
formula
94
As the next step, a functional connection between and will be derived. To this end, Equation (93b) is multiplied by from the right, yielding
formula
95
Multiplying Equation (95) with from the left yields
formula
96
Taking the identity
formula
97
into account (compare the inverse case, Equation (45)), one gets
formula
98
This leads immediately to
formula
99
where is a constant vector to be determined below. Solving Equation (99) for , one gets
formula
100
This is a very simple dependence showing that the evolution of the distance vector to the optimizer is governed by the evolution of the covariance matrix. The constant vector is obtained by applying the initial conditions and to Equation (100), yielding . Solving for yields
formula
101
and reintroduced into Equation (100), one obtains the following lemma.
Lemma 1 (Optimal Weight IGO -Dynamics Are Governed by -Evolution):
Consider the IGO dynamics with locally fitness standardized weights in Equation (86) and offspring sampling acting on a general quadratic fitness model in Equation (7) that generates normally distributed fitness values. Let be the vector representing the distance of to the optimizer according to Equation (12). Let and be the initial values. Then the -dynamics are given by
formula
102
Proof:

See the derivation given above.

The simple result of Lemma 1 obtained from the ODE system in Equation (93) is remarkable if one takes into account that the ODE system is nonlinear. Unfortunately, the next step, calculating the dynamics, seems intractable in general. Therefore, one has to settle for an asymptotic solution of Equation (93b) that describes the evolution of for . To this end, Equation (93b) is multiplied from both the left and the right with taking Equation (97) into account. Furthermore, is substituted in Equation (94), taking Equation (100) into account, yielding
formula
103
Multiplying from the right with , one obtains, using the abbreviation
formula
104
formula
105
That is, all off-diagonal elements of must be constants, since according to Equation (105), . Therefore, . Furthermore, all diagonal elements of have the same scalar function on the rhs in Equation (105). As a result, , where is a function to be determined. To summarize, the general solution to Equation (105) reads
formula
106
Inserting this into Equation (105), one obtains
formula
107
Since the denominator in Equation (105) is and , it holds . By virtue of Equation (106) it follows that and . Thus, is a monotonously increasing unbounded function and for . Therefore, an asymptotically exact solution to Equation (107) can be constructed by letting in the denominator of Equation (107), leading to the simple differential equation,
formula
108
The solution of which is
formula
109
with the inverse time constant
formula
110
Equating Equation (106) with Equation (104) and resolving for yields
formula
111
Now inserting the asymptotic solution from Equation (109) into Equation (111), one obtains
formula
112
This result gives rise to the following theorem.
Theorem 3 (Optimal Weight IGO Asymptotic Dynamics):
Consider the IGO dynamics governed by Equation (93) for with locally fitness standardized weights in Equation (86) and offspring sampling acting on a general quadratic fitness model in Equation (7) that generates normally distributed fitness values. Let be the vector representing the distance of to the optimizer according to Equation (12). Let and be the initial values. Then the asymptotic -dynamics are given by
formula
113
and the asymptotic -dynamics obey
formula
114
Proof:

As for Equation (113), the proof is obtained by following the derivation presented above yielding Equation (112), and finally inserting Equation (110). In order to prove Equation (114), one has to make use of Lemma 1, Equation (102). Inserting Equation (113) into Equation (102) yields Equation (114).

The results obtained are remarkable. At first, Equation (114) shows that IGO approaches the optimizer exponentially fast, thus providing linear convergence order. The rate of approach is determined by the time constant
formula
115
that depends on the search space dimension N only with an unexpected square root law and no dependency on . While the latter might come as a surprise, one should recall that we have derived the asymptotical solution to Equation (93). According to Equation (113), the covariance matrix gets proportional to the inverse of . That is, the IGO ES “sees” effectively a sphere model. Therefore, asymptotically, the IGO ES has transformed the ellipsoidal level sets of f into spherical ones and the dependencies on vanish. Note, according to Equation (111), the influence of the initially chosen covariance matrix vanishes exponentially fast.

The law in Equation (115) is somewhat peculiar. Considering the performance of isotropic -ES with mutation strength control using self-adaptation or cumulative stepsize adaptation on the sphere model, Arnold and Beyer (2004) and Meyer-Nieberg and Beyer (2005), respectively, found time constants . That is, the expected runtime for a fixed relative improvement in f values is proportional to N, whereas for the IGO ODE it grows only with the square root of the search space dimensionality. It is currently an open question whether this square root law indicates a general lower N bound for algorithms derived from the IGO ODE, taking into account that the IGO ODE is an infinite population size model.

Considering Equation (113), one sees that the covariance matrix adapts to the desired behavior in Equation (23) exponentially fast. That is, we have proven that this type of IGO approximates the inverse of up to a scalar factor. Such a behavior has been observed in real ES using covariance matrix adaptation, as, for example, the evolutionary gradient search ES (Arnold and Salomon, 2007, p. 492, Footnote 3). A similar result has been derived in Akimoto (2012) using difference equations, where it was shown that evolves up to a scalar factor to the inverse of the Hessian of . In that work, a specially tailored IGO NES has been considered that deviates from the model given in Equation (93). It relies on an ad hoc constructed time-dependent stepsize scaling factor that uses eigenvalue information taken from the actual matrix.

Comparing the results of Theorem 3 with those of Theorems 1 and 2 regarding NES without local utility transformation, one sees that the rank-based weighting yields qualitatively better performance on quadratic fitness models. Having a closer look at the governing ODEs (42) and (92), one notices the terms in the denominator of the latter. Getting closer to the optimizer, gets smaller, thus, counteracting the decrease of the numerators in the rhs of Equations (92a) and (92b) with the result of larger derivatives. Tracing back the additional terms, one finds that these are due to the term from Equation (86) in the integral in Equation (88). That is, the linear convergence order of this IGO version is clearly a result of that (sphere) optimal rank-based weighting in Equation (69) leading to the locally standardized utility function in Equation (86).

4.3.2  The Linear Fitness Case

The investigation of the behavior of IGO with weights in Equation (86) and linear fitness is a special case of Equation (7) assuming . If introduced in Equations (92) and (77), one obtains the IGO system
formula
116a
formula
116b
This IGO ODE system is exact in that it does not require the condition of fitness normality. Given , the linear function always yields normally distributed fitness values. Solving Equation (116) is trivial; one immediately gets for Equation (116b)
formula
117
If inserted in Equation (116a), this yields
formula
118
That is, increases linearly in time. If the initial covariance matrix is chosen isotropically, that is, , then the evolution is in the direction of the gradient .

4.4  Dynamics of IGO with -Truncation Selection

Truncation selection, also known as -selection in ESs, is the standard selection that takes (in the case of maximization) those individuals out of the sample of offspring individuals that produce the greatest fitness values . In the infinite population model, that means that only individuals above the f quantile are used in the calculation of the gradient in Equation (67). Let be the truncation ratio
formula
119
then the local weighting function in Equation (65) returns for f values and zero otherwise (for ):
formula
120
Assuming normally distributed fitness values f, the pdf is given by Equation (73). Therefore, the quantile obeys the equation
formula
121
That is, the first line in Equation (120) is fulfilled for values that fulfill
formula
122
where is the inverse of the cdf (i.e., the quantile function) of the standard normal variate. Now, plugging Equation (120) into Equation (87) yields
formula
123
Changing the integration variable f to and noting that Equation (122) transforms to yields
formula
124
The t integration can be performed using integral formula (A.16) and (A.17) as shown in Beyer (2001, p. 331 (A.17)). One gets
formula
125
and
formula
126
where
formula
127
Thus, one obtains for Equation (124):
formula
128
Inserting the gradients w.r.t. and given by Equations (81), (82), and (84) into Equation (128) yields
formula
129
and
formula
130
The IGO ODE system in Equation (41) becomes (replacing Ef by EW)
formula
131
and
formula
132
Its asymptotic solution will be investigated in the next section.

4.4.1  Dynamics of the Quadratic Fitness Case

Comparing Equation (92a) with Equations (131) and (93a), one sees that can be substituted again by . Thus, the variable transformation is introduced into Equations (131) and (132). This leads to
formula
133
and
formula
134
where is again given by Equation (94). Pulling out of Equation (133) and out of Equation (134), one obtains the ODE system
formula
135a
formula
135b
This is a nonlinear ODE system where closed-form solutions seem difficult to obtain. Yet it shares similarities with Equation (93). A direct asymptotic solution can be given for the special case of truncation ratio . Due to Equation (127), the second term in the brackets of Equation (135) vanishes and the resulting ODE system becomes proportional to the ODE system Equation (93) with the factor . That is, time t undergoes a linear transformation . This is equivalent to a change of the inverse time constant , Equation (110), to . Thus, the asymptotic dynamics can be taken from Theorem 3, Equations (113) and (114), yielding
formula
136a
formula
136b
where was calculated using Equation (125). As for this special case, we have shown that the IGO ODE exhibits linear convergence order. Considering is much more involved and not completely solved up until now. Therefore, we will first discuss the qualitative behavior of the ODE system when changing ; and afterward, we will derive a solution that holds in the vicinity of .

Considering the different matrix terms in Equation (135b), one notices that all these single terms are positive definite; that is, it holds for that , , and . As for the latter case, this becomes clear by substitution and noting that is positive definite.

First, consider the case in Equation (135b). The term with the h2 factor vanishes because . The remaining expression in the bracket is positive definite, that is, . Due to the negative sign in front of the bracket, the remaining ODE describes the contraction of the covariance matrix , the dynamics of which is given by Equation (113).

Increasing the truncation ratio, that is, , it holds ; and and increase the positiveness of the square bracket, resulting in a faster contraction of the covariance matrix . If, however, the contraction rate is too fast, the contraction of the vector cannot keep pace with that evolution. In such a case, the ODE system describes premature convergence of the IGO algorithm.

Conversely, decreasing the truncation ratio, that is, , it holds . In that case, the positiveness of the square bracket is decreased compared to . Actually, the expression in the square bracket can become negative definite. As a result, the covariance matrix expands. This behavior is desirable to a certain extent, especially in cases where the initial covariance matrix was chosen too small. However, it can also result in uncontrolled growth.

While a quantitative analysis of the contraction/expansion behavior depending on remains to be done, the behavior of the ODE system in the vicinity of can be derived by some kind of continuation. To this end, the solution in Equation (136) for is used as an Ansatz with an unknown inverse time constant
formula
137a
formula
137b
This Ansatz is inserted into the ODE system in Equation (135) in order to determine . At first, the term in Equation (135b) is considered. This term becomes for , as one can easily check by inserting Equation (137):
formula
138
As the next step, the expression in Equation (137) is investigated. Using Equation (94), one obtains
formula
139
Plugging Equations (138) and (139) into Equation (135), one gets a simplified ODE system that holds for large t
formula
140a
formula
140b
with
formula
141
Now, the linear time transformation
formula
142
can be applied to Equations (140a) and (140b), yielding the ODE system
formula
143a
formula
143b
This ODE system is similar to the system in Equation (93) except for the time parameter. Therefore, Theorem 3 can be applied using , from Equation (142), instead of t in Equations (113) and (114). Thus, the exponent in Equations (113) and (114) becomes . That is, writing , one gets for the inverse time constant using Equations (141), (125), and (127):
formula
144
and the asymptotic dynamics become
formula
145a
formula
145b
The special case in Equation (136) is contained in Equation (145), as can be checked by inserting in Equations (144) and (145) and comparing with Equation (136). Concerning values, it should be pointed out that the validity range of in Equation (145) cannot be determined by the calculations presented. Further research is needed to determine the range for Equation (145) that guarantees linear convergence order.

An alternative convergence proof for IGO using the special case of isotropic mutations has been presented by Glasmachers (2012). However, in that work, the existence of a linear convergence rate was only claimed without proof. Another approach proving convergence to the optimizer using Lyapunov’s stability analysis was proposed by Akimoto, Nagata, et al. (2012). However, that analysis did not provide any assertions concerning the convergence order. Therefore, it does not contribute to the question of why and how a local weighting function Wf is needed for efficiently working ESs.

Considering Equation (145a), one sees that the covariance matrix gets asymptotically similar to the inverse of the Hessian of , that is, the -matrix of the quadratic model in Equation (7). This is a proof of the long-conjectured property of CMA-ES-like algorithms using truncation selection (for the case ) that is empirically observed when running such algorithms on goal functions that can be locally approximated by quadratic functions (see Arnold and Salomon, 2007).

4.4.2  The Linear Fitness Case

The IGO ODEs can be directly obtained from the quadratic case in Equations (131), (132), and (77) for . This yields
formula
146a
formula
146b
This system hold for all linear fitness functions. In order to derive the solution, Equation (146b) is multiplied by from the left. This yields
formula
147
Therefore, one immediately obtains
formula
148
Plugging this result into Equation (146a), one gets
formula
149
The solution to Equation (149) is easily obtained, it reads
formula
150
as can be easily checked by inserting Equation (150) into Equation (149). Taking the definition of in Equation (146b) into account and (127), one finally obtains
formula
151
where
formula
152
As one can see, the behavior of IGO with truncation selection on linear functions is qualitatively influenced by the truncation ratio . For , it follows that ; and according to Equation (148), the covariance matrix shrinks exponentially fast. Thus, one observes premature convergence. In the opposite case, , one gets exponential growth of , and increases exponentially quickly. Since , the dynamics increase exponentially as well. It should be mentioned that the influence of the truncation ratio on the convergence behavior has also been found by Glasmachers (2012) considering IGO with isotropic mutations, that is, for the special case .

5  Conclusions

Deriving EAs from first principles is a tempting approach to put EA engineering on a sound scientific base. In an attempt to provide such an approach, the NESs were proposed at the end of the last decade. A closer look at the premises of that approach reveals a paradigm that might be condensed into “an information gain constrained gradient ascent in the expected value fitness landscape.” According to Amari (1998), the resulting ascent direction is referred to as the natural gradient being the steepest direction. However, as has shown in this paper, considering quadratic objective functions, the natural gradient ascent when applied to expected fitness landscapes results in a slowly (but) converging information geometric flow toward the optimizer. The approach to the optimizer obeys an law, Equation (48). While Amari’s claim concerning the steepest descent direction is not wrong, because this is the best direction one can get given the constraints imposed on the information gain, natural gradient ascent alone does not necessarily yield efficient EAs.

Satisfactorily, however, is the desired result concerning the covariance matrix evolution. In all cases investigated that rely on natural gradient ascent, it has been proven that in the case of convergence gets asymptotically proportional to the inverse of the Hessian of f. This behavior is desirable since it counteracts the degeneration tendencies of the search distribution in subspaces.

The slow convergence behavior of the original NES is due to the combination of two properties of the original NES approach: On the one hand, the optimization of the objective function f is transformed into a globally defined expected value landscape. On the other hand, fast changes of the search distribution are suppressed by the natural gradient. Since returning to ordinary gradient descent causes problems for some distribution parameterizations (as has been shown in Section 2.1), the remaining option is to localize the evaluation. That is, the utility of candidate solutions must be evaluated in a time-local manner.

As might have become clear in the above given discussion, the NES/IGO ODE design contains two decisions, which are rather independent:

  1. The choice of an appropriate statistical manifold and an ascent or descent principle, respectively.

  2. The choice of a utility function that transforms the original objective function.

The IGO ODE design choice (1) seems to be obviously fixed by the steepest ascent/descent in accordance with the metric determined by the distribution family chosen. Apart from the fact that the choice of the distribution family cannot be deduced from the IGO principle, real ES implementations—for example, xNES; see Glasmachers et al. (2010)—also depart from the natural gradient direction by introducing different learning rates for and . This is clearly for the sake of real algorithm performance and marks the limitations of the infinite population size assumption inherent in the IGO ODE approach. Such implementational tweaks are hard to deduce from the IGO philosophy.

The same holds for the design choice (2) regarding the utility used. In order to get linear convergence order in the IGO ODE framework, one has to localize the evaluation of the f values generated. There are different options to localize the evaluation. Arnold’s optimal weight function derived for the ES on the sphere model yields in the asymptotic population limit a utility function that is just the locally standardized fitness. That is, f values below get a negative evaluation. While a similar fitness baseline can already be found in early NES versions, the local standard deviation in the denominator of the standardization formula in Equation (86) makes the decisive difference. It ensures large utility signals when getting closer to the optimizer. It would be interesting to implement this theoretical finding into a new NES version and evaluate its performance on standard testbeds. This new NES version would work without ranking, similar to the evolutionary gradient search of Arnold and Salomon (2007).

As an alternative option, truncation selection has been considered. In the case of f maximization, it accepts f values above the local quantile . That is, instead of using the expected value of utility in Equation (86), one has to consider the expected value
formula
153
This quantity can be given the simple interpretation of being proportional to the probability of f realizations that are greater than or equal to the local quantile. Thus, gradient ascent is aiming here at the increase of the probability of generating above values. may be regarded as a threshold that changes with time. It gradually increases (in the case of f maximization) during the IGO flow. This again ensures—similar to the in the denominator of Equation (86)—that there is a large utility signal.

As we have seen, the choice of the utility function has a strong impact on the performance of the IGO system. This performance also depends on the class of fitness functions considered. For example, Arnold’s weight scheme yields an exponentially fast approach to the optimizer of the ellipsoid model. However, as Equation (118) shows, it performs rather slowly on linear functions. Truncation selection as an alternative option yields an exponential growth in Equation (151), provided that the truncation ratio is less than . Therefore, it is better suited for these linear functions. It should be clear that it is impossible to draw general conclusions regarding the usefulness of specific utility functions without fixing the objective function class to be optimized.

While all utility functions considered are f compliant, that is, they emphasize the selection of locally better f values, one can call this into question. As a matter of fact, f compliance can imply an emphasis on local search. Even seemingly well-posed arguments regarding the advantages of truncation selection ensuring the invariance under monotonic f transformations can be challenged when considering demands of robust optimization, as shown in Beyer and Sendhoff (2007).

The theoretical investigations in this paper concerned the performance of IGO on linear and quadratic models. From this analysis one cannot draw reliable conclusions regarding optimization on multimodal objective functions. Considering global optimization, the question arises whether utility functions that allow for non-f-compliance to a certain degree could be a means to improve global search. To this end, IGO ODEs of simple multimodal test functions should be considered in a future research program.

As has been shown in this paper, the use of localized function evaluations yields strong enough utility signals to counteract the information conserving property of the natural gradient. As a result, one can obtain exponentially fast convergence to the optimizer in the case of quadratic objective functions. Considering Equations (114) and (145), one sees that the exponential approach takes place with a time constant proportional to . The time constant does not depend on . This is in contrast to the naive expected value maximization of the dynamics given in Equation (14). Obviously, using the natural gradient approach makes the dynamics asymptotically independent of the shape of the ellipsoid model defined by . That is, using the Fisher metric, the IGO framework asymptotically transforms (for ) the ellipsoidal problem into a spherical one. While we have considered the choice of the metric and the utility function as independent design choices, it seems remarkable that this transformation result is independent of the three different utilities chosen. It raises the question of how sensitive this result is w.r.t. other utility functions (e.g., non-f-compliant versions).

Let us consider the principal limitations of the IGO ODE theory. Due to the infinite population size it is hard—if not impossible—to get meaningful assertions w.r.t. the real behavior of real NES implementations derived from IGO. Obviously, this is a concern, especially for properties that are related to the population size, as, for example, learning rates. However, it also is a concern considering the parameterization used. While the IGO ODE theory is invariant w.r.t. parameter transformations due to the differential geometry imposed by the Fisher metric, in real NES implementations, the choice of an appropriate parameterization of the distribution family seems to have considerable influence on the performance of the NES (see e.g., Glasmachers et al., 2010). Furthermore, in advanced NES versions, the direction of the IGO flow of the parameters departs from the natural gradient in that different step-size factors (learning parameters) are assigned to the and the gradients. That is, the flow vector no longer points into the natural gradient direction. However, such deviations could be incorporated into the IGO ODE framework. In the simplest case, this would lead to different factors in Equations (93) and (135). Finding closed form solutions to those ODEs might be a challenge for future research.

While the IGO ODE theory presented provides useful insight into the dynamic behavior of such systems, one should be cautious when transferring these infinite population size results to real NES implementations. As already pointed out, real NES implementation usually deviates considerably from the IGO ODE such that the original ODE does not correctly describe the real ES. This concerns Monte Carlo gradient estimations, explicit methods of step-size control and evolution path cumulation (as used in CMA-ES; see Hansen et al., 2003), and different learning parameters. The current development of IGO theory cannot account for sampling aspects, path cumulations, and so on. Analyzing the convergence behavior of real NES poses the same problems as in the case of classical ES theory and needs similar approaches and techniques as have been developed for the ES in Beyer (2001). With regard to NES, a first treatise has been provided by Schaul (2012).

Probably the most important benefit of the IGO philosophy lies in its inspiring power for deriving new EA variants on a scientifically grounded base. Even though final implementations do (most often) considerably deviate from pure theory, the IGO philosophy can provide new starting points for research. One such starting point (proposed by one of the reviewers of this paper) could concern the combination of evolutionary methods, system dynamics, and time-series analysis that tracks the evolution of to make predictions for . Future work will show whether this idea leads to improved ES algorithms.

Acknowledgments

This work was partially supported by the Austrian Science Fund FWF under grant P22649-N23. The author would like to thank the anonymous reviewers and Michael Hellwig for helpful discussions.

References

Akimoto
,
Y
. (
2012
).
Analysis of the natural gradient algorithm on monotonic convex-quadratic-composite functions
. In
Proceedings of the Genetic and Evolutionary Computation Conference, GECCO
, pp.
11293
1300
.
Akimoto
,
Y.
,
Auger
,
A.
, and
Hansen
,
N.
(
2012
).
Convergence of the continuous time trajectories of isotropic evolution strategies on monotonic C2-composite functions
. In
C. Coello
Coello
,
V.
Cutello
,
K.
Deb
,
S.
Forrest
,
G.
Nicosia
, and
M.
Pavone
(Eds.),
Parallel Problem Solving from Nature, PPSN XII
.
Lecture notes in computer science
, Vol.
7491
(pp.
42
51
).
Berlin
:
Springer-Verlag
.
Akimoto
,
Y.
,
Nagata
,
Y.
,
Ono
,
I.
, and
Kobayashi
,
S
. (
2012
).
Theoretical foundation for CMA-ES from information geometry perspective
.
Algorithmica
,
64
(
4
):
698
716
.
Amari
,
S.
(
1998
).
Natural gradient works efficiently in learning
.
Neural Computation
,
10
(
2
):
251
276
.
Amari
,
S.
, and
Nagaoka
,
H
. (
2000
).
Methods of information geometry
.
Providence, RI
:
American Mathematical Society
.
Arnold
,
D
. (
2006
).
Weighted multirecombination evolution strategies
.
Theoretical Computer Science
,
361
(
1
):
18
37
.
Arnold
,
B.
,
Balakrishnan
,
N.
, and
Nagaraja
,
H
. (
1992
).
A first course in order statistics
.
New York
:
Wiley
.
Arnold
,
D.
, and
Beyer
,
H.-G
. (
2004
).
Performance analysis of evolutionary optimization with cumulative step length adaptation
.
IEEE Transactions on Automatic Control
,
49
(
4
):
617
622
.
Arnold
,
D.
, and
Salomon
,
R
. (
2007
).
Evolutionary gradient search revisited
.
IEEE Transactions on Evolutionary Computation
,
11
(
4
):
480
495
.
Beyer
,
H.-G
. (
2001
).
The theory of evolution strategies
.
Berlin
:
Springer
.
Beyer
,
H.-G
. (
2007
).
Evolution strategies
.
Scholarpedia
,
2
(
8
):
1965
.
Beyer
,
H.-G.
, and
Deb
,
K
. (
2001
).
On self-adaptive features in real-parameter evolutionary algorithms
.
IEEE Transactions on Evolutionary Computation
,
5
(
3
):
250
270
.
Beyer
,
H.-G.
, and
Melkozerov
,
A.
(
2013
).
The dynamics of self-adaptive multi-recombinant evolution strategies on the general ellipsoid model
.
IEEE Transactions on Evolutionary Computation
. doi:10.1109/TEVC.2013.2283968
Beyer
,
H.-G.
, and
Schwefel
,
H.-P
. (
2002
).
Evolution strategies: A comprehensive introduction
.
Natural Computing
,
1
(
1
):
3
52
.
Beyer
,
H.-G.
, and
Sendhoff
,
B
. (
2007
).
Robust optimization—A comprehensive survey
.
Computer Methods in Applied Mechanics and Engineering
,
196
(
33–34
):
3190
3218
.
Eguchi
,
S.
, and
Copas
,
J
. (
2006
).
Interpreting Kullback–Leibler divergence with the Neyman–Pearson lemma
.
Journal of Multivariate Analysis
,
97
(
9
):
2034
2040
.
Glasmachers
,
T.
(
2012
).
Convergence of the IGO-flow of isotropic Gaussian distributions on convex quadratic problems
. In
C. Coello
Coello
,
V.
Cutello
,
K.
Deb
,
S.
Forrest
,
G.
Nicosia
, and
M.
Pavone
(Eds.),
Parallel Problem Solving from Nature, PPSN XII. Lecture notes in computer science
, Vol.
7491
(pp.
1
10
).
Berlin
:
Springer-Verlag
.
Glasmachers
,
T.
,
Schaul
,
T.
,
Sun
,
Y.
,
Wierstra
,
D.
, and
Schmidhuber
,
J
. (
2010
).
Exponential natural evolution strategies
. In
Proceedings of the Genetic and Evolutionary Computation Conference, GECCO
, pp.
393
400
.
Hansen
,
N.
(
2006
).
The CMA evolution strategy: A comparing review
. In
J.
Lozano
,
P.
Larrañaga
,
I.
Inza
, and
E.
Bengoetxea
(Eds.),
Towards a new evolutionary computation: Advances in the estimation of distribution algorithms
(pp.
75
102
).
Berlin
:
Springer
.
Hansen
,
N.
,
Müller
,
S.
, and
Koumoutsakos
,
P
. (
2003
).
Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES)
.
Evolutionary Computation
,
11
(
1
):
1
18
.
Kay
,
S
. (
1993
).
Fundamentals of statistical signal processing, Vol. I: Estimation theory
.
Englewood Cliffs, NJ
:
Prentice Hall
.
Lehmann
,
E.
, and
Casella
,
G
. (
1998
).
Theory of point estimation
.
Berlin
:
Springer
.
Meyer-Nieberg
,
S.
, and
Beyer
,
H.-G
. (
2005
).
On the analysis of self-adaptive recombination strategies: First results
. In
Proceedings of the CEC’05 Conference
, pp.
2341
2348
.
Ollivier
,
Y.
,
Arnold
,
L.
,
Auger
,
A.
, and
Hansen
,
N.
(
2011
).
Information-geometric optimization algorithms: A unifying picture via invariance principles. (Technical Report arXiv:1106.3708v1)
Rényi
,
A
. (
1961
).
On measures of information and entropy
. In
Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability
, pp.
547
561
.
Schaul
,
T
. (
2012
).
Natural evolution strategies converge on sphere functions
. In
Proceedings of the Genetic and Evolutionary Computation Conference, GECCO
, pp.
329
336
.
Sun
,
Y.
,
Wierstra
,
D.
,
Schaul
,
T.
, and
Schmidhuber
,
J.
(
2009
).
Stochastic search using the natural gradient
. In
J. D.
Schaffer
(Ed.),
Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09
, pp.
1161
1168
.
Wierstra
,
D.
,
Schaul
,
T.
,
Peters
,
J.
, and
Schmidhuber
,
J
. (
2008
).
Natural evolution strategies
. In
Proceedings of the IEEE World Congress on Computational Intelligence, CEC
, pp.
3381
3387
.

Notes

1

In this work, only real-valued optimization is considered, that is, , and therefore, probability distributions will be described by their probability density functions (pdfs) parameterized by sets of distribution parameters.

2

This has the additional advantage that generating the samples can be done directly by the transformation of isotropic standard normally distributed random vector components . That is, no matrix square root operations are needed.

3

Here, calligraphic has been used in order to distinguish this matrix from the unity matrix .

4

The term “natural gradient” is put into quotes throughout this paper because in the author’s opinion there is nothing “natural” here. As has been shown, the “natural gradient” yields a special ascent direction.

5

NES inherits its “natural” from the “natural gradient” ascent, thus, being a rather special, that is, synthetic choice. A better characterizing term would be “synthetic ES” (SES) instead of NES.

6

The part of that corresponds to is indexed by a single index, whereas the -matrix related part needs two indices, for example, .