The convergence behaviors of so-called natural evolution strategies (NES) and of the information-geometric optimization (IGO) approach are considered. After a review of the NES/IGO ideas, which are based on information geometry, the implications of this philosophy w.r.t. optimization dynamics are investigated considering the optimization performance on the class of positive quadratic objective functions (the ellipsoid model). Exact differential equations describing the approach to the optimizer are derived and solved. It is rigorously shown that the original NES philosophy optimizing the expected value of the objective functions leads to very slow (i.e., sublinear) convergence toward the optimizer. This is the real reason why state of the art implementations of IGO algorithms optimize the expected value of transformed objective functions, for example, by utility functions based on ranking. It is shown that these utility functions are localized fitness functions that change during the IGO flow. The governing differential equations describing this flow are derived. In the case of convergence, the solutions to these equations exhibit an exponentially fast approach to the optimizer (i.e., linear convergence order). Furthermore, it is proven that the IGO philosophy leads to an adaptation of the covariance matrix that equals in the asymptotic limit—up to a scalar factor—the inverse of the Hessian of the objective function considered.
For decades, engineering of evolution strategies (ES) as well as of other evolutionary algorithm (EA) related designs (including genetic algorithms, particle swarm optimization and differential evolution to name a few) was mainly based on biomimicry (also referred to as bionics). That is, principles gleaned from biology were translated into optimization algorithms. While there were attempts to put the design philosophy of ESs on some kind of “axiomatic” base by formulating design principles (see Beyer and Deb, 2001; Beyer, 2001; Hansen, 2006; Beyer, 2007), a new development by Wierstra et al. (2008) and Sun et al. (2009) called natural evolution strategies (NES) promised a more principled approach to the design of EAs. This new approach yielded algorithm implementations that were similar to the well-known covariance matrix adaptation (CMA) ES of Hansen et al. (2003) with rank- update (see Glasmachers et al., 2010). This striking similarity gave rise to attempts grounding CMA-ES on information geometry (Amari and Nagaoka, 2000) resulting in the work of Akimoto, Nagata, et al. (2012).
An alternative view on NES is to consider it as an algorithmic implementation of an information-geometric optimization (IGO) flow in the parameter space of the respective probability distribution family. That is, an IGO differential equation describing the time evolution of the distribution parameters is set up and solved numerically by discretizing the time (thus getting a generational algorithm, i.e., an EA) and estimating the expected values by Monte-Carlo sampling [the mutation and recombination process in ES; see Akimoto, Auger, et al. (2012)].
From a bird’s-eye view, most of the research done in this field follows directly or indirectly this line of approach with the aim to derive real EA implementations. On the other hand, the IGO flow differential equation can be used to derive theoretical assertions w.r.t. the convergence behavior and generally the time evolution of the IGO flow system. This line of research has been opened by the work of Glasmachers (2012) and Akimoto, Auger, et al. (2012). Considering isotropic mutations they proved convergence to the optimizer of convex quadratic functions using different techniques avoiding the direct solution of the IGO equation system. The most advanced results have been provided by Akimoto (2012) who proved linear convergence order on functions with ellipsoidal level sets.
It should be clear that proving convergence of the IGO flow without calculating the real time evolution is but a first step. However, the direct solution of the IGO flow system yields the maximum information which allows for a deeper understanding of the evolutionary dynamics of IGO. This is the primary goal of this paper. It is devoted to the calculation of the exact time evolution of NES algorithms optimizing convex quadratic objective functions (the so-called ellipsoid model). The NES to be analyzed uses normally distributed mutations with a fully developed covariance matrix . Both cases are considered, the ordinary expected fitness value optimization and the case of rank-based utility optimization. It will be shown that the original NES philosophy, that is, the natural gradient ascent in the expected fitness landscape, leads to sublinear convergence. That is, the original NES idea results in convergent but slowly performing algorithms. Introducing a rank-based evaluation instead of the direct fitness changes the behavior drastically. This rank-based evaluation localizes the search in the sense that the globally defined objective function is replaced by a locally (in time) acting utility function. Such a utility can be obtained in different ways. For example, considering the optimal ES weighting of Arnold (2006), one gets a locally standardized fitness. Alternatively, -ES truncation selection has a similar effect. In both cases asymptotical exponentially fast convergence to the optimizer will be proven. These results provide the theoretical explanation of why recent NES implementations always rely on rank-based utility functions. Additionally, the analysis technique presented also allows for the calculation of the dynamics of the covariance matrix evolution. As will be shown, approaches the inverse of the Hessian of the objective function up to a scalar factor.
The rest of the paper is organized as follows. The philosophy of NES and IGO will be introduced in a self-contained manner. However, this introduction will not consider implementation aspects since these are not needed in this paper. In Section 2 it will be shown—not solely for didactical reasons—that ordinary gradient ascent in expected fitness landscapes may cause problems in that different parameterizations may yield qualitatively different (and sometimes undesirable) behaviors. Thereafter, the idea of information distance constrained gradient ascent is introduced, building the philosophical basis of NES. The resulting IGO flow will be analyzed in Section 3 by solving the differential equation system yielding a slowly converging evolution dynamics. In Section 4 the idea of localized fitness evaluation is added to the NES framework and analyzed for optimal weighting and truncation selection assuming normally distributed fitness values. Finally, in Section 5 conclusions are drawn.
2 NES/IGO in a Nutshell
Given a function to be optimized, there are numerous options to perform a gradual approach to the optimizer . One way—as pursued in evolutionary algorithms—consists of randomly sampling values from a family of probability distributions the density of which may be given by .1 Here, the set of distribution parameters evolves over time t (the generation counter) with the goal to change the distribution in such a manner that with increasing t the distribution is more and more concentrated about the optimizer . The manner in which the parameter set is changed over the generations t characterizes the different EA class such as estimation of distribution algorithms (EDAs) and ESs. Alternatively, can even be implicitly presented by the population of candidate solutions as is usually done in genetic algorithms (GAs). However, in this work, we will focus on EAs where the distribution parameters are explicitly represented. In the next section, we will start with a naive approach by minimizing the expected value of f. It will be shown that this approach does not always lead to a stable performing EA. Therefore, in an attempt to get a theoretically principled approach based on expected value maximization, the class of so-called natural evolution strategies (NESs) will be considered (Wierstra et al., 2008). The underlying ideas will be presented and investigated in subsequent sections.
2.1 Maximizing the Expected Value of f
2.2 How to Get Unbiased
While for noninfinitesimal deviations it generally holds that , one can easily show that holds for infinitesimal deviations. That is, the rhs of Equation (26) can be considered as a differential length element on the statistical manifold induced by . This is the starting point of information geometry by noting that Iij is just the metric tensor of the statistical manifold (see Amari and Nagaoka, 2000). However, for the considerations in this paper, the apparatus of Riemannian geometry is not really needed. All that is needed is the distance formula in Equation (26).
The idea of natural gradient descent in the expected value landscape is the key idea of the NESs5 (see, e.g., Wierstra et al., 2008). The differential Equation (34) is also referred to as an IGO differential equation in Akimoto, Nagata, et al. (2012). The information flow generated by this differential equation will be subject of investigation in the next section. A very astonishing result will be derived for the ellipsoid model in Equation (7) showing that the IGO flow in Equation (34) results in a sublinear convergence order.
3 On the Dynamics of IGO without Utility Functions
According to Amari (1998, p. 251),
“the ordinary gradient does not give the steepest direction of a target function; rather, the steepest direction is given by the natural (or contravariant) gradient.”
This assertion was made under the premise of a Riemannian metric structure (Amari, 1998, p. 251). From the analysis presented so far in this paper, this assertion seems hard to justify under the premise of expected value maximization of Equation (1). While the family of Gaussian distributions gives rise to a nonflat metric, it remains an open question how quickly Equation (34) approaches the steady state and the optimizer . To shed some light on this matter, we will investigate the dynamics of IGO flow on the general quadratic function f given by Equation (7) using Gaussian samples parameterized by , that is, . To this end, the inverse Fisher information matrix of the Gaussian distribution must be calculated in order to get the concrete form of Equation (34).
3.1 Derivation of the IGO Differential Equation
3.2 On the Time Evolution of the IGO System
4 NES/IGO Localized—On the Use of Utility Transforms
4.1 Assessing Utility by Individual Ranking
In the remainder of this section, the effect of the local weight models on the dynamics of ES will be investigated considering the linear and quadratic fitness models in Equation (7). It turns out that using optimal weights results in surprisingly simple expressions that allow for closed solutions of the IGO differential equation.
4.2 Asymptotic Fitness Distributions and Search Gradients
4.3 Dynamics of IGO with Arnold’s Optimal Weights
4.3.1 Solving the Quadratic Fitness Case
See the derivation given above.
As for Equation (113), the proof is obtained by following the derivation presented above yielding Equation (112), and finally inserting Equation (110). In order to prove Equation (114), one has to make use of Lemma 1, Equation (102). Inserting Equation (113) into Equation (102) yields Equation (114).
The law in Equation (115) is somewhat peculiar. Considering the performance of isotropic -ES with mutation strength control using self-adaptation or cumulative stepsize adaptation on the sphere model, Arnold and Beyer (2004) and Meyer-Nieberg and Beyer (2005), respectively, found time constants . That is, the expected runtime for a fixed relative improvement in f values is proportional to N, whereas for the IGO ODE it grows only with the square root of the search space dimensionality. It is currently an open question whether this square root law indicates a general lower N bound for algorithms derived from the IGO ODE, taking into account that the IGO ODE is an infinite population size model.
Considering Equation (113), one sees that the covariance matrix adapts to the desired behavior in Equation (23) exponentially fast. That is, we have proven that this type of IGO approximates the inverse of up to a scalar factor. Such a behavior has been observed in real ES using covariance matrix adaptation, as, for example, the evolutionary gradient search ES (Arnold and Salomon, 2007, p. 492, Footnote 3). A similar result has been derived in Akimoto (2012) using difference equations, where it was shown that evolves up to a scalar factor to the inverse of the Hessian of . In that work, a specially tailored IGO NES has been considered that deviates from the model given in Equation (93). It relies on an ad hoc constructed time-dependent stepsize scaling factor that uses eigenvalue information taken from the actual matrix.
Comparing the results of Theorem 3 with those of Theorems 1 and 2 regarding NES without local utility transformation, one sees that the rank-based weighting yields qualitatively better performance on quadratic fitness models. Having a closer look at the governing ODEs (42) and (92), one notices the terms in the denominator of the latter. Getting closer to the optimizer, gets smaller, thus, counteracting the decrease of the numerators in the rhs of Equations (92a) and (92b) with the result of larger derivatives. Tracing back the additional terms, one finds that these are due to the term from Equation (86) in the integral in Equation (88). That is, the linear convergence order of this IGO version is clearly a result of that (sphere) optimal rank-based weighting in Equation (69) leading to the locally standardized utility function in Equation (86).
4.3.2 The Linear Fitness Case
4.4 Dynamics of IGO with -Truncation Selection
4.4.1 Dynamics of the Quadratic Fitness Case
Considering the different matrix terms in Equation (135b), one notices that all these single terms are positive definite; that is, it holds for that , , and . As for the latter case, this becomes clear by substitution and noting that is positive definite.
First, consider the case in Equation (135b). The term with the h2 factor vanishes because . The remaining expression in the bracket is positive definite, that is, . Due to the negative sign in front of the bracket, the remaining ODE describes the contraction of the covariance matrix , the dynamics of which is given by Equation (113).
Increasing the truncation ratio, that is, , it holds ; and and increase the positiveness of the square bracket, resulting in a faster contraction of the covariance matrix . If, however, the contraction rate is too fast, the contraction of the vector cannot keep pace with that evolution. In such a case, the ODE system describes premature convergence of the IGO algorithm.
Conversely, decreasing the truncation ratio, that is, , it holds . In that case, the positiveness of the square bracket is decreased compared to . Actually, the expression in the square bracket can become negative definite. As a result, the covariance matrix expands. This behavior is desirable to a certain extent, especially in cases where the initial covariance matrix was chosen too small. However, it can also result in uncontrolled growth.
An alternative convergence proof for IGO using the special case of isotropic mutations has been presented by Glasmachers (2012). However, in that work, the existence of a linear convergence rate was only claimed without proof. Another approach proving convergence to the optimizer using Lyapunov’s stability analysis was proposed by Akimoto, Nagata, et al. (2012). However, that analysis did not provide any assertions concerning the convergence order. Therefore, it does not contribute to the question of why and how a local weighting function Wf is needed for efficiently working ESs.
Considering Equation (145a), one sees that the covariance matrix gets asymptotically similar to the inverse of the Hessian of , that is, the -matrix of the quadratic model in Equation (7). This is a proof of the long-conjectured property of CMA-ES-like algorithms using truncation selection (for the case ) that is empirically observed when running such algorithms on goal functions that can be locally approximated by quadratic functions (see Arnold and Salomon, 2007).
4.4.2 The Linear Fitness Case
Deriving EAs from first principles is a tempting approach to put EA engineering on a sound scientific base. In an attempt to provide such an approach, the NESs were proposed at the end of the last decade. A closer look at the premises of that approach reveals a paradigm that might be condensed into “an information gain constrained gradient ascent in the expected value fitness landscape.” According to Amari (1998), the resulting ascent direction is referred to as the natural gradient being the steepest direction. However, as has shown in this paper, considering quadratic objective functions, the natural gradient ascent when applied to expected fitness landscapes results in a slowly (but) converging information geometric flow toward the optimizer. The approach to the optimizer obeys an law, Equation (48). While Amari’s claim concerning the steepest descent direction is not wrong, because this is the best direction one can get given the constraints imposed on the information gain, natural gradient ascent alone does not necessarily yield efficient EAs.
Satisfactorily, however, is the desired result concerning the covariance matrix evolution. In all cases investigated that rely on natural gradient ascent, it has been proven that in the case of convergence gets asymptotically proportional to the inverse of the Hessian of f. This behavior is desirable since it counteracts the degeneration tendencies of the search distribution in subspaces.
The slow convergence behavior of the original NES is due to the combination of two properties of the original NES approach: On the one hand, the optimization of the objective function f is transformed into a globally defined expected value landscape. On the other hand, fast changes of the search distribution are suppressed by the natural gradient. Since returning to ordinary gradient descent causes problems for some distribution parameterizations (as has been shown in Section 2.1), the remaining option is to localize the evaluation. That is, the utility of candidate solutions must be evaluated in a time-local manner.
As might have become clear in the above given discussion, the NES/IGO ODE design contains two decisions, which are rather independent:
The choice of an appropriate statistical manifold and an ascent or descent principle, respectively.
The choice of a utility function that transforms the original objective function.
The IGO ODE design choice (1) seems to be obviously fixed by the steepest ascent/descent in accordance with the metric determined by the distribution family chosen. Apart from the fact that the choice of the distribution family cannot be deduced from the IGO principle, real ES implementations—for example, xNES; see Glasmachers et al. (2010)—also depart from the natural gradient direction by introducing different learning rates for and . This is clearly for the sake of real algorithm performance and marks the limitations of the infinite population size assumption inherent in the IGO ODE approach. Such implementational tweaks are hard to deduce from the IGO philosophy.
The same holds for the design choice (2) regarding the utility used. In order to get linear convergence order in the IGO ODE framework, one has to localize the evaluation of the f values generated. There are different options to localize the evaluation. Arnold’s optimal weight function derived for the ES on the sphere model yields in the asymptotic population limit a utility function that is just the locally standardized fitness. That is, f values below get a negative evaluation. While a similar fitness baseline can already be found in early NES versions, the local standard deviation in the denominator of the standardization formula in Equation (86) makes the decisive difference. It ensures large utility signals when getting closer to the optimizer. It would be interesting to implement this theoretical finding into a new NES version and evaluate its performance on standard testbeds. This new NES version would work without ranking, similar to the evolutionary gradient search of Arnold and Salomon (2007).
As we have seen, the choice of the utility function has a strong impact on the performance of the IGO system. This performance also depends on the class of fitness functions considered. For example, Arnold’s weight scheme yields an exponentially fast approach to the optimizer of the ellipsoid model. However, as Equation (118) shows, it performs rather slowly on linear functions. Truncation selection as an alternative option yields an exponential growth in Equation (151), provided that the truncation ratio is less than . Therefore, it is better suited for these linear functions. It should be clear that it is impossible to draw general conclusions regarding the usefulness of specific utility functions without fixing the objective function class to be optimized.
While all utility functions considered are f compliant, that is, they emphasize the selection of locally better f values, one can call this into question. As a matter of fact, f compliance can imply an emphasis on local search. Even seemingly well-posed arguments regarding the advantages of truncation selection ensuring the invariance under monotonic f transformations can be challenged when considering demands of robust optimization, as shown in Beyer and Sendhoff (2007).
The theoretical investigations in this paper concerned the performance of IGO on linear and quadratic models. From this analysis one cannot draw reliable conclusions regarding optimization on multimodal objective functions. Considering global optimization, the question arises whether utility functions that allow for non-f-compliance to a certain degree could be a means to improve global search. To this end, IGO ODEs of simple multimodal test functions should be considered in a future research program.
As has been shown in this paper, the use of localized function evaluations yields strong enough utility signals to counteract the information conserving property of the natural gradient. As a result, one can obtain exponentially fast convergence to the optimizer in the case of quadratic objective functions. Considering Equations (114) and (145), one sees that the exponential approach takes place with a time constant proportional to . The time constant does not depend on . This is in contrast to the naive expected value maximization of the dynamics given in Equation (14). Obviously, using the natural gradient approach makes the dynamics asymptotically independent of the shape of the ellipsoid model defined by . That is, using the Fisher metric, the IGO framework asymptotically transforms (for ) the ellipsoidal problem into a spherical one. While we have considered the choice of the metric and the utility function as independent design choices, it seems remarkable that this transformation result is independent of the three different utilities chosen. It raises the question of how sensitive this result is w.r.t. other utility functions (e.g., non-f-compliant versions).
Let us consider the principal limitations of the IGO ODE theory. Due to the infinite population size it is hard—if not impossible—to get meaningful assertions w.r.t. the real behavior of real NES implementations derived from IGO. Obviously, this is a concern, especially for properties that are related to the population size, as, for example, learning rates. However, it also is a concern considering the parameterization used. While the IGO ODE theory is invariant w.r.t. parameter transformations due to the differential geometry imposed by the Fisher metric, in real NES implementations, the choice of an appropriate parameterization of the distribution family seems to have considerable influence on the performance of the NES (see e.g., Glasmachers et al., 2010). Furthermore, in advanced NES versions, the direction of the IGO flow of the parameters departs from the natural gradient in that different step-size factors (learning parameters) are assigned to the and the gradients. That is, the flow vector no longer points into the natural gradient direction. However, such deviations could be incorporated into the IGO ODE framework. In the simplest case, this would lead to different factors in Equations (93) and (135). Finding closed form solutions to those ODEs might be a challenge for future research.
While the IGO ODE theory presented provides useful insight into the dynamic behavior of such systems, one should be cautious when transferring these infinite population size results to real NES implementations. As already pointed out, real NES implementation usually deviates considerably from the IGO ODE such that the original ODE does not correctly describe the real ES. This concerns Monte Carlo gradient estimations, explicit methods of step-size control and evolution path cumulation (as used in CMA-ES; see Hansen et al., 2003), and different learning parameters. The current development of IGO theory cannot account for sampling aspects, path cumulations, and so on. Analyzing the convergence behavior of real NES poses the same problems as in the case of classical ES theory and needs similar approaches and techniques as have been developed for the ES in Beyer (2001). With regard to NES, a first treatise has been provided by Schaul (2012).
Probably the most important benefit of the IGO philosophy lies in its inspiring power for deriving new EA variants on a scientifically grounded base. Even though final implementations do (most often) considerably deviate from pure theory, the IGO philosophy can provide new starting points for research. One such starting point (proposed by one of the reviewers of this paper) could concern the combination of evolutionary methods, system dynamics, and time-series analysis that tracks the evolution of to make predictions for . Future work will show whether this idea leads to improved ES algorithms.
This work was partially supported by the Austrian Science Fund FWF under grant P22649-N23. The author would like to thank the anonymous reviewers and Michael Hellwig for helpful discussions.
In this work, only real-valued optimization is considered, that is, , and therefore, probability distributions will be described by their probability density functions (pdfs) parameterized by sets of distribution parameters.
This has the additional advantage that generating the samples can be done directly by the transformation of isotropic standard normally distributed random vector components . That is, no matrix square root operations are needed.
Here, calligraphic has been used in order to distinguish this matrix from the unity matrix .
The term “natural gradient” is put into quotes throughout this paper because in the author’s opinion there is nothing “natural” here. As has been shown, the “natural gradient” yields a special ascent direction.
NES inherits its “natural” from the “natural gradient” ascent, thus, being a rather special, that is, synthetic choice. A better characterizing term would be “synthetic ES” (SES) instead of NES.
The part of that corresponds to is indexed by a single index, whereas the -matrix related part needs two indices, for example, .