Limited-memory BFGS (L-BFGS; Liu and Nocedal, 1989) is often considered to be the method of choice for continuous optimization when first- or second-order information is available. However, the use of L-BFGS can be complicated in a black box scenario where gradient information is not available and therefore should be numerically estimated. The accuracy of this estimation, obtained by finite difference methods, is often problem-dependent and may lead to premature convergence of the algorithm.
This article demonstrates an alternative to L-BFGS, the limited memory covariance matrix adaptation evolution strategy (LM-CMA) proposed by Loshchilov (2014). LM-CMA is a stochastic derivative-free algorithm for numerical optimization of nonlinear, nonconvex optimization problems. Inspired by L-BFGS, LM-CMA samples candidate solutions according to a covariance matrix reproduced from m direction vectors selected during the optimization process. The decomposition of the covariance matrix into Cholesky factors allows reducing the memory complexity to , where n is the number of decision variables. The time complexity of sampling one candidate solution is also but scales as only about 25 scalar-vector multiplications in practice. The algorithm has an important property of invariance with respect to strictly increasing transformations of the objective function; such transformations do not compromise its ability to approach the optimum. LM-CMA outperforms the original CMA-ES and its large-scale versions on nonseparable ill-conditioned problems with a factor increasing with problem dimension. Invariance properties of the algorithm do not prevent it from demonstrating a comparable performance to L-BFGS on nontrivial large-scale smooth and nonsmooth optimization problems.
In a black box scenario, knowledge about an objective function , to be optimized on some space , is restricted to the handling of a device that delivers the value of for any input . The goal of black box optimization is to find solutions with small (in the case of minimization) value , using the least number of calls to the function f (Ollivier et al., 2011). In a continuous domain, f is defined as a mapping , where n is the number of variables. The increasing number of variables involved in typical optimization problems makes it ever harder to supply the search with useful problem-specific knowledge, such as gradient information or valid assumptions about problem properties. The use of large-scale black box optimization approaches would seem attractive providing that comparable performance can be achieved.
The use of well-recognized gradient-based approaches, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (Shanno, 1970), is complicated in the black box scenario because gradient information is not available and would have to be estimated by costly finite difference methods (e.g., function evaluations per gradient estimation for forward difference, and for central difference). The latter procedures are problem-sensitive and may require a priori knowledge about the problem, such as scaling of f, decision variables, and expected condition number (Li et al., 2007).
By the 1980s another difficulty became evident: the use of quasi-Newton methods such as BFGS is limited to small- and medium-scale optimization problems for which the approximate inverse Hessian matrix can be stored in memory. As a solution, it was proposed not to store the matrix but to reconstruct it using information from the last m iterations (Nocedal, 1980). The final algorithm, called the limited memory BFGS algorithm (L-BFGS or LM-BFGS) and proposed by Liu and Nocedal (1989), is still considered to be the state-of-the-art method for large-scale gradient-based optimization (Becker and Fadili, 2012). However, when a large-scale black box function is considered, L-BFGS is forced to deal both with scarce information coming from only m recent gradients and potentially numerically imprecise estimations of these gradients, which scale up the runtime in the number of function evaluations by a factor of n. It is reasonable to wonder whether L-BFGS and other derivative-based algorithms are still competitive in these settings or whether better performance and robustness can be achieved by derivative-free algorithms.
The covariance matrix adaptation evolution strategy (CMA-ES) seems to be a reasonable alternative. It is a derivative-free algorithm designed to learn dependencies between decision variables by adapting a covariance matrix that defines the sampling distribution of candidate solutions (Hansen et al., 2003). This algorithm constantly demonstrates good performance in various platforms for comparing continuous optimizers like those used at the black box optimization benchmarking (BBOB) workshop (Hansen et al., 2010; Auger et al., 2010; Loshchilov et al., 2013) and special sessions of the Congress on Evolutionary Computation (García et al., 2009; Loshchilov, 2013a). The CMA-ES was also extended to noisy (Hansen et al., 2009), expensive (Kern et al., 2006; Loshchilov et al., 2012), and multiobjective optimization (Igel et al., 2007).
The principal advantage of CMA-ES, the learning of dependencies between n decision variables, also forms its main practical limitations, such as memory storage required to run the algorithm and computational time complexity per function evaluation (Ros and Hansen, 2008). These limitations may preclude the use of CMA-ES for computationally cheap but large-scale optimization problems if the internal computational cost of CMA-ES is greater than the cost of one function evaluation. On nontrivial large-scale problems with , the internal computational cost of CMA-ES becomes prohibitive, and efficiently storing the covariance matrix in memory becomes very difficult. An open problem is how to extend efficient black box approaches such as CMA-ES to while keeping a reasonable trade-off between performance in terms of the number of function evaluations and the internal time and space complexity. The low-complexity methods such as separable CMA-ES (sep-CMA-ES; Ros and Hansen, 2008), linear time natural evolution strategy (R1-NES; Sun et al., 2011), and VD-CMA (Akimoto et al., 2014) are useful when the large-scale optimization problem at hand is separable or decision variables are weakly correlated, but otherwise the performance of these algorithms with respect to the original CMA-ES may deteriorate significantly.
This article presents a greatly improved version of LM-CMA (limited memory CMA-ES), an extension of CMA-ES to large-scale optimization (Loshchilov, 2014). Instead of storing the covariance matrix, LM-CMA stores m direction vectors in memory and uses them to generate solutions. The algorithm has space complexity, where m can be a function of n. The time complexity is linear in practice, with the smallest constant factor among the preceding evolutionary algorithms.
The article is organized as follows. Section 2 briefly describes L-BFGS, and Section 3 describes CMA-ES with its large-scale alternatives. Section 4 presents the improved version of LM-CMA, and Section 5 investigates its performance with respect to large-scale alternatives. The article is concluded in Section 6.
An early version of the L-BFGS method, at that time called the SQN method, was proposed by Nocedal (1980). During the first m iterations L-BFGS is identical to the BFGS method, but stores BFGS corrections separately until the maximum number of m is used up. Then the oldest corrections are replaced by the newest ones. The approximate of the inverse Hessian of f at iteration k, , is obtained by applying m BFGS updates to a sparse symmetric and positive definite matrix provided by the user (Liu and Nocedal, 1989).
The L-BFGS method works as follows (Liu and Nocedal, 1989):
Step 1. Choose , m, , , and a symmetric and positive definite starting matrix . Set .
The novelty introduced by Liu and Nocedal (1989) with respect to the version given in Nocedal (1980) is that the line search is not forced to perform at least one cubic interpolation, but the unit step length is always tried first, and if it satisfies the Wolfe conditions, it is accepted.
Step 4. Set and go to step 2.
The algorithm space and time complexity scales as per iteration (not per function evaluation), where m is on the order of 5–40, suggested in the original paper, is still the most common setting. An extension to bound constrained optimization called L-BFGS-B has the efficiency of the original algorithm but at the cost of a significantly more complex implementation (Byrd et al., 1995). Extensions to optimization with arbitrary constraints are currently not available. Satisfactory and computationally tractable handling of noise is at least problematic, often impossible.
Nevertheless, as mentioned, when gradient information is available, L-BFGS is competitive with other techniques (Becker and Fadili, 2012; Ngiam et al., 2011) and often can be viewed as a method of choice (Andrew and Gao, 2007) for large-scale continuous optimization. However, in the black box scenario, when gradient information is not available (direct search settings), the advantages of L-BFGS become less obvious, and derivative-free algorithms can potentially perform comparably. This scenario is investigated here in detail.
3 Evolution Strategies for Large-Scale Optimization
Historically, evolution strategies (Rechenberg, 1973) were first designed to perform search without learning dependencies between variables, which is a more recent development that gradually led to the CMA-ES algorithm (Hansen and Ostermeier, 1996; Hansen et al., 2003). This section discussed the CMA-ES algorithm and its state-of-the-art derivatives for large-scale optimization. For a recent comprehensive overview of evolution strategies, the reader is referred to Hansen et al. (2015). More specifically, analysis of the theoretical foundations of evolution strategies is provided by Wierstra et al. (2014); Ollivier et al. (2011); Akimoto and Ollivier (2013); Glasmachers (2012); Auger and Hansen (2013); Hansen and Auger (2014); Arnold (2014); and Beyer (2014).
These solutions then are evaluated with an objective function f (line 6). The old mean of the mutation distribution is stored in , and a new mean is computed as a weighted sum of the best parent individuals selected from among generated offspring individuals (line 7). The weights are used to control the impact of the selected individuals; the weights are usually higher for better ranked individuals (line 1).
The procedure of the adaptation of the step size in CMA-ES is inherited from the CSA-ES, cumulative step-size adaptation evolution strategy (Hansen and Ostermeier, 1996), and is controlled by evolution path . Successful mutation steps (line 8) are tracked in the space of sampling, that is, in the isotropic coordinate system defined by principal components of the covariance matrix . To update the evolution path a decay/relaxation factor is used to decrease the importance of the previously performed steps with time. The step size update rule increases the step size if the length of the evolution path is longer than the expected length of the evolution path under random selection , and decreases otherwise (line 13). The expectation of is approximated by . A damping parameter controls the change of the step size.
The covariance matrix update consists of two parts (line 12): rank-one update (Hansen and Ostermeier, 2001) and rank- update (Hansen et al., 2003). The rank-one update computes evolution path of successful moves of the mean of the mutation distribution in the given coordinate system (line 10) in a similar way as for the evolution path of the step size. To stall the update of when increases rapidly, a trigger is used (line 9).
The rank- update computes a covariance matrix as a weighted sum of covariances of successful steps of best individuals (line 11). The update of itself is a replacement of the previously accumulated information by a new one with corresponding weights of importance (line 12): c1 for covariance matrix of rank-one update and for of rank- update (Hansen et al., 2003) such that . It was also proposed to take into account unsuccessful mutations in the active rank- update (Hansen and Ros, 2010; Jastrebski and Arnold, 2006).
In CMA-ES the factorization of the covariance into is needed to sample the multivariate normal distribution (line 5). The eigendecomposition with complexity is used for the factorization. Already in the original CMA-ES it was proposed to perform the eigendecomposition every generations (not shown in Algorithm 1) to reduce the complexity per function evaluation to .
3.2 Large-Scale Variants
This update reduces the computational complexity to and allows exploiting problem separability. The algorithm demonstrated good performance on separable problems and even outperformed CMA-ES on the nonseparable Rosenbrock function for .
A version of CMA-ES with limited memory storage, also called limited memory CMA-ES (L-CMA-ES) was proposed by Knight and Lunacek (2007). The L-CMA-ES uses the m eigenvectors and eigenvalues spanning the m-dimensional dominant subspace of the -dimensional covariance matrix . The authors adapted a singular value decomposition updating algorithm developed by Brand (2006) that allowed avoiding the explicit computation and storage of the covariance matrix. For the performance in terms of the number of function evaluations gradually decreases while enabling the search in for . However, the computational complexity of practically (for m on order of , as suggested by Knight and Lunacek, 2007) leads to the same limitations of time complexity as the original CMA-ES.
The ()-Cholesky-CMA-ES is outlined in Algorithm 2. Like the original CMA-ES, the Cholesky-CMA-ES proceeds by sampling candidate solutions (lines 4–7) and taking into account the most successful out of solutions in the evolution path adaptation (lines 10 and 11). However, the eigendecomposition procedure is not required anymore because the Cholesky factor and its inverse are updated incrementally (lines 13 and 14). This simplifies the implementation of the algorithm a lot and keeps its time complexity as . A postponed update of the Cholesky factors every iterations would not reduce the asymptotic complexity further (as it does in the original CMA-ES); the quadratic complexity will remain because of matrix-vector multiplications needed to sample new individuals.
The non-elitist Cholesky-CMA is a good alternative to the original CMA-ES and demonstrates a comparable performance (Suttorp et al., 2009). While it has the same computational and memory complexity, the lack of a rank- update may deteriorate its performance on problems where it is essential.
The LM-CMA is inspired by the L-BFGS algorithm, but instead of storing m gradients for performing inverse Hessian requiring operations, it stores m direction vectors to reproduce the Cholesky factor and generate candidate solutions with a limited time and space cost (see Section 4.1). These m vectors are estimates of descent directions provided by evolution path vectors and should be stored with a particular temporal distance to enrich (see Section 4.2). An important innovation on the original LM-CMA (Loshchilov, 2014) is a procedure for sampling from a family of search representations defined by Cholesky factors reconstructed from vectors (see Section 4.3) and according to the Rademacher distribution (see Section 4.4). These modifications allow simultaneously reducing the internal time complexity of the algorithm and improving its performance in terms of the number of function evaluations. Before describing the algorithm itself, the following sections introduce all the necessary components.
4.1 Reconstruction of Cholesky Factors
The product of a random vector and the Cholesky factor thus can be directly computed. At iteration , and , the new updated Cholesky factor . At iteration , and . Thus, a very simple iterative procedure that scales as can be used to sample candidate solutions in according to the Cholesky factor reconstructed from m pairs of vectors and .
The pseudocode of the procedure to reconstruct from m direction vectors is given in Algorithm 3. At each iteration of reconstruction of (lines 3 and 4), is updated as a sum of a-weighted version of itself and bt-weighted evolution path (accessed from a matrix as ) scaled by the dot product of and . As can be seen, Algorithms 3 and 4 use indexation instead of t. This is a convenient way to have references to matrices and , which store and vectors, respectively. The next sections show how to efficiently manipulate these vectors.
A very similar approach can be used to reconstruct . For the sake of reproducibility the pseudocode is given in Algorithm 4 for and . The computational complexity of both procedures scales as .
It is important to note that when a vector from a set of m vectors stored in is replaced by a new vector (Algorithm 7, line 15), all inverse vectors for should be iteratively recomputed (Krause, 2014). This procedure corresponds to Algorithm 7, line 16.
4.2 Direction Vectors Selection and Storage
The choice to store only direction vectors to obtain a comparable amount of useful information as stored in the covariance matrix of the original CMA-ES requires a careful procedure of selection and storage. A simple yet powerful procedure proposed by Loshchilov (2014) is to preserve a certain temporal distance in terms of number of iterations between the stored direction vectors. The procedure tends to store unique information (in contrast to the case if the latest m evolution path vectors would be stored). The latter case is different from the storage of m gradients as in L-BFGS, since the evolution path is gradually updated at each iteration with a relatively small learning rate cc and from selected vectors.
The selection procedure is outlined in Algorithm 5, which outputs an array of pointers such that points to a row in matrices and with the oldest saved vectors and , which are taken into account during the reconstruction procedure. The higher the index i of the more recent is the corresponding direction vector. The index jcur points to the vector that will be replaced by the newest one in the same iteration when the procedure is called. The rule to choose a vector to be replaced is the following. Find a pair of consecutively saved vectors ( with the distance between them (in terms of indexes of iterations, stored in ) closest to a target distance Ni (line 6). If this distance is smaller than Ni, then the index will be swapped with last index of (lines 9–12) and the corresponding vector will be replaced by the new vector (line 15); otherwise, the oldest vector among m saved vectors will be removed (as a result of line 8). Thus, the procedure gradually replaces vectors in a way to keep them at the distance Ni and with the overall time horizon for all vectors of at most iterations. The procedure can be called periodically every iterations of the algorithm. The values of Ni are to be defined, for example, as a function of problem dimension n and direction vector index i. Here, however, Ni is set to n for all i, that is, the target distance equals the problem dimension.
4.3 Sampling from a Family of Search Representations
By varying , one can control the reconstruction of the Cholesky factor used for sampling and in this way explore a family of possible transformations of the coordinate system. The maximum number of vectors defined by m can be associated with the number of degrees of freedom of this exploration.
While in the original LM-CMA (Loshchilov, 2014) the value of m is set to to allow the algorithm scale up to millions of variables, greater values of m such as often lead to better performance (see Section 5.7 for a detailed analysis). Thus, when memory allows, a gain in performance can be achieved. However, because of an internal cost of , the time cost then would scale as , which is undesirable for . This is where the use of out of m vectors can drastically reduce the time complexity. Sample from a truncated half-normal distribution (Algorithm 6, line 4), and set to the latest vectors (line 5). For a constant , the time complexity of scales as . A new value of is generated for each new individual. Additionally, to exploit the oldest information, is generated with for one out of individuals. While for the new solution appears to be sampled from an isotropic normal distribution, the computation of inverses is performed using all m vectors.
4.4 Sampling from the Rademacher Distribution
Evolution strategies are mainly associated with the multivariate normal distribution used to sample candidate solutions. However, alternative distributions such as the Cauchy distribution can be used (Yao and Liu, 1997; Hansen, 2008). Moreover, the adaptive encoding procedure proposed by Hansen (2008) can be coupled with any sampling distribution, as in Loshchilov et al. (2011), where it was shown that completely deterministic adaptive coordinate descent on principal components obtained with the adaptive encoding procedure allows obtaining performance comparable to that of CMA-ES.
Inspired by Loshchilov et al. (2011), the present work replaces the original multivariate normal distribution used in LM-CMA by the Rademacher distribution, where a random variable has a 50% chance to be either −1 or +1 (also can be viewed as a Bernoulli distribution). Thus, a preimage vector of candidate solution contains n values, which are either −1 or +1. The intention in using this distribution is threefold: (1) to reduce the computation complexity by a constant but rather significant factor, (2) to demonstrate that the Rademacher distribution can potentially be an alternative to the Gaussian distribution, at least in large-scale settings, and (3) to demonstrate that the new step size adaptation rule (see Section 4.5), which does not make assumptions about the sampling distribution, can work well when used with non-Gaussian distributions. As a support for this substitution, recall that for an n-dimensional unit-variance spherical Gaussian, for any positive real number , all but at most of the mass lies within the annulus , where c is a fixed positive constant (Hopcroft and Kannan, 2013). Thus, when n is large, the mass is concentrated in a thin annulus of width at radius . It is interesting that the sampling from the Rademacher distribution reproduces this effect of large-dimensional Gaussian sampling, since the distance from the center of an n-dimensional hypercube to its corners is .
4.5 Population Success Rule
The step size used to define the scale of deviation of a sampled candidate solution from the mean of the mutation distribution can be adapted by the population success rule (PSR) proposed for LM-CMA by Loshchilov (2014). This procedure does not assume that candidate solutions should come from the multivariate normal distribution, as is often assumed in evolution strategies, including CMA-ES. Therefore, the PSR procedure is well suited for the Rademacher distribution.
4.6 Improved LM-CMA
The improved LM-CMA is given in Algorithm 7. At each iteration t, candidate solutions are generated by mutation, defined as a product of a vector sampled from the Rademacher distribution and a Cholesky factor reconstructed from out of m vectors (lines 4–11), as described in Sections 4.1–4.4. Mirrored sampling (Brockhoff et al., 2010) is introduced to generate the actual only every second time and thus to decrease the computation cost per function evaluation by a factor of 2 by evaluating and then its mirrored version . The latter approach sometimes also improves the convergence rate.
The best out of solutions are selected to compute the new mean of the mutation distribution in line 12. The new evolution path is updated (line 13) from the change of the mean vector and represents an estimate of descent direction. One vector among m vectors is selected and replaced by the new in UpdateSet() procedure (see Section 4.2). All inverses of evolution path vectors that are at least as recent as the direction vector to be replaced should be recomputed in the UpdateInverses() procedure as described in Section 4.1. The step size is updated according to the PSR rule (see Section 4.5).
5 Experimental Validation
The performance of the LM-CMA is compared to those of the L-BFGS (Liu and Nocedal, 1989), the active CMA-ES (Hansen and Ros, 2010), and the VD-CMA (Akimoto et al., 2014). The sep-CMA-ES is removed from the comparison because of its similar but inferior performance with respect to the VD-CMA, observed both in this study and by Akimoto et al. (2014).
The L-BFGS implemented in the MinFunc library (Schmidt, 2005) is used in its default parameter settings,1 the active CMA-ES (aCMA) without restarts in its default parametrization of CMA-ES MATLAB code version 3.61.2 For faster performance in terms of CPU time, the VD-CMA was (exactly) reimplemented in C language from the MATLAB code provided by the authors. For the sake of reproducibility, the source code of all algorithms is available online.3 The default parameters of LM-CMA are given in Algorithm 7.
A set of benchmark problems is used (see Table 1), as is common in evolutionary computation, more specifically in the BBOB framework (Hansen et al., 2010). Many problems are missing, including the ones where tested methods and LM-CMA fail to timely demonstrate reasonable performance in large-scale settings. Algorithm performance is gauged with respect to the number of function evaluations used to reach target values of f, CPU time spent per function evaluation, and the number of memory slots required to run algorithms. Any subset of these metrics can dominate search cost in large-scale settings, while in low-scale settings memory is typically of lesser importance.
|Name .||Function .||Target .||Init .||.|
|Rotated ellipsoid||fRotElli(x) = fElli(Rx)||3|
|Rotated Rosenbrock||fRotRosen(x) = fRosen(Rx)||3|
|Rotated discus||fRotDiscus(x) = fDiscus(Rx)||3|
|Rotated cigar||fRotCigar(x) = fCigar(Rx)||3|
|Rotated different powers||fRotDiffPow(x) = fDiffPow(Rx)||3|
|Name .||Function .||Target .||Init .||.|
|Rotated ellipsoid||fRotElli(x) = fElli(Rx)||3|
|Rotated Rosenbrock||fRotRosen(x) = fRosen(Rx)||3|
|Rotated discus||fRotDiscus(x) = fDiscus(Rx)||3|
|Rotated cigar||fRotCigar(x) = fCigar(Rx)||3|
|Rotated different powers||fRotDiffPow(x) = fDiffPow(Rx)||3|
This section examines the scalability of the proposed algorithm with respect to the existing alternatives. While both the computational time and space complexities scale moderately with problem dimension, the algorithm is able to preserve certain invariance properties of the original CMA-ES. Moreover, unexpectedly good results are obtained on some well-known benchmark problems, for instance, linear scaling of the budget of function evaluations to solve separable and rotated ellipsoid problems. The performance of LM-CMA is comparable to that of L-BFGS with exact estimation of gradient information. Importantly, LM-CMA is competitive with L-BFGS in very large-scale (for derivative-free optimization) settings with 100,000 variables. Finally, LM-CMA is investigated with respect to its sensitivity to some key internal parameters, such as the number of stored direction vectors m.
5.1 Space Complexity
The number of memory slots required to run optimization algorithms versus problem dimension n, referred to as space complexity, can limit applicability of certain algorithms to large-scale optimization problems. Here, the number of slots are listed up to constant and asymptotically constant terms.
The algorithms store generated solutions (LM-CMA, VD-CMA, and aCMA with the default and , L-BFGS with ) and some intermediate information (LM-CMA and L-BFGS with m pairs of vectors, aCMA with at least two matrices of size ) to perform the search. The implementation of VD-CMA requires slots compared to of the original MATLAB code. The LM-CMA requires slots, the L-BFGS requires slots, and aCMA requires slots.
Figure 1 shows that because of its quadratic space complexity aCMA requires about slots (respectively, slots) for 10,000-dimensional (respectively, 20,000-dimensional) problems, which with 8 bytes per double-precision floating-point numbers would correspond to about 1.6 GB (respectively, 6.4 GB) of computer memory. This simply precludes the use of CMA-ES and its variants with explicit storage of the full covariance matrix or Cholesky factors to large-scale optimization problems with . LM-CMA stores m pairs of vectors as well as L-BFGS. For (as the default population size in CMA-ES), L-BFGS is two times and LM-CMA is three times more expensive in memory than VD-CMA, but they all basically can be run for millions of variables.
This article argues that additional memory can be used while allowed and at no cost. Thus, while the default , is used if memory allows (see Section 5.7). In general, the user can provide a threshold on memory and if, for instance, by using this memory threshold is violated, the algorithm automatically reduces m to a feasible mf.
5.2 Time Complexity
The average amount of CPU time internally required by an algorithm per evaluation of some objective function (not per algorithm iteration), referred to as time complexity, also can limit applicability of certain algorithms to large-scale optimization problems. They simply can be too expensive to run, for example, much more expensive than performing function evaluations.
Figure 2 shows how fast CPU time per evaluation scales for different operations measured on a 1.8 GHz processor of an Intel Core i7-4500U. Scalar-vector multiplication of a vector with n variables scales linearly at about seconds; evaluation of the separable ellipsoid function is two to three times more expensive if temporary data is used. Sampling of n normally distributed variables scales as about 60 vectors-scalar multiplications, which defines the cost of sampling of unique candidate solutions of many evolution strategies such as separable CMA-ES and VD-CMA. However, the sampling of variables according to the Rademacher distribution is about ten times cheaper. The use of mirrored sampling also decreases the computational burden without worsening the convergence. Finally, the internal computation cost of LM-CMA scales linearly as about 25 scalar-vector multiplications per function evaluation. It is much faster than the lower bound for the original CMA-ES, defined by one matrix-vector multiplication required to sample one individual. The cost of one matrix-vector multiplication is about scalar-vector multiplications; the overhead is probably due to access to matrix members.
The improved version of LM-CMA is about ten times faster internally than the original version (Loshchilov, 2014) because of the use of mirrored sampling, the Rademacher sampling distribution, and sampling with instead of m direction vectors both for and . For 8,192-dimensional problems, it is about 1,000 times faster internally than CMA-ES algorithms with the full covariance matrix update (the cost of Cholesky-CMA-ES is given in Loshchilov, 2014).
5.3 Invariance under Rank-Preserving Transformations of the Objective Function
The LM-CMA belongs to a family of so-called comparison-based algorithms. The performance of these algorithms is unaffected by rank-preserving (strictly monotonically increasing) transformations of the objective function, whether the function f, f3, or is minimized (Ollivier et al., 2011). Moreover, this invariance property provides robustness to noise insofar as this noise does not impact a comparison of solutions of interest (Auger and Hansen, 2013).
In contrast, gradient-based algorithms are sensitive to rank-preserving transformations of f. While the availability of gradient information may mitigate the problem that objective functions with the same contours can be solved with a different number of functions evaluations, the lack of gradient information forces the user to estimate it with approaches whose numerical stability is subject to scaling of f. Here, L-BFGS is simulated in an ideal black box scenario when gradient information is estimated perfectly (exact gradients are provided) but at the cost of function evaluations per gradient, which corresponds to the cost of the forward difference method. Additionally, the performance of L-BFGS is investigated with the central difference method ( evaluations per gradient), which is twice more expensive but numerically more stable. This method is denoted as CL-BFGS.
5.4 Invariance under Search Space Transformations
Invariance properties under different search space transformations include translation invariance, scale invariance, rotational invariance, and general linear invariance under any full rank matrix when the algorithm performance on and is the same given that the initial conditions of the algorithm are chosen appropriately (Hansen et al., 2011). Thus, the lack of the latter invariance is associated with a better algorithm performance for some and worse for others. In practice, it often appears to be relatively simple to design an algorithm specifically for a set of problems with a particular , for instance, identity matrix, and then demonstrate its good performance. If this set contains separable problems, problems where the optimum can be found with a coordinatewise search, then even on highly multimodal functions, great results can be easily achieved (Loshchilov et al., 2013). Many derivative-free search algorithms in one or another way exploit problem separability and fail to demonstrate a comparable performance on rotated versions of the same problems. This would not be an issue if most real-world problems were separable; this is, however, unlikely to be the case and some partial separability or full nonseparability is more likely to be present.
The original CMA-ES is invariant with respect to any invertible linear transformation of the search space, , if the initial covariance matrix and the initial search point(s) are transformed accordingly (Hansen, 2006). However, the matrix is often unknown (otherwise, one could directly transform the objective function) and cannot be stored in memory in large-scale settings with . Thus, the covariance matrix adapted by LM-CMA has at most rank m, and so the intrinsic coordinate system cannot capture a full rank matrix entirely. Therefore, the performance of the algorithm on compared to depends on . However, in the experiments, differences in performance on axis-aligned and rotated ill-conditioned functions were marginal.
Here, LM-CMA, aCMA, L-BFGS, CL-BFGS are tested both on separable problems and their rotated versions (see Table 1). It is simply intractable to run algorithms on large-scale rotated problems with because of the quadratic cost of involved matrix-vectors multiplications (see Figure 2). Fortunately, there is no need to do this for algorithms that are invariant to rotations of the search space, since their performance is the same as on the separable problems, whose evaluation is cheap (linear in time). Figures 3 and 4 show that the performance of aCMA on 512-dimensional (the dimensionality still feasible to perform full runs of aCMA) separable and rotated problems is very similar, and the difference (if any) is likely due to a noninvariant initialization. The invariance in performance is not guaranteed but rather observed for LM-CMA, L-BFGS, and CL-BFGS. However, the performance of the VD-CMA degrades significantly on fRotElli, fRotDiscus, and fRotDiffPow functions because of the restricted form of the adapted covariance matrix of Equation (9). Both fRosen, fCigar and their rotated versions can be solved efficiently, since they have a Hessian matrix whose inverse can be well approximated by Equation (9) (Akimoto et al., 2014).
An important observation from Figures 3 and 4 is that even the exact gradient information is not sufficient for L-BFGS to avoid numerical problems that lead to an imprecise estimation of the inverse Hessian matrix and premature convergence on fElli and fRotElli. The L-BFGS with the central difference method (CL-BFGS) experiences an early triggering of stopping criteria on fRosen and fDiffPow. While numerical problems due to imprecise derivative estimations are quite natural for L-BFGS, especially on ill-conditioned problems, one can assume that with a better implementation of the algorithm (e.g., with high-precision arithmetic) one could obtain a more stable convergence. Therefore, one extrapolates the convergence curves of L-BFGS and CL-BFGS toward the target after removing the part of the curve that clearly belongs to the stagnation, namely, on fElli.
5.5 Scaling with Problem Dimension
The performance versus the increasing number of problem variables is given in Figure 5. The results of VD-CMA are excluded on some problems because, as can be seen from Figures 3 and 4, the algorithm does not find the optimum with a reasonable number of function evaluations, or it convergences prematurely. For algorithms demonstrating the same performance on separable and nonseparable problems (see Figures 3 and 4), Figure 5 plots results from separable problems as they were obtained on rotated problems, to avoid possible misunderstanding from designers of separability-oriented algorithms.
The results suggest that L-BFGS shows the best performance; this is not surprising given the form of the selected objective functions (see Table 1). Also keep in mind that the exact gradients were provided and this still led to premature convergence on some functions (see Figures 3 and 4). In the black box scenario, one would probably use L-BFGS with the forward or central (CL-BFGS) difference methods. The latter is often found to lead to a loss by a factor of 2 (as expected because of versus cost per gradient), except for the fRotDiffPow, where the loss is increasing with problem dimension.
Quite surprisingly, the LM-CMA outperforms VD-CMA and aCMA on fSphere. This performance is close to the one obtained for (1+1) evolution strategy with optimal step size. Bad performance on fSphere anyway would not directly mean that an algorithm is useless but could illustrate its performance in the vicinity of local optima when variable-metric algorithms (e.g., CMA-like algorithms) perform an isotropic search with respect to an adapted internal coordinate system. The obtained results are mainly due to the population success rule, which deserves independent study similar to the one by Hansen et al. (2014). A few key points of the PSR are mentioned here. By design, depending on the target success ratio , one can get either biased (for ) or unbiased (for ) random walk on random functions. It would be a bias to say that either biased or unbiased change of is better on random functions, since the latter depends on the context. Because the (weighted) mean of each new population is computed from the best out of individuals, the individuals of the new generation are typically as good as the (weighted) best individuals of the previous one, and thus if , one may expect from Equation (16). Typically, it is reasonable to choose lower-bounded by 0 because of random functions and upper-bounded by 0.5 because of linear functions. This study uses 0.3, which lies roughly in the middle of the interval. It is important to mention a striking similarity with the 1/5th success rule (Schumer and Steiglitz, 1968; Rechenberg, 1973). PSR can be considered its population-based version.
The performance of LM-CMA on fElli is probably the most surprising result of this work. In general, the scaling of CMA-ES is expected to be from superlinear to quadratic with n on fElli, since the number of parameters of the full covariance matrix to learn is (Hansen and Ostermeier, 2001). While aCMA demonstrates this scaling, LM-CMA scales linearly albeit with a significant constant factor. The performance of both algorithms coincides at ; then LM-CMA outperforms aCMA (given that the extrapolation is reasonable) with a factor increasing with n. It should be noted that aCMA is slower in terms of CPU time per function evaluation by a factor of (see Figure 2). Another interesting observation is that the L-BFGS is only slightly faster than LM-CMA, while CL-BFGS is actually outperformed by the latter. An insight to these observations can be found in Figure 3, where both LM-CMA and L-BFGS outperform aCMA by a factor of 10 in the initial part of the search, while aCMA compensates this loss by having the covariance matrix well adapted, which allows accelerating convergence close to the optimum. This might be explained as follows. A smaller number of internal parameters defining the intrinsic coordinate system can be learned faster and with greater learning rates; this allows a faster convergence but may slow down the search in vicinity of the optimum if the condition number cannot be captured by the reduced intrinsic coordinate system.
The LM-CMA is better or as good as VD-CMA on fRosen and fCigar, where it is expected to be outperformed by the latter because of presumably few principal components needing to be learned to solve these problems. The scaling on fRosen suggests that the problem is more difficult (e.g., more difficult than fElli) than one could expect, mainly because of an adaptation of the intrinsic coordinate system required while following the banana shaped valley of this function.
The results on 100,000-dimensional problems (see Figure 6) show that LM-CMA outperforms L-BFGS on the first 10n–20n function evaluations, which corresponds to the first 10–20 iterations of L-BFGS. This observation suggests that LM-CMA can be viewed as an alternative to L-BFGS when n is large and the available number of function evaluations is limited. While it can provide a competitive performance in the beginning, it is also able to learn dependencies between variables to approach the optimum.
5.6 Performance on a Nonsmooth Variant of Nesterov’s Function
This function is nonsmooth (though locally Lipschitz) as well as nonconvex; it has Clarke stationary points (Overton, 2015). Overton (2015) showed that for , BFGS starting from 1,000 randomly generated points finds all 16 Clarke stationary points (for the definition of Clarke stationary points, see Abramson and Audet, 2006), and the probability of finding the global minimizer is only by about a factor of 2 greater than of finding any of the Clarke stationary points. This probability dropped by a factor of 2 for while the number of Clarke stationary points doubled (Overton, 2015). Clearly, the problem becomes extremely difficult for BFGS when n is large.
LM-CMA and CL-BFGS (L-BFGS performed worse) were run on for and . Figure 7 shows that CL-BFGS performs better than LM-CMA; however, both algorithms in default settings, and with restarts do not perform well. Both LM-CMA and CL-BFGS were tuned but the results reported are only for LM-CMA, since the performance of CL-BFGS was not improved by more than one order of magnitude of the objective function value. The tuned parameters for LM-CMA are (1) doubled population size , (2) increased learning rate by 15 to , and (3) an extremely small learning rate for step size adaptation instead of . The last modification is probably the most important; practically, it defines the schedule of how step size decreases. A similar effect can be achieved by reducing or increasing . Faster learning of dependencies between variables and slower step size decrease drastically improve the convergence, and the problem can be solved both for and (Figure 7). Its interesting that the number of function evaluations scales almost linearly with problem dimension.
Tuning of CL-BFGS was expected to lead to similar improvements. Surprisingly, attempts to modify its parameters, often in order to slow down the convergence (e.g., type and number of line-search steps, Wolfe conditions parameters) failed. Certain modifications should improve CL-BFGS, and thus this question is left open. The settings tuned for function differ significantly from the default ones. It is of great interest to find an online procedure to adapt them. The next section is aimed at gaining some intuition on parameter importance in LM-CMA.
5.7 Sensitivity to Parameters
The black box scenario implies that the optimization problem at hand is not known; it is therefore hard if even possible to suggest a ”right” parametrization of the algorithm that works best on all problems. Offline tuning in large-scale optimization is also computationally expensive. It is rather optimistic to believe that one always can afford enough computational resources to run algorithms till reaching the optimum on very large real-world optimization problems. Nevertheless, one can focus on this scenario in order to gain an understanding about scalability on benchmark problems.
Experience with parameter selection by exclusion of nonviable settings suggests that there exists a dependency between the population size , number of stored vectors m, the target temporal distance between them Nsteps, the learning rate cc for the evolution path, and learning rate c1 for the Cholesky factor update. The main reason for this is that all of them impact how well the intrinsic coordinate system defined by the Cholesky factor reflects the current optimization landscape. A posteriori, if , it seems reasonable to store vectors with a temporal distance on the order of on problems where a global coordinate system is expected to be constant, for example, on a class of problems described by the general ellipsoid model (Beyer, 2014). The learning rate for the evolution path is related to both m and n; here it is set to , which is roughly inversely proportional to the (if affordable) suggested . The chosen cc was found to be still valid for the default ; there is no good interpretation for the learning rate . In general, one cannot strongly argue for some parameter settings versus others, since as mentioned, they are problem-dependent. A more appropriate approach would be to perform online adaptation of hyperparameters, as implemented for the original CMA-ES by Loshchilov et al. (2014).
The following analysis for m directly affects the amount of memory required to run the algorithm and thus is of special interest, since the user might be restricted in memory on very large-scale optimization problems with . Figure 8 shows that the greater the m, the better the performance. The results obtained for the default (the results demonstrated in the previous sections) can be improved with . The improvements are especially pronounced on fDiscus functions, where the factor is increasing with n and the overall cost to solve the function reaches the one extrapolated for aCMA at (see Figure 5). It is surprising to observe that and even are sufficient to solve fElli, fDiscus, and fDiffPow. The latter is not the case for fCigar, where small values of m lead to an almost quadratic growth of runtime. The overall conclusion would be that on certain problems the choice of m is not critical, while greater values of m are preferable in general.
For the eigenspectrum of the covariance matrix constructed as from the Cholesky factor , the results for single runs on different 1,024-dimensional functions and for different m are shown in Figure 9. The evolution of the eigenspectrum during the run is shown by gradually darkening blue lines with increasing t. Clearly, the number of eigenvalues is determined by m. The profiles, for instance, the one of fCigar, also reflect the structure of the problems (see Table 1). The greater the m, the greater condition number can be captured by the intrinsic coordinate system, as can be seen for fElli, fDiscus, and fDiffPow, which in turn leads to better performance. However, this is not always the case, as can be seen for fRosen, which again demonstrates that optimal hyperparameter settings are problem-dependent.
This work adapts an idea from derivative-based optimization to extend best-performing evolutionary algorithms such CMA-ES to large-scale optimization. This allows reducing the cost of optimization in terms of time by a factor of and memory by a factor between and n. It also often reduces the number of function evaluations required to find the optimum. Storing a limited number of vectors and using them to adapt an intrinsic coordinate system is not the only way, but one of probably very few ways, to efficiently search in large scale continuous domains. Two quite similar alternatives are proposed: (1) the storage of points and a later estimation of descent directions from differences of these points, and (2) the use of a reduced matrix as in Knight and Lunacek (2007) but with a modified sampling procedure to obtain linear time complexity as proposed for the adaptive coordinate descent (Loshchilov, 2013b).
The use of the population success rule is optional, and alternative step size adaptation procedures can be applied. However, the rule’s similarity to the 1/5th rule is interesting. The procedure does not make any assumptions about the sampling distribution; this allows using the Rademacher distribution. When n is large, the sampling from an n-dimensional Rademacher distribution resembles the sampling from an n-dimensional Gaussian distribution, since the probability mass of the latter is concentrated in a thin annulus of width at radius .
The comparison presented shows that LM-CMA outperforms other evolutionary algorithms and is comparable to L-BFGS on nontrivial large-scale optimization problems when the black box (derivative-free) scenario is considered. Clearly, the black box scenario is a pessimistic scenario, but a substantial part of works that use finite difference methods for optimization deal with this scenario, and thus LM-CMA can be considered as an alternative. Important, LM-CMA is invariant to rank-preserving transformations of the objective function and therefore is potentially more robust than L-BFGS. The results shown in Figure 7 suggest that the use of a smaller number of direction vectors m can still be efficient; that is, more efficient algorithms, for instance, with adaptive m (or an adaptive transformation matrix) can be designed. It seems both promising and feasible to extend the algorithm to constrained, noisy, or multiobjective optimization, the domains that are both hardly accessible for L-BFGS and keenly demanded by practitioners. As an important contribution to success in this direction, it would be helpful to implement online adaptation of internal hyperparameters as already implemented in the original CMA-ES (Loshchilov, 2014). This would ensure an additional level of invariance and robustness on large-scale black box optimization problems.
I am grateful to Michèle Sebag and Marc Schoenauer for many valuable discussions and insights. I also would like to thank Oswin Krause, Youhei Akimoto, and the anonymous reviewers whose interest and valuable comments helped to improve this work.