## Abstract

Estimations of distribution algorithms (EDAs) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging, and as a result existing EDAs may become less attractive in large-scale problems because of the associated large computational requirements. Large-scale continuous global optimisation is key to many modern-day real-world problems. Scaling up EAs to large-scale problems has become one of the biggest challenges of the field. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective and efficient EDA-type algorithms for large-scale continuous global optimisation problems. Our concept is to introduce an ensemble of *random projections* to low dimensions of the set of fittest search points as a basis for developing a new and generic divide-and-conquer methodology. Our ideas are rooted in the theory of random projections developed in theoretical computer science, and in developing and analysing our framework we exploit some recent results in nonasymptotic random matrix theory.

## 1 Introduction

Estimations of distribution algorithms (EDAs) are population-based stochastic blackbox optimisation methods that are recognised as a major paradigm of evolutionary computation (EC) (Larrañaga and Lozano, 2002). Unlike most traditional EC approaches, which have no explicit mechanism to take advantage of any correlation structure in the sample of high-fitness individuals, EDAs guide the search for the global optimum by estimating the distribution of the fittest sample and drawing new candidates from this distribution. One of the unique advantages stemming from this approach is that the parameter estimates in EDA are often interpretable and may shed light on the problem structure.

However, it has been widely observed that as the search space dimensionality increases, model building becomes more difficult and declines in effectiveness (Omidvar and Li, 2011; Dong et al., 2013). Indeed, attempts to use the full power of multivariate model building, such as the estimation of multivariate normal algorithm (EMNA), when the search space exceeds 50–100 dimensions, have been scarce. The current practice of EDA most often resorts to independence models or models with some limited dependency structure (Wang and Li, 2008; Bosman et al., 2013; Dong et al., 2013; Ros and Hansen, 2008) in exchange for feasibility when the problem is high-dimensional. Some authors employ univariate heavy tail search distributions; for example, Wang and Li (2008) propose a univariate EDA (UMDAc) with Gaussian and Lévy search distribution for large-scale EDA, and while this improves the exploration ability to some extent, a univariate model unfortunately means that nonseparable problems cannot be tackled adequately—a fact both proved theoretically (Mühlenbein and Mahnig, 1999; Larrañaga and Lozano, 2002) and shown experimentally (Echegoyen et al., 2011).

More refined univariate methods are sep-CMA-ES (Ros and Hansen, 2008) and the univariate version of AMaLGaM (Bosman, 2009); these only estimate the diagonal entries of the sample covariance matrix to reduce the search cost of model building, although in a different way than UMDAc does. By construction, these methods are aimed at dealing with dimensionwise separable problems, and this serves as an approximation of nonseparable problems with few dependencies. In practice, these independence factorisations turn out to be more effective than fully dependent models in high dimensions (Ros and Hansen, 2008; Bosman, 2009), because estimating a reliable fully dependent model requires considerably larger population sizes, which then consumes the budget of function evaluations rapidly, in addition to an increased per-generation time and space complexity. In a different vein, L-CMA-ES (Knight and Lunacek, 2007) addresses the latter issue and obtains savings in terms of the time and the space complexity over the full covariance CMA-ES by employing a limited memory version. However, this does not reduce (but slightly increases) the number of function evaluations taken to reach a target value.

Large-scale continuous optimisation problems are one of the most important concerns in evolutionary computation research because they appear in many real-world problems (Tang et al., 2009; Molina et al., 2010; Omidvar and Li, 2011) such as data mining and biocomputing (Sun et al., 2012), robotics and computational vision (Simonyan et al., 2014). Competitions are organised each year at major conferences, notably the Congress on Evolutionary Computation (CEC), to promote research in this area. Indeed, many optimisation methods suffer from the curse of dimensionality and deteriorate quickly when the search space dimension increases to the thousands. The current target at these competitions is difficult nonseparable problems on 1,000-dimensional search spaces. The state-of-the-art best performers are EC methods that use cooperative co-evolution (Yang et al., 2008a), multilevel co-evolution (Yang et al., 2008b), and hybrid methods that include local searches (Molina et al., 2010). EDA approaches did not yet feature in these competitions.

Our motivation in this work is as follows. It is often infeasible to obtain the exact solution to complicated high-dimensional nonseparable problems. Hence, it is desirable to develop alternative approaches with differing search biases that are able to obtain approximate solutions with a limited budget. By reducing the degrees of freedom in the parametrisation of the search distribution, the preceding methods were able to achieve a better scaling in terms of the search costs, and various ways of doing this induce different search biases; for instance, UMDAc has a bias for separable problems, EDA-MCC (Dong et al., 2013) has a bias for block-diagonal dependency structures; sep-CMA-ES has a bias for separable problems. Each method is more likely to succeed on problems that are similar or not too different from their own search biases.

What these methods have in common in their way of restricting the covariance is a binary decision-making step by which they estimate certain dependencies and neglect certain others. The EDA-type approach we develop in this paper keeps with the general idea of reducing the degrees of freedom of the covariance, but without making such binary decisions on any of the individual dependencies. Instead of dependency selection, we use compression. More specifically, we simply work with a combination of compressed versions of EMNA’s full covariance estimate. This avoids rank deficiency and misestimation of the covariance when the sample size is small; it is by construction invariant to rotations of the search space; and from the perspective of optimisation it creates a different kind of search bias by which separable problems are no longer necessarily the easy type but in turn some of the more sophisticated nonseparable problems may become more manageable. Our goal is to find approximate solutions to difficult problems within a limited budget of function evaluations, and we demonstrate the effectiveness of our approach on the CEC’10 competition benchmark suite (Tang et al., 2009) that was designed with this goal in mind, mimicking real-world scenarios. In addition, despite our building on EMNA, the time complexity per generation of our method is only quadratic in the problem dimension whereas EMNA’s is cubic. Moreover, our approach lends itself naturally to parallel implementation, since the problem of estimating the search distribution can be split over several cores.

Section 2 discusses some fundamentals of the problem of model estimation in high-dimensional probability spaces that motivated our approach. Section 3 introduces our new approach along with an analysis of its working. In Section 4 we demonstrate that this approach is competitive with the state-of-the-art experiments on a battery of test functions. Section 5 concludes the paper. A preliminary version of this work appeared in Kabán et al. (2013). The MATLAB code for our present work is available from http://www.cs.bham.ac.uk/∼axk/rpm.zip

## 2 The Challenges of Model Estimation in High Dimensions

*d*-dimensional Euclidean ball with centre and radius . By definition, is the probability that a draw from the search distribution parametrised by and falls in the -neighbourhood of the global optimum.

In EMNA, the parameters and are maximum likelihood estimates (MLE) from selected search points of the population. Hence is a matrix-valued random variable, that is, a random matrix. Analytic tools are available from random matrix theory (RMT) to analyse random matrices, which were also used in statistics to analyse covariance estimation problems (Vershynin, 2012b; Srivastava and Vershynin, 2013) and which previously have never been exploited in EDA optimisation.

We start by noting that the eigenvalues of the covariance estimate used in EDA to generate the new generation of individuals play the role of some learning rates for the optimisation process. This observation is certainly not new; the widely successful covariance matrix adaptation evolutionary strategy (CMA-ES) (Hansen, 2006) builds on this observation also. Here we use this observation to highlight what goes wrong with EMNA in high dimensions, which then opens up new options to deal with the problem with the use of new tools.

*U*denotes the

_{i}*i*th eigenvector of , and is its associated eigenvalue.

if , then the probability in Eq. (1) is an increasing function of ,

if , then the probability in Eq. (1) is a decreasing function of ,

*i*th eigenvalue of is the squared length of the projection of onto the corresponding eigendirection, namely, . In other words, when , then the probability (1) of drawing a point in the -neighbourhood of can be increased by increasing . On the other hand, when , then (1) can be increased by decreasing . Hence the eigenvalues of play the role of learning rates in Gaussian EDA, and good estimates of these eigenvalues are essential.

Unfortunately, as is well known from RMT, in small sample conditions the smallest eigenvalue is *severely* underestimated, while the largest eigenvalue is overestimated. An example is shown in Figure 1, where we generated 100 points from a 100-dimensional Gaussian with identity covariance, yet the sample covariance of those 100 points has eigenvalues ranging all the way from close to 0 to around 4.

The extent of this misestimation is well understood in RMT, and indeed this theory has given rise to new methods of covariance estimation (Marzetta et al., 2011; Durrant and Kabán, 2015) that are able to remedy the problem effectively even when is singular, using an ensemble of random projections of the covariance estimate. In this work we make extensive use of these results.

These recent RMT-based covariance estimation methods have been found to have certain advantages, for example, in comparison with the Ledoit-Wolf estimator (Marzetta et al., 2011) in terms of approximating the true covariance; they also performed better in data classification (Durrant and Kabán, 2015). Furthermore, these RMT-based methods do not impose a predefined and possibly unjustified structural constraint such as sparsity, diagonality, block-diagonal structure, or limited dependency structure. Instead the reduction of the degrees of freedom comes from exploiting randomised compressions of the maximum likelihood covariance estimate. Moreover, for EDA-type search, these new estimators also have computational advantages, since we only have to sample from a number of small-dimensional multivariate Gaussians. Hence our approach lends itself to parallel implementation that fits well with the algorithmic structure of population-based search, which could potentially be further exploited when the problem scale requires it.

## 3 Approach to Large-Scale Stochastic Optimisation

The main goal of this paper is to develop a new approach to large-scale stochastic optimisation in application to EDA that is effective and computationally efficient in finding approximate solutions to difficult problems with limited search costs. The term *difficult* is of course relative; here we use it to refer to problems that cannot be solved by existing specialised optimisation methods and that have not yet been solved to a satisfactory extent by other existing heuristic optimisation methods.

Building on recent results in other areas, our concept is to introduce an ensemble of random projections that reduce the dimensionality of the fittest high-dimensional search points. Random projection (RP) is often termed a nonadaptive dimensionality reduction method, since it projects the data in directions that are uniformly randomly chosen (independently of the data being projected, as opposed to, for instance, principal component analysis, where the projection directions are determined as a function of the data). Perhaps surprisingly for common intuition, RPs enjoy nice theoretical properties, most notably a high probability guarantee of low distortion of the Euclidean geometry, and the reduced space makes subsequent computations, estimation, and sampling easier. This approach provides us a basis for developing a new and generic divide-and-conquer methodology rooted in the theory of RPs and exploiting recent advances of nonasymptotic random matrix theory and related fields.

At a high level, the rationale is as follows:

Random matrices that satisfy the Johnson-Lindenstrauss lemma (JLL) (Dasgupta, 1999) are approximate isometries. Hence, with appropriate choice of the target dimension, important structures such as Euclidean distances and dot products are approximately preserved in the reduced space. This makes it possible to capture correlations between the

*d*-dimensional search variables in the -dimensional space.In the low-dimensional projection space the distribution of the projected points can become “more Gaussian” as a consequence of the central limit theorem, in a sense made precise by Diaconis and Freedman (1984). Also, both parameter estimation and sampling become feasible and computationally affordable, so there is no need to overly restrict the parametric form of the search distribution and its covariance matrix.

There is a natural smoothing effect that emerges when appropriately combining the ensemble of estimates from several random subspaces (Mahoney, 2011; Marzetta et al., 2011; Durrant and Kabán, 2015). This ensures that the exploration ability of the search distribution can be maintained even with small population sizes.

Random projections have been used in approximation theory since the 1970s (Lorentz et al., 1996). In computer science, information theory, signal processing, and machine learning, random matrices provide a mechanism for dimensionality reduction while preserving the essential information in the data (Vempala, 2004). Compared with other methods in that context, they lead to (1) faster algorithms that are (2) simpler to analyse, (3) lend themselves to parallel implementation, and (4) exhibit robustness. The reader may refer to the recent review by Mahoney (2011). We aim to exploit these characteristics for high-dimensional optimisation.

### 3.1 New Search Operators for EDA

Let be a random matrix with entries drawn i.i.d. from a univariate Gaussian . When *d* is large, as a consequence of the measure concentration phenomenon in high dimensions, the rows of this matrix are almost orthogonal and have a Euclidean norm close to their expected value, which is (Dasgupta, 1999; Vempala, 2004). So if we choose , then *R* well approximates an orthonormal matrix to project from to , where *k* may be chosen much lower than *d*.

Further, let be a point in the search space. Denote by the unique affine subspace parallel to *R* that passes through *x*_{0}. We define new search operators as follows:

takes a (random) , an , and a sample , and projects onto , that is, returns .

takes a sample that lives in a subspace and computes the maximum likelihood parameter estimates (for Gaussian search distribution, ) of the search distribution , which is with respect to (w.r.t.) the restriction of the Lebesgue measure to the

*k*-dimensional affine subspace .takes parameter estimates obtained by and returns a sample of

*N*points drawn i.i.d. from with parameters . These points will live in a*k*-dimensional affine subspace of the search space.takes populations from several

*k*-dimensional subspaces and returns a population that lives in the full search space .

Using these operators, the high-level outline of our meta-algorithm is as follows:

Initialise population by generating

*N*individuals uniformly randomly.Let be the fittest individuals from .

For () randomly oriented (affine) -dimensional subspaces

Project onto .

Produce

*N*new individuals on the subspace using the sequence ; .

Create the new population using .

If stopping criteria are met then Stop; else Goto 2.

We instantiate this by taking the translation vector *x*_{0} of the consecutive set of subspaces (in consecutive generations) to be the mean of in the previous generation. Further, in this work we instantiate the operator as a scaled average of the individuals produced on the individual subspaces. Note, this simple combination scheme makes no appeal to fitness evaluation within subspaces.

The scaling just mentioned is important. An orthogonal projection from to shortens the lengths of vectors by a factor of , and averaging *M* i.i.d. points reduces their standard deviation by a factor of , hence a scaling factor of is needed to recover the original scale. This is the case when the entries of *R* were drawn with variance . With generic the appropriate scaling is .

In Algorithm 1, we now present the specific algorithm, which is an instantiation of the module for creating the new generation (steps 3 and 4 of the foregoing list).

Note that in practice the loop in step 3 of Algorithm 1 can be split over multiple cores, since each random subspace is both generated and sampled from, independently of all the others.

The working of this method is illustrated in Figure 2, with the caveat that high-dimensional geometry is hard to capture on a 2D figure and should be read as follows. In large-scale problems, in order to remain search cost effective, we would often like to work with a population size that is no larger than the problem dimensionality (Dong et al., 2013), since a large population size would consume the search budget rapidly. Then the number of fit points becomes smaller than the dimension of the search space *d*; hence the fit individuals live in the -dimensional subspace of the search space determined by their span. The leftmost plot in Figure 2 illustrates a situation where some points live in a subspace (here 1D) of the overall space (here 2D). Hence, the maximum likelihood (ML) covariance estimate of the fit points is singular. Sampling points from a distribution with a singular covariance means that the next generation is confined in the same subspace. The top plot in the central column of Figure 2 illustrates this. Now, to get around this problem, univariate methods like UMDAc impose a diagonality constraint, that is, estimate only the variances. Hence the next generation is allowed in the full search space; however, any connection between the orientation of the fitness density of the parent population and its estimate is lost as a result of neglecting the correlations. This is seen in the second plot in the central column of Figure 2. The remaining plots in the central column show what happens when we use a random projection ensemble. The upper one shows a case where the number of random subspaces is the smallest that still spans the full search space, and the lower one shows a case where a large number of random subspaces are used. In both cases, the fit points are projected onto each of the random subspaces, and a new generation is sampled within each subspace. The new individuals from these multiple worlds are then averaged to give the new generation shown in the rightmost column of Figure 2 together with the ML covariance estimate of this new population. We see that the new ML covariance estimate tends to respect the orientation of the fitness density of the parent population while it also eliminates degeneracy. It is also easy to picture that the probability recovering the correct orientation gets higher as the number of random projections gets larger. This is because the resulting outcome from a very small number of uniformly random directions have a higher variability, whereas this variation diminishes as we add more random directions into the combination.

### 3.2 Analysis of the Algorithm That Creates New Generations

To understand the effect of Algorithm 1, we analyse it by examining the new full-rank search distribution in the original search space that it implicitly implements. We stress, however, that all *estimation* and *sampling* take place in the *k*-dimensional projected spaces , and it is this fact that enables us to finesse both the computational issues associated with sampling in the high-dimensional search space and the degeneracy of the covariance estimate in the search space.

Fix the set of selected fit individuals , and denote by the maximum likelihood estimate of their sample covariance. This covariance estimate is never computed explicitly throughout the algorithm, but it is useful for the theoretical analysis of this section.

Now, it is straightforward to verify that, by construction, conditionally on the matrices , the new population, , obtained by Algorithm 1, is distributed i.i.d. as . However, while is singular when is smaller than *d*, the matrix is almost surely (a.s.) positive definite provided and *R _{i}* are Gaussian with independent entries, or Haar random matrices (i.e., matrices with orthonormal rows, drawn with a uniformly random orientation). So with this minimum number of RPs we already avoid degeneracy and the problem of getting stuck in a bad subspace as a result. Of course, in order to also recover the correct orientation of the covariance, we need to use a large enough number of random projections so that the finite average gets close to its infinite limit. Fortunately, the law of large numbers guarantees that this is feasible, and the concentration of measure for sums of random matrices lets us quantify the gap between them. This analysis is detailed in the next sections. The resulting full

*d*-dimensional covariance matrix provides information about the correlations between the original search variables, and may be used for learning about the problem structure in the usual way as is normally done in EDA.

#### 3.2.1 Infinitely Many Random Projections

*R*are drawn independent and identically distributed (i.i.d.). Therefore, fixing , by the law of large numbers, the ensemble may be thought of as a finite approximation of the expectation and we can understand the effect of the RP ensemble by computing this expectation.

_{i}For Haar random matrices, this expectation was computed by Marzetta et al. (2011).

*where denotes the trace of its argument, and I_{d} is the d-dimensional identity matrix*.

Observe that the covariance estimate in Lemma 1 is times a convex combination of the maximum likelihood covariance estimate and the spherical covariance estimate , since the combination coefficients sum to 1: . Hence this method interpolates between the unconstrained maximum likelihood covariance as in EMNA and the spherically constrained estimate, and the balance between these two components is controlled by the size of *k* relative to *d*. Observe also that the coefficient of the first term is (for a constant *k*), while that of the second term is . So for a constant *k*, the higher the problem dimension, the less weight is put on the full unconstrained maximum likelihood covariance estimate.

Now, from the observations at the beginning of Section 3.1, when *d* is large we may obtain a similar effect from using *R _{i}* with i.i.d. Gaussian entries. The following lemma computes the matrix expectation in Eq. (5) under such Gaussian

*R*, which also turns out to have a closed form. In fact, this matrix expectation is available in closed form for a much larger class of random matrices, too (Kabán, 2014).

_{i}Before starting the proof, we observe that for , we get , which is times a linear combination of the maximum likelihood covariance estimate and the spherical covariance estimate (now the coefficients sum to ), and again the coefficient of the first term is , while that of the second is . The Gaussian *R _{i}* is more convenient to use, since we do not need to orthogonalise its rows, and for large

*d*problems with small

*k*we indeed experienced no difference in their behaviour. However, for experiments aimed at quantifying the effect of

*k*, the Haar matrices are more appropriate to use so that the interpolation effect is captured precisely though the convex combination.

*R*. So we can absorb

*U*into

*R*and have the right-hand side of Eq. (9) further equal to Therefore it is enough to compute with being diagonal.

*r*the

_{i}*i*th column of

*R*, and by the rank of . Then we can rewrite: We first compute the diagonal elements of a generic term of this sum. These have the form . We need to take separately the cases when and when .

Hence we obtain that for diagonal , is a diagonal matrix, and in particular it follows that if is diagonalised as , then *U* also diagonalises .

In consequence, in the limit of the new population returned by Algorithm 1 will be distributed i.i.d. as . Of course, when *M* is finite, the covariance obtained will concentrate around its expectation; hence it will be close to the estimate just computed. This can be quantified precisely using matrix-valued tail bounds (Srivastava and Vershynin, 2013; Ahlswede and Winter, 2002).

#### 3.2.2 Finitely Many Random Projections

Here we bound the deviation of the assembled covariance with finite *M* from its computed expectation. This is summarised in the following result.

Proof: We use the following Ahlswede-Winter-type result from random matrix theory about sums of independent random matrices.

The proof of Theorem 2 is given in the Appendix for completeness.

Observe, we do not have bounded a.s. when *R _{i}* have Gaussian entries, so we cannot apply this result directly. However, this condition can be satisfied by exploiting concentration, as follows.

*R*that are not wiped out by the zeros of . Then we can write , and we can bound this with high probability (w.r.t. the random draws of

_{i}*R*): The following result bounds the largest singular value of a Gaussian matrix with i.i.d. entries:

_{i}Finally, we choose and denote , which yields the statement of Theorem 1.

This analysis shows that we can use a finite number of random subspaces, since we have control over the spectral distance between the resulting finite average of the *d*-dimensional rank-*k* covariances and the infinite limit of this sum. Hence, we may expect a similar behaviour from a finite ensemble, which is pleasing. The practical implication, as mentioned, is that an efficient parallel implementation can be realised where the estimation and sampling within each subspace is run on a separate core.

In closing, we should mention that although we used the truncation method here, a more direct route might exist if the a.s. boundedness condition could be relaxed in Theorem 2. In particular, we see from its proof (in the Appendix) that some suitable alternative to the Taylor expansion–based inequality in Eq. (24), along with the finiteness condition on the variances of the matrix summands, could possibly lead to a variation of Theorem 2 to eliminate the boundedness condition.

The finite sample analysis in Theorem 1 is too crude to give us the minimum order of *M* required for the matrix average to get close to its expectation with a given confidence. If we disregard the truncation step, then equating the right-hand side of Eq. (18) to some given and solving for *M* yields , and noting that , we get . Otherwise a more careful choice for , such as , with a small positive number bounded away from zero, would yield nearly the same order for *M* (raised to a power just slightly larger than 1). However, the oversampling factor is due to the relatively general conditions in the Ahlswede-Winter bound that we used, which holds for sums of generic positive semi-definite matrices and does not exploit the Gaussian (or sub-Gaussian) distribution of the entries of *R*. There are more refined analyses of analogous problems (Vershynin, 2012a) that we believe to be possible to adapt to our case to eliminate this log factor. Based on these considerations it is expected that is required.

### 3.3 Computational Complexity per Generation

Accounting for the cost of *M* different matrix multiplications on points, our approach has time complexity (per generation) of . As mentioned, the finite *M* implementation can be run in parallel on *M* separate cores, which is an advantageous structural property of our algorithm that could be exploited to gain further computational efficiency when needed. However, even on a single core, and taking as discussed, the complexity per generation becomes when . This is indeed a regime where the proposed method yields good performance also (see Section 4). In contrast, other full covariance EDA-type methods such as EMNA, ED-EDA (Dong and Yao, 2008), and CMA-ES (Hansen, 2006) have a time complexity per generation of .

In sum, the net effect of our RP ensemble approach is to get samples from the regularised covariance, without ever computing the maximum likelihood estimate, and without the need to explicitly sample from a covariance, which would be an operation.

### 3.4 Alternative Random Projections Matrices

Our analysis in the previous sections focused on RP matrices with i.i.d. Gaussian entries. These have the advantage of being full row rank a.s., and they made our analysis in Section 3.2.1 more straightforward, but there are several alternatives available in the random projections literature that have a faster matrix multiplication time and still have full rank with very high probability. Of these we mention two.

*R*are drawn i.i.d. as the following: and then normalised to have variance . Using these, the matrix multiplications take roughly one third of the time that they do in the case of the Gaussian RP matrices.

_{ij}Interesting to note, for this particular sparse RP, the limit at happens to be exactly the same as that for the Gaussian RP (see Kabán, 2014, Lemma 2), and one can also obtain a similar concentration guarantee as that we gave for an ensemble of Gaussian RP matrices in Theorem 1.

*R*are and normalised to have variance . So multiplication with this matrix is efficient, since it only involves bit flipping.

_{ij}Again, these binary RP matrices are full rank with high probability. However, it may be verified using Lemma 2 of Kabán (2014) that for this binary RP ensemble the limit at no longer coincides with that of the Gaussian and the sparse RP ensembles. Instead, for certain , the regularisation effect is diminished, which when it happens can reduce the exploration ability of the search. Therefore we may expect the binary RP matrices to perform worse, and we use them mainly to see the robustness of our algorithm to such deviations from the analysis presented earlier.

## 4 Experiments

To test the potential of our idea and the ability of our algorithm to find a near-optimal solution in large-scale problem settings, we tested it on all multimodal functions of the suite of benchmark test functions from the CEC’10 competition on large-scale global optimisation (Tang et al., 2009). There are 12 multimodal functions in this test suite, which were created from shifted and group-rotated versions of three multimodal base functions, and in each case the search space is 1,000-dimensional. The reason we have chosen to focus on multimodal functions is that this is the category where our methodology is expected to provide most benefits.

The test bed was designed to contain a couple of separable problems and a suite of problems with a varying degree of nonseparability to give insights into the behaviour and performance of optimisation methods.

A test function is called separable if its global optimum can be found by optimising it in each of its arguments separately. Otherwise it is called nonseparable. A nonseparable function is called *m*-nonseparable if at most *m* of its arguments are not independent. Finally, a nonseparable function is called fully nonseparable if any two of its arguments are not independent.

The test functions in our study are listed here, and they are all minimisation problems having their global optimal fitness value equal to zero:

Separable functions:

Shifted Rastrigin’s function

Shifted Ackley’s function

Partially separable functions having a small number of variables that are dependent and all the remaining ones independent ():

Single-group shifted and

*m*-rotated Rastrigin’s functionSingle-group shifted and

*m*-rotated Ackley’s functionSingle-group shifted and

*m*-dimensional Rosenbrock’s function

Partially separable functions consisting of multiple independent sub-components, each of which is

*m*-nonseparable (). This category includes two subtypes, -group*m*-nonseparable and -group*m*-nonseparable functions:-group shifted and

*m*-rotated Rastrigin’s function-group shifted and

*m*-rotated Ackley’s function-group shifted and

*m*-dimensional Rosenbrock’s function-group shifted and

*m*-rotated Rastrigin’s function-group shifted and

*m*-rotated Ackley’s function-group shifted and

*m*-dimensional Rosenbrock’s function

A fully nonseparable function:

Rosenbrock’s function

See Tang et al. (2009) for more details on these functions.

### 4.1 Performance Results on the CEC’10 Large-Scale Optimisation Competition Benchmark

We use a simple averaging combination of RP-EDAs as in Algorithm 1. We allow a fixed budget of function evaluations and use a population size of , and the number of retained individuals is set to . We use truncation selection with elitism. We take the random subspace dimension to be , and the number of subspaces is set to . We also experimented with other parameter settings and observed that the results are qualitatively unchanged as long as we set *k* low enough to get reliable covariance estimates from points and large enough to capture sufficient covariance structure. The number of random subspaces *M* must always be set above the minimum of in order to cover the search space, and it is preferable to set it larger so that the finite average that appears in the analysis of covariance construction, Eq. (5), gets closer to the expectation and hence recovers the correct directions of the covariance. Note that a larger *M* does not incur extra function evaluations and increases the per generation time complexity only linearly. Of course, we do not claim optimality of these parameter settings, and indeed this may be problem-dependent in principle. Guidelines with more detailed discussion on how to set these parameters are given in Section 4.2.

Figure 3 gives a visual summary of the results obtained in comparison with the competition winner (Molina et al., 2010)—a fairly sophisticated memetic algorithm based on local search chains—and two other state-of-the-art co-evolutionary methods referenced on the competition’s page, namely DECC-CG (Yang et al., 2008a) and MLCC (Yang et al., 2008b). A statistical analysis and further comparisons are given in Section 4.1.1. The bar chart in Figure 3 shows the averages of the best fitness values (in log scale) from 25 independent runs. We also included results from 25 independent runs of the limiting version of our algorithm, that is, , which we implemented using the analytic expression computed in Eq. (16) (with sampling done in the full *d*-dimensional space). The reason we included this is to assess how our algorithm with a finite *M* deviates from it. For DECC-CG and MLCC we used the results produced with function evaluations quoted from Molina et al. (2010); that is a considerably larger budget than we allowed for our method, as it is interesting to see that our results still compare well to these also.

Thus, we see that our simple RP ensemble EDA algorithm is highly competitive with the best state-of-the-art methods for large-scale optimisation and even slightly outperforms the CEC’10 competition winner on some of the functions on this difficult benchmark. Furthermore, it is worth noticing that the performance with is nearly indistinguishable from that with infinite *M*.

#### 4.1.1 Statistical Analysis and Further Comparisons with State-of-the-Art EDA-Type Methods

In Tables 1–4 we provide a detailed statistical analysis of the results shown in Figure 3, and in addition to the competition winner and the high-ranking co-evolutionary methods we also present further comparisons with recent and state-of-the-art EDA-type methods: EDA-MCC (Dong et al., 2013), sep-CMA-ES (Ros and Hansen, 2008), and AMaLGaM-Univariate (Bosman, 2009) on function T12. For our method we included results obtained with the alternative random projection matrices (see Section 3.4).

. | . | . | . | . | t-tests vs.
. |
---|---|---|---|---|---|

Func. . | Method . | max FE . | Mean . | Std . | . |

T1 | RP-Ens (g) k=3 M=1000 | 6e+05 | 784.21 | 76.017 | |

RP-Ens k=3 M= | 6e+05 | 868.96 | 49.715 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 760.74 | 50.835 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 818.36 | 40.7 | ||

CEC’10 winner | 6e+05 | 2670 | 163 | ||

DECC-CG | 3e+06 | 1310 | 32.6 | ||

MLCC | 3e+06 | 0.557 | 2.21 | ||

sep-CMA-ES | 3e+06 | 5677.9 | 476.52 | ||

EDA-MCC c=20 N=300 | 6e+05 | 1237 | 1237 | ||

T2 | RP-Ens (g) k=3 M=1000 | 6e+05 | 2.5366e-13 | 4.8104e-15 | |

RP-Ens k=3 M= | 6e+05 | 2.5267e-13 | 3.743e-15 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 0.00013352 | 2.367e-06 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 0.0001336 | 2.9135e-06 | ||

CEC’10 winner | 6e+05 | 3.84 | 0.213 | ||

DECC-CG | 3e+06 | 1.39 | 0.0973 | ||

MLCC | 3e+06 | 9.88e-13 | 3.7e-12 | ||

sep-CMA-ES | 3e+06 | 21.062 | 0.038944 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 0.2025 | 0.012411 |

. | . | . | . | . | t-tests vs.
. |
---|---|---|---|---|---|

Func. . | Method . | max FE . | Mean . | Std . | . |

T1 | RP-Ens (g) k=3 M=1000 | 6e+05 | 784.21 | 76.017 | |

RP-Ens k=3 M= | 6e+05 | 868.96 | 49.715 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 760.74 | 50.835 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 818.36 | 40.7 | ||

CEC’10 winner | 6e+05 | 2670 | 163 | ||

DECC-CG | 3e+06 | 1310 | 32.6 | ||

MLCC | 3e+06 | 0.557 | 2.21 | ||

sep-CMA-ES | 3e+06 | 5677.9 | 476.52 | ||

EDA-MCC c=20 N=300 | 6e+05 | 1237 | 1237 | ||

T2 | RP-Ens (g) k=3 M=1000 | 6e+05 | 2.5366e-13 | 4.8104e-15 | |

RP-Ens k=3 M= | 6e+05 | 2.5267e-13 | 3.743e-15 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 0.00013352 | 2.367e-06 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 0.0001336 | 2.9135e-06 | ||

CEC’10 winner | 6e+05 | 3.84 | 0.213 | ||

DECC-CG | 3e+06 | 1.39 | 0.0973 | ||

MLCC | 3e+06 | 9.88e-13 | 3.7e-12 | ||

sep-CMA-ES | 3e+06 | 21.062 | 0.038944 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 0.2025 | 0.012411 |

t-tests vs. | |||||

Func. | Method | max FE | Mean | Std | |

T3 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1.2397e+07 | 3.1864e+06 | |

RP-Ens k=3 M= | 6e+05 | 1.3202e+07 | 3.1221e+06 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 1.2675e+07 | 4.3228e+06 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 1.2886e+07 | 4.3566e+06 | ||

CEC’10 winner | 6e+05 | 2.17e+08 | 8.56e+07 | ||

DECC-CG | 3e+06 | 2.63e+08 | 8.44e+07 | ||

MLCC | 3e+06 | 3.84e+08 | 6.93e+07 | ||

sep-CMA-ES | 3e+06 | 1.1867e+08 | 2.9231e+07 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 1.0161e+07 | 1.7962e+06 | ||

T4 | RP-Ens (g) k=3 M=1000 | 6e+05 | 117.78 | 17.151 | |

RP-Ens k=3 M= | 6e+05 | 5049 | 754.24 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 121.32 | 11.349 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 82.249 | 2.1203 | ||

CEC’10 winner | 6e+05 | 81400 | 2.84e+05 | ||

DECC-CG | 3e+06 | 4.96e+06 | 8.02e+05 | ||

MLCC | 3e+06 | 1.62e+07 | 4.97e+06 | ||

sep-CMA-ES | 3e+06 | 6.5427e+06 | 3.762e+06 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 12.379 | 0.34627 | ||

T5 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1.3429e+08 | 2.5957e+08 | |

RP-Ens k=3 M= | 6e+05 | 1.216e+08 | 1.986e+08 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 6.6642e+07 | 3.63e+07 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 7.8953e+07 | 3.8096e+07 | ||

CEC’10 winner | 6e+05 | 6.13e+07 | 1.27e+08 | ||

DECC-CG | 3e+06 | 6.44e+07 | 2.89e+07 | ||

MLCC | 3e+06 | 4.38e+07 | 3.45e+07 | ||

sep-CMA-ES | 6e+05 | 7.8052e+06 | 1.6452e+06 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 1.9746e+11 | 4.2577e+11 |

t-tests vs. | |||||

Func. | Method | max FE | Mean | Std | |

T3 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1.2397e+07 | 3.1864e+06 | |

RP-Ens k=3 M= | 6e+05 | 1.3202e+07 | 3.1221e+06 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 1.2675e+07 | 4.3228e+06 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 1.2886e+07 | 4.3566e+06 | ||

CEC’10 winner | 6e+05 | 2.17e+08 | 8.56e+07 | ||

DECC-CG | 3e+06 | 2.63e+08 | 8.44e+07 | ||

MLCC | 3e+06 | 3.84e+08 | 6.93e+07 | ||

sep-CMA-ES | 3e+06 | 1.1867e+08 | 2.9231e+07 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 1.0161e+07 | 1.7962e+06 | ||

T4 | RP-Ens (g) k=3 M=1000 | 6e+05 | 117.78 | 17.151 | |

RP-Ens k=3 M= | 6e+05 | 5049 | 754.24 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 121.32 | 11.349 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 82.249 | 2.1203 | ||

CEC’10 winner | 6e+05 | 81400 | 2.84e+05 | ||

DECC-CG | 3e+06 | 4.96e+06 | 8.02e+05 | ||

MLCC | 3e+06 | 1.62e+07 | 4.97e+06 | ||

sep-CMA-ES | 3e+06 | 6.5427e+06 | 3.762e+06 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 12.379 | 0.34627 | ||

T5 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1.3429e+08 | 2.5957e+08 | |

RP-Ens k=3 M= | 6e+05 | 1.216e+08 | 1.986e+08 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 6.6642e+07 | 3.63e+07 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 7.8953e+07 | 3.8096e+07 | ||

CEC’10 winner | 6e+05 | 6.13e+07 | 1.27e+08 | ||

DECC-CG | 3e+06 | 6.44e+07 | 2.89e+07 | ||

MLCC | 3e+06 | 4.38e+07 | 3.45e+07 | ||

sep-CMA-ES | 6e+05 | 7.8052e+06 | 1.6452e+06 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 1.9746e+11 | 4.2577e+11 |

See Table 1 caption for explanation of symbols.

. | . | . | . | . | t-tests vs.
. |
---|---|---|---|---|---|

Func. . | Method . | max FE . | Mean . | Std . | . |

T6 | RP-Ens (g) k=3 M=1000 | 6e+05 | 832.18 | 62.543 | |

RP-Ens k=3 M= | 6e+05 | 900.08 | 79.14 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 804.09 | 79.982 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 856.02 | 77.542 | ||

CEC’10 winner | 6e+05 | 3220 | 185 | ||

DECC-CG | 3e+06 | 10600 | 295 | ||

MLCC | 3e+06 | 3430 | 872 | ||

sep-CMA-ES | 3e+06 | 6279.5 | 251.42 | ||

EDA-MCC c=20 N=300 | 6e+05 | 1376.1 | 1376.1 | ||

T7 | RP-Ens (g) k=3 M=1000 | 6e+05 | 41.664 | 8.7003 | |

RP-Ens k=3 M= | 6e+05 | 30.093 | 7.9886 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 40.36 | 10.973 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 43.469 | 8.6383 | ||

CEC’10 winner | 6e+05 | 38.3 | 7.23 | ||

DECC-CG | 3e+06 | 23.4 | 1.78 | ||

MLCC | 3e+06 | 198 | 0.698 | ||

sep-CMA-ES | 3e+06 | 212.06 | 6.0235 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 14.658 | 0.45607 | ||

T8 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1.4442e+06 | 53945 | |

RP-Ens k=3 M= | 6e+05 | 1.4528e+06 | 69153 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 1.4281e+06 | 59947 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 1.4842e+06 | 49239 | ||

CEC’10 winner | 6e+05 | 4340 | 3210 | ||

DECC-CG | 3e+06 | 5120 | 3950 | ||

MLCC | 3e+06 | 2080 | 727 | ||

sep-CMA-ES | 6e+05 | 596.05 | 173.43 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 2.7149e+06 | 9.9543e+05 |

. | . | . | . | . | t-tests vs.
. |
---|---|---|---|---|---|

Func. . | Method . | max FE . | Mean . | Std . | . |

T6 | RP-Ens (g) k=3 M=1000 | 6e+05 | 832.18 | 62.543 | |

RP-Ens k=3 M= | 6e+05 | 900.08 | 79.14 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 804.09 | 79.982 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 856.02 | 77.542 | ||

CEC’10 winner | 6e+05 | 3220 | 185 | ||

DECC-CG | 3e+06 | 10600 | 295 | ||

MLCC | 3e+06 | 3430 | 872 | ||

sep-CMA-ES | 3e+06 | 6279.5 | 251.42 | ||

EDA-MCC c=20 N=300 | 6e+05 | 1376.1 | 1376.1 | ||

T7 | RP-Ens (g) k=3 M=1000 | 6e+05 | 41.664 | 8.7003 | |

RP-Ens k=3 M= | 6e+05 | 30.093 | 7.9886 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 40.36 | 10.973 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 43.469 | 8.6383 | ||

CEC’10 winner | 6e+05 | 38.3 | 7.23 | ||

DECC-CG | 3e+06 | 23.4 | 1.78 | ||

MLCC | 3e+06 | 198 | 0.698 | ||

sep-CMA-ES | 3e+06 | 212.06 | 6.0235 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 14.658 | 0.45607 | ||

T8 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1.4442e+06 | 53945 | |

RP-Ens k=3 M= | 6e+05 | 1.4528e+06 | 69153 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 1.4281e+06 | 59947 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 1.4842e+06 | 49239 | ||

CEC’10 winner | 6e+05 | 4340 | 3210 | ||

DECC-CG | 3e+06 | 5120 | 3950 | ||

MLCC | 3e+06 | 2080 | 727 | ||

sep-CMA-ES | 6e+05 | 596.05 | 173.43 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 2.7149e+06 | 9.9543e+05 |

See Table 1 caption for explanation of symbols.

. | . | . | . | . | t-tests vs.
. |
---|---|---|---|---|---|

Func. . | Method . | max FE . | Mean . | Std . | . |

T9 | RP-Ens (g) k=3 M=1000 | 6e+05 | 807.11 | 49.957 | |

RP-Ens k=3 M= | 6e+05 | 880.23 | 62.453 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 811.45 | 56.551 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 852.96 | 59.161 | ||

CEC’10 winner | 6e+05 | 3190 | 146 | ||

DECC-CG | 3e+06 | 12200 | 897 | ||

MLCC | 3e+06 | 7110 | 1340 | ||

sep-CMA-ES | 3e+06 | 6763.5 | 275.75 | ||

EDA-MCC c=20 N=300 | 6e+05 | 1474.8 | 1474.8 | ||

T10 | RP-Ens (g) k=3 M=1000 | 6e+05 | 90.481 | 17.178 | |

RP-Ens k=3 M= | 6e+05 | 72.997 | 20.589 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 94.305 | 16.164 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 102.04 | 13.593 | ||

CEC’10 winner | 6e+05 | 102 | 14.2 | ||

DECC-CG | 3e+06 | 76.6 | 8.14 | ||

MLCC | 3e+06 | 376 | 47.1 | ||

sep-CMA-ES | 3e+06 | 413.83 | 10.971 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 5.0668 | 0.70999 | ||

T11 | RP-Ens (g) k=3 M=1000 | 6e+05 | 54105 | 10877 | |

RP-Ens k=3 M= | 6e+05 | 40345 | 12018 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 56828 | 14651 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 1.3655e+05 | 52065 | ||

CEC’10 winner | 6e+05 | 5530 | 3940 | ||

DECC-CG | 3e+06 | 24600 | 10500 | ||

MLCC | 3e+06 | 7090 | 4770 | ||

sep-CMA-ES | 6e+05 | 1447.1 | 309.65 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 4.2507e+07 | 5.3257e+06 | ||

T12 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1614.7 | 249.44 | |

RP-Ens k=3 M= | 6e+05 | 988.13 | 0.75027 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 1626.2 | 237.59 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 4552.7 | 3204 | ||

CEC’10 winner | 6e+05 | 1210 | 142 | ||

DECC-CG | 3e+06 | 4060 | 366 | ||

MLCC | 3e+06 | 2050 | 180 | ||

sep-CMA-ES | 6e+05 | 1046.5 | 51.644 | ||

sep-CMA-ES | 3e+06 | 903.63 | 39.149 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 6.1795e+07 | 7.7141e+06 | ||

AMaLGaM-Univ | 6e+05 | 8.2448e+08 | 2.5912e+08 | ||

AMaLGaM-Univ | 3e+06 | 990.73 | 15.174 |

. | . | . | . | . | t-tests vs.
. |
---|---|---|---|---|---|

Func. . | Method . | max FE . | Mean . | Std . | . |

T9 | RP-Ens (g) k=3 M=1000 | 6e+05 | 807.11 | 49.957 | |

RP-Ens k=3 M= | 6e+05 | 880.23 | 62.453 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 811.45 | 56.551 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 852.96 | 59.161 | ||

CEC’10 winner | 6e+05 | 3190 | 146 | ||

DECC-CG | 3e+06 | 12200 | 897 | ||

MLCC | 3e+06 | 7110 | 1340 | ||

sep-CMA-ES | 3e+06 | 6763.5 | 275.75 | ||

EDA-MCC c=20 N=300 | 6e+05 | 1474.8 | 1474.8 | ||

T10 | RP-Ens (g) k=3 M=1000 | 6e+05 | 90.481 | 17.178 | |

RP-Ens k=3 M= | 6e+05 | 72.997 | 20.589 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 94.305 | 16.164 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 102.04 | 13.593 | ||

CEC’10 winner | 6e+05 | 102 | 14.2 | ||

DECC-CG | 3e+06 | 76.6 | 8.14 | ||

MLCC | 3e+06 | 376 | 47.1 | ||

sep-CMA-ES | 3e+06 | 413.83 | 10.971 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 5.0668 | 0.70999 | ||

T11 | RP-Ens (g) k=3 M=1000 | 6e+05 | 54105 | 10877 | |

RP-Ens k=3 M= | 6e+05 | 40345 | 12018 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 56828 | 14651 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 1.3655e+05 | 52065 | ||

CEC’10 winner | 6e+05 | 5530 | 3940 | ||

DECC-CG | 3e+06 | 24600 | 10500 | ||

MLCC | 3e+06 | 7090 | 4770 | ||

sep-CMA-ES | 6e+05 | 1447.1 | 309.65 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 4.2507e+07 | 5.3257e+06 | ||

T12 | RP-Ens (g) k=3 M=1000 | 6e+05 | 1614.7 | 249.44 | |

RP-Ens k=3 M= | 6e+05 | 988.13 | 0.75027 | ||

RP-Ens (s) k=3 M=1000 | 6e+05 | 1626.2 | 237.59 | ||

RP-Ens (b) k=3 M=1000 | 6e+05 | 4552.7 | 3204 | ||

CEC’10 winner | 6e+05 | 1210 | 142 | ||

DECC-CG | 3e+06 | 4060 | 366 | ||

MLCC | 3e+06 | 2050 | 180 | ||

sep-CMA-ES | 6e+05 | 1046.5 | 51.644 | ||

sep-CMA-ES | 3e+06 | 903.63 | 39.149 | ||

EDA-MCC c=100 N=1500 | 6e+05 | 6.1795e+07 | 7.7141e+06 | ||

AMaLGaM-Univ | 6e+05 | 8.2448e+08 | 2.5912e+08 | ||

AMaLGaM-Univ | 3e+06 | 990.73 | 15.174 |

See Table 1 caption for explanation of symbols.

We chose the particular EDA-type methods to compare by the following reasoning. EDA-MCC (Dong et al., 2013) was chosen because it is a recent method specifically developed to scale up EDA to high dimensions. It assumes that the covariance of the selected points has a block-diagonal structure, so depending on the block size it interpolates between UMDAc and EMNA. At first sight this seems quite similar to our approach, since the blocks define subspaces of the search space; however, the subspaces in EDA-MCC are axis-aligned and disjoint, whereas in our approach they are not. The implication of this difference is that when we decrease the subspace dimension, the covariance becomes more spherical while still avoiding the independence assumption of UMDAc (see our analysis in the earlier sections). This results in a better capability to escape unwanted early convergence. We used our own implementation of EDA-MCC, since there is no publicly available implementation. Since the guidelines on the parameter setting of EDA-MCC are not prescriptive and were only tested up to 500 dimensions (Dong et al., 2013), for our 1,000-dimensional benchmark set we ran experiments with two different sensible parameter settings and chose the best of the two results to report. This was to ensure that we did not inadvertently disadvantage this method. With block size of EDA-MCC denoted by *c*, the two versions we ran were , , and , . In both cases we set the budget of maximum function evaluations equal to our proposed RP-Ens-EDA, namely, .

The sep-CMA-ES method (Ros and Hansen, 2008) was included in our comparison because it is a variant of CMA-ES developed to handle high-dimensional problems, and it currently represents the gold standard for comparisons in new EDA research. We used the MATLAB implementation available from the authors^{1} with the diagonal option, default parameter settings, and random initialisation. We ran sep-CMA to a maximum of function evaluations equal to that used for our RP-Ens-EDA in the first instance, namely, , but for functions on which it did not outperform our methods we further ran it to a maximum of function evaluations and reported that result instead. This was to avoid having the limited budget as the major obstacle for sep-CMA-ES.

Finally, the AMaLGaM-Univariate was chosen as a representative of AMaLGaM (Bosman, 2009; Bosman et al., 2013) because it was previously demonstrated to work up to 1,000-dimensional problems (Bosman, 2009), and among the 12 versions of the AMaLGaM package this version was found by the authors to work best in high-dimensional problem solving (Bosman, 2009), with results comparable to sep-CMA-ES. For this reason, and since the available software implementation of AMaLGaM that we used^{2} contains a preset list of test functions of which one (Rosenbrock) is in common with our test suite (function T12), we included a comparison with AMaLGaM-Univariate on this function only. We tested several versions of AMaLGAM on this function, and AMaLGaM-Univariate was indeed the variant that produced the best results, which is in line with the authors’ own finding.

Tables 1–4 give for each of the 12 test functions, and for each competing method, the mean and the standard deviation of the best fitness achieved, as computed from 25 independent repetitions. To determine the statistical significance of the differences in performance, we performed 2-tailed *t*-tests for each competing method against each of the four variants of our method. The symbols in the last column indicate whether the fitness achieved by a competing method is statistically significantly better (+) (that is, significantly lower, since we tackle minimisation problems) or worse (−) (i.e., higher) than that of a variant of RP ensemble EDA at the 95% confidence level. The symbol in the first position is a comparison with RP-Ens-EDA that uses Gaussian RP matrices (g); the second symbol is a comparison with RP-Ens-EDA that uses infinitely many Gaussian or sparse RP matrices () (their infinite ensemble limits coincide; see Kabán, 2014, Lemma 2), the third symbol is a comparison with RP-Ens-EDA that uses 1,000 sparse RP matrices (s); and the last symbol is a comparison with RP-Ens-EDA that uses 1,000 binary RP matrices (b). The symbol is placed where a comparison is not applicable (the method on a row and column coincide). The absence of any symbol in any of the four positions means that the associated comparison test detected no significant difference at the 95% confidence level.

One thing we notice from Tables 1–4 is that although some differences of statistical significance were detected in comparisons between some of the variants of our own method, these different variants behaved very similarly when looked at through comparisons with other methods. A second observation to be made is the great diversity in the performance behaviour among the competing methods versus ours; that is, on nearly every function (except T6 and T9, on which our proposed RP-Ens-EDA methods are the overall winners) there is at least one method that does better than ours and at least one that does worse than ours, but the methods that outperform RP-Ens-EDA on one function lose out on another function. This reflects a nice complementarity of the search biases of the methods, so although for the purpose of our comparison all these methods are treated as competitors, in reality in the toolbox of a practitioner they may be used to cooperate in solving difficult problems.

Let us first look at the details of the comparisons between pairs of our own methods. That is, we look at the first four rows for each test function and follow the markers in the last column (an antisymmetric matrix of markers). We notice the following. In the comparisons between the finite dimensional ensemble with Gaussian RPs versus its limit of infinite ensemble, we see that on only four out of twelve functions the infinite ensemble was significantly superior, and on another four functions the finite ensemble performed better. No statistically significant differences were detected on the remaining four functions. So we can conclude that our practical algorithm using a finite ensemble of just RPs is performing no worse that an infinite ensemble, and we also see the gap between them is small in practice.

In the comparisons between the Gaussian versus sparse RP ensembles, both taken with finite ensemble sizes, the 2-tailed *t*-test rejects the null of equal means only once out of 12 functions; in practical terms the two different RP ensembles appear equivalent. Indeed, apart from function T2, where in fact both algorithms got practically close to the global optimum, no statistically significant difference was detected between these two types of random projections. We have so far not identified any obvious reasons for the statistical difference observed on T2; it does appear consistent in all repeated runs. However, it is surprising to see the equivalent performance in all the 11 other functions, given that in stochastic search algorithms like ours there are many factors that may influence the dynamics. These results imply that taking advantage of the computational efficiency offered by the sparse RP matrices essentially comes for free.

The comparisons between the Gaussian and the binary finite ensemble turn out favourable for the former on six of the functions, whereas the binary ensemble wins only once. This was somewhat expected because of the reduced exploration capabilities of the binary RP ensemble (see Section 3.4). In answer to the question set out there, we should observe also that although these differences are statistically significant, the actual fitness values achieved are rather close on average for the two RP variants. By this we may conclude that our overall algorithm appears robust to deviations from the conditions of our analysis presented earlier, where we treated the case of Gaussian *R*.

It may also be interesting to comment on the comparisons between the finite binary RPs versus the infinite ensemble of (Gaussian or sparse) RPs. Despite the fact that the analytical form of the infinite ensemble covariance with the binary RPs is provably different from that with the Gaussian or the sparse RPs, for only five out of twelve functions is the latter superior with statistical significance, while the binary ensemble wins on one function. Four of those five functions coincide with those we observed in the case of the sparse RP ensemble.

Table 5 summarises the findings for the comparisons between the variants of our methods in terms of the overall number of test functions on which a variant wins / loses against another method. All counts are out of the 12 overall test functions, and only the differences that are significant with 95% confidence are counted. Based on these results we may then conclude that our new algorithmic framework appears to be quite robust and efficient, and is able to take advantage of speedups offered by alternative RP matrices.

. | No. Wins: No. Losses . | |||
---|---|---|---|---|

. | RP-Ens (g) . | RP-Ens () . | RP-Ens (sp) . | RP-Ens (b) . |

RP-Ens (g) | N/A | 4 : 4 | 1 : 0 | 6 : 1 |

RP-Ens () | 4 : 4 | N/A | 5 : 4 | 5 : 3 |

RP-Ens (sp) | 0 : 1 | 4 : 5 | N/A | 6 : 1 |

RP-Ens (b) | 1 : 6 | 3 : 5 | 1 : 6 | N/A |

. | No. Wins: No. Losses . | |||
---|---|---|---|---|

. | RP-Ens (g) . | RP-Ens () . | RP-Ens (sp) . | RP-Ens (b) . |

RP-Ens (g) | N/A | 4 : 4 | 1 : 0 | 6 : 1 |

RP-Ens () | 4 : 4 | N/A | 5 : 4 | 5 : 3 |

RP-Ens (sp) | 0 : 1 | 4 : 5 | N/A | 6 : 1 |

RP-Ens (b) | 1 : 6 | 3 : 5 | 1 : 6 | N/A |

We now look at the comparisons between the previously existing methods and the set of our own in the results given in Tables 1–4. Of the separable functions, MLCC is the only one that is better or not statistically worse than ours. Of the single-group nonseparable functions, EDA-MCC performed better than our proposed methods on T3 and T4 with all of the other competing methods being significantly worse than ours. However, on T5, the EDA-MCC turns out to be significantly worse, while sep-CMA-ES outperforms our RP-Ens-EDA instead. MLCC is also better or not significantly worse on this function.

In the -group *m*-nonseparable functions, our approach is the overall winner on T6, outperformed by DECC-GG and EDA-MCC on T7, and outperformed by four competitors on T8.

In the -group nonseparable functions, the algorithm we proposed is the overall winner on T9, outperformed by EDA-MCC and partly outperformed by DECC-CG on T10, and outperformed by four competitors on T11. Finally, on the nonseparable function T12, sep-CMA-ES outperformed our algorithms, AMaLGaM-Univariate outperformed our finite-M variants only when given a larger budget of function evaluations, and the CEC’10 winner also marginally (but with statistical significance) outperformed our finite-M versions.

An aggregated summary of these findings in terms of the number of functions (from the total of 12) on which our method wins/loses is provided in Table 6 for each of the competing methods. Only the differences significant with 95% confidence are counted. We see that the version with binary RPs is the least effective, as expected, but still the number of wins obtained by our proposed methods is larger than or equal to the number of losses in each individual comparison.

. | No. Wins: No. Losses . | ||||
---|---|---|---|---|---|

. | CEC10 . | DECC-CG . | MLCC . | sep-CMA-ES . | EDA-MCC . |

RP-Ens (g) | 6 : 3 | 7 : 4 | 7 : 3 | 8 : 4 | 8 : 4 |

RP-Ens () | 8 : 2 | 7 : 3 | 7 : 3 | 8 : 4 | 8 : 4 |

RP-Ens (sp) | 5 : 3 | 7 : 4 | 7 : 5 | 8 : 4 | 8 : 4 |

RP-Ens (b) | 5 : 4 | 6 : 4 | 6 : 6 | 8 : 4 | 8 : 4 |

. | No. Wins: No. Losses . | ||||
---|---|---|---|---|---|

. | CEC10 . | DECC-CG . | MLCC . | sep-CMA-ES . | EDA-MCC . |

RP-Ens (g) | 6 : 3 | 7 : 4 | 7 : 3 | 8 : 4 | 8 : 4 |

RP-Ens () | 8 : 2 | 7 : 3 | 7 : 3 | 8 : 4 | 8 : 4 |

RP-Ens (sp) | 5 : 3 | 7 : 4 | 7 : 5 | 8 : 4 | 8 : 4 |

RP-Ens (b) | 5 : 4 | 6 : 4 | 6 : 6 | 8 : 4 | 8 : 4 |

#### 4.1.2 Fitness Trajectories through the Generations

In order to gain more insight into the behaviour of our RP ensemble EDA algorithm it is useful to inspect its convergence behaviour by looking at the evolution of the best fitness across generations. This is plotted in Figures 4 and 5 comparatively for both the Gaussian RP and the sparse RP (Achlioptas, 2003) for four very different settings of the pair of parameters *k* and *M*: ; ; ; . The last two use the minimum number of random subspaces that cover the full search space a.s. (in case of the Gaussian RP) or with very high probability (in case of the sparse RP). As before, we use population sizes of . All results represent averages of the best fitness as obtained from 25 independent runs.

First, we see that the behaviour of the two different RP matrices is nearly indistinguishable even when the ensemble size is minimal. However, the minimal *M* that barely spans the search space tends to be a poor choice relative to a larger *M*, and since increasing *M* does not involve extra function evaluations, we recommend using a larger value of *M* (of the order of *d*) in order to work in the regime where the ensemble covariance is understood by the analytical treatment we presented in previous sections. A more systematic empirical study of the effects of the parameter choices follows in the next section.

Furthermore, from Figures 4 and 5, we can see a clear tendency of our RP ensemble EDA algorithms to escape early convergence (which is known to be typical of both UMDAc and EMNA in high dimensions) and to explore the search space. It is particularly pleasing that the versions with finite number of random subspaces also perform well, and as expected, a larger value of *M* gets the performance closer to that of the idealised version with .

### 4.2 Impact of the Parameters and Generic Guidelines for Setting the Parameters

Some observations already emerged regarding the setting of *M* and *k*, and the rule-of-thumb parameter settings we used in the large 1,000-dimensional experiments (see Section 4.1) turned out to perform well. Here we conduct a set of systematic experiments with parameter values varied on a grid and tested on four 100-dimensional functions. Two of these functions are among the few re-scalable ones of the same CEC’10 test suite that we used earlier—T2 (Ackley function, fully separable) and T12 (Rosenbrock, fully nonseparable)—and two others are toy problems that serve to demonstrate certain characteristics of the behaviour of the method: the sphere function () and the rotated ellipse (, where the rotation matrix *M* was uniformly randomly generated in each repeated run and then fixed).^{3} Throughout this section we use a population size fixed to , and we focus to study the influence of *k*, *M*, and the selection pressure . We set the maximum function evaluations to 3 million—much larger than previously—in order to count the number of fitness evaluations required to reach various target values. For each combination of parameter values we performed 10 independent repetitions. We varied , (making sure that we always had at least points to estimate a covariance) and .

Figures 6–13 show the average of the function evaluations needed to reach two chosen target values for the sphere, Ackley, ellipse, and Rosenbrock functions, respectively. If a target was not reached, then the count appears as (i.e., the maximum number of function evaluations cutoff). There are also error bars on the black markers on these surface plots that represent one standard deviation computed from the 10 repeated runs; however, these are rather small and hardly visible at the overall scale. But the trend of the average search costs against the various parameter settings is clearly visible. Two target values (chosen from powers of 10) are displayed for each of these functions, of which one was chosen such that at least for half of the various combinations of values for *k* and *M* the specified target was successfully reached, and the second threshold value was a relatively larger one in order to see the influence of the parameter settings at two different stages of the optimisation. We will refer to this latter target value as the coarser target.

We found, rather unsurprisingly, that the optimal parameter setting depends on the problem in a complex way in the case of tight targets. However, this is not as much the case for the coarser target. The latter is of interest in the practical cases where a “quick and dirty” solution is sought, that is, when we seek an approximate solution to a difficult problem with limited resources. Considering that evolutionary heuristics are most often used in this latter role (He and Yao, 2003; Yu et al., 2012), it may be of interest to look at the search costs involved at both scales.

We now go through the results obtained. We see in Figures 6 and 7 that for the sphere function the surface plots have nearly identical shapes for both target values, while the search costs on the vertical axes differ. The unchanged behaviour is most likely because at both coarse and fine-grained scales the fitness landscape has the same spherical shape, which is easily modelled by our ensemble covariance with a low value of *k*. So the search strategy of our approach works equally well on both scales. Looking at the three surface plots that correspond to different selection pressures , one notices that the smaller value of required lower search costs to reach the target in comparison with the larger value of . This observation is also consistently valid for the other functions. A larger value of means less selection pressure, which leads to slower convergence. Now, for each individual value, the associated surface plot indicates that a small *k* and (interestingly) a small *M* reaches the target quicker in the case of the sphere function. The small value of *M* works here, but we need to keep in mind that the analysis in Section 3.2.2 gives little guarantee for small values of *M* in terms of recovering the correct direction of the covariance, and indeed choosing *M* too small works poorly.

Next, the results on the Ackley function in Figures 8 and 9 display a striking similarity with those in the case of the sphere function. This is despite the fact that Ackley is a multimodal function with several local optima. The reason that our method has such similar behaviour on this function is most likely that the Ackley function has a spherical basin of attraction around its global optimum, and on a coarse scale the Ackley function resembles a spherical shape. So the close-to-spherical covariance induced in our approach when *k* is small turns out advantageous again in this case. Indeed, all the observations we have made about the results on the sphere function do carry over on Ackley. We did not reach a target of as we did on the sphere function, but we did reach below .

For the rotated ellipse function, the picture looks very different, as expected. Figures 10 and 11 show that small values of *k* and *M* all failed to reach the target of even . For this fitness landscape we need a higher value of *M* and *k* to have the flexibility to model an elongated covariance, and to recover the correct direction of the covariance (which is only guaranteed for a large *M*). What is interesting to note is that the picture looks almost symmetric in *k* versus *M* (Figure 10), that is, a less elongated covariance (due to small *k*) can still work on this ellipse function provided that we recover its orientation (with larger enough *M*). Since the time complexity per generation is only linear in *M* but cubic in *k*, it seems a good heuristic to increase *M* first, and increase *k* only if the increased *M* did not deliver satisfactory performance. Of course, a very ill-conditioned fitness function would be a very difficult problem for our approach because we would need a large *k* close to *d*—in which case the averaging ensemble approach itself becomes no longer profitable. Possibly a weighted averaging combination might be developed to handle this case. It is important to remember the no-free-lunch theorem (Wolpert and Macready, 1997), which implies that no method is best on all problems, but each method works well on certain problems, namely, the problems that match the method’s own search bias. In our case, we have just seen this at work, where the Ackley function is an easy problem for our method, whereas the ellipse is a difficult problem. In Section 4.3 we find that for the sep-CMA-ES method the relative difficulty of these two functions is exactly the opposite of what was true for our RP ensemble EDA.

Let us now look at the coarser target value for the same rotated ellipse function. It is interesting to observe in Figure 11 that the surface plots of the search costs to reach the target value of take the same shape as those in the case of the sphere and Ackley functions. This means that at a coarser scale our simple strategy still works with the same profitable parameter values as before. Hence apparently when an approximate solution is needed at low search costs, our method with the rule-of-thumb parameter settings is appropriate to use.

Finally, we show similar results on the Rosenbrock function in Figures 12 and 13. The search costs for the target value of display rather complicated shapes, although the differences on the vertical axes are not particularly large. However, when we inspect the surface plots of the search costs for the coarse target of , we recognise the same shape that we have seen for all the other three functions. This reinforces the previous conclusion.

Based on these results we can set the following guidelines:

A tight selection pressure, for instance, , worked best. This may be different from other approaches, such as EDA-MCC, which recommends , since in that approach a larger group size is important in order to avoid similarity with UMDAc. By contrast, the spherical component of our covariance favours exploration and has a better chance to avoid early convergence, which may be the reason that a higher selection pressure is more cost-efficient. However, these differences when varying have not been massive, and hence when the budget of function evaluations is not particularly tight, a larger may be justified, especially if we want to increase

*k*.Regarding the setting of

*M*, there is no substantial cost to setting it to a higher rather than a lower value. It does not incur any additional function evaluations, and it increases the per generation time complexity only linearly. A large enough*M*has the substantial benefit that we recover the orientation of the covariance from the ensemble, and in many cases this reduces the required function evaluations to reach tighter target values. Small values of*M*work well occasionally but can also lead to poor performance in some cases. We recommend setting*M*of the order of*d*.*k*is the parameter that controls the extent of regularisation. The smaller the value of*k*, the closer to spherical the ensemble covariance. A small*k*works very well on functions with a spherical basin of attraction around the optimum, and this also approximates other functions at a coarser scale. Therefore a small*k*can be effective when the goal is to get an approximate solution with limited resources. On the other hand, when the goal is to reach a target very close to the optimum and the fitness landscape is rather ill-conditioned, then*k*would need to be large. In that case we need to weigh the benefits against the much increased per generation computation time (cubic in*k*). The practitioner needs to weigh these trade-offs in making the appropriate choice for the problem at hand.

Finally, a comment is in order about the population size *N*. Since the estimation step is only required to estimate covariances, *N* only needs to be large enough to have of the order *k* (e.g., a minimum of ) selected points in order to get sufficiently good covariance estimates in the *k*-dimensional space.

### 4.3 Scalability Experiments

Our final set of experiments measures the search costs (number of function evaluations) to reach a specified target value as the problem dimension varies, and compares these with the search costs of sep-CMA-ES.

We use the same four functions as previously, namely sphere, Ackley, rotated ellipse, and Rosenbrock. We fix the value to reach (VTR) to and vary the dimensionality in . We count the number of fitness evaluations needed for our proposed RP ensemble EDA to reach the VTR. We repeated the experiment for three other choices of VTR: , and in order to make sure that the conclusions would not be specific to a particular choice of VTR. In all these experiments we used the rule-of-thumb parameter settings based on the previous observations. We set the maximum fitness evaluations to , so the algorithm stops either upon reaching the VTR or when the maximum function evaluations are exhausted.

The results are displayed in the log-log plots in Figure 14. The figure shows the average number of function evaluations as computed from the successful runs out of 10 independent repetitions for each problem dimension. When none of the 10 repeated runs reached the prespecified VTR, there are missing data in these plots.

From Figure 14 we observe that a linear fit matches tightly the obtained scalability measurements on the log-log plots (dashed lines). The slopes of these lines signify the degree of the polynomial that describes the scaling of our algorithm. The dotted line that corresponds to linear scaling (slope = 1) is also superimposed on these plots for visual comparison.

From this figure we see that our proposed algorithm displays a near-linear dependence of the search costs (number of function evaluations to reach a VTR) as the problem dimension varies. More precisely, the scaling is sublinear in the majority of the cases tested (slope of the best-fitting line on the log-log plot is smaller than 1) in all experiments with sphere and Ackley as well as in the two larger VTR values for ellipse. In the remaining cases the scaling is slightly superlinear but still very close to linear (the slopes are 1.08, and 1.09 on the two smaller VTR values for ellipse, and 1.21 for Rosenbrock).

These results match the best scaling known for sep-CMA-ES (Ros and Hansen, 2008), and considering that the volume of the search space grows exponentially with the dimension; this is indeed as good as one can hope for. Note that a larger VTR can sometimes be reached with very few costs, for instance, this is what we see on the Ackley function. Hence we see that the proposed method is most profitable for quickly finding approximate solutions.

Figure 15 gives a detailed comparison with sep-CMA-ES, using a comparison protocol similar to that utilised by Bosman (2009). Our proposed RP-Ens-EDA and sep-CMA-ES present comparable scaling efficiency overall but have very different search biases and perform well in different situations. We compare the average numbers of function evaluations used by our method to reach various prespecified VTRs against those required by sep-CMA-ES to reach the same VTRs. We included a wider range of VTRs here, equally spaced in the interval for a better visibility of the behaviour of the two methods comparatively. We can summarise the following observations and conclusions from these results:

- Our method gains advantage in higher dimensions, whereas sep-CMA-ES scales better in lower dimensions. We see that in 1,000 dimensions RP-Ens-EDA consistently scales better on three out of the four functions tested (sphere, Ackley, and Rosenbrock) and partly on ellipse. In 500 dimensions our method scales no worse on two out of four functions (sphere and Ackley) and partly on ellipse, and it scales worse on one function (Rosenbrock). In 100 and 50 dimensions RP-Ens-EDA only scales better on one function (Ackley), whereas sep-CMA-ES scales better on the remaining three.Figure 15: