## Abstract

We describe a parameter-free estimation-of-distribution algorithm (EDA) called the adapted maximum-likelihood Gaussian model iterated density-estimation evolutionary algorithm (AMaLGaM-IDA, or AMaLGaM for short) for numerical optimization. AMaLGaM is benchmarked within the 2009 black box optimization benchmarking (BBOB) framework and compared to a variant with incremental model building (iAMaLGaM). We study the implications of factorizing the covariance matrix in the Gaussian distribution, to use only a few or no covariances. Further, AMaLGaM and iAMaLGaM are also evaluated on the noisy BBOB problems and we assess how well multiple evaluations per solution can average out noise. Experimental evidence suggests that parameter-free AMaLGaM can solve a wide range of problems efficiently with perceived polynomial scalability, including multimodal problems, obtaining the best or near-best results among all algorithms tested in 2009 on functions such as the step-ellipsoid and Katsuuras, but failing to locate the optimum within the time limit on skew Rastrigin-Bueche separable and Lunacek bi-Rastrigin in higher dimensions. AMaLGaM is found to be more robust to noise than iAMaLGaM due to the larger required population size. Using few or no covariances hinders the EDA from dealing with rotations of the search space. Finally, the use of noise averaging is found to be less efficient than the direct application of the EDA unless the noise is uniformly distributed. AMaLGaM was among the best performing algorithms submitted to the BBOB workshop in 2009.

## 1 Introduction

Estimation-of-distribution algorithms (EDAs) are an important strand of research on black box optimization (BBO). In BBO, either no prior knowledge on the function to be optimized is available, or the function cannot be expressed in closed form, for example, when solutions are evaluated using simulations. Still, the optimization problem can have exploitable structural features. One type of structure is dependencies between problem variables. Identifying and using dependencies during optimization can lead to more efficient search. EDAs provide a principled way of doing this in a stochastic manner.

EDAs belong to the class of evolutionary algorithms (EAs) and as such maintain a population of solutions that are subject to selection and variation. The first population is usually generated randomly. Each solution's fitness is evaluated and the better solutions are selected for variation. Selection pushes the population into promising regions of the search space. New candidate solutions are generated by estimating a probability distribution from the selected solutions and randomly drawing new samples from this distribution. These newly generated solutions replace parts of the old population or the entire old population. The new population is evaluated and the better solutions are kept. This process is iterated until a convergence criterion is met.

EDAs mark an important contribution to BBO research. The main operator is the estimation (also called learning) of a probability distribution and resampling the probability distribution to generate offspring. This classifies the EDA as a model-based optimization technique. Probability distributions provide a principled way of modeling dependencies between problem variables, allowing EDAs to successfully exploit problem dependencies during the search. As a result, they are often more flexible and efficient at performing BBO than other EAs. Different EDAs have been proposed that use different probability distributions. For an overview, we refer the interested reader to the literature (Lozano et al., 2006; Pelikan et al., 2006).

The BBOB (black box optimization benchmarking) framework for real-valued optimization was introduced in 2009. BBOB is an extensive benchmark consisting of both an experimental setup (Hansen et al., 2009a) and carefully chosen functions (Finck et al., 2009; Hansen et al., 2009b). The reader should be familiar with the BBOB framework prior to reading the experimental results provided in this article.

In this article, we give a detailed description of an EDA for real-valued optimization, benchmark it within the BBOB framework, and discuss our findings. The EDA uses a Gaussian probability distribution and is known as the adapted maximum-likelihood Gaussian model iterated density-estimation evolutionary algorithm (Bosman et al., 2008; Bosman, 2009; AMaLGaM-IDA, or AMaLGaM for short).

This article is organized as follows. In Section 2, we describe AMaLGaM and three variants using (1) a fully multivariate Gaussian distribution, (2) a Bayesian factorization of the Gaussian distribution, or (3) a univariately factorized Gaussian distribution. Section 3 explains how the Gaussian distribution can be learned incrementally over multiple generations rather than anew in each generation, the goal of which is to reduce the required population size. Section 4 describes our approach to removing all parameters and introduces a restart mechanism to improve the EDA's performance on multimodal problems. We provide pseudocode in Section 5 and present the results on the non-noisy and noisy functions in the BBOB framework in Section 6. The article ends with concluding remarks and an outlook on future research.

## 2 AMaLGaM

AMaLGaM represents the current state in a line of research on using Gaussian distributions in EDAs for numerical optimization. Some initial publications used maximum-likelihood (ML) parameter estimates (Bosman and Thierens, 2000; Larrañaga et al., 2000). Studies of the behaviour of such models revealed important shortcomings and led to the development of AMaLGaM. In the following, we review the basic principles of Gaussian distributions and the adaptations incorporated into AMaLGaM. For a more in-depth discussion of the design of various components, we refer the interested reader elsewhere (Bosman and Thierens, 2000; Grahl et al., 2006; Bosman et al., 2007; Bosman et al., 2008; Bosman, 2009).

### 2.1 Gaussian Probability Distribution

*X*for each real-valued problem variable

_{i}*x*, where

_{i}*l*is the problem dimensionality. Let

**be a vector of |**

*v***| unique indices, . Using**

*v***we can distinguish between the case of having a distribution over all variables (i.e., ) and a distribution over a subset of all variables, which is important in the context of factorizations (see below). The Gaussian probability distribution for a vector of random variables is parameterized by a vector of means and a symmetric covariance matrix comprising a total of parameters to be estimated. It is defined as follows:**

*v*### 2.2 Maximum-Likelihood Estimates

### 2.3 Random Sampling

To draw new random samples for a single Gaussian-distributed random variable with a mean parameter of 0 and a variance parameter of 1, elementary algorithms exist (see, e.g., Knuth, 1981, Section 3.4.1, Algorithm C).

To draw random samples from a multivariate Gaussian the Cholesky decomposition of the covariance matrix can be computed . Then, the vector , where *z _{i}* is a single-dimensional Gaussian random variable with mean 0 and variance 1, is a |

**|-dimensional random sample drawn from .**

*v*### 2.4 Gaussian Factorization Selection

*l*means and variances. Sampling a new solution is also straightforward and entails sampling the associated one-dimensional Gaussian distribution for each of the

*l*variables.

*X*is conditioned in the Bayesian factorization is called the vector of parents of

_{i}*X*. A Bayesian factorization can now be written as As described elsewhere (Pelikan et al., 1999; Mühlenbein and Mahnig, 1999; Bosman and Thierens, 2000; Larrañaga et al., 2000) greedy algorithms are the standard approach to compute Bayesian factorizations because this problem itself is NP-hard. The factorization expresses a subset of all dependencies between the variables. The density contours of a (factorized) Gaussian probability distribution are ellipsoids. Depending on the dependencies modeled by the factorization, these ellipsoids can be aligned along any axis. If there is no dependency between a set of random variables (i.e., the univariate factorization), the projected density contours in those dimensions are aligned along the main axes. In any case, a Gaussian distribution is only capable of modeling linear dependencies.

_{i}

*W*^{v}be the inverse of the symmetric covariance matrix for variables , that is, . Matrix

*W*^{v}is commonly called the precision matrix. An ML estimate of can be expressed in terms of Equation (2) (Bosman and Thierens, 2000): where and

Equation (5) is a single-dimensional Gaussian distribution. Sampling from Bayesian factorizations is thus straightforward once all computations have been performed.

### 2.5 AVS

The smaller the variance, the smaller the area of exploration for the EDA. The variance in the Gaussian distribution is stored in the covariance matrix . Using ML estimates directly, the variance may decrease too rapidly for the EDA to have sufficient time for exploration. As a remedy, the variance can be scaled (adaptively) beyond its ML estimate (Ocenasek et al., 2004; Yuan and Gallagher, 2005; Grahl et al., 2006). The technique used in AMaLGaM is called adaptive variance scaling (AVS; Grahl et al., 2006).

A distribution multiplier *c*^{Multiplier} is maintained and upon sampling new solutions, the covariance matrix used is instead of just . If the best fitness value improves in one generation, then the current size of the variance allows for progress. A further enlargement of the variance may allow further improvement in the next generation. To fight the variance-diminishing effect of selection, the size of *c*^{Multiplier} is scaled by . If, on the other hand, the best fitness does not improve, the range of exploration may be too large to be effective and the variance multiplier should be decreased by a factor of . For symmetry, .

The AVS technique is also used to detect termination. To this end, *c*^{Multiplier}<1 is allowed if no improvement has been found in more than a predefined number of subsequent generations while *c*^{Multiplier}=1. When this happens, the mean of the distribution is set to the best solution. This allows the EDA to focus on a single peak: the one to which the currently best solution belongs. If an improvement is found during this phase, *c*^{Multiplier} is reset to 1. Termination is enforced if *c*^{Multiplier}<10^{-10}.

It has previously been observed that gives good results (Grahl et al., 2006). The number of subsequent generations before *c*^{Multiplier}<1 is allowed, that is, the maximum no-improvement-stretch NIS^{MAX}, is set to 25+*l*, which was empirically observed to give good results. This number varies with *l* since convergence speed typically decreases for increasing *l*.

### 2.6 SDR

AVS increases *c*^{Multiplier} when fitness values improve. However, improved fitness values do not always mean that the variance needs to be enlarged. This is the case when the Gaussian kernel is already near the optimum. Increasing the variance will then only slow down convergence, as the EDA is unnecessarily forced to explore a larger area of the search space. It is therefore sensible to attempt to separate two cases: traversing a slope, and searching around an optimum.

To this end, the standard deviation ratio (SDR) approach (Bosman et al., 2007) relates the distance of the improvements to the mean in parameter space. If improvements mostly take place far away from the mean, then AVS is used. If most of the improvements are obtained near the mean, then the EDA with ML parameters is focused around the optimum and variance enlargement is not required.

Let denote the average of all new samples that were an improvement over the set of selected solutions in the same generation. The distance between and the current mean is decomposed along the *l* principal axes of the Gaussian distribution. If along any of these *l* axes the distance exceeds the standard deviation more than a threshold times, further enlargement of *c*^{Multiplier} is triggered. Otherwise, the distribution multiplier remains unchanged. Note that this trigger is independent of the sample range and has a fixed, predefined notion of being “close” to the mean. From previous work, is known to give good results (Bosman et al., 2007).

### 2.7 AMS

Estimating the distribution using the selected solutions of the current generation only, the density contours can become aligned with directions that contain only solutions of similar quality (Hansen, 2006; Yunpeng et al., 2007; Bosman et al., 2008). Methods that scale the covariance matrix, such as SDR-AVS, do not help much as they almost solely increase search effort in the direction perpendicular to the direction of improvement.

One remedy is to use the anticipated mean shift (AMS; Bosman et al., 2008). The AMS is computed as the difference between the means of subsequent generations, that is, . A part, specifically , of the newly sampled solutions is then moved in the direction of the AMS, that is, . The rationale is that solutions changed by AMS are further down the slope (assuming minimization). Selecting those solutions as well as solutions not changed by AMS aligns the distribution better with the direction of improvement. In a population of size *n* (where solutions are selected, *n*^{elitist} solutions are maintained and *n*−*n*^{elitist} new solutions are generated), proportioning the selected solutions perfectly between unaltered and AMS-altered solutions requires that and thus . From previous work, *n*^{elitist}=1 and are known to give good results (Bosman, 2009).

### 2.8 AMS-SDR-AVS = AMaLGaM

AMaLGaM combines AMS, SDR, and AVS. Traversing a slope is further sped up by multiplying the movement of solutions in the direction of the AMS by the same multiplier used for the covariance matrix, that is, . The same multiplier is used as for enlarging the values in the covariance matrix, because the multiplier relates to finding improvements far away from the mean, which is indicative that larger movements in the direction of the AMS are beneficial.

## 3 iAMaLGaM

Distribution estimation in AMaLGaM is done from scratch each generation. However, subsequent generations have much in common and therefore the required population size can be reduced by incremental learning, that is, combining the distribution estimated from the currently selected solutions with the distribution used in the previous generation. In iAMaLGaM, a memory-decay approach is taken to this end.

Most of the parameters are stored in the covariance matrix . The covariance matrix is also the main reason why the population size has to grow faster with *l* than when only the variances are considered. More samples are required for the maximum-likelihood estimate of the covariance matrix to ensure that it is conditioned well.

*is*desirable. With less reliable mean estimates, the AMS may oscillate around its optimal direction. A memory can smooth these oscillations out. The equations for incrementally estimating the covariance matrix and the AMS are

Values for the learning-rate parameters were determined empirically in previous work. Specifically, the function class for each parameter is where is the selection size and *l* the number of variables. The parameters were determined by means of regression on the best results found on a set of test problems up to dimensionality *l* = 40 (Bosman, 2009). The resulting parameter settings are shown in Table 1.

## 4 Parameter-Free (i)AMaLGaM

EDAs and other EAs typically have many parameters. Removing them greatly eases their use although it may come at the cost of losing the optimal settings for specific problems. Typically, and this has also been the case for AMaLGaM (see, e.g., Bosman, 2009), guidelines are derived from testing the algorithm on a set of problems and taking the worst-case or average-case for parameter settings. In real-valued optimization, this set of problems typically consists of various forms of unimodal problems, most of which are also included in BBOB, and typically includes the Rosenbrock function (which is bimodal) as well, which is also in BBOB. In this section, we summarize the guidelines that have been developed for all parameters in iAMaLGaM on the basis of which parameter-free versions can be designed. First, we summarize all parameters in iAMaLGaM, as shown in Table 2.

Parameter . | Part . | Definition . |
---|---|---|

General | Selection percentile (select best solutions) | |

n | General | Population size |

Distribution | Covariance matrix of the Gaussian distribution | |

Distribution | Mean vector of the Gaussian distribution | |

AVS | Distribution-multiplier increase factor | |

AVS | Distribution-multiplier decrease factor | |

NIS^{MAX} | AVS | Threshold on subsequent generations without improvementbefore c^{Multiplier}<1 is allowed |

SDR | Threshold on ratio of distance of improvements and standarddeviation for triggering AVS | |

AMS | Percentile of generated solutions to apply AMS to | |

AMS | Factor of mean shift for moving generated solutions | |

n^{elitist} | AMS/General | Number of elitist solutions (generate n−n^{elitist} anew) |

Incremental | Learning rate for covariance matrix | |

Incremental | Learning rate for mean shift |

Parameter . | Part . | Definition . |
---|---|---|

General | Selection percentile (select best solutions) | |

n | General | Population size |

Distribution | Covariance matrix of the Gaussian distribution | |

Distribution | Mean vector of the Gaussian distribution | |

AVS | Distribution-multiplier increase factor | |

AVS | Distribution-multiplier decrease factor | |

NIS^{MAX} | AVS | Threshold on subsequent generations without improvementbefore c^{Multiplier}<1 is allowed |

SDR | Threshold on ratio of distance of improvements and standarddeviation for triggering AVS | |

AMS | Percentile of generated solutions to apply AMS to | |

AMS | Factor of mean shift for moving generated solutions | |

n^{elitist} | AMS/General | Number of elitist solutions (generate n−n^{elitist} anew) |

Incremental | Learning rate for covariance matrix | |

Incremental | Learning rate for mean shift |

In AMaLGaM, the parameters and for the Gaussian distribution are estimated with maximum likelihood. In iAMaLGaM these estimations are updated using memory-decay with empirically determined functions for the learning rates and as described in Section 3. Guidelines for setting all parameters in AMS, SDR, and AVS are given above and are based on previous work. The parameters for estimation and variation are thereby defined. For the selection step, truncation selection with a truncation percentile of (i.e., select the 35% best solutions) is known to give good results (Bosman, 2009).

What remains is an important parameter: the population size *n*. To overcome the burden of tuning its value, an EA can be restarted after convergence with an exponentially increasing population size (see e.g., Auger and Hansen, 2005). With a small population the EA is expected to terminate quickly, wasting only few evaluations if the optimum is not found. However, for real-valued optimization with a Gaussian EDA, enlarging the population size may not always be the most effective approach, because a single Gaussian essentially represents a centered search. Therefore, increasing the population size does not necessarily mean that the probability of locating the global optimum far away from the mean of the Gaussian kernel significantly increases. The population size may have to be greatly increased before initialization favorably places solutions well inside the global optimum at better values than solutions at other local optima. Therefore, (i)AMaLGaM uses a different approach (Bosman, 2009).

The number of solutions is increased upon each restart by alternating between two approaches: a single run with a larger population and several parallel runs. To maximize the joint global effect of the parallel runs, they are started in separate regions obtained from clustering the search space. When increasing the number of parallel runs, the subpopulation size is also increased slightly so as to increase the robustness of the more localized searches. Let *t*^{Restart} denote the number of restarts so far (i.e., the first run has *t*^{Restart}=0) and let *m* denote the number of parallel executions of subpopulation size *n*^{Sub} (i.e., *mn*^{Sub}=*n*). The restart mechanism in (i)AMaLGaM is:

if

*t*^{Restart}is even, then , ;if

*t*^{Restart}is odd, then ,*m*=1.

*n*

^{Base}is a baseline population size. Previous work suggests the following values:

Full covariance matrix

AMaLGaM:

*n*^{Base}=17+3*l*^{1.5}, iAMaLGaM:*n*^{Base}=10*l*^{0.5}Bayesian factorization

AMaLGaM:

*n*^{Base}=12+8*l*^{0.7}, iAMaLGaM:*n*^{Base}=7*l*^{0.5}Univariate factorization

AMaLGaM:

*n*^{Base}=10*l*^{0.5}, iAMaLGaM:*n*^{Base}=4*l*^{0.5}

## 5 Pseudocode

For technical completeness, pseudocode is presented for AMaLGaM itself (see Algorithm 1) as well as its parameter-free version (see Algorithm 2), which can be seen as a wrapper.

## 6 BBOB Results

### 6.1 Without Noise

BBOB experiments were performed with random inital solutions in [−5, 5]^{l} and at most 10^{6}*l* function evaluations. Here we describe the results on functions without noise.

#### 6.1.1 AMaLGaM-Full

#FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . | #FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 1.0 | 1.9 | 2.2 | 3.1 | 4.2 | 8.5 | 2 | 1.0 | 3.7 | 11 | 31 | 40 | 40 |

10 | 1.7 | 3.1 | 3.4 | 3.9 | 5.6 | 11 | 10 | 4.8 | 7.4 | 15 | 1.2e2 | 2.0e2 | 2.0e2 |

100 | 1.0 | 1.6 | 2.9 | 6.1 | 8.0 | 30 | 100 | 4.2 | 6.5 | 10 | 27 | 35 | 2.7e2 |

1e3 | 1.1 | 2.1 | 2.7 | 8.4 | 21 | 77 | 1e3 | 1.0 | 1.7 | 4.6 | 20 | 66 | 2.4e2 |

1e4 | 1.0 | 1.4 | 3.3 | 7.8 | 41 | 66 | 1e4 | 1.0 | 1.7 | 2.3 | 12 | 86 | 6.2e2 |

1e5 | 1.0 | 1.4 | 3.3 | 5.2 | 32 | 1.5e2 | 1e5 | 1.0 | 1.2 | 2.2 | 19 | 4.5e2 | 1.1e3 |

1e6 | 1.0 | 1.4 | 3.3 | 5.3 | 28 | 1.7e2 | 1e6 | 1.0 | 1.2 | 2.2 | 12 | 2.6e2 | 3.1e3 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e7 | 1.0 | 1.2 | 2.2 | 12 | 3.5e2 | 2.7e4 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 |

#FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . | #FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 1.0 | 1.9 | 2.2 | 3.1 | 4.2 | 8.5 | 2 | 1.0 | 3.7 | 11 | 31 | 40 | 40 |

10 | 1.7 | 3.1 | 3.4 | 3.9 | 5.6 | 11 | 10 | 4.8 | 7.4 | 15 | 1.2e2 | 2.0e2 | 2.0e2 |

100 | 1.0 | 1.6 | 2.9 | 6.1 | 8.0 | 30 | 100 | 4.2 | 6.5 | 10 | 27 | 35 | 2.7e2 |

1e3 | 1.1 | 2.1 | 2.7 | 8.4 | 21 | 77 | 1e3 | 1.0 | 1.7 | 4.6 | 20 | 66 | 2.4e2 |

1e4 | 1.0 | 1.4 | 3.3 | 7.8 | 41 | 66 | 1e4 | 1.0 | 1.7 | 2.3 | 12 | 86 | 6.2e2 |

1e5 | 1.0 | 1.4 | 3.3 | 5.2 | 32 | 1.5e2 | 1e5 | 1.0 | 1.2 | 2.2 | 19 | 4.5e2 | 1.1e3 |

1e6 | 1.0 | 1.4 | 3.3 | 5.3 | 28 | 1.7e2 | 1e6 | 1.0 | 1.2 | 2.2 | 12 | 2.6e2 | 3.1e3 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e7 | 1.0 | 1.2 | 2.2 | 12 | 3.5e2 | 2.7e4 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 |

Left: *f*_{1}−*f*_{24} in 5D, maxFE/D = 1.00e6. Right: *f*_{1}−*f*_{24} in 20D, maxFE/D = 1.00e6.

Table 3 shows that compared to the best results obtained in the BBOB-2009 workshop, AMaLGaM's median performance over all benchmark functions becomes less favorable with increasing problem size. The scalability of AMaLGaM in terms of function evaluations is worse than the best known results. Figure 1 underlines this result. The difference in scalability is specifically clear for the Sphere function.

On most benchmark functions up to *l*=40, AMaLGaM identifies the optimum within the highest measured precision. Its perceived scalability is polynomial, even on multimodal problems. It is known from a related recent experimental study that includes some of the BBOB functions and a dimensionality up to *l* = 200, that the perceived scalability remains polynomial (Bosman, 2009), namely for Sphere, for Ellipsoid, for Sharp ridge, and for Rosenbrock. These results let us conclude that AMaLGaM's scalability is robust. It was found experimentally that on a set of linear and quadratic functions the perceived scalability of AMaLGaM with a full covariance matrix is approximately quadratic (Bosman, 2009).

AMaLGaM's scalability on other functions is not always approximately quadratic and 10^{6}*l* evaluations are not always enough to obtain the optimum within the highest precision. The hardest problems for AMaLGaM are 4 and 24. However, the results for lower precision again suggest polynomial scalability. This suggests that AMaLGaM-Full is a robust algorithm able to overcome rotation and, using restarts, multimodality without losing polynomial scalability. Figure 2 shows the percentile of successfully tackled benchmark problems. AMaLGaM efficiently solves a wide range of problems and is among the most successful algorithms in the BBOB-2009 workshop.

#### 6.1.2 The Impact of Incremental Learning

We also tested AMaLGaM with incremental model building. Results for comparing non-incremental model building to incremental model building using a full covariance matrix (i.e., (i)AMaLGaM-Full) are shown in Figures 2 and 3. Further experiments were performed to study incremental model building using factorized Gaussian distributions. The impact of incremental model building was found to be similar.

Figure 3 shows that incremental model building impacts the performance of AMaLGaM, but not uniformly positively or negatively. On unimodal problems such as Sphere, Ellipsoid, and Linear slope, results for iAMaLGaM are better and the difference increases with dimensionality and target quality. This is in agreement with the aforementioned study in which the perceived scalability of iAMaLGaM-Full remains polynomial, but slightly better than for AMaLGaM-Full (Bosman, 2009), namely for Sphere, for Ellipsoid, for Sharp ridge, and for Rosenbrock.

For multimodal problems, for example, Rastrigin, Weierstrass, and Griewank-Rosenbrock, nonincremental AMaLGaM performs better. This is most likely due to its larger base-population size. iAMaLGaM is sometimes unable to find the optimum within the predetermined limit of function evaluations because it starts from a smaller base population size. Hence, if memory resources are not very important, a larger base-population size helps in optimizing multimodal problems.

AMaLGaM and iAMaLGaM solve a similar percentile of problems, as shown in Figure 2. If nothing is known about the problem, that is, in a true BBO setting, unless larger population sizes are problematic, the nonincremental version may be preferred because when considering all problems it obtains slightly better results using fewer function evaluations.

#### 6.1.3 The Impact of Factorization

We compared AMaLGaM-Full with AMaLGaM using a Bayesian factorization with few parents, or a univariate factorization (see Section 2.4). We allowed at most five parents per variable in the Bayesian factorization. The results comparing the full covariance matrix with the Bayesian factorization are shown in Figures 4 and 5. The results comparing the full covariance matrix with the univariate factorization are shown in Figures 6 and 7.

Without a full covariance matrix, not all pairwise dependencies can be processed and not all ellipsoid density contours can be aligned with an arbitrary direction. This can lead to an inefficient representation of promising areas of the search space and therefore to inefficient exploitation of the structure of the landscape. For low dimensional problems having , this problem is absent for the Bayesian factorization, since five parents are allowed. The cumulative results in Figure 5 show little difference between the two algorithms for *l* = 5. As the problem dimensionality is increased, the proportion of dependencies that can be expressed decreases, making the Bayesian-factorization approach increasingly similar to the univariate-factorization approach regarding performance. Whereas for *l* = 5 the Bayesian factorization is still capable of solving most problems, the univariate factorization can solve a smaller number of problems, as shown in Figure 6. The only significant difference between using a Bayesian factorization and a univariate factorization for higher-dimensional problems appears when the number of dependencies in the problem is not too high, which is the case for function 8, Rosenbrock. Here the use of the Bayesian factorization outperforms the use of the univariate factorization and even remains more efficient than the use of the full covariance matrix.

Computing a Bayesian factorization is time-consuming even using a greedy heuristic. Although less costly than working with a full covariance matrix, the expected added benefit quickly vanishes with an increase in problem dimensionality unless the number of dependencies between problem variables is known to be small. Computing Bayesian factorizations for Gaussian distributions is thus arguably not worth the effort and one can focus on either using the full covariance matrix or using the univariate factorization. The results in Figure 7 show that for problems with no dependencies, the use of univariate factorization is more efficient. Polynomial scalability was also found to be favorable (Bosman, 2009), namely for Sphere, for Ellipsoid separable, and for Sharp ridge. The use of a univariate factorization is also much faster in terms of actual computation time. Polynomial scalability then is slightly above quadratic, compared to halfway between cubic and quartic for the full covariance matrix. This makes problem dimensionality and actual time for a function evaluation the most important discriminators in choosing the most appropriate EDA. Considering only function evaluations, the use of the full covariance matrix is superior.

### 6.2 With Noise

#### 6.2.1 AMaLGaM-Full

#FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . | #FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 1.0 | 1.2 | 1.6 | 2.1 | 5.6 | 10 | 2 | 1.0 | 1.0 | 3.8 | 23 | 40 | 40 |

10 | 1.1 | 1.4 | 1.8 | 3.4 | 5.0 | 28 | 10 | 1.0 | 2.4 | 10 | 1.3e2 | 2.0e2 | 2.0e2 |

100 | 1.0 | 1.0 | 2.1 | 5.5 | 17 | 2.8e2 | 100 | 1.0 | 1.0 | 1.4 | 8.1 | 28 | 1.0e3 |

1e3 | 1.0 | 4.2 | 8.7 | 18 | 49 | 2.6e3 | 1e3 | 1.0 | 1.0 | 2.3 | 13 | 32 | 2.0e4 |

1e4 | 1.0 | 1.4 | 7.2 | 18 | 30 | 2.5e4 | 1e4 | 1.0 | 2.3 | 5.5 | 14 | 56 | 1.0e5 |

1e5 | 1.0 | 1.4 | 4.0 | 16 | 51 | 2.4e2 | 1e5 | 1.0 | 1.2 | 5.5 | 19 | 1.1e2 | 1.0e6 |

1e6 | 1.0 | 1.4 | 8.2 | 18 | 80 | 2.4e2 | 1e6 | 1.0 | 1.1 | 5.5 | 23 | 1.3e2 | 7.1e2 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e7 | 1.0 | 1.1 | 4.2 | 29 | 1.3e2 | 1.3e3 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 |

#FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . | #FEs/D . | Best . | 10% . | 25% . | Median . | 75% . | 90% . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 1.0 | 1.2 | 1.6 | 2.1 | 5.6 | 10 | 2 | 1.0 | 1.0 | 3.8 | 23 | 40 | 40 |

10 | 1.1 | 1.4 | 1.8 | 3.4 | 5.0 | 28 | 10 | 1.0 | 2.4 | 10 | 1.3e2 | 2.0e2 | 2.0e2 |

100 | 1.0 | 1.0 | 2.1 | 5.5 | 17 | 2.8e2 | 100 | 1.0 | 1.0 | 1.4 | 8.1 | 28 | 1.0e3 |

1e3 | 1.0 | 4.2 | 8.7 | 18 | 49 | 2.6e3 | 1e3 | 1.0 | 1.0 | 2.3 | 13 | 32 | 2.0e4 |

1e4 | 1.0 | 1.4 | 7.2 | 18 | 30 | 2.5e4 | 1e4 | 1.0 | 2.3 | 5.5 | 14 | 56 | 1.0e5 |

1e5 | 1.0 | 1.4 | 4.0 | 16 | 51 | 2.4e2 | 1e5 | 1.0 | 1.2 | 5.5 | 19 | 1.1e2 | 1.0e6 |

1e6 | 1.0 | 1.4 | 8.2 | 18 | 80 | 2.4e2 | 1e6 | 1.0 | 1.1 | 5.5 | 23 | 1.3e2 | 7.1e2 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e7 | 1.0 | 1.1 | 4.2 | 29 | 1.3e2 | 1.3e3 |

RL_{US}/D | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 | 1e6 |

Left: *f*_{101}−*f*_{130} in 5D, maxFE/D = 1.00e6. Right: *f*_{101}−*f*_{130} in 20D, maxFE/D = 1.00e6.

Table 4 shows that, similar to the case without noise, AMaLGaM compared to the best results from BBOB 2009 become less favorable with increasing dimensionality, although the decline appears to be smaller.

Figure 8 indicates that the perceived polynomial scalability is upheld even in the presence of noise, though at a higher degree. More evaluations are required and the optimum is not always found. AMaLGaM is robust and again among the best algorithms submitted to the BBOB-2009 workshop, as shown in Figure 9.

Problems with uniform noise are the hardest for AMaLGaM. Here its scalability declines strongly. The noise strength becomes more severe near the optimum in the case of uniform noise. This makes it harder for the algorithm to zoom in on the region of the best function values and an increasing number of additional generations is required to overcome decision errors. Note that Cauchy noise is applied sparingly, and not on every sample. This is much less problematic for AMaLGaM.

#### 6.2.2 The Impact of Noise Averaging

EDAs, including AMaLGaM, are not specifically designed to handle noise. Function evaluations are assumed to be without noise so that selection always returns the better solutions. The lack of explicit noise modeling (in the distribution) is not necessarily a drawback of EDAs as arguably modeling noise can already be done in the selection phase, possibly including a memory for previous generations to get a more reliable ordering of solutions before selecting the better ones for probabilistic modeling.

Here we consider one of the simplest adaptations aimed at getting more robust results in the presence of noise: any evaluation of a solution is the sample average of multiple evaluations of the actual, noisy, function. This potentially smoothes out noise, but the question is whether such costly evaluation benefits the performance of AMaLGaM. We compare AMaLGaM with a full covariance matrix to the same EDA in which evaluations are averaged over 10 trials. The results are given in Figures 9 and 10. AMaLGaM with noise averaging is allowed 10 times more function evaluations. The resulting increase in computation time prohibited us from obtaining results for *l*=40.

AMaLGaM is hardly influenced on functions with moderate noise; see Figure 10. The ERT for the algorithm with averaging lies on the first diagonal to the top of the main diagonal, that is, indicating a performance decrease of a factor 10, exactly the number of trials used for averaging. For moderate Cauchy noise, the algorithm that employs averaging actually gets worse. This can be seen for Cauchy noise in general. Since the sample average of a Cauchy distribution follows the same distribution as a single sample drawn from the same Cauchy distribution, we would expect the performance of AMaLGaM with averaging over 10 trials in the presence of Cauchy noise to be exactly 10 times worse. However, Cauchy noise is applied seldom in the BBOB framework. When averaging over multiple trials, the number of times the evaluation procedure for a single solution contains Cauchy noise strongly increases. As a result, the search actually becomes more noisy with noise averaging.

For Gaussian noise, averaging over 10 trials and no averaging yield similar results. In some cases, such as Griewank-Rosenbrock, averaging even surpasses no averaging. For uniform noise, which in the case of the BBOB framework is more severe, there is almost always an improvement with the use of averaging, especially for higher dimensionality. The uniform distribution obeys the conditions of the central limit theorem and therefore the sample average, in the limit, follows a Gaussian distribution (with shrinking variance as more trials are used in the averaging). From the results on Gaussian noise we already know that AMaLGaM is capable of handling this type of noise and that the impact of noise averaging becomes more substantial with increasing problem dimensionality. Hence, it is a matter of how severe the noise is. The larger the noise variance, the more need for noise averaging and the bigger the impact.

Regarding the entire benchmark, noise averaging does not improve overall competence, as shown in Figure 9. It slows the EDA down by almost the same factor as the number of trials used to average over, especially for large dimensionality. This is underlined by the results obtained using 100 trials, as shown in Figure 11. Given the number of required evaluations and the available time, experiments were performed up to *l*=10. The gap between AMaLGaM with and without noise averaging increases to a factor of .

Tackling noise more efficiently can be important because many real-life problems are noisy. One issue is to detect the frequency of noise. Infrequent noise may lead to an increase in stochasticity as a result of averaging, which may be detected and used to reduce or disable noise averaging. In general, noise averaging was found only to be useful for uniform noise or for severe-enough Gaussian noise and for large-enough problem dimensionalities. A second issue that may be detected is whether the noise comes from a model that satisfies the conditions for the central limit theorem to hold and whether the variance for an evaluation decreases. We further suppose that noise can be addressed more effectively if it is explicitly modeled in EDAs. A separate model could be maintained which explicitly models uncertainty. This model could be used to guide selection and make fewer decision errors regarding which solution is better.

## 7 Conclusions

We have given an overview of a real-valued EDA called AMaLGaM that is based on the Gaussian distribution and benchmarked it on non-noisy and noisy functions within the BBOB framework. We tested variants in which the Gaussian distribution was factorized and in which incremental learning over multiple generations was used.

Experimental evidence suggests that AMaLGaM is a robust algorithm that solves the benchmark functions with similar polynomial scalability in terms of required number of function evaluations for similar (quadratic or linear) functions, regardless of rotations. Combined with a restart scheme that increases the population size and runs multiple instances of AMaLGaM in parallel, the algorithm is robust to noise and multimodality. In the presence of noise, AMaLGaM still appears to obtain polynomial scalability, albeit of higher degree. AMaLGaM was among the best performing algorithms submitted to the BBOB-2009 workshop.

Averaging evaluations over multiple trials to smooth out noise was found to make the EDA less efficient unless the problem is of high dimensionality and the noise satisfies the conditions of the central limit theorem. For future work, we have pointed out that explicitly modeling noise in EDAs may be beneficial.