## Abstract

In this paper we extend a previously proposed randomized landscape generator in combination with a comparative experimental methodology to study the behavior of continuous metaheuristic optimization algorithms. In particular, we generate two-dimensional landscapes with parameterized, linear ridge structure, and perform pairwise comparisons of algorithms to gain insight into what kind of problems are easy and difficult for one algorithm instance relative to another. We apply this methodology to investigate the specific issue of explicit dependency modeling in simple continuous estimation of distribution algorithms. Experimental results reveal specific examples of landscapes (with certain identifiable features) where dependency modeling is useful, harmful, or has little impact on mean algorithm performance. Heat maps are used to compare algorithm performance over a large number of landscape instances and algorithm trials. Finally, we perform a meta-search in the landscape parameter space to find landscapes which maximize the performance between algorithms. The results are related to some previous intuition about the behavior of these algorithms, but at the same time lead to new insights into the relationship between dependency modeling in EDAs and the structure of the problem landscape. The landscape generator and overall methodology are quite general and extendable and can be used to examine specific features of other algorithms.

## 1. Introduction

An important research direction in evolutionary and metaheuristic optimization is to improve our understanding of the relationship between algorithms and the optimization problems that they are applied to. From a theoretical point of view, all optimization algorithms perform equally when most reasonable measures of performance are averaged over the space of all possible optimization problems (Wolpert and Macready, 1997). While algorithm-performance relationships may not be of interest in this general case, in practice, the optimization problems that arise from real-world domains are only a small fraction of the space of all possible problems. Hence, experimental results that are reported in the literature typically show significant performance differences between algorithms. In a general sense, an algorithm can be expected to perform well if the assumptions that it makes, either explicit or implicit, are well-matched to the properties of the search landscape or solution space of a given problem or set of problems.

While it is possible to carry out theoretical investigations of the performance of specific algorithms and their behavior on certain (typically simple) problems, it is also useful to take a systematic and rigorous approach to the experimental analysis of algorithms. One class of tools that have been developed to assist in this approach are randomized problem (landscape) generators. Landscape generators have some favorable properties which can be used to gain insights into the behavior of metaheuristic optimizers with respect to underlying properties of the problem instances generated.

The main goal of this paper is to introduce a new experimental methodology for comparing continuous metaheuristic optimization algorithms. Specifically, we extend a previously proposed randomized landscape generator in combination with a methodology inspired by Langdon and Poli (2007). We analyze pairwise performance comparisons of algorithms on 2D test problems with linear ridge structure to gain insight into problem difficulty for different algorithm instances. The second goal of the paper is then to illustrate the use of this approach to investigate the specific issue of explicit dependency modeling in the estimation of multivariate normal algorithm (EMNA) compared to the univariate marginal distribution algorithm (UMDA_{c}) which does not model variable dependencies. The overall methodology is quite general and can be used to examine experimentally the specific features of other algorithms.^{1}

An outline of the paper is as follows. Section 2 gives an overview of the previous work that provides a basis for the methodology in this paper. The extension of the landscape generator to incorporate linear ridge structure and some illustrative experiments are presented in Section 3. In Section 4 the extended generator and methodology are used to study the relationship between dependencies in problem variables and the modeling in UMDA_{c} and EMNA. Section 5 proposes and conducts an active search over the space of landscape generator parameters for parameter value vectors that differentiate algorithms. Section 6 concludes the paper. This paper is based in part on the work presented in Morgan and Gallagher (2010) but contains significant revisions and extensions.

## 2. Background and Motivation

The methodology and practice of experimental research in metaheuristics is receiving increasing attention in the literature as a means of evaluating and comparing the performance of newly proposed and existing algorithms (Bartz-Beielstein, 2006; Bartz-Beielstein et al., 2010). Such directions in experimental algorithmics can also be seen in the broader computer science and optimization literature, to analyze algorithms where a full theoretical understanding is lacking (Johnson, 2002; McGeoch, 1996, 2007). While experiments have always been used to illustrate the performance of metaheuristics, there is some evidence to suggest that many researchers in the field are striving to increase the size, rigor, sophistication, and detail of the experimental studies they undertake. This includes the development of large-scale competitions and associated sets of benchmark test problems (e.g., the Special Session on Real-Parameter Optimization at the 2005 Congress on Evolutionary Computation, CEC, and subsequent competitions at CEC^{2}; the 2009 and 2010 workshops on black-box optimization benchmarking at the *Genetic and Evolutionary Computation Conference*, GECCO).^{3}

Recent research has also explored the use of machine learning, data mining, exploratory data analysis, and visualization techniques for interpreting metaheuristic experimental data (see, e.g., Smith-Miles et al., 2010; Mersmann et al., 2011; Corne and Reynolds, 2011). While this work is not specifically related to the main contributions of this paper, it shares the general aim of using such techniques to increase the value of experimental results, for example in the recognition of patterns and trends in large amounts of data.

Several different types of test problems have been used in the literature for the evaluation of metaheuristic optimization algorithms, including constructed analytical functions, real-world problem instances or simplified versions of real-world problems, and problem/landscape generators (Gallagher and Yuan, 2006; Addis and Locatelli, 2007; Gaviano et al., 2003; MacNish, 2007). Different problem types have their own characteristics; however, it is usually the case that complementary insights into algorithm behavior result from conducting larger experimental studies using a variety of different problem types (Rardin and Uzsoy, 2001).

Max-Set of Gaussians (MSG) is a randomized landscape generator that specifies test problems as a weighted max-sum of Gaussian functions (Gallagher and Yuan, 2006). MSG defines a distribution over the parameters of this function, including the number of Gaussian component functions and the mean and covariance parameters for each component. A variety of test landscape instances can then be generated by sampling from this distribution. The topological properties of the landscapes generated are intuitively related to (and vary smoothly with) the parameters of the generator. Two specific landscape instances based on the MSG generator are incorporated as part of the BBOB benchmark test functions (Finck et al., 2009).

Landscape or test-problem generators typically require some degree of manual specification of the landscape properties. In contrast, Langdon and Poli (2007) use genetic programming (GP) to evolve landscapes for the evaluation and comparison of metaheuristics. Individuals in the GP are candidate landscapes, represented and evolved as 2D polynomial functions. The fitness function for the GP is defined as the performance difference between two specified algorithms that are trialled on a landscape. Consequently, landscapes found by the GP are optimization problems where one of the algorithms significantly outperforms the other. The results show that considerable new insights can be gained into the behavior of the algorithms tested and their parameter settings.

Langdon and Poli's methodology is generally applicable to compare metaheuristic optimization algorithms and for discovering interesting behaviour of algorithms, particularly on individual landscape instances. However, evolving polynomial functions using GP does not allow for systematic control of the fitness landscape topology. The complexity of the polynomial function produced by the GP is also variable and biased toward simple functions because complex functions will tend to require polynomials with many terms (long expressions). Langdon and Poli use a simple GP (tinyGP) with a minimal function and terminal set and point out that they are “… using GP as a tool, it is the landscapes that it produces that are important” Langdon and Poli (2007, p. 562). In this sense, our paper proposes an alternative, additional technique for producing interesting landscapes.

A novel and interesting possibility that we explore in this paper is to combine the advantages of a randomized landscape generator with an active search or meta-optimization at the level of the landscape generator parameters in order to discover landscapes that maximize the performance difference between algorithms. This approach allows greater control over the types of landscapes generated through the parameterization of the MSG generator, compared to using a GP to evolve arbitrary polynomial functions. Experiments can be conducted while systematically and incrementally varying the landscape parameters, including repeated trials to examine trends in the distribution of performance. If a parameterization is found that produces a significant performance difference between two algorithms, a large number of problem instances can be generated with known topological features for analysis and further experimentation.

## 3. Randomized Landscapes with Dependency Structure

### 3.1. The MSG Landscape Generator

*g*(

_{i}**x**) is the

*i*th Gaussian component in a Gaussian mixture model defined over the search space A specification of the MSG landscape generator can be written as a tuple of the form where we specify in turn the form of the problem generator (MSG), the dimensionality of the search space (

*n*), the boundary constraints of ([−

*s*, +

*s*]

^{n}), the number of Gaussian components (

*m*), the distribution used to generate mean vectors of components (

*D*

_{μ}), the distribution/procedures used to generate covariances of components (

*D*

_{Σ}), the threshold for local optima (

*t*), and the fitness value of the global optimum (

*G**). As an example, random MSG landscapes can be specified via: which generates 2D landscapes composed of 20 Gaussians with means uniformly distributed in [− 1, 1]

^{2}, component covariance values between 0.1 and 2.1, at uniform rotations and with a threshold of 0.8 for the maximum fitness of local optima. Another example is which is similar but generates big valley landscapes because the means of the components are distributed from a Gaussian distribution centered at the origin.

### 3.2. Constructing Linear Ridges in Randomized Landscapes

*a*,

*b*, and

*c*are the parameters of the line and

*a*∧

*b*≠ 0. Our aim is to generate linear ridges positioned randomly in the search space, with a specified angle to the coordinate axes. A ridge can be formed by positioning a number of Gaussian components such that their means are distributed along a line. The means are generated by first producing two points within the bounds [−

*s*, +

*s*]

^{2},

*e*

_{1}and

*e*

_{2}such that

*e*

_{2}is rotated

*θ*degrees from

*e*

_{1}. The two points are then used to solve Equation (4) for

*a*,

*b*, and

*c*. Then, the mean points of the

*m*Gaussians are determined by generating values of

*x*

_{1}and

*x*

_{2}that satisfy Equation (4) and are within .

The orientation (rotation angle) of each Gaussian component on the ridge is determined via its covariance structure. Firstly, an orientation angle is specified. This would impose a homogeneous structure on the ridge with every local peak at the same orientation (between completely aligned with, or orthogonal to the linear ridge). It is possible that a given algorithm might profit from this specific structure, which is not a desirable property from the point of view of a randomized landscape generator. To remove this homogeneous structure, the orientation of each component is subsequently adjusted by a small amount of noise. Note that this is only one possible method of generating 2D ridge landscapes; to generalize the procedure to produce landscapes of higher dimensionality, different techniques may be utilized. Examples of ridge landscapes resulting from this method are shown in Figure 1. The parameterizations for Figure 1(a) and Figure 1(b) are and , respectively.

### 3.3. Illustrative Experiments

_{c}(see Section 4), and an implementation of simulated annealing on random (i.e., with component mean values uniformly distributed in the feasible search space), big valley, and ridge structured landscapes over a varying number of components. The three types of landscapes were parameterized by: where

*m*is the number of components and is controlled for the experiment.

The direct algorithm is very proficient at 2D problems, and so only 100 function evaluations were allocated to it. The instance of UMDA_{c} had a population of 50 with a selection threshold of 0.8, and a total of 50 generations (2,500 function evaluations). The simulated annealing instance had a cooling schedule of *T* = 0.9*T*, with at most 30 function evaluations within each temperature. With the stop temperature of 10^{−8}, there are at most 5,250 function evaluations for our implementation of simulated annealing.

The goal of this illustrative experiment is to compare the direct algorithm with UMDA_{c} and simulated annealing on a variety of landscapes to see if there is any variety in the performance difference between the algorithms across the landscape types or across the number of Gaussian components. The performance difference^{4} is calculated as follows. A landscape instance is initialized from a given parameterization, and both algorithms are run on the instance for 30 random restarts. The best fitness found by each algorithm is averaged across the 30 restarts, and the absolute value of their difference is recorded. This is done for 30 landscape instances of the given parameterization, and repeated across different parameterizations. Statistical tests could be performed to measure the significance of the mean best fitness found by each algorithm over the 30 restarts; however, as this experiment is purely illustrative, we have not performed such tests.

Figure 2(a) shows the absolute mean performance difference between direct and UMDA_{c} over restarts for all problem instances. We see that direct and UMDA_{c} perform quite similarly on big valley, but quite differently on random landscapes. On ridge landscapes, the difference is somewhere in between. Performance is also not strongly related to the number of Gaussian components, except perhaps when the number of components equals 1. In this case, the difference is consistently very small for big valley landscapes, since the global peak will be biased toward the center of the search space. This is not true for random and ridge landscapes.

Figure 2(b) shows the absolute mean performance difference between direct and simulated annealing. There is some concentration of points close to zero performance difference, that is, trials where the two algorithms performed almost identically (e.g., both found the global optimum). However, a larger fraction of the results is distributed around a performance difference value of approximately 0.6. Not surprisingly, this is strongly related to the structure of the generated landscapes. The generator includes specification of a threshold between the maximum height of local optima (for these results 0.5) and the height of the global optimum (1.0). This threshold will appear in the results of many different algorithms for these landscapes, as most algorithms tend to converge to either the global or a local optimum.

### 3.4. Analyzing the Performance of an Ideal Local Optimizer

Consider measuring the mean best fitness of an ideal (fictitious) local optimizer, repeatedly initialized to a uniform random starting position on generated landscapes. We define an ideal local optimizer as an optimizer that will converge to a local optimum if it is initialized within the optimum's basin of attraction. We assume the algorithm will converge to one of the *m* peaks, with probability proportional to the size of the basin of attraction of that peak.

*b*and

_{g}*b*denote the proportion of that the basin of attraction of the global optima and local optima

_{i}*i*encompasses, respectively. Thus, . The mean best fitness of an ideal local optimizer on a landscape with

*m*optima is: Now, let the basins of attraction for all optima be of equal size, and let

*f*and

_{g}*f*denote the fitness of the global optimum and local optimum

_{i}*i*respectively. Intuitively, the mean best fitness of an ideal local optimizer on a landscape with

*m*optima with equal basins of attraction will be the sum of the fitnesses of each optimum divided by the total number of optima: Equation (5) can only be used if both the fitness of all optima and the size of each basin of attraction are known. Landscapes generated by the MSG generator have local optimum fitness values ranging from 0 to a given threshold,

*t*. If the fitness values of the local optima are not known, the expected fitness () can be used instead. Equation (5) becomes: Consider the case where both the fitness of local optima and all the basins of attraction are unknown. We hypothesize that all basins of attraction have equal size, in which case Equation (7) can be generalized to:

Figure 3 shows the estimated mean fitness for an ideal local optimizer using Equation (8), as well as the results of Nelder-Mead on random landscapes ranging from 1 Gaussian component (*m*) to 50, with component covariances of 0.25, and with rotation ranging between 0 and 90° and a threshold value of (*t*) 0.5. Similar to the previous experiments, 30 landscape instances were generated for each parameterization, and the mean fitness of 30 algorithm trials on each landscape instance is shown on Figure 3. The estimation is reasonable but not completely accurate with the model of Nelder-Mead. One possible explanation for this is that the instance of Nelder-Mead used in these experiments was not a good approximation for an ideal local optimizer. Alternatively, the hypothesis of equal basin sizes may not have been perfect in this experiment.

Now let us examine the mean performance difference between two algorithms. If we assume all basins of attraction are of equal size, and the two algorithms are initialized randomly in , there are *m*^{2} possible combinations of the two algorithms converging on the *m* optima. Let *f _{g}* denote the fitness of the global optimum, and let the fitness of all local peaks be uniformly in the range [0,

*t*]. Allow

*GG*to denote the event where both algorithms converge to the same optimum. Of the

*m*

^{2}total possibilities,

*m*will converge to this event. The event (denoted

*GL*) where one algorithm converges to the global optimum while the other converges to a local optimum will occur for 2(

*m*− 1) of the

*m*

^{2}possibilities. Finally, let

*LL*be the event that the algorithms converge to different local optima. This event occurs for the remaining

*m*

^{2}−

*m*− 2(

*m*− 1) possibilities. Using these definitions, we can determine the fitness difference between two ideal local optimizers.

*GG*, there is a fitness difference (

*f*) of 0. For

_{d}*GL*we subtract the mean fitness of a local peak () from the fitness of the global optimum (

*f*), thus obtaining . Again using the mean fitness of local optimum, we arrive at for

_{g}*LL*. Therefore, if the global optimum has a fitness of

*f*and the local optima have a fitness in the range [0,

_{g}*t*], then it follows that the mean fitness difference converges to

To summarize, the landscape generators used here produce landscapes with a global optimum of known fitness value and local optima with fitness values no greater than a specified threshold value. This structure will be evident in experimental results using these landscapes and is primarily a property of the generator rather than the algorithms used. The analysis above shows that it is possible to estimate the fitness value of this threshold level in experimental results.

## 4. Relating Landscape Dependency Structure to Dependency Modeling in EDAs

EDAs are a class of metaheuristic optimization algorithms that build and use a probabilistic model to direct the search process (Larrañaga and Lozano, 2002; Pelikan et al., 2002). For continuous problems, the most commonly used model is a Gaussian or normal distribution with a specified covariance structure. The continuous univariate marginal distribution algorithm (UMDA_{c}) uses a diagonal covariance matrix corresponding to a factorized product of univariate normal distributions. The estimation of multivariate normal algorithm (EMNA) uses a full covariance matrix corresponding to an unrestricted multivariate normal distribution (Larrañaga and Lozano, 2002).

One of the major issues that has been explored across EDA research, and has motivated the work in Section 3, has been the incorporation of dependency modeling in the probabilistic model of the algorithms. The general assumption is that many real world optimization problems are defined over variables that have unknown dependency relationships between them. Therefore, a model that has the ability to capture and exploit dependencies between problem variables can be expected to provide good performance on such problems. This argument has been experimentally verified several times in the context of developing new algorithms for both continuous and binary problems. If such a model works well for a given optimization problem, it suggests that there are features present in the fitness landscape that the model is able to fit well, but there are few reported studies that specifically analyze the relationship between landscape properties and dependency modeling in EDAs.

### 4.1. Algorithm Performance with Respect to Degree of Landscape Component Dependency

In this section we use the ridge landscape generator described above to evaluate and compare the performance of UMDA_{c} and EMNA. Our assumption is that linear ridges on the landscape result from a very simple and direct dependency relationship between *x*_{1} and *x*_{2}. In two dimensions, a Gaussian component with 0° of rotation has no dependency between *x*_{1} and *x*_{2}, while a rotation of 45° has maximum dependency. An experiment to investigate the effect of component rotation was carried out as follows. The rotation angle of components in the landscapes (see Section 3) was varied between 0 and 90° with increments of 1° with random noise of ±5°. At each angle, 30 randomized landscapes were generated and 30 trials of each algorithm were conducted on each landscape. The best fitness found at the end of each algorithm trial is averaged over the 30 trials on an instance. Thus, the mean fitness difference for a given landscape instance is the mean (best) fitness of UMDA_{c} minus the mean (best) fitness of EMNA. Each algorithm used a population size of 50, a selection threshold of 0.8, and was run for 50 generations. Note that this repeats the set of experiments described in Morgan and Gallagher (2010), but while analyzing these previous results, we discovered an error in the implementation of the EMNA algorithm. The effect of this error was that EMNA was being initialized in [− 3, 1]^{2} while UMDA_{c} was (correctly) initialized in [− 1, 1]^{2}. We therefore corrected the error and repeated the set of experiments.

Figure 4 shows the mean fitness difference between UMDA_{c} and EMNA in terms of best fitness values found on each landscape instance. Counter to our assumption and intuition, the results show no obvious trend between the angle of component rotation and the fitness difference between the two algorithms. This lack of trend was also evident in the results in Morgan and Gallagher (2010). However, the distribution of the results in Figure 4 shows that the majority of points have fitness differences concentrated between 0 and –0.05 and skewed in favor of EMNA (negative values), indicating that EMNA tends to slightly outperform UMDA_{c} on average. This agrees with intuition since the full covariance model of EMNA should be able to capture dependencies within ridge-structured landscapes, but includes as a special case the ability to produce a diagonal covariance model (equivalent to UMDA_{c}). The opposite was suggested in Morgan and Gallagher (2010) due to implementation error.

### 4.2. Landscape Instances Characterizing the Performance Difference Between EMNA and **UMDA**_{c}

The experimental results in Section 4.1 were analyzed to determine example landscapes where EMNA outperforms UMDA_{c} and where UMDA_{c} outperforms EMNA, as well as where the performance between the two algorithms is approximately equal. The landscape instances in Figure 4 were ranked according to the mean fitness difference between the two algorithms, and the four best instances were chosen for each of the three scenarios. A two-sample *t*-test was performed to determine if the mean fitness for the 30 trials of one algorithm (e.g., EMNA) is statistically significant to the mean fitness of the other algorithm (e.g., UMDA_{c}). The null hypothesis is that the two means are the same, while the alternative hypothesis is that they are different. A significance level of 5% was used. Contour diagrams of the landscape instances are provided in Figure 5, along with the mean fitness difference, the *p* value of the *t*-test and the variance of the mean fitness difference between algorithm trials.

Figure 5 shows the landscape instances found in the experiment in Section 4.1 where EMNA most strongly outperforms UMDA_{c}. The mean performances of the two algorithms are statistically significant from each other, and so the mean fitness difference can be used to compare the algorithms. Each instance features a global ridge aligned away from the coordinate axes. Each ridge appears to be highly irregular, meaning they exhibit significant variation in height along the top of the ridge. In addition, the global peak tends to be highly elliptical and positioned away from the origin, though its orientation with respect to the ridge tends to vary.

Figure 6 shows landscape instances where UMDA_{c} outperforms EMNA. The *t*-test failed to reject the null hypothesis on two of the four landscape instances. Note that the performance difference values are much smaller than those in Figure 5, indicating that it is more difficult to find landscape parameter values that produce instances where UMDA_{c} strongly outperforms EMNA (indeed such values might not exist for any given parameterization). Ridges in these landscapes tend to be more axis aligned than those in Figure 5, and each global peak appears to be more aligned with its respective ridge. Global peaks are relatively narrow compared to other peaks in the landscape and are strongly positioned toward the origin of the search space. Ridges in these landscapes are much more smooth (regular) compared to the ridges in Figure 5.

Figure 7 shows landscape instances where the performance of EMNA and UMDA_{c} is almost identical. The *t*-test failed to reject the null hypothesis on all four cases. Global peaks within these landscape are much more regular than those in Figures 5 and 6 in the sense that they have wider variance, are less elliptical, and are positioned closer to the center of the search space. Most of the landscapes in Figure 7 have ridges that are closely aligned to the coordinate axes.

It is clear from Figures 5–7 that the performance difference between the algorithms is strongly influenced by identifiable features of the landscape. The main topological features that we have observed above are the regularity, position, and orientation of the ridge, as well as the position and orientation of the global peak relative to the ridge, and how elliptical it is. It seems likely that a number of factors are responsible for the performance differences observed in the above experiments. The summary of all results in Figure 4 focus on a single factor (i.e., orientation) in isolation, but no trend is observed because of the variability in the generated landscape instances contributed by other factors. When these factors are identified and controlled or constrained, clearer performance difference trends may be seen. From Figure 5, when EMNA outperforms UMDA_{c}, the landscape does tend to have a diagonal ridge in agreement with our initial assumption. But this is in combination with a global peak that is relatively small, located away from the origin, and highly elliptical. In contrast, the landscapes where UMDA_{c} outperforms EMNA (Figure 6) also tend to have a narrow, elliptical global peak, but the ridges are much more axis-aligned and smooth.

### 4.3. Comparison of Algorithm Dynamics on a Landscape Instance

Given that the above experiments have identified landscape instances that yield significant performance differences between EMNA and UMDA_{c}, we can then examine the dynamics of the algorithms on a landscape instance to gain insight into why the performance difference exists. For these EDAs, the dynamics are summarized by the changes in the probabilistic model (i.e., mean vectors and covariance matrices) used to generate the search points. Figure 8 shows again the landscape instance from the upper-left of Figure 5. Also shown are the final positions (for each of the 30 trials) of the mean vectors of the algorithm models. The means of the EMNA model are shown by crosses, ×, while the means of the UMDA_{c} model are shown by plus signs, +. It is clear that on almost all trials, the EMNA model mean has converged closely to the global optimum. In contrast, the UMDA_{c} model means never approach the global peak and are to some extent attracted by the local basins in the center of the search space and toward the lower-left corner.

Figure 9 shows the EMNA and UMDA_{c} model dynamics for a single representative trial from the 30 shown in Figure 8. The points represent the positions of mean vectors over the 40 generations and ellipses, representing the model covariance over generations, are at 1 *SD* from the mean. The means of the EMNA model are shown by boxes, , while the means of the UMDA_{c} model are shown by circles, ○. Solid ellipses represent the EMNA model and dotted ellipses represent the UMDA_{c} model. It is clear that the EMNA model covariance aligns closely with the shape of the global basin, while the UMDA_{c} model (with a diagonal covariance matrix) fails to do so.

### 4.4. Summarizing Algorithm Performance Over Landscape Parameterizations

On further analysis of Figure 4 in Section 4, we see outlying points indicating that there exist landscape instances with a significant performance difference between algorithms. For example, EMNA outperforms UMDA_{c} on a landscape instance with angle 35°, while UMDA_{c} outperforms EMNA on a 78° instance. While there are outlying points for both of these parameterizations, there are also landscapes where the performance difference is negligible. Clearly, there is variation between landscape instances; however, Figure 4 does not contain any information about the variation of performance difference between the trials within individual landscape instances. Such variation is important to analyze as it can give insight into the stability of algorithms on the landscape parameterization.

To visualize the variation of experimental results within landscape instances, as well as variation between instances, we propose a heat map, that is, an *r* × *q* grid where each of the *r* rows represents a single landscape instance and each of the *q* columns represents an experimental result. The mean performance difference for each experiment is indicated by a color scale for the grid points. Because landscape instances (and their experiments) are independent of each other, the ordering of grid points is arbitrary. Overall, the heat map provides a visualization of the empirical difference between two algorithms. A heat map dominated by one end of the color scale shows that one algorithm consistently outperforms the other. A heat map dominated by the middle of the color scale shows that the two algorithms consistently have very similar performance. Finally, a heat map that contains significant variation of color shows that the two algorithms have large variability in their performance difference.

Figures 10(a) and 10(b) contain heat maps generated from the experimental results of angles 35° and 78°, found in Figure 4. Since each parameterization was tested using 30 landscape instances, with 30 algorithm trials in each, the heat maps form a 30 × 30 grid. White indicates EMNA outperforms UMDA_{c} while black indicates UMDA_{c} outperforms EMNA. The instances (rows) are ordered by the variance of their experimental results, and the columns within each row are sorted by mean fitness difference value.

The two heat maps have many similarities, and yet also subtle differences. In Figure 10(a), we see there are landscape instances, such as instance 29, where the performance difference varies between experiments, that is the grid color transforms from black, through gray, to white. It seems that for these landscape instances, neither algorithm is clearly favorable. This is quite different from the instances in Figure 10(b), where we see some instances with very strong dominance. For example, landscape instance 29 strongly shows EMNA outperforming UMDA_{c}, and instance 30 indicates UMDA_{c} outperforms EMNA. It seems the parameterization for Figure 10(a) produces more impartial landscapes compared to the parameterization in Figure 10(b).

Both Figure 10(a) and Figure 10(b) contain multiple rows where the entire row is gray. Completely gray rows indicate that for the corresponding landscape instance, the performance of the two algorithms on every experiment was approximately equal. Therefore, for these particular landscape instances, there is no clear advantage to modeling a full covariance matrix. Note that we cannot generalize this for the entire landscape parameterizations; there are instances in both Figure 10(a) and Figure 10(b) where EMNA is favorable. These instances are easily identifiable by large amounts of white or light gray within the instance row. In fact, excluding instances that are purely gray, we see that there are more experiments where EMNA is favorable as opposed to UMDA_{c}. Therefore, we can deduce that while modeling a full covariance matrix (as in EMNA) for this particular problem class may not always be advantageous in terms of the best fitness found, it is rarely detrimental.

## 5. Searching the Landscape Parameter Space to Compare and Evaluate Algorithms

The results presented in Section 4 show that there is little relationship between the performance difference of UMDA_{c} and EMNA and one parameter of the MSG ridge landscape generator, when that parameter alone is systematically varied across a large set of experiments. However, the experimental results gathered can still be examined at the level of individual landscape instances to gain insight into the dynamics of the algorithms. For example, outlying instances that produced large performance differences (positive and negative) can be visualized together with instances that produced approximately zero performance difference (Morgan and Gallagher, 2010). A more active approach is to conduct a meta-search over the space of landscape generator parameters for parameter value vectors that produce performance results according to some criterion of interest (e.g., maximum mean performance difference, maximum variance performance difference).

*m*Gaussian components. The means are uniformly distributed along a randomly positioned ridge rotated by θ degrees. The components are rotated by φ and have covariance values specified by

*v*. Local optima have a maximum fitness of 0.5, while the global optimum has a fitness of 1.0. From this specification, we obtain an optimization problem with four variables:

*m*∈ {1, 5, 10, 50, 100, 500, 1000, 5000, 10000}, α_{m}increments to the next or previous element in the set (with wraparound).*v*∈ [0, 0.25], α_{v}= ±0.001. (As covariances of a Gaussian component cannot actually be 0, precautions must be taken to ensure this parameter is never equal to 0. We implement a check in the algorithm that will replace covariances of 0 with 10^{−15}.)θ ∈ [0, 90] ⊂

*IR*, α_{θ}= ±0.1.φ ∈ [0, 90] ⊂

*IR*, α_{φ}= ±0.1.

A (1+1)-EA was implemented as the metasearch algorithm over these landscape generator parameters. The mutation operator was a local perturbation of the current solution by steps α_{{m,v,θ,φ}} independently at each iteration. The algorithm was initialized randomly in the search space. The fitness of the metasearch was specified as the mean difference between the two algorithms on 10 landscape instances, each obtained as the mean fitness difference over 10 algorithm trials on each landscape. These experiments are computationally time-consuming. Hence, we only attempted six trials of the meta-search, with each trial being run for 60 iterations. Out of the six trials, three were maximizing the mean fitness difference between EMNA and UMDA_{c}, while the remaining three were maximizing the difference between UMDA_{c} and EMNA.

The progress of each metasearch trial is shown in Figure 11. The three solid lines show the progress of maximizing the difference between EMNA and UMDA_{c}, and the three dotted lines show the progress of maximizing the difference between UMDA_{c} and EMNA. It is clear from this graph that ridge landscapes maximizing the fitness difference between EMNA and UMDA_{c} are easier to find than ridge landscapes maximizing the difference between UMDA_{c} and EMNA.

The results of the six trials are summarized in Table 1. Additionally, significance testing was performed to determine whether the reported mean fitness difference is significant for the given landscape parameterization. This was done by firstly determining, for each landscape instance, whether the mean fitness for Algorithm 1 is significantly different from Algorithm 2, which can be done with a two-sample *t*-test. If the majority of the instances pass this test, then the means for all of the landscape instances must be then be tested to see whether they form a distribution with a nonzero mean. A one-sample *t*-test was used to do this. If the parameterization passes both the two-sample *t*-test and the one-sample *t*-test, we can conclude that the result is statistically significant. Both tests were performed at the 5% significance level, and the results are summarised in Table 2.

Metasearch . | Best fitness . | m
. | θ . | φ . | v
. |
---|---|---|---|---|---|

EMNA vs UMDA_{c} 1 | 0.1023 | 1000 | 73.1 | 0.1 | 0.0772 |

EMNA vs UMDA_{c} 2 | 0.2528 | 10000 | 45.0 | 75.1 | 0.0144 |

EMNA vs UMDA_{c} 3 | 0.0999 | 50 | 40.1 | 60.2 | 0.1204 |

UMDA_{c} vs EMNA 1 | 0.0126 | 10 | 46.0 | 54.0 | 0.2074 |

UMDA_{c} vs EMNA 2 | 0.0220 | 1 | 11.4 | 82.4 | 0.2264 |

UMDA_{c} vs EMNA 3 | 0.0116 | 1 | 75.0 | 15.0 | 0.2084 |

Metasearch . | Best fitness . | m
. | θ . | φ . | v
. |
---|---|---|---|---|---|

EMNA vs UMDA_{c} 1 | 0.1023 | 1000 | 73.1 | 0.1 | 0.0772 |

EMNA vs UMDA_{c} 2 | 0.2528 | 10000 | 45.0 | 75.1 | 0.0144 |

EMNA vs UMDA_{c} 3 | 0.0999 | 50 | 40.1 | 60.2 | 0.1204 |

UMDA_{c} vs EMNA 1 | 0.0126 | 10 | 46.0 | 54.0 | 0.2074 |

UMDA_{c} vs EMNA 2 | 0.0220 | 1 | 11.4 | 82.4 | 0.2264 |

UMDA_{c} vs EMNA 3 | 0.0116 | 1 | 75.0 | 15.0 | 0.2084 |

_{c}of 0.2528. The standard deviation of the mean fitness difference across landscape instances was relatively small (0.0698), indicating that these parameter values generate landscapes that consistently produce a sizeable performance difference. This is amazing because out of the 2,700 landscapes used for the experiment shown in Figure 4, only one produced a fitness difference of this size. This demonstrates the value for algorithm researchers of actively searching a parameterized landscape generator in order to find problems that differentiate algorithms. Furthermore, this solution was found with a very simple metasearch algorithm after only 53 function evaluations.

Metasearch . | Two-sample t-test
. | One-sample t-test
. |
---|---|---|

EMNA versus UMDA_{c} 1 | p = [0.0289, 0.1032, 0.2877, 0.2493, 0.0002, | n/a |

0.0008, 0.0000, 0.0773, 0.4034, 0.0644] | ||

EMNA versus UMDA_{c} 2 | p = [0.0216, 0.0439, 0.0062, 0.0001, 0.0010, | p = 1.1468 × 10^{−6} |

0.0001, 0.0000, 0.0176, 0.0169, 0.0002] | ||

EMNA versus UMDA_{c} 3 | p = [0.0433, 0.2372, 0.0284, 0.3525, 0.0105, | p = .0024 |

0.0252, 0.0003, 0.0895, 0.0037, 0.7966] | ||

UMDA_{c} versus EMNA 1 | p = [0.6813, 0.0219, 0.7964, 0.3042, 0.8978, | n/a |

0.2824, 0.0158, 0.2327, 0.0271, 0.0133] | ||

UMDA_{c} versus EMNA 2 | p = [0.1490, 0.0235, 0.9842, 0.5992, 0.2254, | n/a |

0.6252, 0.1450, 0.1756, 0.0375, 0.1022] | ||

UMDA_{c} versus EMNA 3 | p = [0.5939, 0.9965, 0.4125, 0.3155, 0.3323, | n/a |

0.3471, 0.7764, 0.2895, 0.0126, 0.0161] |

Metasearch . | Two-sample t-test
. | One-sample t-test
. |
---|---|---|

EMNA versus UMDA_{c} 1 | p = [0.0289, 0.1032, 0.2877, 0.2493, 0.0002, | n/a |

0.0008, 0.0000, 0.0773, 0.4034, 0.0644] | ||

EMNA versus UMDA_{c} 2 | p = [0.0216, 0.0439, 0.0062, 0.0001, 0.0010, | p = 1.1468 × 10^{−6} |

0.0001, 0.0000, 0.0176, 0.0169, 0.0002] | ||

EMNA versus UMDA_{c} 3 | p = [0.0433, 0.2372, 0.0284, 0.3525, 0.0105, | p = .0024 |

0.0252, 0.0003, 0.0895, 0.0037, 0.7966] | ||

UMDA_{c} versus EMNA 1 | p = [0.6813, 0.0219, 0.7964, 0.3042, 0.8978, | n/a |

0.2824, 0.0158, 0.2327, 0.0271, 0.0133] | ||

UMDA_{c} versus EMNA 2 | p = [0.1490, 0.0235, 0.9842, 0.5992, 0.2254, | n/a |

0.6252, 0.1450, 0.1756, 0.0375, 0.1022] | ||

UMDA_{c} versus EMNA 3 | p = [0.5939, 0.9965, 0.4125, 0.3155, 0.3323, | n/a |

0.3471, 0.7764, 0.2895, 0.0126, 0.0161] |

We can also interpret the parameter values in this solution to gain intuition about the difference between the algorithms. As discussed in Section 4.1, our expectation was that EMNA would be well-suited to problems containing significant correlation structure (i.e., component angles close to 45°). While our initial experiment (see Figure 4) was unable to show this, this is perhaps because we underestimated the importance of the angle of the ridge. The ridge angle of 45° in our metasearch solution is the maximum correlation between *x*_{1} and *x*_{2}. A component angle of 75.1° results in components that are aligned with neither the ridge nor the coordinate axes, but the significance of this value is unclear. The parameterization also includes the largest number of Gaussian components considered (10,000) together with a relatively small value (0.0144) for the maximum possible covariances of a component (i.e., all components are narrow). Intuitively, more complex landscape topologies can be produced from a (max) sum of a larger number of localized functions.

Further experimentation is naturally suggested by these results. While a ridge angle of 45° fits nicely with our intuition about EMNA capturing dependencies, we cannot draw general conclusions from the small number of instances produced by our metasearch. A follow-up set of experiments would be to systematically examine the effect of the ridge angle on performance over many instances and trials as was done for the Gaussian rotation angle in Figure 4. To dig deeper and investigate the 75.1° component angle, yet another experiment could be performed by holding both the ridge angle (45°) and Gaussian rotation angle constant and exploring a more specific range of values for the number of Gaussian components and maximum variance values.

The best parameterization found by the metasearch was used to generate a heat map similar to the heat maps in Figure 10. The results of the 30 landscape instances, and their respective 30 algorithm trials, can be seen in Figure 12. This heat map shows an extremely clear dominance of EMNA over UMDA_{c}; we see that the lower right triangle of the heat map is white, and the upper left triangle is mostly gray, with only a few trials that are black. These results show that for the given parameterization, there are landscapes where EMNA consistently outperforms UMDA_{c}, which is in contrast to the experiments shown in Figure 10.

### 5.1. Discussion and Related Work

The results above can be related to what is already known about the behavior of UMDA_{c} and EMNA. It has been shown that the modeling in EMNA does not lead to efficient progress on a linear correlated slope function: the variance of the model extends orthogonally to the direction of increasing fitness (i.e., in the worst possible direction; Hansen, 2006; Bosman et al., 2008). This is equally true of UMDA_{c} if the contours of the slope are axis-aligned, but since the UMDA_{c} model cannot completely capture a linear dependence, it should outperform EMNA on a correlated slope. Our general observation that UMDA_{c} tends to outperform EMNA on ridge landscapes (see Figure 4) supports this reasoning.

For univariate or factorizable problems (i.e., uncorrelated variables), UMDA_{c} is also known to converge prematurely on any monotonic or flat function (González et al., 2002; Grahl et al., 2005; Yuan and Gallagher, 2006, 2009) while convergence is fast on a unimodal function when the model is close enough to the optimum. In the landscapes generated in Figure 12, a ridge on the search space boundary is globally similar to a monotonic slope (in the direction orthogonal to the ridge), while a ridge through the center of the search space is more like a unimodal function along most directions through the search space. In the above experiments we have observed a tendency of both algorithms to become stuck on the side of a ridge close to the boundary (such as Figure 1b). However, in practice, the behavior of the algorithms is affected by the interaction of several different factors: ridge location and rotation, global peak size, orientation (with respect to the ridge direction) and eccentricity (see the discussion in Section 4.2). Additional experiments are required to study the interactions of these factors relative to algorithm performance.

## 6. Summary and Conclusion

This paper proposes a general experimental methodology, combining a randomized landscape generator with an active search for problems that differentiate between algorithms. More specifically, an existing generator was extended to incorporate global ridge dependency structure. This was used to investigate the relationship between problem variables and dependency modeling in simple continuous EDAs. Initial experiments across a single landscape parameter showed no significant trend with respect to mean algorithm performance difference. However, individual landscape instances exhibited clear topological features characterizing algorithm performance. Further analysis of the results was carried out to show the behavior of the algorithm leading to the performance differences observed. For a given set of parameter values, we show how heat maps can be used to provide a clear, visual comparison of the performance of two algorithms over a large number of landscape instances and algorithm trials.

Experiments were also conducted performing a metasearch in the landscape parameter space to find a parameterization maximizing the mean performance difference between two algorithms. The landscape parameters found produce landscapes with a consistently larger mean performance difference. The heat map generated from the results confirmed the effectiveness of the metasearch.

Overall, the methodology of this paper provides novel ways to gain insight into the relationship between algorithm performance and landscape structure. Although the methodology presented is general, the experiments described above were limited to 2D problems and the algorithmic parameter settings used. The ridge landscape generator could readily be extended to generate *n*-dimensional problems by defining the ridge as lying along a vector in the space with arbitrary orientation (using *n* − 1 rotations). Examining higher-dimensional problems and integrating the variation of algorithm parameters are avenues for future work. A longer-term goal of this type of exploratory work is to be able to more precisely categorize or quantify the relationship between landscape structure and algorithm behavior. One avenue that may further this goal is the use of different performance measures.

## Notes

^{1}

Matlab code for the experiments conducted in this paper can be downloaded at http://itee.uq.edu.au/∼uqrmorg4/extended_msg.html

^{4}

We consider absolute mean fitness difference in this paper, but any measure of interest could be used.

## References

_{c}algorithm with tournament selection. Behaviour on linear and quadratic functions

_{c}, with truncation selection on monotonous functions

_{c}with finite populations: A case study on flat landscapes

## Author notes

*Corresponding Author.