Abstract

We present methods to answer two basic questions that arise when benchmarking optimization algorithms. The first one is: which algorithm is the “best” one? and the second one is: which algorithm should I use for my real-world problem? Both are connected and neither is easy to answer. We present a theoretical framework for designing and analyzing the raw data of such benchmark experiments. This represents a first step in answering the aforementioned questions. The 2009 and 2010 BBOB benchmark results are analyzed by means of this framework and we derive insight regarding the answers to the two questions. Furthermore, we discuss how to properly aggregate rankings from algorithm evaluations on individual problems into a consensus, its theoretical background and which common pitfalls should be avoided. Finally, we address the grouping of test problems into sets with similar optimizer rankings and investigate whether these are reflected by already proposed test problem characteristics, finding that this is not always the case.

1  Introduction

In the domain of stochastic optimization, the available theory is still too limited to predict the performance of different algorithms on more than the most simple objective functions. Therefore, benchmarking plays a vital role in developing and comparing these types of algorithms. The BBOB 2009 and 2010 (Hansen et al., 2009a, 2009b, and follow-ups for Hansen et al. 2010) results provide the most sophisticated benchmarking data currently available, and their specific strength lies in the inclusion of many classic and modern optimization algorithms from fields outside of evolutionary computation. Although a result summary has already been published by the organizers (Auger et al., 2010), much more can be learned from the available data when inspected from a slightly different angle, namely through benchmarking theory (see Section 2), a long-since existing branch of statistics. A first attempt in this direction has been made by Mersmann et al. (2010a), solely considering the BBOB 2009 data. In the present work, we aggregate the data from 2009 and 2010 and expand the portfolio of applied statistical techniques. Due to the large number of algorithms from both conferences (64) it also became inevitable to group the algorithms according to their main idea, choosing only one representative of each group for comparison as documented in Section 3.3. Such grouping is of course subjective, however we have tried to be as fair as possible, so as to obtain a good overview of the capabilities of the different classes of algorithms currently available on the BBOB test problems.

Who is the winner of an optimization competition and does this question make any sense? Without clearly qualifying one method as the overall winner, Hansen et al. (2010) suggest looking at such measures as the number of problems solved satisfactorily over the number of evaluations. This perspective implies that the most successful algorithms can solve the largest possible fraction of problems in the shortest time, thereby suggesting a default method to be employed if no problem knowledge is available. There may, however, be algorithms which are especially well suited for problems that cannot be solved well by the winning method. One may further argue that some problem properties are available even for a black box problem of unknown structure, for example, its dimensionality. Or preliminary runs with standard (gradient based) methods might have hinted at a unimodal or multimodal structure of the function landscape.

Thus, we can state that: (1) looking at different classes of problems makes sense, and (2) choosing a best algorithm (possibly for only a subgroup), requires finding a ranking for the aggregation of the data of the respective test functions. The latter is made possible by using consensus ranking methods which we introduce in Section 2.2. Unfortunately, there cannot be a consensus ranking method that satisfies a complete set of reasonable, intuitive prerequisites simultaneously. Instead, we will have to make a subjective choice which will influence our interpretation of the resulting consensus. These methods, while not particularly well known in the EC community, are often used in other disciplines and should not be dismissed as exotic. Examples of the applications of consensus methods in everyday life are sporting events where several rankings are produced by a panel of judges, races, and so on, and then averaged to find the winner of the competition.

The ultimate goal of a comparison of methods on a benchmark suite should be to detect which optimization algorithm to use for a given practical problem, but clearly, not every imaginable problem can be represented by a benchmark. Additionally, benchmarking also suits the purpose of algorithm development as one may try to enhance a method that does not work too well on some problems. Therefore, it is essential that we are able to determine the difficulty that a problem poses for an algorithm. In any case, some generalization from single problems to problem groups is necessary. The BBOB test set is partitioned into five groups according to very general properties. But do these problem groups match the abilities of the optimization algorithms under test well? Or shall they be restructured according to other criteria? We treat this question in Section 3.5, coming to the conclusion that the groups are largely well chosen, but with some exceptions.

The assignment of problems to groups is also interesting from another perspective: If one is confronted with a new problem of which not much is known, it would be desirable to automatically detect some of its properties and then sort it into an existing group for which good optimization algorithms are available. This is our long term goal, and we have coined the term exploratory landscape analysis (ELA) to describe this methodology. First approaches are presented in Mersmann et al. (2010a) and further work in this direction can be found in Mersmann et al. (2011). In this work however, we concentrate on the manual analysis of the BBOB workshop data to obtain a ground truth, to which future ELA approaches can be compared to (Section 3). Finally, we provide a summary and some conclusions in Section 5.

2  Benchmarking

Benchmarking experiments and comparisons are set up in order to evaluate the performance of different algorithms on given problem classes. To derive an overall ranking of all the algorithms in a benchmark experiment is one of the common goals. This has several uses, the most prominent one being the identification of an overall best algorithm. Without loss of generality we will focus on optimization algorithms in the following description. A specific position in an overall ranking will in general be dependent on the rank aggregation method employed and there is no consensus method which can be considered optimal as has been known in econometrics for several decades (Arrow, 1950). So all methods presented here will have to be a trade-off between three basic properties which any optimal consensus method should satisfy.

Figure 1 schematically visualizes the general setup of what we call a benchmark experiment. We will see that its outcome strongly depends on the chosen performance measures and ranking procedures (Hornik and Meyer, 2007; Mersmann, 2009; Mersmann et al., 2010b). The definition of a problem domain together with a set of t test functions is a key step. Benchmarking results can only be generalized to the considered domain, and even this is not admissible if the functions are not chosen systematically and therefore represent the problem domain properly. Ideally, this would be guaranteed by applying the design of experiments methodology to the most important features which characterize the function types, but this is a nontrivial task for several reasons. First of all, the relationship between these features and the measured performance of an algorithm is generally not linear. Secondly and more importantly, this would require that we have a method to randomly sample from the set of all functions which satisfy some constraints on their characteristics. This is currently an unsolved issue.

Figure 1:

Flowchart describing the steps involved in a benchmark experiment.

Figure 1:

Flowchart describing the steps involved in a benchmark experiment.

To do this performance assessment, q (ideally stochastically independent) quality indicators or performance measures are chosen to judge different aspects of algorithm performance and optimization quality. Then, a set of k algorithms is chosen for the benchmark. This set should be diverse and care should be taken not to over-represent a certain class of algorithms since this may bias the consensus rankings derived later on.

Since the outcome of the algorithms under test is usually stochastic, they are compared by aggregating r independent runs. Appropriate choices of r depend on the expected difference in quality compared to the variance of the observed quality indicator values. A good rule of thumb would be a value between 10 and 25 repetitions, but during the analysis it may become apparent that more runs are required to adequately differentiate between the algorithms (Mersmann, 2009). Here the trade-off is between the speed of running such an experiment and the accuracy of the results.

Our k algorithms will now be run r times for each of the t test functions resulting in values of the q quality indicators for each algorithm. Using these values we can derive individual rankings of the algorithms for each combination of a quality indicator and a test function (see Section 2.1). An overall (consensus) ranking can then be generated from the individual rankings or from subsets of these. Different methods for this are introduced and discussed in Section 2.2.

2.1  Individual Ranking

Benchmarking theory is often based on the theoretical framework of relations and orders (Hunter, 2008) from which a formal definition of a rankingR of a set of items can be derived as a weak order over the set . For we say ai is better than or equal to aj and denote this by iff . If and , then we say ai and aj are tied and denote this by . If R is a linear order, we refer to the corresponding ranking as a strict ranking.

Initially, without loss of generality, we will consider the case of a fixed test function f and quality criterion I to be maximized. For example, I could be the dominated hypervolume indicator (HV; Zitzler and Thiele, 1998) in multiobjective evolutionary optimization or the accuracy that a single-objective evolutionary algorithm has reached after a fixed number of function evaluations. If we only had two algorithms, a1 and a2, then we could define by saying (a1 is better than or equal to a2) if . Generalizing this result to a higher number of algorithms is straightforward in that we use the order induced by I on the algorithms as our ranking. However, due to the inherent stochastic nature of the algorithms, becomes a random variable with unknown distribution. We will therefore estimate some properties of this distribution, usually the expected value or some quantile, from the r repetitions we performed. Choosing good summary statistics requires an initial study of the distribution of the quality indicator. An example of such an analysis is given in Mersmann et al. (2010c) where the distribution of the HV is studied for different evolutionary multiobjective algorithms. Among other characteristics it is shown to be unimodal in most cases.

The usage of a summary statistic of the indicator distribution together with the classic “greater than or equal” () or “less than or equal” () relation is a straightforward approach to generate a linear order on the algorithms under test. Table 1 lists suitable summary statistics assuming fixed samples of r quality indicator values . However, any uncertainty in these statistics is neglected which could lead to perturbed ranking results where the extent of the error depends on the variance of the indicator distribution. Another possibility is to use statistical hypothesis tests (Mood et al., 1974) to decide if the locations of two quality indicator distributions are different in some sense. Caution is in order here because this approach can lead to a relation that is not transitive and antisymmetric. More details are given in Mersmann (2009) and Mersmann et al. (2010b).

Table 1:
List of summary statistics and order relations R.
ScenarioSummary statistic (si)R
Best case quality   
Average case quality   
Median quality   
Worst case quality   
Consistent quality   
ScenarioSummary statistic (si)R
Best case quality   
Average case quality   
Median quality   
Worst case quality   
Consistent quality   

2.2  Consensus Ranking

If we want to find the best algorithm out of a given set by using the individual rankings obtained using the methodology described above, we need to aggregate these into a single ranking. This is called finding a consensus among the individual rankings and the result is a consensus ranking. As mentioned earlier, there is no single best consensus method for rankings and we will elaborate on this here.

One can postulate several criteria that a “best” consensus method should fulfill (Arrow, 1950):

  1. A consensus method that takes into account all rankings instead of mimicking one predetermined ranking is said to be nondictatorial.

  2. A that, given a fixed set of rankings, deterministically returns a complete ranking is called a universal consensus method or is said to have a universal domain.

  3. A fulfills the independence of irrelevant alternatives criterion, or in short form an IIA criterion, if given two sets of rankings and in which for every the order of two algorithms a1 and a2 in ri and ti is the same, the resulting consensus rankings rank a1 and a2 in the same order. IIA means that introducing a further algorithm does not lead to a rank reversal between any of the already ranked algorithms which is a very strict requirement.

  4. A which ranks an algorithm higher than another algorithm if it is ranked higher in a majority of the individual rankings for which the consensus ranking is sought, fulfills the majority criterion.

  5. A is called Pareto efficient if given a set of rankings in which for every ranking an algorithm ai is ranked higher than an algorithm aj, the consensus also ranks ai higher than aj.

Unfortunately, all criteria cannot be met simultaneously because the IIA criterion and the majority criterion are incompatible if we assume a nondictatorial consensus method. Thus, consensus approaches will yield different results with respect to the criteria chosen to be fulfilled. At this point one might ask why we even bother to find a consensus if it will always be a trade-off between the above criteria. The reason is, that while it may not be optimal in some sense, it still gives us insight into which algorithms might be worth further investigation and which algorithms perform rather poorly. However, we will have to take care that no (accidental or even intentional) manipulation of the consensus takes place. This might easily happen if the IIA is not fulfilled—which it usually is not. Therefore simply adding similar algorithms to the benchmark can increase the chance of a rank reversal.

Generally, we can differentiate between positional and optimization based methods. Positional methods calculate sums of scores s for each algorithm ai over all rankings and result in an ordering as follows: for t test functions and q quality indicators considered where denotes the ranking induced by the ith function and the jth comparison or quality indicator:
formula
1
The simplest score function assigns a value of one to the best algorithm in each ranking while all other algorithms get a value of zero. Although this is somewhat intuitive, undesirable consensus rankings can occur. Consider the situation with two different rankings of three algorithms, that is, as well as . After calculating the scores the consensus would be although the rankings are directly opposed and we would intuitively expect to have .

The Borda count method (de Borda, 1781) accounts for this drawback and assigns an algorithm one point for each algorithm that is not better than the algorithm considered, that is, which in the case of no ties reduces to the ranks of the data. Unfortunately, the Borda method does not fulfill the majority or the IIA criterion. It is still a popular consensus method because it can be easily implemented and understood. The main criticism voiced in the literature is that it implicitly, like all positional consensus methods, assumes a distance between the positions of a ranking—which is equally spaced in case of the Borda method.

Optimization-based methods on the other hand require a distance function (see Cook and Kress, 1992, for an overview) between different rankings which quantifies how much two rankings deviate from each other. Central to this is a notion of betweenness, expressed by pairwise comparisons. That is, a ranking r2 lies between r1 and r3 if for all pairs of algorithms either r2 agrees with r1 or r3 on the relative order of the pair, or r1 and r3 have conflicting orderings for the pair and r2 declares the pair to be tied. From this, we obtain a geometry on the set of all possible rankings and can ask for something like a mean (minimizing squared loss) or median (minimizing absolute loss) consensus ranking.

If we add the requirement of non-negativity, symmetry, and the triangle inequality to the list of properties our distance function should fulfill, we can formalize the above notion of a mean or median consensus by taking the distance measure as our loss function and then minimizing over all admissible consensus rankings :
formula
2
The consensus ranking is then given by the ranking whose loss L is minimal. Setting results in what is called a median consensus ranking and results in a mean consensus ranking.
Kemeny and Snell (1972) postulated meaningful axioms for distance functions which can be proven to uniquely lead to the symmetric difference (SD) which counts the cases where is contained in one of the relations but not the other:
formula
3
where is the k dimensional one vector, the incidence matrix belonging to the relation that corresponds to the ranking ri and denotes the elementwise absolute value. denotes the SD approach for the set of all linear and for the set of all partial orders.

Unfortunately, we cannot give a general recommendation regarding the introduced consensus methods (Saari and Merlin, 2000) as each method offers a different trade-off of the consensus criteria. The SD/L and SD/O methods meet the majority criterion and thus cannot meet the IIA criterion simultaneously. However, on real data they rarely result in rank reversals if algorithms are added or dropped. The Borda count (BC) method does not fulfill either of these criteria. Saari and Merlin (2000) show that the SD method always ranks the Borda winner above the Borda loser and that the Borda method always ranks the SD winner above the SD loser.

It is important to note that consensus rankings generally do not admit nesting in a hierarchical structure. For example, separate consensus rankings could be of interest for test functions with specific features, for example, high multimodality or convexity. While this certainly is a valid and meaningful approach, one has to keep in mind that an overall consensus of these separate consensus rankings does not necessarily have to equal the consensus ranking directly generated based on all individual rankings.

3  BBOB Analysis

In the following sections we will apply the benchmarking framework which was presented in the previous section to the joint results of the 2009 (Hansen et al., 2010) and 2010 BBOB open benchmark (Auger et al., 2010). Our aim will not be to identify the “best” algorithm but to try to characterize the performance of different algorithm classes. This will allow us to deduce a small subset of algorithms in Section 3.3 which together might form the basis of a practitioner’s black box optimization toolbox.

3.1  Benchmark Setup

In this section we will give a short overview of the BBOB 2009 and 2010 open benchmarks. For a detailed description of the experimental setup see Hansen, Auger, et al. (2009). The general setup used by the BBOB team follows the methodology depicted in Figure 1 but instead of choosing the k algorithms themselves, since this is an open benchmark, researchers are invited to submit results for their algorithms. It uses a balanced and unbiased sample of the published set of test functions from the field of black box optimization. Their characteristics have been studied by the BBOB team to ensure that different aspects and difficulties are covered.

To assess the performance of an algorithm, the BBOB team proposes the use of the so-called expected running time. This measure estimates the expected number of function evaluations required to achieve an accuracy of . For a given the ERT is defined as
formula
4
where denotes the number of function evaluations until the algorithm reaches the desired accuracy of , denotes the number of function evaluations until the algorithm terminates without reaching the desired accuracy level (unsuccessful run), and is the probability of a successful run. We will estimate the ERT from the r runs performed by every algorithm on each test function by plugging in the empirical equivalents of the unknown parameters. For a thorough motivation of the ERT see Hansen et al. (2005). There are certainly other measures and ways to characterize algorithm performance. One could for example ask for the accuracy attained after a fixed budget of function evaluations. In this analysis we will however focus on the ERT and restrict ourselves to the performance metric originally suggested by the BBOB team.

In order to estimate the ERT, each contestant is required to submit 15 runs of his or her algorithm for each of the 24 test functions. It was required to submit results for 2, 3, 5, 10, and 20 dimensional parameter spaces. Results for 40 dimensional parameter spaces were optional. The experimenter may therefore have to perform up to runs. The results of each run are automatically stored in a file by the BBOB framework. From this file it is possible to infer the number of function evaluations used for almost any accuracy level . There is one small difference in the way the 15 runs are composed between the 2009 and 2010 BBOB instance. In 2009, the 15 runs were divided among five different test function instances1 for each of which three runs had to be performed. In 2010, instead of five instances, 15 instances with just one run per instance were required.

Table 2:
Algorithms for which there are deviations from the 15 runs per test function/dimension rule. The second column shows the minimal and maximal number of runs performed per test function/dimension combination.
AlgorithmNumber of runs
ALPS 15–45 
BFGS 15–30 
(1+1)-CMA-ES 15–30 
DIRECT 5–5 
AVGNEWUOA 15–30 
MA-LS-CHAIN 15–29 
MCS 15–30 
(1+1) ES 15–26 
Artificial bee colony 15–33 
AlgorithmNumber of runs
ALPS 15–45 
BFGS 15–30 
(1+1)-CMA-ES 15–30 
DIRECT 5–5 
AVGNEWUOA 15–30 
MA-LS-CHAIN 15–29 
MCS 15–30 
(1+1) ES 15–26 
Artificial bee colony 15–33 

The way the BBOB team analyzes the results turned in by the contestants is different from what we propose in Section 2. We will therefore refer the reader to Hansen et al. (2010) for a description of their methodology and the results they obtain. Instead we will use the raw data, graciously provided by BBOB team on their website2, to conduct an analysis based on the methods proposed in Mersmann et al. (2010b).

Before we begin with the analysis we would like to mention a few discrepancies between the actual data available and what should be included in a contestant’s submission. Not all contestants have turned in results for each dimension or test function. This is legitimate but hinders some analysis. For example, very few algorithm runs in 40D are available. We can only speculate if the other contestants did not turn in results because their algorithms did not perform well in this dimension or because they did not have enough time/resources to perform the additional runs. Another problem is that some results do not contain the required 15 runs per test function/dimension combination. In fact, some authors turned in more runs than required! These findings are summarized in Table 2. One should note that DIRECT is a deterministic method which was submitted in 2009 when three replications for each function were required by the BBOB rules. It therefore does abide by the rules since each repetition would have produced the same result. To avoid any biasing of the results we have opted to use a form of stratified sampling that either chose 15 unique instances (2010 submission) or three runs from five instances (2009 submission) for those submissions with more than 15 runs. Other issues encountered include backup files in the submitted archives and differing directory structures between contestants. In the future it would be desirable to standardize the layout of the submitted data to ease external analysis.

3.2  Individual Rankings

Initially, algorithm rankings for each test problem and dimension combination are generated for all accuracy levels for . An exemplary visualization of rankings obtained for the maximum accuracy level of is shown in Figure 2. All rankings are obtained by ranking based on the ERT using the (i.e., smaller is better) relation.

Figure 2:

Parallel coordinate plot for each function, showing the rank of each algorithm as the number of dimensions rise for the accuracy level of .

Figure 2:

Parallel coordinate plot for each function, showing the rank of each algorithm as the number of dimensions rise for the accuracy level of .

From the plot, we can see that it is not advisable to simultaneously analyze all competing algorithms. Obviously, figures are not very meaningful for 55 algorithms due to information overload. Due to this limitation and the inherent risk of including multiple variants of one algorithm in an analysis, as described in the previous section, algorithm groups are defined which consist of a number of very similar algorithms, better denoted algorithm variants. For instance, there are 17 variants of the CMA-ES which differ only in population size and restart strategy. In Section 3.3 we discuss why an inclusion of all such algorithm variants can extremely bias resulting consensus rankings. Therefore we will choose a best variant from each algorithm group and use this algorithm as a representative for further analysis and especially the final consensus rankings.

An aggregation on the test problem level would be possible as well. In Hansen, Finck, et al.(2009) the BBOB test problems are grouped into predefined problem classes with specific properties as (1) unimodal separable problems (UM-Sep., f1f5), (2) unimodal low or moderate conditioned problems (UM-Low C, f6f9), (3) high conditioned and unimodal problems (UM-High C, f10f14), (4) multimodal problems with adequate global structure (MM-adeq. GS, f15f19), and (5) multimodal problems with weak global structure (MM-weak GS, f20f24). As already shown in Mersmann et al. (2010a) and obvious from Figure 2, the algorithm performance is not consistent enough within the specified function groups to justify the selection of a reduced set of representative test problems.

Some general statements about the algorithms’ performance can however be extracted from Figure 2. It is evident 3 that there is a change in the general performance of the algorithms as the dimension rises, for example, not surprisingly the performance of gradient-based methods like BFGS in most cases deteriorates with increasing dimension. On the other hand, there is an opposite tendency for the CMA-ES variants. Only rarely is the set of the best-ranked algorithms quite stable across all dimensions, as for the sphere problem f1 and the rotated Rosenbrock problem f9, for instance. Interestingly, a few algorithms, for example, harmony search, perform even worse than Monte Carlo search for some combinations of dimension and test problem.

Specifically, the differences within dimensions 5–20 and within 2 and 3 are much smaller than the differences between these two groups. This has already been pointed out by Mersmann et al. (2010a). Therefore, detailed performance analysis of a carefully selected set of best performing algorithms should be conducted separately for these two dimension classes. In Section 3.5 we perform such an analysis for the higher dimensional class as this is the most interesting one in our view.

3.3  Algorithm Groups and Representatives

The high number of more than 50 algorithms (including variants) present in the combined BBOB data set of 2009 and 2010 overstrains the capacities of tables and figures, as can easily be seen in Figure 2. We therefore need to group the algorithms and continue only with the ones ranked best in each category. This requires a grouping criterion, for which we chose algorithmic similarity, so that all optimizers relying on the same base mechanism will be put into the same group. While this appears straightforward, for example, for the different variants of the CMA-ES, it is much less self-evident for the group of so-called hybrid optimization methods. In this case, the use of multiple search paradigms itself is the underlying principle. Another problem is the different size of the obtained groups. For this reason, and also because there is a difference in the use of nontrivial restart heuristics, we have cut the largest group (CMA-ES) in two: the simple CMA-ES variants and the more complex ones (dubbed CMA-hybrid). Even so, the former group still has 11 members. In stark contrast to that, the gradient and random search groups contain only one. This may seem a bit unfair, but it is a necessary step in the analysis to mitigate the possibility of inadvertently changing the ranking of two algorithms by just adding another variant of one of them. This risk was previously pointed out in the section on benchmarking. Moreover, we finally strive for a small number of best algorithms which is sufficient to, in some sense, cover all test problems. We expect that on very different problems, very different algorithms perform best, and not variants of the same algorithm. If this were the case, we would not call these algorithms variants but assume that the effect we see stems from tuning. Finally, we think that answers to more general questions such as: “Shall I use an estimation of distribution algorithm (EDA) or a gradient method on problem x?” are of higher interest than the choice of the exact variant, which may also be seen as a different parameterization in many cases. By selecting a representative out of each class we would like to support this view. This position should however not be misunderstood as discouragement of algorithm variant development. We would merely like to point out that a new variant should initially be benchmarked against the other available ones before competing against an algorithm from a different class.

Figure 3:

Consensus rankings within the chosen algorithm groups, according to target value precision.

Figure 3:

Consensus rankings within the chosen algorithm groups, according to target value precision.

Figure 4:

Consensus rankings within the chosen algorithm groups, according to function group.

Figure 4:

Consensus rankings within the chosen algorithm groups, according to function group.

Table 3:
Consensus ranking results for each algorithm group at the lowest/highest studied precision. Only the best algorithm is shown for each consensus method and as can be seen there are both groups in which the two consensus methods are in agreement and those where they do not agree.
Best algorithm
GroupConsensus method# Alg.Precision Precision
CMA-ES SD/L 11 (1+2_ms) CMA-ES (1+2_ms) CMA-ES 
CMA-ES Borda 11 (1+2_ms) CMA-ES (1+2_ms) CMA-ES 
CMA-hybrid SD/L IPOP-ACTCMA-ES IPOP-ACTCMA-ES 
CMA-hybrid Borda IPOP-ACTCMA-ES IPOP-ACTCMA-ES 
DE SD/L DE-F-AUC DE-F-AUC 
DE Borda DE-F-AUC DE-F-AUC 
EDA SD/L iAMALGAM iAMALGAM 
EDA Borda iAMALGAM iAMALGAM 
GA/ES SD/L G3PCX G3PCX 
GA/ES Borda G3PCX G3PCX 
Global search SD/L AVGNEWUOA AVGNEWUOA 
Global search Borda FULLNEWUOA FULLNEWUOA 
Gradient SD/L BFGS BFGS 
Gradient Borda BFGS BFGS 
Hybrid SD/L MOS VNS 
Hybrid Borda MOS MOS 
Local search SD/L Nelder-Doerr Nelder-Doerr 
Local search Borda Nelder-Doerr Nelder-Doerr 
PSO SD/L PSO PSO 
PSO Borda PSO PSO 
Random SD/L RANDOMSEARCH RANDOMSEARCH 
Random Borda RANDOMSEARCH RANDOMSEARCH 
Best algorithm
GroupConsensus method# Alg.Precision Precision
CMA-ES SD/L 11 (1+2_ms) CMA-ES (1+2_ms) CMA-ES 
CMA-ES Borda 11 (1+2_ms) CMA-ES (1+2_ms) CMA-ES 
CMA-hybrid SD/L IPOP-ACTCMA-ES IPOP-ACTCMA-ES 
CMA-hybrid Borda IPOP-ACTCMA-ES IPOP-ACTCMA-ES 
DE SD/L DE-F-AUC DE-F-AUC 
DE Borda DE-F-AUC DE-F-AUC 
EDA SD/L iAMALGAM iAMALGAM 
EDA Borda iAMALGAM iAMALGAM 
GA/ES SD/L G3PCX G3PCX 
GA/ES Borda G3PCX G3PCX 
Global search SD/L AVGNEWUOA AVGNEWUOA 
Global search Borda FULLNEWUOA FULLNEWUOA 
Gradient SD/L BFGS BFGS 
Gradient Borda BFGS BFGS 
Hybrid SD/L MOS VNS 
Hybrid Borda MOS MOS 
Local search SD/L Nelder-Doerr Nelder-Doerr 
Local search Borda Nelder-Doerr Nelder-Doerr 
PSO SD/L PSO PSO 
PSO Borda PSO PSO 
Random SD/L RANDOMSEARCH RANDOMSEARCH 
Random Borda RANDOMSEARCH RANDOMSEARCH 

In the following, we list the chosen groups and their algorithms, together with a short explanation concerning the conjunctive concept. Figure 3 and Figure 4 display consensus rankings within the algorithm groups over the required target precision and BBOB function groups as described in Section 3.2. Note that for selecting the representative, we rely on the Borda count consensus as the most intuitive approach. In case the rankings over precisions and function groups lead to different results, we allow for two representatives. Table 3 summarizes the Borda and SD/L consensus results for the algorithm groups for the lowest and highest precision value considered.

  • CMA-ES. (1,2_m) CMA-ES, (1,2_ms) CMA-ES, (1,2) CMA-ES, (1,2s) CMA-ES, (1,4_m) CMA-ES, (1,4_ms) CMA-ES, (1,4) CMA-ES, (1,4s) CMA-ES, (1+1) CMA-ES, (1+2_ms) CMA-ES, () CMA-ES. This group contains all simple variants of the CMA-ES that do not employ a specific restart heuristic (other than randomly placed restarts). They can be seen as reparameterizations of the original CMA-ES (which is not in the test set). The representative of this group is the (1,2_ms) CMA-ES that strongly dominates the other CMA-ES versions on the unimodal function groups.

  • CMA-Hybrid. BIPOP-CMA-ES, CMA-EGS, IPOP-ACTCMA-ES, IPOP-CMA-ES, IPOP-SEP-CMA-ES, NBC-CMA. In contrary to the group above, these CMA variants all employ nontrivial restart heuristics that change the population size or determine the search space positions for restarts. The group is represented by the IPOP-ACTCMA-ES.

  • DE. DE-F-AUC, DE-PSO, DEuniform, PM-AdapSS-DE. Here we find all methods that may largely be considered as differential evolution (DE) algorithms. It is represented by DE-F-AUC which clearly dominates over most precision values.

  • EDA. AMALGAM, BAYEDA, Cauchy-EDA, iAMALGAM. This group is composed of the estimation of distribution (EDA) methods, and its representative is the iAMALGAM, which ranks best over four of the five function groups.

  • GA/ES. ALPS, DASA, G3PCX, One-Fifth Variant3, RCGA, Simple GA, SPSA. Algorithms which largely follow the design principles of a genetic algorithm or an evolution strategy and do not use the covariance matrix adaptation are collected here. This group is represented by G3PCX as it ranks best over three of the five problem groups.

  • Global Search. AVGNEWUOA, DIRECT, FULLNEWUOA, GLOBAL, MCS, NEWUOA. These algorithms take explicit measures to aim for a good covering of the search space. Its representative is FULLNEWUOA because it is a good average performer.

  • Gradient. BFGS. This group contains the only pure gradient method of all competing algorithms, BFGS, which also represents it.

  • Hybrid. EDA-PSO, MA-LS-CHAIN, MOS, nPOEMS, oPOEMS, POEMS, VNS. This group consists of methods that rely on multiple different search paradigms and can therefore be considered hybrid. We also put memetic algorithms here. The group is represented by the MOS algorithm which achieved the top rank for two of five cases.

  • Local Search. fminsearch, Line Search—fminbnd, Line Search—STEP, Local Search—Rosenbrock, Local Search: Nelder-Doerr. Here, the direct search and related methods are collected that explicitly aim for best possible local optimization and do not possess an explicit global search mechanism. However, Nelder-Doerr employs an evolutionary component by selecting the best of a population of local search instances. Nevertheless, it appears more similar to a (repeated) local search algorithm than to a hybrid or global optimization algorithm. It is also the representative of this group as it ranks best over three of the five problem groups.

  • PSO. Artificial bee colony, PSO, PSO Bounds (2010). This group consists of nature-inspired swarm algorithms, namely different particle swarm optimization methods (PSO) and the artificial bee colony (ABC) method. It is represented by the PSO method which ranks best for most precisions and problem groups.

  • Random. RANDOMSEARCH. Only one random search algorithm has entered the BBOB instances, and its search paradigm is different enough from all others to justify its own group. It also represents the group.

3.4  Representative Algorithms for Groups

As motivated in the previous section, we need to choose a few representatives from each algorithm group. This was done by calculating the Borda consensus over all algorithms in each group for the accuracy levels and and choosing the best algorithm in each consensus as one of the representatives. All further analysis will also restrict itself to the two aforementioned precisions. Why do we use the Borda and not the SD/L consensus for this decision? The Borda consensus method captures our intent to find an algorithm that performs above average over all functions when compared to the other algorithms in the group. The SD/L method would prefer an algorithm that performs well on the majority of the test functions but may fail catastrophically on a minority. The chosen 11 algorithms for the further analysis are therefore (1+2_ms) CMA-ES, IPOP-ACTCMA-ES, DE-F-AUC, iAMALGAM, G3PCX, FULLNEWUOA, BFGS, MOS, Nelder-Doerr, PSO, and RANDOMSEARCH. We will call this group of algorithms the best algorithms in the following sections. Do not confuse this with the mythical best algorithm mentioned in Section 2.

To get an initial idea of how these best algorithms compare to each other we can look at the distribution of their ranks. For each test function and dimension we obtain two rankings of the best algorithms: one for an accuracy level of , and one for an accuracy level of . We can now extract the rank (i.e., the position) of each algorithm in each of these rankings and then look at this distribution. Two things we would want to see in a good algorithm are a low mean rank which means it is often one of the better algorithms and, as a secondary goal, a low variance of the ranks implying that its performance does not vary much with the test function. Since we cannot expect the algorithms to perform equally well in 2D and 3D when compared to higher dimensions, we will look at these distributions separately. The results can be seen as box plots in Figure 5. We can instantly assess that there are differences between the 2D–3D and the 5D–20D distributions, as expected. In fact, in low dimensions it appears that a different set of algorithms performs well when compared to the high dimensional set.

Figure 5:

Box plots showing the distribution of the rank of each of best algorithms for all 2D and 3D (left) and 5D to 20D (right) rankings. The red dot marks the mean of the ranks which coincides with the Borda score of the algorithm if a consensus was formed. The algorithms are therefore ordered by the Borda consensus. The vertical lines mark the median of the ranks.

Figure 5:

Box plots showing the distribution of the rank of each of best algorithms for all 2D and 3D (left) and 5D to 20D (right) rankings. The red dot marks the mean of the ranks which coincides with the Borda score of the algorithm if a consensus was formed. The algorithms are therefore ordered by the Borda consensus. The vertical lines mark the median of the ranks.

Finally we can also look at an overall (Borda) consensus of the algorithms over all test functions, dimensions, and the two precision levels. This gives us

  • IPOP-ACTCMA-ES Nelder-Doerr (1+2_ms) CMA-ES iAMALGAM FULLNEWUOA BFGS MOS DE-F-AUC G3PCX PSO RANDOMSEARCH

More interesting are the Borda consensus rankings for the same scenario as above but not over all test functions. Instead, for each function group as defined in Hansen, Finck, et al. (2009), a consensus is calculated over all dimensions and accuracy levels. This gives the following consensus rankings:
  • Unimodal—Separable. Nelder-Doerr BFGS (1+2_ms) CMA-ES FULLNEWUOA IPOP-ACTCMA-ES iAMALGAM MOS DE-F-AUC PSO G3PCX RANDOMSEARCH

  • Unimodal—Low Contrast. FULLNEWUOA IPOP-ACTCMA-ES BFGS (1+2_ms) CMA-ES Nelder-Doerr iAMALGAM G3PCX DE-F-AUC MOS PSO RANDOMSEARCH

  • Unimodal—High Contrast. iAMALGAM IPOP-ACTCMA-ES (1+2_ms) CMA-ES Nelder-Doerr BFGS DE-F-AUC FULLNEWUOA MOS G3PCX PSO RANDOMSEARCH

  • Multimodal—Weak Global Structure. Nelder-Doerr FULLNEWUOA BFGS (1+2_ms) CMA-ES G3PCX MOS iAMALGAM IPOP-ACTCMA-ES DE-F-AUC PSO RANDOMSEARCH

  • Multimodal—Adequate Global Structure. IPOP-ACTCMA-ES MOS DE-F-AUC iAMALGAM Nelder-Doerr (1+2_ms) CMA-ES PSO FULLNEWUOA G3PCX BFGS RANDOMSEARCH

Here we can see that while Nelder-Doerr is only second in the overall ranking, it dominates two function groups, whether the IPOP-ACTCMA-ES leads only one. Additionally, FULLNEWUOA and iAMALGAM, which do not play an important role in the overall ranking, each dominate one function group. It is obvious that knowing the problem type is very important for selecting the right algorithm.

After looking at these different ways to aggregate the results we might ask if there is a natural grouping of functions that arises from the performance of the algorithms. To answer this question and gain further insight into the performance characteristics of the set of best algorithms we will need some tools from statistics, which will be introduced next.

3.5  Statistical Methods

In the following section we apply different methods from data analysis and statistics to visualize and interpret the BBOB results. These statistical tools will be briefly introduced in this section. Much more detailed explanations can be found in the cited literature, and we recommend Hastie et al. (2001) as a general reference, which covers all the required material.

3.5.1  Multidimensional Scaling

Multidimensional scaling (MDS) is a visualization technique, embedding objects xi for from a high-dimensional into a lower dimensional Euclidean space (usually 2D or 3D). MDS operates only on a matrix of given distances (e.g., Euclidean) or dissimilarities between the objects. Note that MDS can therefore be applied even when only the are known, but not the xi themselves. The embedding is performed in such a way that the distances are maintained as closely as possible by solving the following optimization problem:
formula
Here, the are the low-dimensional mappings of the original xi. The optimization problem is usually solved by a gradient descent algorithm.

3.5.2  Partitioning Around Medoids

Partitioning around medoids (PAM) is an unsupervised data mining method, which clusters unlabeled objects xi into sets of neighbor items. Again, we assume having a matrix of distances between all objects xi and xj available. At each stage of the algorithm a set of k representatives (the medoids) is maintained. As this is just a subset of the original data points, their set of indices suffices. The target function to minimize is defined as
formula
where is the index of the nearest representative to observation xi.

In the initial phase the medoids are chosen in a greedy, iterative way to reduce the target function value. Afterward, the algorithm exchanges in each step until convergence of a medoid with a nonmedoid so that the target function is maximally reduced.

The number of clusters can be decided by running PAM several times with different values for k and calculating the so-called average silhouette width. This width defines a numerical measure for each observation and reflects how well it fits into its selected cluster instead of any other cluster. For a formal definition see Kaufman and Rousseeuw (1990). The parameter k is then chosen by selecting the number of clusters with the highest silhouette width.

3.5.3  Classification and Regression Trees

Breiman et al. (1984) try to find a mapping between a d-dimensional input space (where the individual features are usually measured on a metric, ordinal, or nominal scale) and a set of finite labels . The mapping is of the form of a binary tree, where every node represents a univariate rule (or if Xi is nominal). These rules are generated in a greedy, top-down fashion by analyzing a finite set of learning examples . The best rule for the current node k is found by first considering the so-called impurity
formula
of the node. This is maximal if all classes in the data which reach node k occur with equal relative frequency . Note that our definition above is also called the Gini index and alternative measures of node impurity are available. The rule for node k is selected now in such a way that the difference in impurity between k and its subnodes (created by this new rule) is maximal:
formula
Here and are the impurities of the left and right subnode of k, and pl and pr are the percentages of data which move from the parent node r to its children l and r, respectively.

Usually, a large tree is grown w.r.t. some stopping criterion, such as the minimal node size, and then simplified in a second stage by cost-complexity pruning.

3.6  Characterizing Performance of the Best Group

In Section 3.3 we chose a subset of algorithms from the original dataset called the best algorithms and used consensus methods to find a best algorithm for some dimension or some subset of the test functions. In this section we will focus on understanding the structure the algorithms reveal in our set of test functions. Consequently, we will restrict ourselves to what we consider the current sweet spot of black box optimization, that is, to 5–20 dimensional problems.

For these problems, we would like to know if the similarity between test functions that is implied in the five function groups defined by the BBOB team is also present in the rankings. Therefore, we need means to characterize similarity or distance between rankings. Recall that we introduced such a distance metric for the optimization-based consensus method in Section 2. Using the SD metric proposed there we can calculate a similarity matrix between all the individual rankings. We can visualize this matrix using multidimensional scaling and also try to cluster the rankings based on their relative distance to each other using PAM. Using the average silhouette width as the cluster index, the optimal number of clusters into which to partition the rankings is four. This is already a departure from the five function groups proposed. To see how well the two groupings of the test functions agree, we plot the 2D representation of the dissimilarity matrix as recovered by the MDS in Figure 6, and color the points according to their group memberships.

Figure 6:

Plot of the two-dimensional projection recovered from the dissimilarity matrix of the rankings of the best algorithms in 5 to 20 dimensions. Each point represents a single test function/dimension combination. On the left the points are coded according to the function group the test function belongs to and on the right the coding shows which cluster was assigned to the test function by the PAM procedure.

Figure 6:

Plot of the two-dimensional projection recovered from the dissimilarity matrix of the rankings of the best algorithms in 5 to 20 dimensions. Each point represents a single test function/dimension combination. On the left the points are coded according to the function group the test function belongs to and on the right the coding shows which cluster was assigned to the test function by the PAM procedure.

Even though the MDS is only an approximate representation of the distance matrix, we can see that the clusters are also visible in the plots. On the other hand, the function groups do not seem to be reflected in the clustering or the MDS plot. We therefore infer that the function grouping which was provided by the BBOB team does not coincide with similar algorithm behavior. Instead we have empirically determined the four new groups shown in Table 4.

Table 4:
Table showing the cluster that was assigned to each function/dimension combination. Note that for f24 in 20 dimensions we do not have enough data to assign a cluster.
Function5D10D20D
f01 
f02 
f03 
f04 
f05 
f06 
f07 
f08 
f09 
f10 
f11 
f12 
f13 
f14 
f15 
f16 
f17 
f18 
f19 
f20 
f21 
f22 
f23 
f24 – 
Function5D10D20D
f01 
f02 
f03 
f04 
f05 
f06 
f07 
f08 
f09 
f10 
f11 
f12 
f13 
f14 
f15 
f16 
f17 
f18 
f19 
f20 
f21 
f22 
f23 
f24 – 

What is missing is a set of rules describing these clusters. From the list above it is almost impossible to deduce anything about what defines a cluster. Some functions are spread between different clusters, and all clusters contain some 5-, some 10-, and some 20-dimensional functions. So instead of trying to describe the clusters based on the function name and dimension, we will try to abstract away from the concrete test function and replace each test function by a set of features that describes the function. These were first proposed by Mersmann et al. (2010a) and are shown in Table 5. They are again a subjective way of categorizing each function. We then train a classification tree on the obtained dataset by using the function properties as the features and the cluster as our target class. The resulting tree is displayed in Figure 7 and permits two interesting observations.

Figure 7:

Decision tree describing the relationship between the features defined in Table 5 and the clusters found by PAM. Each node of the tree describes a decision. The variable on which the decision is based is given in the node and the values are the labels on each edge. The leaves of the tree describe the relative frequency of each cluster in the data that, according to the decision rules, belong in that leaf. High values mean that it is likely that a function with these properties would belong to the respective cluster. For example, all test functions which are in leaf node 4, that is, that have no multimodality, and no variable scaling, belong to cluster 1.

Figure 7:

Decision tree describing the relationship between the features defined in Table 5 and the clusters found by PAM. Each node of the tree describes a decision. The variable on which the decision is based is given in the node and the values are the labels on each edge. The leaves of the tree describe the relative frequency of each cluster in the data that, according to the decision rules, belong in that leaf. High values mean that it is likely that a function with these properties would belong to the respective cluster. For example, all test functions which are in leaf node 4, that is, that have no multimodality, and no variable scaling, belong to cluster 1.

Table 5:
Classification of the noiseless functions based on their properties. Predefined groups are separated by extra space. Note that we did not assign global structure to the unimodal problems. This is debatable and contrary to the opinion of the BBOB organizers. However, it has not been formally defined in the BBOB setup.
GlobalVariableBasinGlobal-local
FunctionMultimodalitystructureSeparabilityscalingHomogeneitysizescontrastPlateaus
1: Sphere None None High None High None None None 
2: Ellipsoidal separable None None High High High None None None 
3: Rastrigin separable High Strong High Low High Low Low None 
4: Büche-Rastrigin High Strong High Low High Medium Low None 
5: Linear slope None None High None High None None None 
6: Attractive sector None None None Low Medium None None None 
7: Step ellipsoidal None None None Low High None None Small 
8: Rosenbrock Low None None None Medium Low Low None 
9: Rosenbrock rotated Low None None None Medium Low Low None 
10: Ellipsoidal high conditioned None None None High High None None None 
11: Discus None None None High High None None None 
12: Bent cigar None None None High High None None None 
13: Sharp ridge None None None Low Medium None None None 
14: Different powers None None None Low Medium None None None 
15: Rastrigin multimodal High Strong None Low High Low Low None 
16: Weierstrass High Medium None Medium High Medium Low None 
17: Schaffer F7 High Medium None Low Medium Medium High None 
18: Schaffer F7 moderately ill-cond. High Medium None High Medium Medium High None 
19: Griewank-Rosenbrock High Strong None None High Low Low None 
20: Schwefel Medium Deceptive None None High Low Low None 
21: Gallagher 101 peaks Medium None None Medium High Medium Low None 
22: Gallagher 21 peaks Low None None Medium High Medium Medium None 
23: Katsuura High None None None High Low Low None 
24: Lunacek bi-Rastrigin High Weak None Low High Low Low None 
GlobalVariableBasinGlobal-local
FunctionMultimodalitystructureSeparabilityscalingHomogeneitysizescontrastPlateaus
1: Sphere None None High None High None None None 
2: Ellipsoidal separable None None High High High None None None 
3: Rastrigin separable High Strong High Low High Low Low None 
4: Büche-Rastrigin High Strong High Low High Medium Low None 
5: Linear slope None None High None High None None None 
6: Attractive sector None None None Low Medium None None None 
7: Step ellipsoidal None None None Low High None None Small 
8: Rosenbrock Low None None None Medium Low Low None 
9: Rosenbrock rotated Low None None None Medium Low Low None 
10: Ellipsoidal high conditioned None None None High High None None None 
11: Discus None None None High High None None None 
12: Bent cigar None None None High High None None None 
13: Sharp ridge None None None Low Medium None None None 
14: Different powers None None None Low Medium None None None 
15: Rastrigin multimodal High Strong None Low High Low Low None 
16: Weierstrass High Medium None Medium High Medium Low None 
17: Schaffer F7 High Medium None Low Medium Medium High None 
18: Schaffer F7 moderately ill-cond. High Medium None High Medium Medium High None 
19: Griewank-Rosenbrock High Strong None None High Low Low None 
20: Schwefel Medium Deceptive None None High Low Low None 
21: Gallagher 101 peaks Medium None None Medium High Medium Low None 
22: Gallagher 21 peaks Low None None Medium High Medium Medium None 
23: Katsuura High None None None High Low Low None 
24: Lunacek bi-Rastrigin High Weak None Low High Low Low None 

First of all, we have no way of differentiating between cluster 3 and cluster 4 using the defined features. This is surprising and might lead to the discovery of new characteristics of the test functions which might explain the difference between the two clusters. The main difference between the remaining cluster 1 and cluster 2 is that cluster 1 has no variable scaling and cluster 2 is unimodal.

We conclude by presenting a Borda consensus over the four groups discovered using the cluster analysis. Recall that these four groups are already fairly homogeneous w.r.t. the algorithm rankings since their distance to each other, in the SD metric, is small. The resulting consensus rankings are as follows.

  • Cluster 1. FULLNEWUOA BFGS Nelder-Doerr (1+2_ms) CMA-ES IPOP-ACTCMA-ES G3PCX iAMALGAM MOS DE-F-AUC PSO.

  • Cluster 2. IPOP-ACTCMA-ES (1+2_ms) CMA-ES iAMALGAM MOS DE-F-AUC Nelder-Doerr G3PCX FULLNEWUOA BFGS PSO.

  • Cluster 3. IPOP-ACTCMA-ES MOS DE-F-AUC iAMALGAM FULLNEWUOA PSO G3PCX (1+2_ms) CMA-ES Nelder-Doerr BFGS.

  • Cluster 4. MOS iAMALGAM IPOP-ACTCMA-ES Nelder-Doerr PSO (1+2_ms) CMA-ES BFGS DE-F-AUC FULLNEWUOA G3PCX.

The consensus rankings allow for making several interesting observations. We discover that the difference between clusters 1 and 2 is that classical optimization approaches work well on functions in the first cluster while evolutionary strategies outperform them on the second cluster of functions. These are the two clusters we can characterize fairly well, as can be seen in Figure 7. The last two clusters consist of a large number of functions that are some of the hardest in the test function set because on these RANDOMSEARCH outperforms some more advanced methods. Nearly all functions in clusters 3 and 4 are multimodal, and most of them are highly multimodal. Whereas the functions in cluster 3 can usually be solved by the IPOP-ACTCMA-ES, this is often not the case for cluster 4, so that many of these are not solved by any method.

Comparing the results obtained on the newly found four clusters and the clusters defined by the BBOB team based on human experience, we come to very different conclusions. However, we have to admit that we do not yet have adequate features to describe our groups.

4  Outlook on Future Work

Analyzing the benchmark data as done in Section 3 already generates many valuable insights into the performance ranking of optimizers and algorithmic groups under various problem conditions. However, it does not provide a satisfying answer to the urgent problem of what optimization algorithm should be selected in practice for a given, unknown problem or problem domain. It is a well known fact from practice—and also clearly visible in our presented results—that no current optimization algorithm solves all problems equally well. Although the no-free-lunch theorem does not hold for the case of continuous spaces as shown by Auger and Teytaud (2007), it is very unlikely that one optimization algorithm will completely dominate all others in the near future. Therefore, a general set of rules which guide a practitioner in choosing an appropriate algorithm for the problem at hand from the vast pool of available optimizers would be a highly useful tool. At the same time, the knowledge used to construct such a ruleset might be used to construct even better or more robust optimizers.

Of course, the construction of such a set of rules requires the definition of test problem characteristics and relating them in a meaningful way to the expected performance of an optimizer. This will in general be a very challenging problem. A very similar task has been considered in the machine learning community under the term metalearning (see, e.g., Brazdil et al., 1994), where one tries to predict which learner is most appropriate given a feature vector of dataset characteristics. In Mersmann et al. (2010a) we already proposed a set of manually constructed testset properties. These have two major disadvantages: They are discrete (e.g., low, medium, or high multimodality) and therefore somewhat ambiguous. And they obviously require their definition by an expert, which limits their practical usefulness, if one wants to move beyond solving and analyzing the BBOB problem set.

We are currently working on the definition of an extensive set of numerical, computable test problem characteristics, which contain, among other numerical properties, average gradients and convexity, landmarking by simple and fast optimizers and techniques from regression and classification to capture general landscape shapes. For first results see Mersmann et al. (2011).

In our future work we will try to demonstrate that:

  1. These features can be constructed for any test problem without help from a human expert.

  2. For the BBOB set these at least contain all the information gained by the manually crafted features.

  3. At least in an exploratory sense they can be used to relate test problem characteristics to optimizer performance or ranking.

The biggest remaining challenge then will be to efficiently calculate a relevant subset of these features in an online fashion and use them to select or switch to an appropriate optimization algorithm set for a considered, unknown target problem.

5  Conclusions

In this article, we have shown how a combination of benchmarking methods and classical statistical exploratory data analysis can be used to gain insight into the performance characteristics of a set of algorithms under test. For this we introduced a novel approach to aggregate the results of black box optimization benchmarks. The approach requires a carefully chosen set of test functions and performance measures. We apply this approach to the combined 2009 and 2010 BBOB results. After reducing the number of algorithms by partitioning the set of algorithms into groups of similar algorithm designs and reducing these partitions to one or two representative algorithms, we are able to show that the relative performance of these algorithms is far from uniform over all test functions. Even within the predefined groups of functions, algorithm performance varies widely. Using the similarity between the individual rankings we used cluster methods to find four groups within which the relative performance of the algorithms is homogeneous. We then set out to explain the cluster memberships by using properties of the functions as features in order to build a decision tree which describes the relationship between clusters and function properties. We will continue this line of work by developing tools to automatically characterize functions using empirical features. Based upon these, rules can be constructed which allow practitioners the selection of a reasonable set of algorithms as a starting point for a new optimization problem. We call this new line of research ELA.

Acknowledgments

This work was partly supported by the Collaborative Research Center SFB 823, the Graduate School of Energy Efficient Production and Logistics and the Research Training Group “Statistical Modelling” of the German Research Foundation.

Supplementary Material

The complete source code used to produce the figures, tables and consensus rankings in this paper, and all figures, in color, as well some additional figures which might be useful to better understand how some of the conclusions were derived, especially for Figure 2, are available at http://ptr.p-value.net/ecj13.

References

Arrow
,
K. J
. (
1950
).
A difficulty in the concept of social welfare
.
Journal of Political Economy
,
58
(
4
):
328
.
Auger
,
A.
,
Finck
,
S.
,
Hansen
,
N.
, and
Ros
,
R.
(
2010
).
BBOB 2010: Comparison tables of all algorithms on all noiseless functions. INRIA. (Technical Report RT-388)
.
Auger
,
A.
, and
Teytaud
,
O.
(
2007
).
Continuous lunches are free! In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, GECCO ’07
, pp.
916
922
.
Brazdil
,
P.
,
Gama
,
J.
, and
Henery
,
B
. (
1994
).
Characterizing the applicability of classification algorithms using meta-level learning
. In
ECML-94: Proceedings of the European Conference on Machine Learning
, pp.
83
102
.
Breiman
,
L.
,
Friedman
,
J.
,
Olshen
,
R.
, and
Stone
,
C
. (
1984
).
Classification and regression trees
.
London
:
Chapman & Hall
.
Cook
,
W. D.
, and
Kress
,
M
. (
1992
).
Ordinal information and preference structures
.
Upper Saddle River, NJ
:
Prentice Hall
.
de Borda
,
J. C.
(
1781
).
Mémoire sur les élections au scrutin
.
Historie de l’Académie Royale des Sciences
,
Paris
.
Hansen
,
N.
,
Auger
,
A.
, and
Auger
,
A
. (
2005
).
Performance evaluation of an advanced local search evolutionary algorithm
. In
Proceedings of the IEEE Congress on Evolutionary Computation
, pp. 
1777
1784
.
Hansen
,
N.
,
Auger
,
A.
,
Finck
,
S.
, and
Ros
,
R.
(
2009
).
Real-parameter black-box optimization benchmarking 2009: Experimental setup. INRIA. (Technical Report RR-6828)
Hansen
,
N.
,
Auger
,
A.
,
Ros
,
R.
,
Finck
,
S.
, and
Pošík
,
P
. (
2010
).
Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009
. In
Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, GECCO ’10
, pp.
1689
1696
.
Hansen
,
N.
,
Finck
,
S.
,
Ros
,
R.
, and
Auger
,
A.
(
2009
).
Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions. INRIA. (Technical Report RR-6829)
Hastie
,
T.
,
Tibshirani
,
R.
, and
Friedman
,
J
. (
2001
).
The elements of statistical learning
.
Berlin
:
Springer
.
Hornik
,
K.
, and
Meyer
,
D.
(
2007
).
Deriving consensus rankings from benchmarking experiments
. In
R.
Decker
and
H.-J.
Lenz
(Eds.),
Advances in Data Analysis, Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation
, pp.
163
170
.
Springer
.
Hunter
,
D. J
. (
2008
).
Essentials of discrete mathematics
.
Boston, MA
:
Jones and Bartlett
.
Kaufman
,
L.
, and
Rousseeuw
,
P
. (
1990
).
Finding groups in data: An introduction to cluster analysis
.
New York
:
Wiley Interscience
.
Kemeny
,
J. G.
, and
Snell
,
J. L
. (
1972
).
Mathematical models in the social sciences
.
Cambridge, MA
:
MIT Press
.
Mersmann
,
O.
(
2009
).
Benchmarking evolutionary multiobjective optimization algorithms using R. Bachelor’s thesis
,
TU Dortmund
. http://www.statistik.tu-dortmund.de/∼olafm/files/ba.pdf
Mersmann
,
O.
,
Bischl
,
B.
,
Trautmann
,
H.
,
Preuss
,
M.
,
Weihs
,
C.
, and
Rudolph
,
G
. (
2011
).
Exploratory landscape analysis
. In
Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO ’11
, pp.
829
836
.
Mersmann
,
O.
,
Preuss
,
M.
, and
Trautmann
,
H.
(
2010a
).
Benchmarking evolutionary algorithms: Towards exploratory landscape analysis
. In
Proceedings of the 11th International Conference on Parallel Problem Solving from Nature, PPSN XI
.
Lecture notes in computer science
, Vol.
6238
(pp. 
71
80
).
Berlin
:
Springer-Verlag
.
Mersmann
,
O.
,
Trautmann
,
H.
,
Naujoks
,
B.
, and
Weihs
,
C
. (
2010b
).
Benchmarking evolutionary multiobjective optimization algorithms
. In
Proceedings of the Congress on Evolutionary Computation, CEC
.
Mersmann
,
O.
,
Trautmann
,
H.
,
Naujoks
,
B.
, and
Weihs
,
C.
(
2010c
).
On the distribution of EMOA hypervolumes
. In
C.
Blum
and
R.
Battiti
(Eds.),
LION. Lecture notes in computer science
, Vol. 
6073
(pp.
333
337
).
Berlin
:
Springer-Verlag
.
Mood
,
A.
,
Graybill
,
F.
, and
Boes
,
D
. (
1974
).
Introduction to the theory of statistics
.
New York
:
McGraw-Hill
.
Saari
,
D. G.
, and
Merlin
,
V. R
. (
2000
).
A geometric examination of Kemeny’s rule
.
Social Choice and Welfare
,
17
(
3
):
403
438
.
Zitzler
,
E.
, and
Thiele
,
L.
(
1998
).
Multiobjective optimization using evolutionary algorithms—A comparative case study
. In
Proceedings of the 5th International Conference on Parallel Problem Solving from Nature, PPSN V
.
Lecture notes in computer science
, Vol.
1498
(pp.
292
304
).
Berlin
:
Springer-Verlag
.

Notes

1

Slight reparameterizations of the test function obtained by rescaling, rotating, or otherwise transforming the parameter vector before applying the function.

3

Note that because of the difficulty of visualizing the results at this stage on just one page, we have made larger individual plots available at http://ptr.p-value.net/ecj13