## Abstract

In this article, we build upon previous work on designing informative and efficient *Exploratory Landscape Analysis* features for characterizing problems' landscapes and show their effectiveness in automatically constructing algorithm selection models in continuous black-box optimization problems. Focusing on algorithm performance results of the COCO platform of several years, we construct a representative set of high-performing complementary solvers and present an algorithm selection model that, compared to the portfolio's single best solver, on average requires less than half of the resources for solving a given problem. Therefore, there is a huge gain in efficiency compared to classical ensemble methods combined with an increased insight into problem characteristics and algorithm properties by using informative features. The model acts on the assumption that the function set of the *Black-Box Optimization Benchmark* is representative enough for practical applications. The model allows for selecting the best suited optimization algorithm within the considered set for unseen problems *prior* to the optimization itself based on a small sample of function evaluations. Note that such a sample can even be reused for the initial population of an evolutionary (optimization) algorithm so that even the feature costs become negligible.

## 1 Introduction

Although the *Algorithm Selection Problem* (ASP, Rice, 1976) has been introduced more than four decades ago, there only exist few works (e.g., Bischl et al., 2012; Muñoz Acosta, Sun et al., 2015), which perform algorithm selection in the field of continuous optimization. Independent of the underlying domain, the goal of the ASP can be described as follows: given a set of optimization algorithms $A$, often denoted algorithm *portfolio*, and a set of problem instances $I$, one wants to find a model $m:I\u2192A$ that selects the best algorithm $A\u2208A$ from the portfolio for an *unseen* problem instance $I\u2208I$. Albeit there already exists a plethora of optimization algorithms, even when only considering single-objective, continuous optimization problems, none of them can be considered to be superior to all the other ones across all optimization problems. Hence, it is very desirable to find a sophisticated selection mechanism, which automatically picks the portfolio's best solver for a given problem. Of course hyper-heuristics (Burke et al., 2003) already internally combining several algorithmic approaches can be part of the solver-portfolio as well.

Within other optimization domains, such as the well-known *Traveling Salesperson Problem*, feature-based algorithm selectors have already shown their capability of outperforming the respective state-of-the-art optimization algorithm(s) by combining machine learning techniques and problem dependent features (Kotthoff et al., 2015; Kerschke et al., 2017). As schematized in Figure 1, we now transfer the respective idea of using instance-specific features to single-objective continuous optimization problems based on Exploratory Landscape Analysis (ELA, Mersmann et al., 2011) for leveraging solver complementarity of a well-designed algorithm portfolio.

As we show within this work, the integration of ELA results in strong performance improvements (averaged across the entire benchmark) over *any* of the portfolio's single solvers. More precisely, our selector requires only half of the resources needed by the portfolio's single best solver. Hence, our model strongly reduces the gap towards the idealistic, and thus, from a realistic point of view unreachable, virtual best solver. The latter is an oracle-like algorithm selector, which always predicts the best algorithm for a given problem, without using any additional information (such as problem features) and therefore without any additional costs.

To the best of our knowledge, this work is the first one of its kind in the domain of continuous black-box optimization, which successfully combines automatically computable features, by means of ELA, with sophisticated machine learning and feature selection techniques in order to model powerful algorithm selectors on a comprehensive, well-accepted algorithm portfolio and benchmark set. Our proposed approach is related to the field of *automated machine learning* (AutoML, see e.g., Feurer et al., 2015), which tries to tackle machine learning problems (such as optimization and classification) in a completely automated fashion. Within our work, we extend the idea of automated machine learning with an additional assessment of the trained algorithm selection models and the analysis of the included ELA features. This allows for a better understanding of (a) the algorithm selectors, (b) the differences between the different optimization algorithms, and (c) the considered problem instances.

A more detailed overview of Exploratory Landscape Analysis, as well as an introduction into flacco—an extensive R toolbox for computing a variety of such landscape features enhanced by a graphical user interface—is given in Section 2. In Section 3, we give more insights into the COCO platform and afterwards describe our experimental setup, including the generation of the considered algorithm portfolio, in Section 4. An analysis of our found algorithm selection models is given in Section 5, and Section 6 concludes our work.

## 2 Exploratory Landscape Analysis

While problem-dependent (landscape) features can in general be computed for any optimization problem (e.g., Mersmann et al., 2013; Hutter et al., 2014; Ochoa et al., 2014; Pihera and Musliu, 2014; Daolio et al., 2017), we will consider only single-objective, continuous optimization problems within this work.

For this domain, Mersmann et al. (2011) introduced a sophisticated approach for characterizing a problem's landscape by means of numerical feature values and called it *Exploratory Landscape Analysis*. Within their work, they designed a total of 50 numerical measures and grouped them into six categories of so-called “low-level” features, which aim at numerically characterizing the landscape properties of a problem to be solved. The individual features are based on systematic sampling of the decision space by for example, using a latin hypercube design. A wide range of mathematical and statistical measures, which can be calculated based on relatively few function evaluations, are included, and the respective composition of the features reflects the desired overall problem characteristics. Six low-level feature classes were introduced, that is, measures related to the distribution of the objective function values (*y*-*Distribution*), estimating meta-models such as linear or quadratic regression models on the sampled data (*Meta-Model*) and the level of convexity (*Convexity*). Furthermore, local searches are conducted starting at the initial design points (*Local Search*), the relative position of each objective value compared to the median of all values is investigated (*Levelset*), and numerical approximations of the gradient or the Hessian represent the class of curvature features (*Curvature*). Each class comprises a set of sub-features which result from the same experimental data generated from the initial design. Figure 2 visualizes the assumed main relationships between the low-level feature classes and so-called “high-level” properties (Mersmann et al., 2015), such as the degree of multimodality, the separability, the basin size homogeneity or the number of plateaus. However, given that these properties (a) require expert knowledge and as a consequence cannot be computed automatically, and (b) are categorical and thus, make it for instance impossible to distinguish problems by their class of multimodality (none, low, medium, high), the introduction of the low-level features can be seen as a major step towards automatically computable landscape features and hence automated algorithm selection.

### 2.1 Related Work

Already in the years before the term ELA was introduced, researchers have tried to characterize a problem's landscape by numerical values: Jones and Forrest (1995) assessed a problem's difficulty by means of a fitness distance correlation, Lunacek and Whitley (2006) introduced a dispersion metric, Malan and Engelbrecht (2009) quantified the landscape's ruggedness, and Müller and Sbalzarini (2011) performed fitness-distance analyses.

However, after Bischl et al. (2012) showed the potential of landscape analysis by using ELA features for algorithm selection, a manifold of new landscape features has been introduced. Abell et al. (2013) used hill climbing characteristics, Muñoz Acosta, Kirley et al. (2015) measured a landscape's information content, and Morgan and Gallagher (2015) analyzed the problems with the help of length scale features. More recently, Malan et al. (2015) characterized constrained optimization problems, and Shirakawa and Nagao (2016) introduced an entire bag of local landscape features.

In other works, research groups, which also include this article's authors, have designed features based on a discretization of the continuous search space into a grid of cells (Kerschke et al., 2014), and successfully employed the nearest better clustering approach for distinguishing funnel-shaped landscapes with a global structure from problems, whose local optima are aligned in a more random manner (Kerschke et al., 2015). This particularly facilitates the decision of the class of optimization algorithms that suits best for the problem at hand. We also showed that even low budgets of $50\xd7d$ observations (*d* being the problem dimensionality)—that is, a sample size that is close to the size of an evolutionary algorithm's initial population—is sufficient for such a distinction (Kerschke, Preuss et al. 2016). In consequence, the evaluation of this initial sample, which is required for the landscape features, would come without any additional costs, given that the evolutionary algorithm would have to evaluate those points anyway.

### 2.2 Flacco

In the previous subsection, we have provided an overview of numerous landscape features. Unfortunately, those features were usually, if available at all, implemented in different programming languages, such as Python (VanRossum and The Python Development Team, 2015), Matlab (MATLAB, 2013), or R (R Core Team, 2017), making it extremely complicated to use all of them within a single experiment. This obstacle has been solved (for R-users) with the development of flacco (Kerschke, 2017b), an R-package for **f**eature-based **l**andscape-**a**nalysis of **c**ontinuous and **c**onstrained **o**ptimization problems. The package (currently) provides a collection of more than 300 landscape features (including the majority of the ones from above), distributed across a total of 17 different feature sets. In addition, the package comes with several visualization techniques, which should help to facilitate the understanding of the underlying problem landscapes (Kerschke and Trautmann, 2016). One can either use the package's stable release from CRAN^{1} or its developmental version from GitHub.^{2} Note that the latter also provides further information on the usage of flacco, as well as a link to its online-tutorial.^{3}

Being aware of possible obstacles for non-R-users, Hanster and Kerschke (2017) recently developed a web-hosted and hence, platform-independent, graphical user interface (GUI)^{4} of flacco. A screenshot of that GUI is displayed in Figure 3. The GUI provides a slim, and thus user-friendly, version of flacco. In its left panel (highlighted in grey), the user needs to provide information on the function that should be optimized, either by selecting one of the single-objective optimization problems available in the R-package smoof (Bossek, 2017), configuring one of the BBOB-functions, manually defining a function, or by uploading an evaluated set of points. On the right side, one then can either compute (and download) a specific feature set or visualize certain aspects of the optimization problem and/or its features.

## 3 Exploratory Data Analysis of COCO

Instead of executing the optimization algorithms ourselves, we use the results from COCO^{5} (Hansen et al., 2016), which is a platform for **CO**mparing **C**ontinuous **O**ptimization algorithms on the Black-Box Optimization Benchmark (BBOB, Hansen et al., 2009; Finck et al., 2010). The platform provides a collection of the performance results of 129 optimization algorithms, which have been submitted to the BBOB workshops at the GECCO conference^{6} of the years 2009, 2010, 2012, 2013, and 2015. Using the results published on this platform comes with the advantage that over the years basically all important classes of optimization algorithms, including current state-of-the art optimizers, have been submitted. Thus, experiments are of representative character in a comprehensive setting.

### 3.1 General Setup of the COCO-Platform

The competition settings were quite similar across the years: per dimension *d*$\u2208$ {2, 3, 5, 10, 20, 40} and function $FID\u2208{1,\u2026,24}$, the participants had to submit the results for a total of 15 problem instances. As shown below, only five instances (IIDs 1 to 5) were part of *each* competition, while the remaining ten instances changed per year:

2009: IIDs 1 to 5 (with 3 replications each)

2010: IIDs 1 to 15 (1 replication each)

2012: IIDs 1 to 5, 21 to 30 (1 replication each)

2013: IIDs 1 to 5, 31 to 40 (1 replication each)

2015: IIDs 1 to 5, 41 to 50 (1 replication each)

*successful*to approximate the (known) global optimum of instance $i\u2208I$ up to a precision value $\u025b\u2208{101,100,\u2026,10-7}$ and also, how many function evaluations $FEi(\u025b)$ were performed until the solver (un)successfully terminated. Here, a solver is

*successful*with respect to the precision threshold $\u025b$; that is, $Successi(\u025b)=1$, if it found a solution $x*\u2208[-5,+5]d$, whose fitness value $f(x*)$ lies within $[f(xopt),f(xopt)+\u025b]$. Note that this definition implies that the found solution $x*$ does not necessarily have to be located in the (decision space's) close neighborhood of the global optimum $xopt$, as

*success*is exclusively defined via the proximity in the objective space. Then, $Successi(\u025b)$ and $FEi(\u025b)$ were used to compute the solver's

*Expected Runtime*(ERT, Hansen et al., 2009):

Although the aforementioned setup comes with some drawbacks, we will still use it as it has been well-established in the community and thus allows for wide comparability. Nevertheless, we would like to address its pitfalls as well: (a) Given the strong variability across the instances' objective values, it is not straightforward to use a fixed *absolute* precision value for comparing the solvers across all instances. Instead, it might be more reasonable to use a relative threshold. (b) Computing the ERT based on different instances, even if they belong to the same BBOB problem, is very questionable as the instances can have completely different landscapes: if the original instance is a highly multimodal function, the transformed instance^{7} could, if at all, consist of only a few peaks. Therefore, we strongly encourage to ask for several solver runs on the same instance in future setups, i.e., perform (at least five to ten) replicated runs, and then evaluate the performance results per instance rather than per function allowing for ERT computations on function *and* instance level.

### 3.2 Preprocessing COCO's Solver Performances

We will use the performance results of the 129 optimization algorithms available at COCO.^{8} However, in order to work with a valid data base, we first performed some sanity checks on the platform's data before combining it in a joint data base. For this purpose, we first combined the algorithms' performance results in a joint data base and afterwards checked whether they fulfill each competition's requirements regarding the required number of runs on the correct instances.

While the submitted results of all 29 (2009) and 23 (2010) solvers of the first two competitions passed these checks, in the following years, only 22 of 28 (2012), 16 of 31 (2013), and 13 of 18 (2015) submissions did so. The invalid submissions *partially* used settings of the previous years. However, in order to use the most general portfolio of solvers, we only considered the submissions for IIDs 1 to 5 with only one run per instance, as this is the only set of instances that was used across all five BBOB competitions. Fortunately, the performances of all 129 solvers could be considered for our experiments, because even the problematic submissions from above had valid data for the first five instances.

## 4 Experimental Setup

### 4.1 Algorithm Performance Data

For each of the 129 solvers we computed the ERT per tuple of problem dimension $d\u2208{2,3,5,10}$, BBOB problem respectively function $FID\u2208{1,\u2026,24}$ and problem instance $IID\u2208{1,\u2026,5}$ (if multiple runs exist, we only considered the first run), resulting in a total of 61 920 observations. The ERTs were computed for a precision threshold of $\u025b=10-2$ because smaller values led to too many unsuccessful runs. Even for this chosen precision, only approximately $67%$ of all (considered) runs terminated successfully. For better comparability across the different problems, a standardized version of the ERT (the *relative ERT*) will be used (see Section 4.6).

### 4.2 Instance Feature Data

Each of the ELA feature sets was computed using a so-called *improved Latin hypercube design* (Beachkofski and Grandhi, 2002) consisting of $50\xd7d$ observations, which were sampled across the decision space, that is, $[-5,+5]d$. The feature sets were then computed using the R-package flacco (Kerschke, 2017b) for all four problem dimensions, 24 BBOB problems and five problem instances that were used by the performance data (see Section 4.1). For each of these 480 problems, we calculated the six “classical” ELA feature sets from Mersmann et al. (2011) (convexity, curvature, levelset, local search, meta-model, and *y*-distribution), as well as the basic, (cell mapping) angle^{9} (Kerschke et al., 2014), dispersion (Lunacek and Whitley, 2006), information content (Muñoz Acosta, Kirley et al. 2015), nearest better clustering (Kerschke et al., 2015), and principal component features, resulting in a total of 102 features per problem instance.

Although being conscious of the resulting information loss, we aggregate each feature across the five problem instances (per BBOB problem) via the median of the respective feature values, in order to map our feature data to the 96 observations (24 problems, four dimensions) of the performance data.

### 4.3 Constructing the Algorithm Portfolio

For meaningfully addressing the algorithm selection task, the choice of the underlying algorithm portfolio is crucial. Ideally, the considered set should be as small and as complementary as possible and should include state-of-the art optimizers. For this purpose, we ranked the solvers per considered BBOB problem based on ERT performance. We then constructed four solver sets (one per dimension), each of them containing the solvers that ranked within the “Top 3” of at least one of the 24 functions of the respective dimension. Based on these four solver sets—that is, one per considered problem dimension, and each of them consisting of 37 to 41 solvers—a portfolio of 12 solvers was constructed by only considering optimizers that belonged to *each* of the four sets.

The 12 optimization algorithms from the found portfolio can be grouped into four categories and are summarized below.

#### 4.3.1 Deterministic Optimization Algorithms (2)

The two solvers of this category are variants of the Brent-STEP algorithm^{10} (Baudiš and Pošík, 2015). It performs axis-parallel searches and chooses the next iteration's search dimension either using a round-robin (**BSrr**, Pošík and Baudiš, 2015) or a quadratic interpolation strategy (**BSqi**, Pošík and Baudiš, 2015).

#### 4.3.2 Multi-Level Approaches (5)

The origin of most solvers belonging to this category is the *multi-level single linkage* method (**MLSL**, Pál, 2013; Rinnooy Kan and Timmer, 1987). It is a stochastic, multistart, global optimizer that relies on random sampling and local searches. Aside from MLSL itself, some of its variants also belong to our portfolio: an interior-point version for constrained nonlinear problems (**fmincon**, Pál, 2013), a quasi-Newton version, which approximates the Hessian using BFGS (Broyden, 1970) (**fminunc**, Pál, 2013), and a hybrid variant whose most important improvements are related to its sampling phase (**HMLSL**, Pál, 2013). The final optimizer belonging to this group is the *multilevel coordinate search* (**MCS**, Huyer and Neumaier, 2009), which splits the search space into smaller boxes, each containing a known observation, and then starts local searches from promising boxes.

#### 4.3.3 Variants of the CMA-ES (4)

Hansen and Ostermeier (2001) introduced one of the most popular evolution strategies: the *Covariance Matrix Adaption Evolution Strategy (CMA-ES)* with cumulative step-size adaptation (**CMA-CSA**, Atamna, 2015). It led to a plethora of variants (van Rijn et al., 2016), including the following three solvers from our portfolio: (1) **IPOP400D** (Auger et al., 2013), a restart version of the CMA-ES with an increasing population size (IPOP-CMA-ES, Auger and Hansen, 2005) and a maximum of $400\xd7(d+2)$ function evaluations. (2) A hybrid CMA (**HCMA**, Loshchilov et al., 2013a), which combines a *bi-population self-adaptive surrogate-assisted CMA-ES*^{11} (BIPOP-$s*$aACM-ES-k, Loshchilov et al., 2013b), STEP (Swarzberg et al., 1994) and NEWUOA (Powell, 2006) to benefit from surrogate models and line searches simultaneously. (3) A *sequential, model-based algorithm configuration* (SMAC, Hutter et al., 2011) procedure applied to the BBOB problems (**SMAC-BBOB**, Hutter et al., 2013). It uses Gaussian processes (GP, Rasmussen and Williams, 2006) to model the the *expected improvement function* and then performs one run of DIRECT (Jones et al., 1993) (with $10\xd7d$ evaluations) and ten runs of the classical CMA-ES (Hansen and Ostermeier, 2001) (with $100\xd7d$ evaluations) on the expected improvement function.

#### 4.3.4 Others (1)

The final optimizer from our portfolio is called OptQuest/NLP (**OQNLP**, Pál, 2013; Ugray et al., 2007). It is a commercial, heuristic, multistart algorithm that was designed to find the global optima of smooth constrained nonlinear programs (NLPs) and mixed integer nonlinear programs (MINLPs). The algorithm uses the *OptQuest Callable Library* (OCL, Laguna and Martí, 2003) to generate candidate starting points for a local NLP solver.

### 4.4 Machine Learning Algorithms

We considered three classes of supervised learning strategies for training our algorithm selection models: (1) A *classification* approach, which simply tries to predict the best-performing optimizer^{12} and hence, completely ignores the magnitude of performance differences between the best and the remaining portfolio solvers. (2) A *regression* approach, which trains a separate model for the performances of each optimizer and afterwards predicts the solver with the best predicted performance. (3) In addition to these well-known strategies, we also considered the so-called *pairwise regression*, which led to promising results in other works (e.g., Kotthoff et al., 2015; Kerschke et al., 2017). In contrast to modeling the performances straightforwardly (as in (2)), it models the performance differences for each solver pair and afterwards predicts the solver whose predicted performance difference was the highest, compared to all other solvers.

The algorithm selectors were trained in R (R Core Team, 2017) using the R-package mlr (Bischl, Lang et al., 2016). For each class of the considered supervised learning approaches (i.e., classification, regression and paired regression), we used recursive partitioning and regression trees (rpart, Therneau et al., 2017), kernel-based support vector machines (ksvm, Karatzoglou et al., 2004), random forests (randomForest, Liaw and Wiener, 2002), and extreme gradient boosting (xgboost, Chen et al., 2017). Additionally, we also tried multivariate adaptive regression splines (mars, Leisch et al., 2016) in case of the (paired) regression approaches.

Note that the SVM's inverse kernel width sigma was the only hyperparameter that was (automatically) configured using the sigest function from the R-package kernlab (Karatzoglou et al., 2004). All other hyperparameters were used in their default settings: the SVMs used Gaussian radial basis kernels and the random forests were constructed using 500 trees, whose split points were sampled random uniformly of $\u230ap\u230b$ (classification) or $max{\u230ap/3\u230b,1}$ (regression / paired regression) features with *p* being the data set's number of (ELA) features.

### 4.5 Feature Selection Strategies

Each of the aforementioned 14 algorithm selectors (see Section 4.4), that is, 4 using a classification-based approach, 5 using regression, and 5 using paired regression, are initially trained using the set of all 102 features (see Section 4.2). However, using all of the features simultaneously likely causes lots of noise and/or redundancy, which could lead to poorly performing algorithm selectors. Furthermore, some of the feature sets, namely, the convexity, curvature and local search features from the classical ELA features (Mersmann et al., 2011), require additional function evaluations on top of the costs for the initial design. In order to overcome these obstacles, we used the following four feature selection strategies—all of them are implemented in mlr—to train further algorithm selectors. The resulting 56 additional models (four feature selection strategies times 14 machine learning algorithms) consist of smaller and likely much more meaningful feature subsets, which on the one hand lead to better performing selectors and on the other hand improve the interpretability of the trained models.

*Greedy forward-backward selection (sffs):*This feature selection strategy starts with an empty feature set and iteratively alternates between greedily adding and/or removing features as long as the model's performance improves.*Greedy backward-forward selection (sfbs):*Analogously to sffs, this approach greedily adds and removes features per iteration. However, in contrast to the previous method, sfbs starts with the full set of all 102 features and afterwards alternates between removing and adding features.$(10+5)$

*-GA:*A genetic algorithm (GA, see e.g., Eiben and Smith, 2015) using mlr's default values for the population ($\mu =10$) and offspring ($\lambda =5$) sizes is used. For this approach, the selected features are represented as a 102-dimensional bit string, where a value of 1 at position*k*implies that the*k*-th feature is selected. The GA runs for a maximum of 100 generations and selects the features by performing random bit flips, with a (default) mutation rate of $5%$ and a crossover rate of $50%$.$(10+50)$

*-GA:*A slightly modified version of the previous GA, which uses the tenfold of offsprings (50 instead of 5) per generation in order to increase the selection pressure within the GA's evolutionary loop.

### 4.6 Performance Assessment

In lieu of using the ERT itself, we will use the *relative ERT (relERT)* for our experiments. While the former strongly biases the algorithm selectors towards multimodal and higher-dimensional problems due to much larger amounts of used function evaluations, the relERTs, which are also used within the BBOB competitions, allow a fair comparison of the solvers across the problems and dimensions by normalizing each solver's ERT with the ERT of the best solver for the respective BBOB problem (of that dimension). Instead of scaling each performance with the respective best ERT of all 129 solvers, we used the best performance from the 12 solvers of the considered portfolio as this is our set of interest in this study.

As some solvers did not even solve a single instance from some of the BBOB functions, the respective ERTs and relERTs were not defined. We thus imputed the missing relERTs using the well-known PAR10 score (e.g., Bischl, Kerschke et al., 2016). That is, each of the problematic values is replaced with a penalty value (36 690.3) that is the tenfold of the highest valid relERT; for all other values, the respective (finite) relERTs are used.

For each of the supervised learning approaches (see Section 4.4), the algorithm selectors are evaluated using the *mean relative ERT*, which averages the relERTs of the predictions (including the costs for the initial design) on the corresponding test data, i.e., a subset of the BBOB problems. In order to obtain more realistic and reliable estimates of the selectors' performances, they were assessed using leave-one-(function)-out cross-validation. That is, per algorithm selector we train a total of 96 submodels (24 BBOB problems in four problem dimensions each). Each of them uses only 95 of the 96 BBOB problems for training and the remaining one for testing. Note that each problem was used exactly once as test data. As a consequence, within each iteration/fold of the leave-one-(function)-out cross-validation, exactly one problem (i.e., the respective test data) is completely kept out of the modeling phase and only used for assessing the respective submodel's performance. The average of the resulting 96 relERTs is then used as our algorithm selector's performance.

Following common practices in algorithm selection (e.g., Bischl, Kerschke et al. 2016), we compare the performances of our algorithm selectors with two baselines: the *virtual best solver* (VBS) and the *single best solver* (SBS). The virtual best solver, sometimes also called *oracle* or *perfect selector*, provides a lower bound for the selectors as it shows the performance that one *could* (theoretically) achieve on the data, when always selecting the best performing algorithm per instance. Given that the relERT has a lower bound of 1 and that at least one solver achieves that perfect performance per instance, the VBS has to be 1 per definition. Nevertheless, it is quite obvious that algorithm selectors usually do not reach such an idealistic performance given their usually imperfect selections and/or the influence of the costs for computing the landscape features.

The second baseline, the SBS, represents the (aggregated) performance of the (single) best solver from the portfolio. Consequently, this baseline is much more important than the VBS, because an algorithm selector is only useful if its performance (including feature costs) is better than the SBS. Note that in principle, the SBS could either be determined on the entire data set or, for (leave-one-out) cross-validation, be computed per fold. However, within our experiments, both approaches result in identical performances.

### 4.7 Overview of the Interlinks Between Our Framework's Building Blocks

In the previous sections, we have outlined the separate building blocks for our considered algorithm selection framework. The connections and therefore the interactions between these blocks are displayed in Figure 4.

The benchmarked algorithm performances (Sections 4.1 and 4.3; visualized by means of a green box in the bottom left of Figure 4) and the corresponding automatically computable and problem-specific landscape features (Section 4.2; blue box in the top left) are used by different machine learning algorithms (Section 4.4; red box in the center) for training several possible algorithm selection models. Their performances are then assessed (Section 4.6; yellow in the bottom right) and in turn used as feedback mechanism to improve the previously trained algorithm selection models, e.g., by selecting more informative feature subsets (Section 4.5).

As shown within the grey area of Figure 4, further feedback mechanisms, such as tuning the considered machine learning techniques, are also an option and in general very promising. However, in this first study of its kind (within the domain of single-objective continuous optimization), we rely on the default configurations of the machine learning algorithms (except for one very important parameter of the SVM as detailed in Section 4.4), which already provides lots of interesting and promising results – as will be described in Section 5. The quite complex issue of optimally configuring the machine learning algorithms will be addressed in future work.

## 5 Results

### 5.1 Analyzing the Algorithm Portfolio

Here, we examine the constructed portfolio to get more insights into the data provided by COCO. In a first step, we compare the best performances of the portfolio solvers to the best performances of all 129 solvers. For 51 of the 96 considered BBOB functions over the dimensions, the portfolio's VBS is identical to the VBS of all COCO-solvers.

The violin plot on the right side of Figure 5, which can be understood as a boxplot, whose shape represents a kernel density estimation of the underlying observations, indicates that the ratios between the best ERTs from the portfolio and all COCO-solvers (per BBOB problem) is usually close to one. More precisely, only ten BBOB problems, depicted by black dots in the boxplot of Figure 5, have a ratio greater than two, and only three of them—the 2D-version of *Schaffers F7 Function* (FID 17, Finck et al., 2010), as well as the 5D- and 10D-versions of the *Lunacek bi-Rastrigin Function* (FID 24, Finck et al., 2010)—have ratios worse than three. Consequently, we can conclude that *if* we find a well-performing algorithm selector, its suggested optimization algorithm likely requires less than twice the number of function evaluations compared to the best algorithm from the entire COCO platform.

As mentioned in Section 4.6, we had to impute some of the performances, because not all optimizers solved at least one instance per BBOB problem. More precisely, HCMA was the only one to find at least one valid solution for all 96 problems, followed by HMLSL (88), CMA-CSA (87), OQNLP (72), fmincon (71), MLSL (69), fminunc (68), IPOP400D and MCS (67), BSqi (52), BSrr (50), and SMAC-BBOB (18). Hence, the mean relERT of HCMA (30.37) was by far the lowest of all considered solvers, making this solver the clear SBS.

However, as shown in Table 1, HCMA is not superior to the remaining solvers across all problems. Independent of the problem dimensions, the Brent-STEP variants BSqi and BSrr outperform the other optimizers on the separable problems (FIDs 1 to 5). Similarly, the multilevel methods MLSL, fmincon and HMLSL are superior on the unimodal functions with high conditioning (FIDs 10 to 14), and the multimodal problems with adequate global structure (FIDs 15 to 19) are solved fastest by MCS (in 2D) and CMA-CSA (3D to 10D). The remaining problems, i.e., functions with low or moderate conditioning (FIDs 6 to 9), as well as multimodal functions with a weak global structure (FIDs 20 to 24) do not show such obvious patterns. Nevertheless, one can see that HCMA, HMLSL, OQNLP and sometimes also MLSL perform quite promising on these function groups.

Relative Expected Runtime of the 12 Solvers from the Considered Algorithm Portfolio and the 2 Presented Algorithm Selection-Models | |||||||||||||||

Dim | BBOB-Group | BSqi | BSrr | CMA-CSA | fmincon | fminunc | HCMA | HMLSL | IPOP-400D | MCS | MLSL | OQNLP | SMAC-BBOB | AS-Model | |

# 1 | # 2 | ||||||||||||||

F1 - F5 | 1.2 | 1.3 | 54.8 | 11.0 | 11.8 | 3.7 | 14.6 | 18.4 | 5.8 | 15.5 | 17.0 | 22 014.9 | 16.6 | 20.3 | |

F6 - F9 | 18 516.7 | 9 708.2 | 7.4 | 18.6 | 19.2 | 5.8 | 1.7 | 5.7 | 11.3 | 24.2 | 1.5 | 27 518.6 | 3.1 | 3.5 | |

2 | F10 - F14 | 7 649.2 | 7 481.5 | 8.3 | 1.0 | 62.7 | 6.3 | 1.0 | 10.7 | 322.7 | 1.0 | 4.9 | 29 353.2 | 4.7 | 4.0 |

F15 - F19 | 7 406.6 | 14 710.3 | 14.7 | 7 392.0 | 7 367.7 | 25.3 | 8.1 | 15.5 | 7.7 | 7 391.7 | 7 351.2 | 29 354.8 | 26.2 | 10.1 | |

F20 - F24 | 84.8 | 14 768.5 | 7 351.9 | 4.1 | 14.5 | 44.9 | 3.9 | 14 679.3 | 11.4 | 2.1 | 2.7 | 22 014.6 | 42.5 | 3.0 | |

all | 6 240.7 | 9 318.4 | 1 549.1 | 1 546.5 | 1 556.7 | 17.7 | 6.0 | 3 068.4 | 74.3 | 1 547.9 | 1 536.9 | 25 990.1 | 19.3 | 8.4 | |

F1 - F5 | 1.3 | 1.3 | 7 367.9 | 85.2 | 132.1 | 356.1 | 6.8 | 14 686.6 | 45.9 | 55.9 | 7 347.6 | 22 015.1 | 58.4 | 94.9 | |

F6 - F9 | 331.2 | 9 527.4 | 4.7 | 38.5 | 9 173.7 | 4.5 | 1.9 | 6.5 | 31.4 | 9 173.4 | 2.5 | 36 690.3 | 3.3 | 39.9 | |

3 | F10 - F14 | 29 356.3 | 14 712.1 | 8.9 | 1.0 | 4.1 | 5.0 | 1.0 | 12.3 | 8 132.7 | 1.0 | 9.3 | 29 353.4 | 4.8 | 3.6 |

F15 - F19 | 14 698.2 | 22 026.2 | 1.6 | 14 701.2 | 14 699.5 | 2.6 | 11.4 | 7 339.4 | 7 346.9 | 14 700.0 | 14 686.2 | 36 690.3 | 2.8 | 7.1 | |

F20 - F24 | 14 741.8 | 14 758.7 | 7 389.4 | 7 339.6 | 14 677.4 | 66.8 | 2.3 | 22 015.1 | 7 342.4 | 7 339.8 | 1.9 | 22 014.8 | 67.0 | 3.4 | |

all | 12 304.7 | 12 316.7 | 3 077.4 | 4 616.2 | 7 677.5 | 90.4 | 4.8 | 9 178.9 | 4 769.4 | 6 132.4 | 4 593.1 | 29 047.1 | 28.3 | 29.4 | |

F1 - F5 | 1.4 | 1.4 | 7 533.6 | 14 678.4 | 14 679.2 | 12.0 | 17.5 | 14 688.7 | 14 678.1 | 14 678.5 | 14 678.0 | 22 015.1 | 22.7 | 22.9 | |

F6 - F9 | 27 597.4 | 36 690.3 | 5.6 | 9 173.5 | 9 173.8 | 3.9 | 2.4 | 4.9 | 28.8 | 9 173.4 | 9 173.5 | 36 690.3 | 4.8 | 4.8 | |

5 | F10 - F14 | 22 032.8 | 29 360.3 | 8.9 | 1.0 | 11.9 | 4.2 | 1.0 | 13.6 | 22 019.2 | 1.0 | 10.7 | 36 690.3 | 5.2 | 5.2 |

F15 - F19 | 36 690.3 | 36 690.3 | 3.1 | 36 690.3 | 36 690.3 | 4.3 | 7 346.1 | 29 352.5 | 36 690.3 | 36 690.3 | 29 352.5 | 36 690.3 | 4.4 | 4.4 | |

F20 - F24 | 22 053.6 | 22 050.8 | 7 400.0 | 14 678.9 | 22 014.9 | 7.7 | 7 339.8 | 22 017.4 | 14 681.0 | 22 015.0 | 14 676.8 | 22 014.9 | 7.8 | 7.8 | |

all | 21 428.3 | 24 469.8 | 3 114.6 | 15 289.0 | 16 819.9 | 6.5 | 3 063.8 | 13 765.8 | 18 352.4 | 16 817.4 | 13 761.8 | 30 575.6 | 9.1 | 9.2 | |

F1 - F5 | 1.6 | 1.6 | 14 691.0 | 14 679.9 | 14 682.7 | 2.7 | 7 365.5 | 14 698.8 | 14 680.0 | 14 679.9 | 14 678.3 | 22 015.7 | 16.3 | 16.3 | |

F6 - F9 | 36 690.3 | 27 563.9 | 4.3 | 9 173.4 | 9 173.8 | 2.2 | 4.1 | 9 181.9 | 9 188.1 | 9 173.4 | 9 173.9 | 36 690.3 | 2.7 | 2.7 | |

10 | F10 - F14 | 29 359.3 | 29 359.8 | 8.4 | 1.1 | 15.4 | 2.8 | 1.1 | 7 352.5 | 22 018.7 | 1.1 | 12.0 | 36 690.3 | 3.7 | 3.7 |

F15 - F19 | 36 690.3 | 36 690.3 | 1.7 | 36 690.3 | 36 690.3 | 2.0 | 22 028.5 | 29 352.5 | 36 690.3 | 36 690.3 | 36 690.3 | 36 690.3 | 2.1 | 2.1 | |

F20 - F24 | 36 690.3 | 29 367.0 | 14 685.9 | 22 015.2 | 22 015.0 | 23.6 | 14 677.1 | 29 352.8 | 22 018.9 | 22 014.6 | 22 014.9 | 36 690.3 | 23.7 | 23.7 | |

all | 27 519.5 | 24 472.9 | 6 123.0 | 16 817.8 | 16 821.3 | 6.9 | 9 182.4 | 18 354.6 | 21 408.0 | 16 817.6 | 16 819.7 | 33 633.1 | 10.0 | 10.0 | |

F1 - F5 | 1.4 | 1.4 | 7 411.8 | 7 363.6 | 7 376.5 | 93.6 | 1 851.1 | 11 023.1 | 7 352.4 | 7 357.4 | 9 180.2 | 22 015.2 | 28.5 | 38.6 | |

F6 - F9 | 20 783.9 | 20 872.4 | 5.5 | 4 601.0 | 6 885.1 | 4.1 | 2.5 | 2 299.8 | 2 314.9 | 6 886.1 | 4 587.9 | 34 397.4 | 3.5 | 12.7 | |

all | F10 - F14 | 22 099.4 | 20 228.4 | 8.7 | 1.0 | 23.5 | 4.6 | 1.0 | 1 847.3 | 13 123.3 | 1.0 | 9.3 | 33 021.8 | 4.6 | 4.1 |

F15 - F19 | 23 871.3 | 27 529.3 | 5.2 | 23 868.5 | 23 861.9 | 8.6 | 7 348.5 | 16 515.0 | 20 183.8 | 23 868.1 | 22 020.0 | 34 856.4 | 8.9 | 5.9 | |

F20 - F24 | 18 392.6 | 20 236.3 | 9 206.8 | 11 009.4 | 14 680.5 | 35.8 | 5 505.8 | 22 016.2 | 11 013.4 | 12 842.9 | 9 174.1 | 25 683.7 | 35.3 | 9.5 | |

all | 16 873.3 | 17 644.5 | 3 466.0 | 9 567.4 | 10 718.9 | 30.4 | 3 064.3 | 11 091.9 | 11 151.0 | 10 328.8 | 9 177.9 | 29 811.5 | 16.7 | 14.2 |

Relative Expected Runtime of the 12 Solvers from the Considered Algorithm Portfolio and the 2 Presented Algorithm Selection-Models | |||||||||||||||

Dim | BBOB-Group | BSqi | BSrr | CMA-CSA | fmincon | fminunc | HCMA | HMLSL | IPOP-400D | MCS | MLSL | OQNLP | SMAC-BBOB | AS-Model | |

# 1 | # 2 | ||||||||||||||

F1 - F5 | 1.2 | 1.3 | 54.8 | 11.0 | 11.8 | 3.7 | 14.6 | 18.4 | 5.8 | 15.5 | 17.0 | 22 014.9 | 16.6 | 20.3 | |

F6 - F9 | 18 516.7 | 9 708.2 | 7.4 | 18.6 | 19.2 | 5.8 | 1.7 | 5.7 | 11.3 | 24.2 | 1.5 | 27 518.6 | 3.1 | 3.5 | |

2 | F10 - F14 | 7 649.2 | 7 481.5 | 8.3 | 1.0 | 62.7 | 6.3 | 1.0 | 10.7 | 322.7 | 1.0 | 4.9 | 29 353.2 | 4.7 | 4.0 |

F15 - F19 | 7 406.6 | 14 710.3 | 14.7 | 7 392.0 | 7 367.7 | 25.3 | 8.1 | 15.5 | 7.7 | 7 391.7 | 7 351.2 | 29 354.8 | 26.2 | 10.1 | |

F20 - F24 | 84.8 | 14 768.5 | 7 351.9 | 4.1 | 14.5 | 44.9 | 3.9 | 14 679.3 | 11.4 | 2.1 | 2.7 | 22 014.6 | 42.5 | 3.0 | |

all | 6 240.7 | 9 318.4 | 1 549.1 | 1 546.5 | 1 556.7 | 17.7 | 6.0 | 3 068.4 | 74.3 | 1 547.9 | 1 536.9 | 25 990.1 | 19.3 | 8.4 | |

F1 - F5 | 1.3 | 1.3 | 7 367.9 | 85.2 | 132.1 | 356.1 | 6.8 | 14 686.6 | 45.9 | 55.9 | 7 347.6 | 22 015.1 | 58.4 | 94.9 | |

F6 - F9 | 331.2 | 9 527.4 | 4.7 | 38.5 | 9 173.7 | 4.5 | 1.9 | 6.5 | 31.4 | 9 173.4 | 2.5 | 36 690.3 | 3.3 | 39.9 | |

3 | F10 - F14 | 29 356.3 | 14 712.1 | 8.9 | 1.0 | 4.1 | 5.0 | 1.0 | 12.3 | 8 132.7 | 1.0 | 9.3 | 29 353.4 | 4.8 | 3.6 |

F15 - F19 | 14 698.2 | 22 026.2 | 1.6 | 14 701.2 | 14 699.5 | 2.6 | 11.4 | 7 339.4 | 7 346.9 | 14 700.0 | 14 686.2 | 36 690.3 | 2.8 | 7.1 | |

F20 - F24 | 14 741.8 | 14 758.7 | 7 389.4 | 7 339.6 | 14 677.4 | 66.8 | 2.3 | 22 015.1 | 7 342.4 | 7 339.8 | 1.9 | 22 014.8 | 67.0 | 3.4 | |

all | 12 304.7 | 12 316.7 | 3 077.4 | 4 616.2 | 7 677.5 | 90.4 | 4.8 | 9 178.9 | 4 769.4 | 6 132.4 | 4 593.1 | 29 047.1 | 28.3 | 29.4 | |

F1 - F5 | 1.4 | 1.4 | 7 533.6 | 14 678.4 | 14 679.2 | 12.0 | 17.5 | 14 688.7 | 14 678.1 | 14 678.5 | 14 678.0 | 22 015.1 | 22.7 | 22.9 | |

F6 - F9 | 27 597.4 | 36 690.3 | 5.6 | 9 173.5 | 9 173.8 | 3.9 | 2.4 | 4.9 | 28.8 | 9 173.4 | 9 173.5 | 36 690.3 | 4.8 | 4.8 | |

5 | F10 - F14 | 22 032.8 | 29 360.3 | 8.9 | 1.0 | 11.9 | 4.2 | 1.0 | 13.6 | 22 019.2 | 1.0 | 10.7 | 36 690.3 | 5.2 | 5.2 |

F15 - F19 | 36 690.3 | 36 690.3 | 3.1 | 36 690.3 | 36 690.3 | 4.3 | 7 346.1 | 29 352.5 | 36 690.3 | 36 690.3 | 29 352.5 | 36 690.3 | 4.4 | 4.4 | |

F20 - F24 | 22 053.6 | 22 050.8 | 7 400.0 | 14 678.9 | 22 014.9 | 7.7 | 7 339.8 | 22 017.4 | 14 681.0 | 22 015.0 | 14 676.8 | 22 014.9 | 7.8 | 7.8 | |

all | 21 428.3 | 24 469.8 | 3 114.6 | 15 289.0 | 16 819.9 | 6.5 | 3 063.8 | 13 765.8 | 18 352.4 | 16 817.4 | 13 761.8 | 30 575.6 | 9.1 | 9.2 | |

F1 - F5 | 1.6 | 1.6 | 14 691.0 | 14 679.9 | 14 682.7 | 2.7 | 7 365.5 | 14 698.8 | 14 680.0 | 14 679.9 | 14 678.3 | 22 015.7 | 16.3 | 16.3 | |

F6 - F9 | 36 690.3 | 27 563.9 | 4.3 | 9 173.4 | 9 173.8 | 2.2 | 4.1 | 9 181.9 | 9 188.1 | 9 173.4 | 9 173.9 | 36 690.3 | 2.7 | 2.7 | |

10 | F10 - F14 | 29 359.3 | 29 359.8 | 8.4 | 1.1 | 15.4 | 2.8 | 1.1 | 7 352.5 | 22 018.7 | 1.1 | 12.0 | 36 690.3 | 3.7 | 3.7 |

F15 - F19 | 36 690.3 | 36 690.3 | 1.7 | 36 690.3 | 36 690.3 | 2.0 | 22 028.5 | 29 352.5 | 36 690.3 | 36 690.3 | 36 690.3 | 36 690.3 | 2.1 | 2.1 | |

F20 - F24 | 36 690.3 | 29 367.0 | 14 685.9 | 22 015.2 | 22 015.0 | 23.6 | 14 677.1 | 29 352.8 | 22 018.9 | 22 014.6 | 22 014.9 | 36 690.3 | 23.7 | 23.7 | |

all | 27 519.5 | 24 472.9 | 6 123.0 | 16 817.8 | 16 821.3 | 6.9 | 9 182.4 | 18 354.6 | 21 408.0 | 16 817.6 | 16 819.7 | 33 633.1 | 10.0 | 10.0 | |

F1 - F5 | 1.4 | 1.4 | 7 411.8 | 7 363.6 | 7 376.5 | 93.6 | 1 851.1 | 11 023.1 | 7 352.4 | 7 357.4 | 9 180.2 | 22 015.2 | 28.5 | 38.6 | |

F6 - F9 | 20 783.9 | 20 872.4 | 5.5 | 4 601.0 | 6 885.1 | 4.1 | 2.5 | 2 299.8 | 2 314.9 | 6 886.1 | 4 587.9 | 34 397.4 | 3.5 | 12.7 | |

all | F10 - F14 | 22 099.4 | 20 228.4 | 8.7 | 1.0 | 23.5 | 4.6 | 1.0 | 1 847.3 | 13 123.3 | 1.0 | 9.3 | 33 021.8 | 4.6 | 4.1 |

F15 - F19 | 23 871.3 | 27 529.3 | 5.2 | 23 868.5 | 23 861.9 | 8.6 | 7 348.5 | 16 515.0 | 20 183.8 | 23 868.1 | 22 020.0 | 34 856.4 | 8.9 | 5.9 | |

F20 - F24 | 18 392.6 | 20 236.3 | 9 206.8 | 11 009.4 | 14 680.5 | 35.8 | 5 505.8 | 22 016.2 | 11 013.4 | 12 842.9 | 9 174.1 | 25 683.7 | 35.3 | 9.5 | |

all | 16 873.3 | 17 644.5 | 3 466.0 | 9 567.4 | 10 718.9 | 30.4 | 3 064.3 | 11 091.9 | 11 151.0 | 10 328.8 | 9 177.9 | 29 811.5 | 16.7 | 14.2 |

Interestingly, several solvers had extremely high PAR10-scores for multiple BBOB-groups. The most obvious one is SMAC-BBOB, whose relative ERT (aggregated across the instances of a BBOB group) was higher than 10 000 on all (!) BBOB groups. However, this finding is plausible: although this optimization algorithm belongs to the “Top 3” of at least one problem per dimension (according to the setup of the portfolio, see Section 4.3) and therefore to the “Top 3” of at least four problems across the entire benchmark, it only managed to solve 18 of the considered 96 problem instances (up to the required precision of $\u025b=10-2$). This in turn means that it failed on more than 80% of all instances and thus the corresponding performances were set to the penalty value (36 690.3, see Section 4.6). Based on the fact that Table 1 only displays performances that were aggregated across at least four problems each, SMAC-BBOB's scores look that extreme—and consequently might be misleading to a certain extent.

A further finding of Table 1 is that HCMA often is inferior to the other 11 solvers. This clearly indicates that the data contain sufficient potential for improving over the SBS, for example, with a reasonable algorithm selector. This is supported by the small amount of points located exactly on the diagonal of the scatterplots in Figure 6, which depict the ERT-pairs of the two baseline algorithms—the idealistic VBS and the SBS, that is, the best performing solver from the portfolio (HCMA)—for all 24 BBOB problems and per problem dimension. Noticeably, though not very surprising, the VBS and SBS always have higher ERTs on the multimodal problems (FIDs 15 to 24) than on the unimodal problems (FIDs 1 to 14). Also, the shape of the cloud of observations indicates that the problems' complexity grows with its dimension: while the 2D-observations are mainly clustered in the lower left corner (indicating rather easy problems), the cloud stretches along the entire diagonal for higher-dimensional problems. Furthermore it is obvious that two problems, FIDs 1 (*Sphere*, indicated by $\u2218$) and 5 (*Linear Slope*, $\u22c4$), are consistently solved very fast (independent of its dimensionality), whereas FID 24 (*Lunacek Bi-Rastrigin*, $\u25b5$) is always the most difficult problem for the two baseline algorithms.

Looking at the performances of HCMA across all 96 problems, one problem stands out: the three-dimensional version of the *Büche-Rastrigin* function (FID 4). On this (usually rather simple) function, HCMA has a relative ERT of 1 765.9, compared to at most 51.8 on the non-3D-versions of this problem and 249.1 as the worst relative ERT of all remaining 95 BBOB problems. We therefore had a closer look at the submitted data of HCMA across FID 4 and for all dimensions. The respective data are summarized in Table 2. Comparing the results of the 20 optimization runs with each other, three of them seem to be clear outliers. However, it remains unclear whether they were caused by the algorithm itself—it could for instance be stuck in a local optimum without terminating—or whether the outliers were caused by some technical issue on the platform.

Dim | IID | Gap to Optimum | # Function Evaluations | Success? | ERT of VBS (Solver) | ERT of SBS (relERT) |

1 | 8.44e-03 | 215 | TRUE | |||

2 | 6.91e-03 | 545 | TRUE | 98.8 | 405.2 | |

2 | 3 | 8.84e-05 | 440 | TRUE | (BSqi) | (4.1) |

4 | 5.44e-03 | 248 | TRUE | |||

5 | 4.29e-03 | 578 | TRUE | |||

1 | 4.48e-03 | 976 690 | TRUE | |||

2 | 9.99e-01 | 570 925 | FALSE | 219.6 | 387 784.5 | |

3 | 3 | 9.62e-04 | 781 | TRUE | (BSrr) | (1 765.9) |

4 | 1.72e-03 | 1 373 | TRUE | |||

5 | 8.24e-03 | 1 369 | TRUE | |||

1 | 6.52e-03 | 2 048 | TRUE | |||

2 | 8.09e-04 | 2 248 | TRUE | 486.2 | 25 197.0 | |

5 | 3 | 1.00e+00 | 91 882 | FALSE | (BSrr) | (51.8) |

4 | 2.16e-03 | 2 382 | TRUE | |||

5 | 8.54e-03 | 2 228 | TRUE | |||

1 | 3.75e-03 | 4 253 | TRUE | |||

2 | 6.72e-03 | 4 495 | TRUE | 1 067.8 | 4 286.6 | |

10 | 3 | 2.58e-03 | 4 632 | TRUE | (BSrr) | (4.0) |

4 | 2.79e-03 | 4 032 | TRUE | |||

5 | 5.43e-03 | 4 021 | TRUE |

Dim | IID | Gap to Optimum | # Function Evaluations | Success? | ERT of VBS (Solver) | ERT of SBS (relERT) |

1 | 8.44e-03 | 215 | TRUE | |||

2 | 6.91e-03 | 545 | TRUE | 98.8 | 405.2 | |

2 | 3 | 8.84e-05 | 440 | TRUE | (BSqi) | (4.1) |

4 | 5.44e-03 | 248 | TRUE | |||

5 | 4.29e-03 | 578 | TRUE | |||

1 | 4.48e-03 | 976 690 | TRUE | |||

2 | 9.99e-01 | 570 925 | FALSE | 219.6 | 387 784.5 | |

3 | 3 | 9.62e-04 | 781 | TRUE | (BSrr) | (1 765.9) |

4 | 1.72e-03 | 1 373 | TRUE | |||

5 | 8.24e-03 | 1 369 | TRUE | |||

1 | 6.52e-03 | 2 048 | TRUE | |||

2 | 8.09e-04 | 2 248 | TRUE | 486.2 | 25 197.0 | |

5 | 3 | 1.00e+00 | 91 882 | FALSE | (BSrr) | (51.8) |

4 | 2.16e-03 | 2 382 | TRUE | |||

5 | 8.54e-03 | 2 228 | TRUE | |||

1 | 3.75e-03 | 4 253 | TRUE | |||

2 | 6.72e-03 | 4 495 | TRUE | 1 067.8 | 4 286.6 | |

10 | 3 | 2.58e-03 | 4 632 | TRUE | (BSrr) | (4.0) |

4 | 2.79e-03 | 4 032 | TRUE | |||

5 | 5.43e-03 | 4 021 | TRUE |

### 5.2 Analyzing the Best Algorithm Selector

After gaining more insights into the underlying data, we use the performance and feature data for training a total of 70 algorithm selectors: 14 machine learning algorithms (see Section 4.4), each of them using either one of the four feature selection strategies described in Section 4.5 or no feature selection at all. The classification-based support vector machine, which employed a greedy forward-backward feature selection strategy (sffs), achieved the best performance. While the single best solver (HCMA) had a mean relative ERT of 30.37, the algorithm selector (denoted as “Model 1”) had a mean relative ERT of 16.67 (including feature costs) and 13.60 (excluding feature costs). Therefore, the selector is able to pick an optimizer from the portfolio, which solves the respective problem on average twice as fast.

Eight features resulted from the feature selection approach and were included in Model 1: three features from the y-distribution feature set (the skewness, kurtosis and number of peaks of a kernel-density estimation of the problems' objective values; see Mersmann et al., 2011), one levelset feature (the ratio of mean misclassification errors when using a linear (LDA) and mixed discriminant analysis (MDA); Mersmann et al., 2011), two information content features (the *maximum information content* and the *settling sensitivity*; see Muñoz Acosta, Kirley et al., 2015), one cell mapping feature (the standard deviation of the distances between each cell's center and worst observation; see Kerschke et al., 2014) and one of the basic features (the best fitness value within the sample).

Looking at the scatterplots shown in the left column of Figure 7, one can see that, with the exception of the 2D *Rastrigin* function (FID 3, +), Model 1 predicts on all instances an optimizer that performs at least as good as HCMA. Admittedly, this comparison is not quite fair, as the ERTs shown within the respective column do not consider the costs for computing the landscape features. Hence, a more realistic comparison of the selector and the SBS is shown in the second column of Figure 7.

The fact that Model 1 performs worse than HCMA on some of the problems is negligible as the losses only occur on rather simple, and thus, quickly solvable, problems. The allocated budget for computing the landscape features was $50\xd7d$, that is, between 100 and 500 function evaluations depending on the problem dimension. On the other hand, the ERT of the portfolio's VBS lies between 4.4 and 7 371 411. These numbers (a) explain why Model 1 performs poorly on rather simple problems, and (b) support the thesis that the costs for computing the landscape features only have a small impact on the total costs when solving rather complex problems. However, one should be aware of the fact that the number of required function evaluations for the initial design is in a range of the common size of initial (evolutionary) algorithm populations so that—given that the majority of (promising) optimization algorithms belongs to the group of evolutionary algorithms—in practice no additional feature costs would occur.

These findings are also visible in the upper row of Figure 8, which compares the relative ERTs of HCMA and Model 1 (on a log-scale) across all 96 BBOB problems (distinguished by FID and dimension). For better visibility of the performance differences, the area between the relative ERTs of the SBS (depicted by $\u2022$) and Model 1 ($\u25b4$) is highlighted in grey. One major contribution of our selector obviously is to successfully avoid using HCMA on the three-dimensional version of FID 4. Moreover, one can see that the SBS mainly dominates our selector on the separable BBOB problems (FIDs 1 to 5), whereas our selector achieved equal or better performances on the majority of the remaining problems.

### 5.3 Improving the Currently Best Algorithm Selector

Although Model 1 already performs better than the SBS, we tried to improve the previous model by making use of results from previous works. More precisely, we start with the set of the eight features that were employed by Model 1 and try to find a more informative feature subset by adding the meta-model and nearest better clustering features. The rational behind this approach is that (a) these feature sets, as shown in Kerschke et al. (2015), are very useful for characterizing a landscape's global structure, and (b) the feature set of Model 1 was derived from a greedy approach and therefore might have missed better feature subsets.

Based on the combination of the three feature sets from above (i.e., the features from Model 1, extended by the NBC and meta-model feature sets), additional feature selections were performed (using the four feature selection strategies described in Section 4.5). Two of the new selectors, namely the ones constructed using the greedy sfbs-approach (with a mean relative ERT of 14.27) and the $(10+50)$-GA (mean relative ERT: 14.24)—performed better than Model 1. The latter model, that is, the one based on the genetic algorithm, is from here on denoted as “Model 2.”

As displayed in Table 1, Model 2 not only has the best overall performance, but also performs best on the multimodal functions with a weak global structure (FIDs 20 to 24). This effect is also visible in the right column of Figure 7, where the problems of the respective BBOB group (colored in pink) are located either in the left half of the respective plots (for 2D and 3D) or at least on the diagonal itself (5D and 10D).

Also, Figures 7 and 8 reveal that Model 2 differs more often from the SBS than Model 1. This finding is also supported by Table 3, which summarizes the actual and predicted solvers. More precisely, it lists (in the left column) how often each of the 12 optimization algorithms from the portfolio performed best. The remaining columns list the predicted optimization algorithms. For instance, the first row of the table shows that for six problems, BSqi had the best performance among all 12 solvers from the portfolio. However, instead of predicting BSqi, Model 1 predicted fmincon and HCMA three times each, and Model 2 predicted fmincon, HCMA and HMLS (each of them twice) for these six problems. Interestingly, one can observe that, although HCMA only in roughly 15% of the problems (14 of 96) is the true best solver, both models show a strong bias towards the SBS: Model 1 selects it in approximately 80% and Model 2 in 46% of the problems. Another remarkable observation is that the two models only predict three and four different solvers, respectively.

True Best Solver | Predicted Solver (Model 1 / Model 2) | ||||

Solver | # | fmincon | HCMA | HMLSL | MLSL |

BSqi | 6 | 3 / 2 | 3 / 2 | 0 / 2 | 0 / 0 |

BSrr | 6 | 1 / 2 | 5 / 4 | 0 / 0 | 0 / 0 |

CMA-CSA | 7 | 0 / 1 | 7 / 3 | 0 / 3 | 0 / 0 |

fmincon | 12 | 0 / 4 | 8 / 4 | 0 / 1 | 4 / 3 |

fminunc | 6 | 1 / 2 | 4 / 3 | 0 / 0 | 1 / 1 |

HCMA | 14 | 0 / 3 | 14 / 11 | 0 / 0 | 0 / 0 |

HMLSL | 11 | 3 / 3 | 7 / 4 | 0 / 0 | 1 / 4 |

IPOP400D | 7 | 0 / 0 | 7 / 3 | 0 / 3 | 0 / 1 |

MCS | 4 | 2 / 1 | 2 / 0 | 0 / 3 | 0 / 0 |

MLSL | 12 | 4 / 2 | 8 / 6 | 0 / 2 | 0 / 2 |

OQNLP | 6 | 0 / 1 | 6 / 2 | 0 / 2 | 0 / 1 |

SMAC-BBOB | 5 | 0 / 0 | 5 / 2 | 0 / 1 | 0 / 2 |

Σ | 96 | 14 / 21 | 76 / 44 | 0 / 17 | 6 / 14 |

True Best Solver | Predicted Solver (Model 1 / Model 2) | ||||

Solver | # | fmincon | HCMA | HMLSL | MLSL |

BSqi | 6 | 3 / 2 | 3 / 2 | 0 / 2 | 0 / 0 |

BSrr | 6 | 1 / 2 | 5 / 4 | 0 / 0 | 0 / 0 |

CMA-CSA | 7 | 0 / 1 | 7 / 3 | 0 / 3 | 0 / 0 |

fmincon | 12 | 0 / 4 | 8 / 4 | 0 / 1 | 4 / 3 |

fminunc | 6 | 1 / 2 | 4 / 3 | 0 / 0 | 1 / 1 |

HCMA | 14 | 0 / 3 | 14 / 11 | 0 / 0 | 0 / 0 |

HMLSL | 11 | 3 / 3 | 7 / 4 | 0 / 0 | 1 / 4 |

IPOP400D | 7 | 0 / 0 | 7 / 3 | 0 / 3 | 0 / 1 |

MCS | 4 | 2 / 1 | 2 / 0 | 0 / 3 | 0 / 0 |

MLSL | 12 | 4 / 2 | 8 / 6 | 0 / 2 | 0 / 2 |

OQNLP | 6 | 0 / 1 | 6 / 2 | 0 / 2 | 0 / 1 |

SMAC-BBOB | 5 | 0 / 0 | 5 / 2 | 0 / 1 | 0 / 2 |

Σ | 96 | 14 / 21 | 76 / 44 | 0 / 17 | 6 / 14 |

The different behavior of the two selectors is likely caused by the differences in the used features. While three of the features are identical for both selectors—the previously described angle and levelset features, as well as the skewness of the objective values—the remaining six features of Model 2 are meta-model and nearest better clustering (NBC) features. The two meta-model features measure (1) the smallest absolute, non-intercept coefficient of a linear model (without interactions) and (2) the adjusted model fit ($Radj2$) of a quadratic model (without interactions; see Mersmann et al., 2011). The remaining four NBC features measure (1) the ratio of the standard deviations of the distances among the nearest neighbours and the nearest better neighbours, (2) the ratio of the respective arithmetic means, (3) the correlation between the distances of the nearest neighbours and the nearest better neighbours, and (4) the so-called *indegree*, i.e., the correlation between fitness value and the count of observations to whom the current observation is the nearest better neighbour (Kerschke et al., 2015).

### 5.4 Critical Discussion: Robustness of the Results

Based on the previous results, one might argue that the proposed approach performs that well only due to HCMA's problem on the three-dimensional version of the *Büche-Rastrigin* function (FID 4). Therefore, we repeated our experiments with a subset of all 95 BBOB problems leaving out the problematic instance. Following the approach that we already employed for constructing Model 2 (Section 5.3), we again found algorithm selection models that outperform HCMA on this subset. The best of them is a classification-based random forest using *y*-distribution (2), meta-model (5), basic, cell mapping, information content and nearest better clustering (1 each) features. As shown in Table 4, the random forest outperforms HCMA, despite the feature costs, in each of the dimensions and requires on average less than two-thirds of the function evaluations.

Problem Dimension | SBS | AS |

2D | 17.7 | 17.2 |

3D | 17.6 | 5.1 |

5D | 6.5 | 4.7 |

10D | 6.9 | 4.5 |

all | 12.1 | 7.9 |

Problem Dimension | SBS | AS |

2D | 17.7 | 17.2 |

3D | 17.6 | 5.1 |

5D | 6.5 | 4.7 |

10D | 6.9 | 4.5 |

all | 12.1 | 7.9 |

Moreover, analyzing Table 1, one might eventually draw the conclusion that the optimization algorithms mainly perform differently on lower dimensional problems (2D and 3D) and that our presented approach thus might not be effective enough on higher dimensional problems. One could even argue that HCMA outperforms Models 1 and 2 (Sections 5.2 and 5.3) on the latter ones. Although the results in Table 4 already disprove this hypothesis, we additionally repeated our experiments based on the 48 higher dimensional BBOB problems (5D and 10D) only to support the effectiveness of our approach.

Again, HCMA showed the best performance among all solvers within the portfolio, having very competitive relative ERTs of 6.5 (5D) and 6.9 (10D), respectively. Nevertheless, we trained new algorithm selection models and again found models that substantially outperform HCMA on this particular subset. The best of them is a classification-based random forest using two *y*-distribution, six meta-model, two information content and two nearest better clustering features.

Table 5 compares the performances of HCMA and the random forest per group of BBOB problems. For most of them, the relative ERTs of our algorithm selector (including feature costs) are below the ones of the SBS and in consequence the random forest also had a lower mean relative ERT (4.5), indicating that it (on average) requires only two-thirds of the number of function evaluations compared to HCMA.

BBOB-Group | 5D | 10D | all | |||

SBS | AS | SBS | AS | SBS | AS | |

F1 - F5 | 12.0 | 11.7 | 2.7 | 14.8 | 7.4 | 13.3 |

F6 - F9 | 3.9 | 2.1 | 2.2 | 1.6 | 3.0 | 1.9 |

F10 - F14 | 4.2 | 3.7 | 2.8 | 2.8 | 3.5 | 3.2 |

F15 - F19 | 4.3 | 3.1 | 2.0 | 2.1 | 3.2 | 2.6 |

F20 - F24 | 7.7 | 1.2 | 23.6 | 1.3 | 15.7 | 1.3 |

all | 6.5 | 4.4 | 6.9 | 4.6 | 6.7 | 4.5 |

BBOB-Group | 5D | 10D | all | |||

SBS | AS | SBS | AS | SBS | AS | |

F1 - F5 | 12.0 | 11.7 | 2.7 | 14.8 | 7.4 | 13.3 |

F6 - F9 | 3.9 | 2.1 | 2.2 | 1.6 | 3.0 | 1.9 |

F10 - F14 | 4.2 | 3.7 | 2.8 | 2.8 | 3.5 | 3.2 |

F15 - F19 | 4.3 | 3.1 | 2.0 | 2.1 | 3.2 | 2.6 |

F20 - F24 | 7.7 | 1.2 | 23.6 | 1.3 | 15.7 | 1.3 |

all | 6.5 | 4.4 | 6.9 | 4.6 | 6.7 | 4.5 |

## 6 Conclusion and Outlook

We showed that sophisticated machine learning techniques combined with informative exploratory landscape features form a powerful tool for automatically constructing algorithm selection models for unseen optimization problems. This work provides the first extensive study on feature-based algorithm selection in case of single-objective continuous black-box optimization. With our approach, we were able to reduce the mean relative ERT of the single best solver of our portfolio by half relying only on a quite small set of ELA features and function evaluations. A specific R-package enhanced by a user-friendly graphical user interface allows for transferring the presented methodology to other (continuous) optimization scenarios differing from the BBOB setting.

Of course, the quality and usability of such models heavily relies on the quality of the algorithm benchmark, the considered performance measure, as well as on the representativeness of the included optimization problems. While we considered the well-known and commonly accepted BBOB workshop setting, we are aware of possible shortcomings and will in future work extend our research in terms of including other comprehensive benchmarks and practical applications once available.

Moreover, within this work our experimental studies relied on the ERT, that is, the most commonly used performance measure for single-objective continuous optimization. However, despite the ERT's current status as gold standard within this domain, its unquestioned usage across any possible setting is at least debatable. In fact, other, also multi-objective, performance measures (see, e.g., van Rijn et al., 2017; Bossek and Trautmann, 2018; Kerschke, Bossek et al., 2018) might be more applicable. And although our approach is easily transferrable to other settings (including different performance measures), changing the underlying metric likely results in a different algorithm selection model. Within future research, we thus intend to investigate alternative performance measures in detail and assess their applicability for algorithm selection purposes.

In addition, we also intend to investigate the potential of more sophisticated feature selection strategies, as well as of automatically tuning the hyperparameters of the included machine learning techniques. Finally, we work on extending the methodological approach to constrained and multi-objective optimization problems (Kerschke, Wang et al., 2016; Kerschke and Grimme, 2017; Kerschke, Wang et al., 2018) in order to increase the applicability to real-world scenarios.

## Acknowledgments

The authors acknowledge support from the *European Research Center for Information Systems ^{13} (ERCIS)* and the DAAD PPP project No. 57314626.

## Notes

^{7}

Per definition, instances are identical up to rotation, scaling and shifts (Finck et al., 2010).

^{8}

An overview of all submitted optimization algorithms along with their descriptions can be found at http://coco.gforge.inria.fr/doku.php?id=algorithms.

^{9}

The cell mapping angle features were computed using three blocks per dimension, as larger values would result in too many empty cells due to the “curse of dimensionality.”

^{11}

A BIPOP-CMA-ES (Hansen, 2009) is a multistart CMA-ES with equal budgets for two interlaced restart strategies: one with an increasing population size and one with varying small population sizes.

^{12}

Ties are broken via random uniform sampling among the tied solvers.

## References

*Real-parameter black-box optimization benchmarking 2009: Experimental setup*