## Abstract

Exploratory Landscape Analysis provides sample-based methods to calculate features of black-box optimization problems in a quantitative and measurable way. Many problem features have been proposed in the literature in an attempt to provide insights into the structure of problem landscapes and to use in selecting an effective algorithm for a given optimization problem. While there has been some success, evaluating the utility of problem features in practice presents some significant challenges. Machine learning models have been employed as part of the evaluation process, but they may require additional information about the problems as well as having their own hyper-parameters, biases and experimental variability. As a result, extra layers of uncertainty and complexity are added into the experimental evaluation process, making it difficult to clearly assess the effect of the problem features. In this article, we propose a novel method for the evaluation of problem features which can be applied directly to individual or groups of features and does not require additional machine learning techniques or confounding experimental factors. The method is based on the feature's ability to detect a prior ranking of similarity in a set of problems. Analysis of Variance (ANOVA) significance tests are used to determine if the feature has successfully distinguished the successive problems in the set. Based on ANOVA test results, a percentage score is assigned to each feature for different landscape characteristics. Experimental results for twelve different features on four problem transformations demonstrate the method and provide quantitative evidence about the ability of different problem features to detect specific properties of problem landscapes.

## 1 Introduction

A continuous optimization problem can be defined as maximizing or minimizing a function $F:IRD\u2192IR$ with respect to the input $X\u2208IRD$ where *D* is the dimensionality of the problem. Black-box optimization problems are those cases where the output variables are optimized without the knowledge of internal details of the problem. Most of the real-world problems are considered as black box because the algebraic expression of the optimization function is unknown. Many metaheuristics and algorithms have been developed for solving black-box continuous optimization problems (Weise et al., 2009; Smith-Miles, 2008). Theoretically we know from the No Free Lunch Theorem (Wolpert and Macready, 1997) that no algorithm is better than any other over all problems classes. In practice it is likely that different algorithms will perform better on different types of problems. Therefore, an important challenge is to understand the relationship between algorithms and problems. To select an appropriate algorithm for a given optimization problem the structural properties of problem landscape need to be explored. As a result, different features have been proposed in the literature to try and provide insight into this structure. The general framework for algorithm selection provided by Rice (1976) involves these problem characteristics as well.

Exploratory Landscape Analysis (ELA) (Mersmann et al., 2011) provides different sample-based features to gain insight into the complexity of black-box problems. For example, some ELA features aim to capture information about modality/global structure of the problem whereas some features measure the ruggedness of the landscape. Some commonly used problem features which are based on a set of data samples include fitness distance correlation (FDC) (Müller and Sbalzarini, 2011), dispersion (Lunacek and Whitley, 2006), ruggedness or neutrality (Marín, 2012), information content (Vassilev et al., 2000), length scale (Morgan and Gallagher, 2012), and cell mapping-based features (Kerschke et al., 2014). A description of existing problem features can be found in Muñoz et al. (2015) and Malan and Engelbrecht (2013b). An implementation of these features is easily available in the R-package FLACCO (Kerschke and Trautmann, 2016).

The fundamental goal of developing problem features is to quantitatively characterize and categorize problems based on their structural complexity using a minimum number of problem samples. This leads to the main focus of our study: How can we evaluate the performance/utility of problem features in a systematic, quantitative manner? Some previous approaches to feature evaluation are based on the feature's ability to predict the class label for problems in a standard benchmark problem set (that have been previously categorized). The benchmark problem sets used for classification are typically the Black-Box Optimization Benchmarking (BBOB) set (Hansen et al., 2009) and the Congress on Evolutionary Computation (CEC) numerical problem set (Liang et al., 2013). The features calculated on these problems are fed as an input to the classifier. The most informative features which improve the performance of the classifier are selected by applying the feature selection algorithms in the classification framework (Mersmann et al., 2010). The classifier's performance is taken as an indication of the ability of the feature set to distinguish different problem categories. This is a rather indirect approach to feature evaluation, since it depends on factors such as the type of classifier used, the choice of hyper-parameters for the classifier, and the feature selection algorithm. Therefore, it is difficult to evaluate the contribution of each individual feature.

In this article, we propose a new method for direct evaluation of problem features which can be applied to an individual feature and does not require additional machine learning techniques. There are no assumptions involved in this methodology and it can be applied to any problem feature. The method is based on the feature's ability to follow the trend in a set of problems which have a prior ranking of similarity among the problems. We create sets of ordered problems by transforming standard benchmark problems in a controlled and measurable way. The problem transformations suggested here are designed to capture different characteristics of problem landscapes. In each problem transformation one characteristic of a landscape is changed gradually and all the features are tested upon it. Feature values obtained for each problem in the transformation set are compared using ANOVA significance tests. Features are assigned a percentage score depending upon the number of problems distinguished correctly over the transformation. In our experiments we evaluate twelve problem features on four problem transformations with landscape properties including ruggedness, neutrality, ill-conditioning, and linearity. We also consider the effect of sample size on feature estimation. Clearly it is of interest to be able to estimate features with a small sample size (Kerschke et al., 2016); however, the dependence on sample size is an experimental detail that is often overlooked.

The article is organized as follows: Section 2 summarizes the utility and evaluation of features in the literature. Section 3 briefly describes the methods to calculate some commonly used features. The detailed methodology of feature evaluation criteria is provided in Section 4. An algorithm explaining the feature evaluation criteria is also discussed in this section. Results are discussed in Section 5 before the concluding Section 6.

## 2 Literature Review

Algorithm selection using sample-based features was introduced in Bischl et al. (2012). The results show that using cheap (low level) features we can predict optimal or close to optimal candidates from a portfolio of algorithms. In this research, a portfolio of four algorithms is created and a cost-sensitive classification approach is used for algorithm selection. The algorithm ranking has been developed for the BBOB problem set based on the data provided from the BBOB competition. Despite promising results there are some limitations associated with these algorithm selection frameworks. A critical assumption is that a labelled problem set exists, where the labels indicate a meaningful categorization of the problems. However, the organization of the BBOB problems into five categories is rather informal. In fact, recently it has been suggested that four problem classes could be a more natural categorization instead of five for the BBOB problems (Mersmann et al., 2015). It is also observed that the algorithm performance is not consistent within the problem groups (Bischl et al., 2012).

In a related approach, regression models using neural networks are built in Muñoz et al. (2012) to predict the performance of different configurations of algorithms in terms of expected run time (ERT). The landscape features and algorithm parameter values are fed into the model. The accuracy of the model depends upon the relevance of the features and precision in the feature values. This requires algorithm performance data on a set of problems to be available.

The algorithm selection problem using features is also analyzed in Muñoz et al. (2013). The authors have pointed out the effect of variations in the landscape features values and the bias incurred by sampling on the algorithm selection. A possible solution proposed is to use confidence intervals instead of a single value for a landscape measure.

Muñoz and Smith-Miles (2016) attempt to estimate an algorithm's footprint in the problem instance space. An algorithm's footprint is the region of all possible instances of problem where it is capable of performing well. ELA features are used to construct an instance space and an algorithm's footprint is associated with the region of common instance space. This framework reveals regions of algorithm strength and weakness in problem space. Hence, feature values for easy or hard problems can be identified.

Three landscape features (ruggedness based on entropy, dispersion, and a gradient estimate) are correlated with particle swarm optimization (PSO) performance measures (rate and speed), on a range of benchmark problems in Malan and Engelbrecht (2013a). Results show that these three features show some correlation with PSO performance but suggest that to predict algorithm performance more accurately, more features should be considered. A similar study was done in Malan and Engelbrecht (2014b) where fitness landscape measures are used to build a model to predict the failure of seven different variants of the PSO algorithm on a set of benchmark problems. Decision trees have been used in the prediction model. Using the decision trees helps in identifying the regions where the algorithm faced difficulty in solving the problem.

Problem features are also used in algorithm configuration in addition to algorithm selection. A case study regarding feature-based algorithm configuration is done in Belkhir et al. (2016a) to understand the relation between different configurations of differential evolution (DE) and problem characteristics for the set of benchmark problems. A framework capable of providing a suitable parameter configuration based on feature values of an unknown problem has been developed using random forest regression models. Another framework for algorithm configuration is build using the results produced by different runs of CMA-ES configurations on different instances (Belkhir et al., 2017). Per Instance Algorithm Configuration based on problem features is used for training the model and the results are promising.

Benchmark Functions and Benchmark Suites commonly used in algorithm selection are analyzed in Garden and Engelbrecht (2014). In this paper, two common benchmark problem sets are analyzed to find whether these functions cover the wide range of problem characteristics. Results show that some characteristics are underrepresented by these suites. This result can effect the performance of many algorithm selection models as the training set is considered to be incomplete.

Morgan and Gallagher (2014) have studied the effect of sampling technique and distance metric in calculation of dispersion metric in continuous domain. It is shown theoretically and experimentally that the dispersion metric calculated using uniform random sampling employing Euclidean distance will converge to a constant value of $16$ as the dimension approaches infinity. Hence, the sample size should be selected according to the exponential increase in the volume with increasing dimensions *D*.

The reliability of the problem feature strongly depends upon the quality of the data sample used to calculate the feature value (Pitzer and Affenzeller, 2012). It is also investigated that using appropriate sample size for the feature computation is very important (Belkhir et al., 2017). Research is now focused on methods to evaluate these features using minimum number of samples (Belkhir et al., 2016b). Selecting the right sample size to measure a feature value not only depends on the method of feature calculation but also on the complexity of the problem for which the feature is evaluated. However, in experimental results, little attention is typically given to how reliable these feature values are given the sample size used (Scott and De Jong, 2016).

The most relevant work to this research has been discussed in Muñoz and Smith-Miles (2015). They explore the trend in feature values for different instances of the same problem generated after different function translations in a bounded region. The results show that some features' values change abruptly with the translations or there is even a phase transition in the measure. This implies we should not use averaged values of features for different instances as a representative of the generating function. This will also effect the algorithm selection model which is based on these ELA measures. The second important issue raised is regarding the dimensionality reduction applied to visualize the problems in feature space. It is reported that dimensionality reduction techniques may alter the neighborhood structure which may result in hiding some important problem characteristics.

It is clear that improvements and alternative approaches to the evaluation of problem features is an important direction for research in exploratory landscape analysis. Our approach consists of two parts:

First, we explore the effect of different sample sizes on the calculation of commonly used features for a set of standard problems.

Second, we propose a simple and direct feature evaluation criteria based on a set of controlled problem transformations.

## 3 Black-Box Optimization Problem Features

A large number of landscape features have been proposed in the literature. Here we discuss some of those features which are commonly used in the problem understanding frameworks (Bischl et al., 2012; Muñoz and Smith-Miles, 2016, 2015). A brief description of the computation method for these features is also given because it is important to the behaviour of the estimated feature values.

### 3.1 Dispersion

Dispersion (Lunacek and Whitley, 2006) is an algorithm independent measure based on the samples of search space to determine the global topology of a function. It describes the distribution of good solution points *Y* in the search space. Mathematically it is calculated as the expected pairwise distance between sample points of good fitness value. Good fitness is selected depending upon a threshold ($R%$) from a random sample. Changing the threshold can change the estimated dispersion value of a given problem.

### 3.2 FDC

*Y*and $Ds$; $y*$ and $d*$ are the mean of

*Y*and $Ds$; and $Sf$ and $SDs$ are the respective sample standard deviations.

*N*is the number of samples. FDC requires the knowledge of the global optimum solution which limits its application. The best sample found is sometimes used as a substitute but this will significantly effect the estimated FDC value.

### 3.3 Information Content

*N*is the length of original string $S(\u025b)$.

The accuracy of the estimates of information content-based features depends on the parameter $\u025b$ which acts as a magnifying glass over the landscape. For small values of $\u025b$ the accuracy of the estimates of information content is high as it is more sensitive to the change in fitness values. If $\u025b$ is very high then the information content features will not be able to capture the changes in fitness values and the landscape would be observed as a flat surface. The minimum value of $\u025b$ which makes the landscape look flat defines information stability.

### 3.4 Meta-Model-Based Features

Fitting a linear or quadratic regression model to the set of samples obtained from the problem gives an indication about the modality and global structure of the landscape. The goodness of fit of these models is used as a feature. Mathematically these features are measured as adjusted coefficient $R2$ of linear and quadratic regression models (Mersmann et al., 2011). Adjusted coefficient is a statistical measure that refers to sum of the squared differences between the function values and the regression model.

### 3.5 Objective Function Value Distribution

Structural information about a problem can be gained by characterizing the distribution of the sample objective function values *y* (also known as the density of states) through its skewness and kurtosis (Mersmann et al., 2011). The sample skewness and kurtosis indicate whether the distribution is flat or peaked as compared to a normal distribution. We can use these statistics as a feature to predict about the global structure of the landscape and multi-modality.

### 3.6 Length Scale

Length Scale is invariant to isometric transformations but sensitive towards scaling and shearing. To summarize the information about the problem landscape the probability density of $rij$ is estimated. The mean, variance and entropy of the distribution $p(rij)$ are used as problem characterizing features.

## 4 Methodology

### 4.1 Effect of Sample Size

In a continuous optimization problem the number of solutions is infinite. Hence, problem features $\Phi $ are calculated using a finite sample set from the solution space and their corresponding objective function values $(X,Y)$. The estimated feature value will clearly depend on the sample size (Abell et al., 2013). However, experimental studies in the literature often do not closely examine the effect of sample size on feature estimation. Therefore, we examined different sample sizes for the computation of features. The features used here are: Dispersion ($\Phi 1$), Fitness Distance Correlation ($\Phi 2$), Information Content ($\Phi 3$), Partial Information Content ($\Phi 4$), Information stability ($\Phi 5$), Length Scale mean ($\Phi 6$), Length Scale variance ($\Phi 7$), Length Scale entropy ($\Phi 8$), Adjusted coefficient $R2$ for linear model ($\Phi 9$), Adjusted coefficient $R2$ for quadratic model ($\Phi 10$), sample Skewness of Y ($\Phi 11$), and sample Kurtosis of Y ($\Phi 12$). The estimated feature value also depends upon the complexity of the problem so we calculated features for a set of standard problems including Sphere, Rastrigin, Ackley, Griewank, and Rosenbrock. All the functions used are defined in the standard bounded region of $[-5,5]D$.

As $N\u2192\u221e$ the feature $\Phi i$ approaches a true value. According to the Central Limit Theorem, the standard deviation of feature values would follow $1/N$. The method to calculate a feature value would also affect the trend in variance with increasing sample size. We calculated the values of the above-mentioned features on the standard problem set for a (single) very large sample size of $40000*D2$ in 2-D and 5-D problem set and the results are shown in Tables 1 and 2. While this would not be practical for online algorithm selection, it is valuable for benchmarking and experimental verification. A few papers have reported the feature values with different (usually small) sample sizes and different hyper-parameters (Malan and Engelbrecht, 2014a) which makes it hard to compare the results. For example, the feature values reported in Bond et al. (2015) for the same problems seem very different. Such confusions can be avoided if accurate values of features on specific problems are known. These feature values are also critical input to any algorithm selection framework. It would be very helpful to have highly accurate, published and publicly available feature values in comparing different algorithm selection frameworks which could be directly fed into the framework without recalculation. In this way the variability in the results due to feature computation methods will be reduced.

Sphere | Rastrigin | Rosenbrock | Griewank | Ackley | |

$\Phi 1$ | 1.1423 | 2.6267 | 2.4024 | 2.9072 | 1.1824 |

$\Phi 2$ | 0.9746 | 0.7061 | 0.6207 | 0.6131 | 0.9674 |

$\Phi 3$ | 0.4088 | 0.4088 | 0.4088 | 0.4088 | 0.4088 |

$\Phi 4$ | 0.6664 | 0.6667 | 0.6665 | 0.6665 | 0.6670 |

$\Phi 5$ | 49.0058 | 77.3671 | 8.9229e + 04 | 236.9401 | 13.3828 |

$\Phi 6$ | 2.7451 | 4.3632 | 3.8204e + 03 | 10.3427 | 0.7001 |

$\Phi 7$ | 4.5970 | 29.8226 | 2.08e + 07 | 158.6029 | 0.3834 |

$\Phi 8$ | 2.8579 | 3.5437 | 13.2860 | 4.7411 | 0.9229 |

$\Phi 9$ | 0 | 3.33e-06 | 0.0722 | 0.0704 | 1.67e-05 |

$\Phi 10$ | 1 | 0.5215 | 0.8714 | 0.8502 | 0.84022 |

$\Phi 11$ | 0.4522 | 0.1430 | 1.6191 | 1.5603 | -0.68605 |

$\Phi 12$ | 2.5710 | 2.6606 | 4.9492 | 4.8547 | 2.95389 |

Sphere | Rastrigin | Rosenbrock | Griewank | Ackley | |

$\Phi 1$ | 1.1423 | 2.6267 | 2.4024 | 2.9072 | 1.1824 |

$\Phi 2$ | 0.9746 | 0.7061 | 0.6207 | 0.6131 | 0.9674 |

$\Phi 3$ | 0.4088 | 0.4088 | 0.4088 | 0.4088 | 0.4088 |

$\Phi 4$ | 0.6664 | 0.6667 | 0.6665 | 0.6665 | 0.6670 |

$\Phi 5$ | 49.0058 | 77.3671 | 8.9229e + 04 | 236.9401 | 13.3828 |

$\Phi 6$ | 2.7451 | 4.3632 | 3.8204e + 03 | 10.3427 | 0.7001 |

$\Phi 7$ | 4.5970 | 29.8226 | 2.08e + 07 | 158.6029 | 0.3834 |

$\Phi 8$ | 2.8579 | 3.5437 | 13.2860 | 4.7411 | 0.9229 |

$\Phi 9$ | 0 | 3.33e-06 | 0.0722 | 0.0704 | 1.67e-05 |

$\Phi 10$ | 1 | 0.5215 | 0.8714 | 0.8502 | 0.84022 |

$\Phi 11$ | 0.4522 | 0.1430 | 1.6191 | 1.5603 | -0.68605 |

$\Phi 12$ | 2.5710 | 2.6606 | 4.9492 | 4.8547 | 2.95389 |

Sphere | Rastrigin | Rosenbrock | Griewank | Ackley | |

$\Phi 1$ | 4.5487 | 6.0464 | 5.1150 | 5.5777 | 4.63291 |

$\Phi 2$ | 0.9884 | 0.7150 | 0.7935 | 0.9688 | 0.9688 |

$\Phi 3$ | 0.4088 | 0.4088 | 0.4088 | 0.4087 | 0.4087 |

$\Phi 4$ | 0.6667 | 0.6667 | 0.6666 | 0.6670 | 0.6670 |

$\Phi 5$ | 105.1258 | 151.4530 | 2.9452e + 05 | 185.6551 | 10.0572 |

$\Phi 6$ | 2.3022 | 3.2131 | 4.8704e + 03 | 3.0914 | 0.2040 |

$\Phi 7$ | 3.3316 | 7.4258 | 1.6351e + 07 | 6.5718 | 0.03245 |

$\Phi 8$ | 2.6051 | 3.1029 | 13.6650 | 3.0433 | -0.8537 |

$\Phi 9$ | 0 | 0 | 0.0688 | 0.0671 | 1.67E-05 |

$\Phi 10$ | 1 | 0.5218 | 0.8782 | 0.8580 | 0.89256 |

$\Phi 11$ | 0.2854 | 0.0905 | 0.9043 | 0.8715 | -0.6441 |

$\Phi 12$ | 2.8278 | 2.8634 | 3.8697 | 3.8238 | 3.56535 |

Sphere | Rastrigin | Rosenbrock | Griewank | Ackley | |

$\Phi 1$ | 4.5487 | 6.0464 | 5.1150 | 5.5777 | 4.63291 |

$\Phi 2$ | 0.9884 | 0.7150 | 0.7935 | 0.9688 | 0.9688 |

$\Phi 3$ | 0.4088 | 0.4088 | 0.4088 | 0.4087 | 0.4087 |

$\Phi 4$ | 0.6667 | 0.6667 | 0.6666 | 0.6670 | 0.6670 |

$\Phi 5$ | 105.1258 | 151.4530 | 2.9452e + 05 | 185.6551 | 10.0572 |

$\Phi 6$ | 2.3022 | 3.2131 | 4.8704e + 03 | 3.0914 | 0.2040 |

$\Phi 7$ | 3.3316 | 7.4258 | 1.6351e + 07 | 6.5718 | 0.03245 |

$\Phi 8$ | 2.6051 | 3.1029 | 13.6650 | 3.0433 | -0.8537 |

$\Phi 9$ | 0 | 0 | 0.0688 | 0.0671 | 1.67E-05 |

$\Phi 10$ | 1 | 0.5218 | 0.8782 | 0.8580 | 0.89256 |

$\Phi 11$ | 0.2854 | 0.0905 | 0.9043 | 0.8715 | -0.6441 |

$\Phi 12$ | 2.8278 | 2.8634 | 3.8697 | 3.8238 | 3.56535 |

### 4.2 Problem Transformations for Feature Evaluation

*F*from the problem space $F$ which can be ranked in terms of geometric or functional similarity.

*T*with

*t*problems such that the problem $F1$ is more similar to $F2$ as compared to $F3$ and so on. When a feature $\Phi i$ is calculated on this set, we would like to be able to recover the ordering from the values $\Phi i(F1)$, $\Phi i(F2),\u2026\Phi i(Ft)$. Each feature $\Phi i(Fj)$ is calculated

*K*times using an independent sample of size

*N*for each problem in the set to compensate for the sampling error. This provides a group of observations against each problem for every feature. ANOVA analysis of variance test is used to check whether the difference between features values of different problems is significant or not. The complete algorithm for the feature evaluation criteria is given in Algorithm 1.

The set of ordered problems used for feature evaluation are generated using controlled problem transformations. All the proposed problem transformations in 1D are shown in Figure 1 to understand the structure of successive problems. In the experiments involving controlled problem transformations all the features are calculated using 50 different sample sets of the same problem to compensate for the sampling error. The sample size used here is $500*D2$.

*t*equal steps (we used step size of 1 for this transformation), we vary the function from sphere to Rastrigin. Every problem in the transition has the same global structure and the value of $At$ controls the magnitude of the periodic term in the Rastrigin function. As the value of $At$ increases the size of local basins increase. The whole problem transformation $TSR$ changes the landscape from being smooth to highly rugged.

*L*is any level in the range of function definition depending upon the number of functions

*t*in one transformation. The flat region starts building from the center and grows outwards. The transformation is stopped before going completely flat as it makes it impossible to calculate some features for a perfectly flat function.

*y*-axis is different for each problem which indicates the shape of the function.

### 4.3 ANOVA Test

To determine whether there is a significant difference between the mean feature values for each problem in every controlled problem transformation, we have used one-way ANOVA tests (Hogg and Ledolter, 1987). The ANOVA test gives an indication of whether the mean feature values for two or more optimization problems are different or not. ANOVA is an omnibus test as it can only indicate that at least two of the optimization problems have different (true) mean feature values. A post hoc test can be used to identify which specific groups were significantly different statistically. Here we have used a class of post hoc tests which is Tukey's range test (Tukey, 1949). We can use such post-hoc tests only when the null hypothesis is rejected. The significance level ($\alpha $) used is 0.05. The detailed multiple comparison results are given in Tables 5, 6, 7, and 8. In each table, if the null hypotheses is rejected between a pair of problems then it is represented by a “1,” otherwise by a “0.” The information from each multiple comparison test is summarized as a percentage which is given in Tables 3 and 4. The results are discussed in detail in Section 5. ANOVA assumes that observations within each group are normally distributed, and that the variance for each group is the same. However, when used to compare means with equal group sizes, ANOVA is robust to mild non-normality and unequal variance, particularly when the number of groups is large (Scheffe, 1959; Bathke, 2004). In our experiments we found these to be reasonable assumptions. For our experiments, we examined histograms for each feature value which is evaluated for every problem in the transformations 50 times. The shape of the histograms suggested the distributions were approximately normal. If there was high non-normality or dissimilar variances, similar nonparametric significance tests (e.g., Kruskal-Wallis test) could be considered (Bhattacharyya, 2004).

$TSR$ | $TRF$ | $TSE$ | $TLS$ | |

$\Phi 1$ | 00 | 97.7 | 18.18 | 53.3 |

$\Phi 2$ | 00 | 00 | 18.18 | 86.6 |

$\Phi 3$ | 0 | 97.7 | 0 | 0 |

$\Phi 4$ | 0 | 86.6 | 0 | 0 |

$\Phi 5$ | 93.3 | 62.2 | 100 | 37.7 |

$\Phi 6$ | 97.7 | 100 | 100 | 86.6 |

$\Phi 7$ | 93.3 | 71.11 | 98.18 | 57.77 |

$\Phi 8$ | 97.7 | 77.77 | 100 | 100 |

$\Phi 9$ | 0 | 37.7 | 0 | 93.3 |

$\Phi 10$ | 100 | 100 | 0 | 93.3 |

$\Phi 11$ | 97.7 | 93.3 | 18.18 | 86.66 |

$\Phi 12$ | 55.5 | 44.4 | 18.18 | 84.4 |

$TSR$ | $TRF$ | $TSE$ | $TLS$ | |

$\Phi 1$ | 00 | 97.7 | 18.18 | 53.3 |

$\Phi 2$ | 00 | 00 | 18.18 | 86.6 |

$\Phi 3$ | 0 | 97.7 | 0 | 0 |

$\Phi 4$ | 0 | 86.6 | 0 | 0 |

$\Phi 5$ | 93.3 | 62.2 | 100 | 37.7 |

$\Phi 6$ | 97.7 | 100 | 100 | 86.6 |

$\Phi 7$ | 93.3 | 71.11 | 98.18 | 57.77 |

$\Phi 8$ | 97.7 | 77.77 | 100 | 100 |

$\Phi 9$ | 0 | 37.7 | 0 | 93.3 |

$\Phi 10$ | 100 | 100 | 0 | 93.3 |

$\Phi 11$ | 97.7 | 93.3 | 18.18 | 86.66 |

$\Phi 12$ | 55.5 | 44.4 | 18.18 | 84.4 |

$TSR$ | $TRF$ | $TSE$ | $TLS$ | |

$\Phi 1$ | 100 | 86.11 | 18.18 | 77.7 |

$\Phi 2$ | 100 | 91.6 | 18.18 | 86.6 |

$\Phi 3$ | 0 | 83.3 | 0 | 0 |

$\Phi 4$ | 0 | 72.2 | 0 | 0 |

$\Phi 5$ | 80 | 88.8 | 100 | 46.6 |

$\Phi 6$ | 100 | 91.6 | 100 | 84.4 |

$\Phi 7$ | 97.7 | 91.6 | 100 | 84.4 |

$\Phi 8$ | 100 | 72.2 | 100 | 84.4 |

$\Phi 9$ | 0 | 36.11 | 0 | 84.44 |

$\Phi 10$ | 100 | 83.33 | 0 | 77.7 |

$\Phi 11$ | 97.7 | 75 | 18.18 | 86.6 |

$\Phi 12$ | 31.11 | 22.2 | 18.18 | 77.7 |

$TSR$ | $TRF$ | $TSE$ | $TLS$ | |

$\Phi 1$ | 100 | 86.11 | 18.18 | 77.7 |

$\Phi 2$ | 100 | 91.6 | 18.18 | 86.6 |

$\Phi 3$ | 0 | 83.3 | 0 | 0 |

$\Phi 4$ | 0 | 72.2 | 0 | 0 |

$\Phi 5$ | 80 | 88.8 | 100 | 46.6 |

$\Phi 6$ | 100 | 91.6 | 100 | 84.4 |

$\Phi 7$ | 97.7 | 91.6 | 100 | 84.4 |

$\Phi 8$ | 100 | 72.2 | 100 | 84.4 |

$\Phi 9$ | 0 | 36.11 | 0 | 84.44 |

$\Phi 10$ | 100 | 83.33 | 0 | 77.7 |

$\Phi 11$ | 97.7 | 75 | 18.18 | 86.6 |

$\Phi 12$ | 31.11 | 22.2 | 18.18 | 77.7 |

Prob | $FLS2$ | $FLS3$ | $FLS4$ | $FLS5$ | $FLS6$ | $FLS7$ | $FLS8$ | $FLS9$ | $FLS10$ |

$FLS1$ | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9171 | 0.9999 | $<10-4$ | $<10-4$ | $<10-4$ |

$FLS2$ | 1.0000 | 1.0000 | 1.0000 | 0.8651 | 1.0000 | $<10-4$ | $<10-4$ | $<10-4$ | |

$FLS3$ | 1.0000 | 1.0000 | 0.9441 | 0.9998 | $<10-4$ | $<10-4$ | $<10-4$ | ||

$FLS4$ | 1.0000 | 0.7783 | 1.0000 | $<10-4$ | $<10-4$ | $<10-4$ | |||

$FLS5$ | 0.7800 | 1.0000 | $<10-4$ | $<10-4$ | $<10-4$ | ||||

$FLS6$ | 0.6164 | $<10-4$ | $<10-4$ | $<10-4$ | |||||

$FLS7$ | $<10-4$ | $<10-4$ | $<10-4$ | ||||||

$FLS8$ | $<10-4$ | $<10-4$ | |||||||

$FLS9$ | $<10-4$ |

Prob | $FLS2$ | $FLS3$ | $FLS4$ | $FLS5$ | $FLS6$ | $FLS7$ | $FLS8$ | $FLS9$ | $FLS10$ |

$FLS1$ | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9171 | 0.9999 | $<10-4$ | $<10-4$ | $<10-4$ |

$FLS2$ | 1.0000 | 1.0000 | 1.0000 | 0.8651 | 1.0000 | $<10-4$ | $<10-4$ | $<10-4$ | |

$FLS3$ | 1.0000 | 1.0000 | 0.9441 | 0.9998 | $<10-4$ | $<10-4$ | $<10-4$ | ||

$FLS4$ | 1.0000 | 0.7783 | 1.0000 | $<10-4$ | $<10-4$ | $<10-4$ | |||

$FLS5$ | 0.7800 | 1.0000 | $<10-4$ | $<10-4$ | $<10-4$ | ||||

$FLS6$ | 0.6164 | $<10-4$ | $<10-4$ | $<10-4$ | |||||

$FLS7$ | $<10-4$ | $<10-4$ | $<10-4$ | ||||||

$FLS8$ | $<10-4$ | $<10-4$ | |||||||

$FLS9$ | $<10-4$ |

Prob | $FSE2$ | $FSE3$ | $FSE4$ | $FSE5$ | $FSE6$ | $FSE7$ | $FSE8$ | $FSE9$ | $FSE10$ | $FSE11$ |

$FSE1$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ |

$FSE2$ | 1 | 0.9999 | 1 | 0.9335 | 0.6708 | 1 | 0.512 | 1 | 0.9999 | |

$FSE3$ | 1 | 1 | 0.9801 | 0.8179 | 1 | 0.6802 | 1 | 1 | ||

$FSE4$ | 1 | 0.9989 | 0.9574 | 0.9981 | 0.8899 | 0.9982 | 1 | |||

$FSE5$ | 0.9592 | 0.7405 | 1 | 0.5876 | 1 | 1 | ||||

$FSE6$ | 1 | 0.8052 | 0.9997 | 0.8081 | 0.9992 | |||||

$FSE7$ | 0.4555 | 1 | 0.4592 | 0.9631 | ||||||

$FSE8$ | 0.3103 | 1 | 0.9976 | |||||||

$FSE9$ | 0.3134 | 0.9011 | ||||||||

$FSE10$ | 0.9977 |

Prob | $FSE2$ | $FSE3$ | $FSE4$ | $FSE5$ | $FSE6$ | $FSE7$ | $FSE8$ | $FSE9$ | $FSE10$ | $FSE11$ |

$FSE1$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ |

$FSE2$ | 1 | 0.9999 | 1 | 0.9335 | 0.6708 | 1 | 0.512 | 1 | 0.9999 | |

$FSE3$ | 1 | 1 | 0.9801 | 0.8179 | 1 | 0.6802 | 1 | 1 | ||

$FSE4$ | 1 | 0.9989 | 0.9574 | 0.9981 | 0.8899 | 0.9982 | 1 | |||

$FSE5$ | 0.9592 | 0.7405 | 1 | 0.5876 | 1 | 1 | ||||

$FSE6$ | 1 | 0.8052 | 0.9997 | 0.8081 | 0.9992 | |||||

$FSE7$ | 0.4555 | 1 | 0.4592 | 0.9631 | ||||||

$FSE8$ | 0.3103 | 1 | 0.9976 | |||||||

$FSE9$ | 0.3134 | 0.9011 | ||||||||

$FSE10$ | 0.9977 |

Prob | $FRF2$ | $FRF3$ | $FRF4$ | $FRF5$ | $FRF6$ | $FRF7$ | $FRF8$ | $FRF9$ | $FRF10$ |

$FRF1$ | 1 | 0.8241 | 0.02 | $<10-4$ | 0.0006 | 0.0134 | $<10-4$ | $<10-4$ | $<10-4$ |

$FRF2$ | 0.9749 | 0.0837 | 0.0004 | 0.0041 | 0.0023 | $<10-4$ | $<10-4$ | $<10-4$ | |

$FRF3$ | 0.7347 | 0.0417 | 0.1736 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | ||

$FRF4$ | 0.9209 | 0.9966 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | |||

$FRF5$ | 1 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | ||||

$FRF6$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | |||||

$FRF7$ | $<10-4$ | $<10-4$ | $<10-4$ | ||||||

$FRF8$ | $<10-4$ | $<10-4$ | |||||||

$FRF9$ | 1 |

Prob | $FRF2$ | $FRF3$ | $FRF4$ | $FRF5$ | $FRF6$ | $FRF7$ | $FRF8$ | $FRF9$ | $FRF10$ |

$FRF1$ | 1 | 0.8241 | 0.02 | $<10-4$ | 0.0006 | 0.0134 | $<10-4$ | $<10-4$ | $<10-4$ |

$FRF2$ | 0.9749 | 0.0837 | 0.0004 | 0.0041 | 0.0023 | $<10-4$ | $<10-4$ | $<10-4$ | |

$FRF3$ | 0.7347 | 0.0417 | 0.1736 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | ||

$FRF4$ | 0.9209 | 0.9966 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | |||

$FRF5$ | 1 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | ||||

$FRF6$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | |||||

$FRF7$ | $<10-4$ | $<10-4$ | $<10-4$ | ||||||

$FRF8$ | $<10-4$ | $<10-4$ | |||||||

$FRF9$ | 1 |

Prob | $FSR2$ | $FSR3$ | $FSR4$ | $FSR5$ | $FSR6$ | $FSR7$ | $FSR8$ | $FSR9$ | $FSR10$ |

$FSR1$ | 0.9224 | 1 | 0.9997 | 0.0617 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ |

$FSR2$ | 0.6775 | 0.999 | 0.8093 | $<10-4$ | 0.0005 | $<10-4$ | $<10-4$ | $<10-4$ | |

$FSR3$ | 0.9822 | 0.0135 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | ||

$FSR4$ | 0.3028 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | |||

$FSR5$ | 0.0105 | 0.1764 | 0.0065 | 0.0001 | $<10-4$ | ||||

$FSR6$ | 0.9951 | 1 | 0.9703 | 0.7589 | |||||

$FSR7$ | 0.9877 | 0.4877 | 0.1719 | ||||||

$FSR8$ | 0.9859 | 0.8306 | |||||||

$FSR9$ | 0.9999 |

Prob | $FSR2$ | $FSR3$ | $FSR4$ | $FSR5$ | $FSR6$ | $FSR7$ | $FSR8$ | $FSR9$ | $FSR10$ |

$FSR1$ | 0.9224 | 1 | 0.9997 | 0.0617 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ |

$FSR2$ | 0.6775 | 0.999 | 0.8093 | $<10-4$ | 0.0005 | $<10-4$ | $<10-4$ | $<10-4$ | |

$FSR3$ | 0.9822 | 0.0135 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | ||

$FSR4$ | 0.3028 | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | $<10-4$ | |||

$FSR5$ | 0.0105 | 0.1764 | 0.0065 | 0.0001 | $<10-4$ | ||||

$FSR6$ | 0.9951 | 1 | 0.9703 | 0.7589 | |||||

$FSR7$ | 0.9877 | 0.4877 | 0.1719 | ||||||

$FSR8$ | 0.9859 | 0.8306 | |||||||

$FSR9$ | 0.9999 |

### 4.4 Feature Evaluation Criteria

## 5 Results and Discussion

### 5.1 Effect of Sample Size on Problem Features

The normalized estimated feature values for different sample sizes for two representative problems (2D Sphere and 5D Rastrigin) are summarized in Figures 2 and 3. Box plots for remaining problems are provided in the Appendix. Results are displayed in grouped box plots where each group corresponds to a single feature. Three box plots in one group represent the results for 30 independent feature calculations for the same problem using three sample sizes: small $=100*D2$, medium $=500*D2$, and large $=2000*D2$.

There is currently no standard sample size in the literature for problem feature estimation experiments. Our assumption here is that reliable estimation of possibly complex features may require a sample size growing quadratically in *D*. Refer to Section 2 for further discussion on sample size.

The feature values are normalized so that the variance in different feature values is comparable. We normalize the set of three observations (three sample sizes) for one feature at one time. To bring the plots on same level we subtract the mean: [(x - mean(x))/(max(x) - min(x))]. We take the maximum and minimum from all the observed feature values using three sample sizes. In the box plots, the circle in the middle is the Mean. Horizontal line in the middle is the median. The box represents the inter-quartile Range and the whiskers are marked with the dashed line. It is not surprising to see that the variance of the estimated values typically decreases with increased sample size. Most features display this behaviour on the functions we have used. The main exceptions to this are $\Phi 9$ and $\Phi 10$, which have smaller ($\Phi 9$) or zero ($\Phi 10$) variance with increasing *N*. This can be explained by considering the definitions of these features. The Sphere function is, by definition, the quadratic function and $\Phi 10$ is the measure of goodness of fit for a quadratic model. Hence, the model can fit any number of samples from the Sphere function perfectly, so there is no variability in the estimated feature value. Feature $\Phi 9$ is the measure of goodness of fit to a linear model. A linear regression model has a relatively low estimation variance and so for a sample of points from a reasonably well behaved function, $\Phi 9$ will be very stable.

For the Rastrigin function feature $\Phi 5$ has a strange behaviour with increasing sample size. If we look at the definition of $\Phi 5$ in Section 3.3, it is calculated as the maximum difference between any two fitness values in the sample. As a summary statistic, the maximum (and minimum) of a sample are the least robust. So for a complex structured landscape, large samples imply more exploration and more chance of discovering a large fitness value. Hence, we do not see much shrinkage in the variance of $\Phi 5$ values with higher sample size.

The results also show that using small sample size can have a radical effect on the estimated feature value for an individual experiment. Consider the median values shown within each group of boxplots. For some features (e.g., $\Phi 3$ and $\Phi 4$, in Figure 3, and $\Phi 2$ and $\Phi 9$ in Figure 2), the difference between median values obtained using small and large sample size is very obvious. Similarly, using a small sample size for some features shifts the median line away from the mean value. The risk of using a small *N* is that important structural information of the landscape is lost. Consequently this effects the feature estimate. This implies that the feature value calculated from a small sample of data could be misleading. Box plots representing feature values for the remaining problems are included in the Appendix.

The estimated feature values for the 2D test problems on a *very large* sample size ($40000\xd7D2$) are given in Table 1. Similar results for 5D problems are provided in Table 2. By using a *very large* sample we can remove any significant error in the estimation due to sample size.

For 2D and 5D the values of $\Phi 3$ for the whole problem set are very similar. $\Phi 3$ strongly depends upon its hyper parameter, $\u025b$, which decides the sensitivity towards change in fitness values during the random walk on the landscape (Section 3.3). Here, we have chosen $\u025b$ to be zero which means it tracks every change no matter how small. But at the same time it makes it insensitive towards different change levels. As a result we get similar value of information content for different functions unless we get a completely flat region in the function (Merkuryeva and Bolshakovs, 2010).

To verify our experiments we compared the feature values obtained with the previous reported results in the literature. For the test problems some feature values can be found in Malan and Engelbrecht (2014a). The sample size used in their calculations is 500*D. Comparing the reported feature values with our estimates using a sample size of 40,000$D2$, we found that the values are similar in case of sphere function. However for other complex functions, there is a difference between feature values obtained using a small or large sample size. Feature values for some other functions were not comparable as they are defined in different bounded regions.

In summary, these results demonstrate that sample size effects need to be considered when optimization problem features are estimated. The estimates depend on the sample size, the definition of the feature and the problem landscape in question. Subsequent analysis of features could be erroneous if incorrect assumptions are made about the estimated feature values.

### 5.2 Features Evaluated on Problem Transformations

In this section, we present the results of applying the methodology proposed in Section 4 to evaluate the ability of the problem features to identify the trend in problem transformations described in Section 4.4. The percentage of correct identified differences over each problem transformation for 2D are shown in Table 3. Similar results for the 10D problems are shown in Table 4.

For the first transformation, $TSR$, most of the features were very successful in identifying the difference between successive problems except $\Phi 3$, $\Phi 4$ and $\Phi 9$. $\Phi 12$ partly detected the different levels of ruggedness in the transformation. This may happen when any problem in the transformation set is not distinguished from the very next problem, however it gets identified different from a distant problem.

For the transformation $TRF$ none of the features failed completely. $\Phi 1$, $\Phi 2$, $\Phi 3$, $\Phi 6$, $\Phi 10$, and $\Phi 11$ perfectly tracked the changes in the transformation. Some features ($\Phi 4$, $\Phi 7$, and $\Phi 8$) identified the difference between problems most of the time. Features $\Phi 5$, $\Phi 9$, and $\Phi 12$ show a limited ability to track the problems in this transformation.

Most of the features fail to distinguish problems in $TSE$ except $\Phi 5$, $\Phi 6$, $\Phi 7$, and $\Phi 8$. Other features $\Phi 1$, $\Phi 2$, $\Phi 11$, and $\Phi 12$ were able to distinguish the sphere from the ellipse problems but could not identify the different conditioned ellipse functions from each other.

For $TLS$ all the features performed reasonably well except $\Phi 3$, $\Phi 4$, and $\Phi 5$. $\Phi 9$ performs very well in identifying the different levels of linearity in successive problems in $TLS$. $\Phi 2$, $\Phi 6,\Phi 7,\Phi 8$, and $\Phi 11$ and $\Phi 12$ also perform well in identifying the linearity. $\Phi 10$ also performed well in this transformation $TLS$, and the reason behind this behavior is the increase in the quadratic structure of the problem. Kerschke et al. (2015) have reported that $\Phi 9$ and $\Phi 10$ are very important features to identify the funnel structure in the problems. These results are supported by our results here as both of these features score very high in this transformation $TLS$.

$\Phi 3$ captures the neutrality in the $TRF$ very well as there is a perfectly flat region introduced in this transformation. However, $\Phi 3$ is unable to distinguish different fitness levels during the transformations $TSR$, $TSE$, and $TLS$.

$\Phi 9$ is poor at identifying problems with different level of ruggedness as seen in $TSR$ results. It is a measure of goodness of fit to a linear function which explains the reason behind failing to identify different rugged functions.

If we compare the score of each feature over all the transformations then $\Phi 6$ has the highest values in 2D and 10D. Other features which show good overall performance are $\Phi 1,\Phi 2$, $\Phi 5$, $\Phi 10$, and $\Phi 11$ in 10D. It is important to note, however that the objective here is not to try and determine the “best” feature. Different features will have some ability to detect certain structural changes in landscapes. The main point of the methodology proposed here is to measure this ability on specific problem transformations or sets of problem instances of interest.

More detailed analysis of the performance of the features can be done by using the post-hoc multiple comparison tests for all pairs of problems in one transformation set. This produces a table of results for each feature/transformation combination. Here we present and discuss a few representative examples. Table 5 shows the outcome of the multiple comparison tests for $\Phi 1$ obtained for $TLS$. In the first row, the $FLS1$ is compared with every other problem in the transformation set. The second row contains the comparison result of $FLS2$ with the rest of problems in the transformation and so on. The p-values for each comparison test are presented in Table 5. Based on the significance level used, the problem pairs which are significantly different are shaded gray. P-values of 1 indicate that there was no difference in the sample mean feature values between the two types of problems. If we look at Table 5 (row 1), we find that $FLS1$ is confused with $FLS2$, and all subsequent problems until $FLS8$. The later problems are distinguished from $FLS1$ correctly. Similarly problem $FLS2$ to $FLS7$ are confused with each other but $FLS8$, $FLS9$ and $FLS10$ are correctly distinguished from the rest of the problems. The reason for this behaviour may be explained using properties of $\Phi 1$. Dispersion characterizes the funnel structure of the problem (Kerschke et al., 2015). As we transform the function from linear to sphere, the funnel structure starts to dominate and as we reach $FLS8$ a strong funnel has been created. So dispersion tracks the changes in different level of global funnel structure very nicely.

Table 6 contains the multiple comparison results for $\Phi 2$ on $TSE$ transformation. The first row is shaded gray which indicates that the $FSE1$ is distinguished from the rest of the problems. However, all problems from $FSE2$ to $FSE11$ are confused with each other. This means that $\Phi 2$ distinguishes the sphere function from an elliptical function. But it cannot identify elliptical functions with different condition numbers.

Multiple comparison results for $\Phi 8$ for $TRF$ transformation are given in Table 7. The first few columns have fewer shaded entries whereas the last few columns are filled with many shaded entries. In transformation $TRF$ the flat region expands from origin and grows outwards. The results show that it is easier for $\Phi 8$ to identify different problems when there is a major flat region. $FRF4$ is confused with every problem except $FRF8$, $FRF9$, and $FRF10$, whereas $FRF9$ is distinguished from every other problem except $FRF10$.

Table 8 contains the multiple comparison results for $\Phi 12$ on the transformation $TSR$. Here we can see that the entries which are away from the diagonal in the table are mostly shaded. This shows that the problems which are near to each other are not distinguished properly but the problems which are a bit far can be identified easily. For example, $FSR9$ is clearly distinguished from $FSR1$ to $FSR5$ but it is confused with $FSR6$, $FSR7$, and $FSR8$.

## 6 Conclusion

Problem features play an essential role in the algorithm selection frameworks for continuous optimization problems. In this article, we have proposed a novel methodology to evaluate an individual problem feature. The direct feature evaluation criteria is based on problem sets which are ordered in terms of geometric or structural similarity. Here we have used controlled problem transformations to generate the set of ordered problems. Multiple problem transformations are designed to exploit different landscape characteristics such as ruggedness, neutrality, ill-conditioning, and linearity. Features are tested on their ability to track the difference between successive problems during the transformation. Features are calculated for each problem in the transformation. ANOVA significance test is used to check whether there is a significant difference between the feature values for problems in one transformation. A multiple comparison test is further used to summarize the ANOVA results in one percentage number. The algorithm for the feature evaluation criteria is provided in the article. It is observed that a single feature is not capable of identifying all the landscape characteristics but some features prove good in capturing certain characteristics of the landscape. Since feature computation methods are based on samples and their corresponding fitness function values, it is important to find their variability with different sample sizes. It is found that some features can be estimated reliably using a small sample size while some features are very sensitive towards the sample size.

The methodology proposed here is quite general and could be applied to any problem set where a ranking or ordering of problems can be defined. Interesting directions for future work are to explore the application of the methodology to real-world problem classes and to higher-dimensional problems.

## References

*Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions*