Surrogate modeling has become a valuable technique for black-box optimization tasks with expensive evaluation of the objective function. In this paper, we investigate the relationships between the predictive accuracy of surrogate models, their settings, and features of the black-box function landscape during evolutionary optimization by the covariance matrix adaptation evolution strategy (CMA-ES) state-of-the-art optimizer for expensive continuous black-box tasks. This study aims to establish the foundation for specific rules and automated methods for selecting and tuning surrogate models by exploring relationships between landscape features and model errors, focusing on the behavior of a specific model within each generation in contrast to selecting a specific algorithm at the outset. We perform a feature analysis process, identifying a significant number of non-robust features and clustering similar landscape features, resulting in the selection of 14 features out of 384, varying with input data selection methods. Our analysis explores the error dependencies of four models across 39 settings, utilizing three methods for input data selection, drawn from surrogate-assisted CMA-ES runs on noiseless benchmarks within the comparing continuous optimizers (COCO) framework.

When solving a real-world optimization problem, we often have no information about the analytic form of the objective function. Evaluation of such black-box functions is frequently expensive in terms of time and money (Baerns and Holeňa, 2009; Lee et al., 2016; Thomaser et al., 2023; Zaefferer et al., 2016), which has been for two decades the driving force of research into surrogate modeling of black-box objective functions (Büche et al., 2005; Forrester and Keane, 2009; Jin, 2011). Given a set of observations, a surrogate model can be fitted to approximate the landscape of the black-box function.

The covariance matrix adaptation evolution strategy (CMA-ES) by Hansen (2006), which we consider the state-of-the-art evolutionary black-box optimizer, has been frequently combined with surrogate models. However, different surrogate models can show significant differences in the performance of the same optimizer on different data (Pitra et al., 2021). Such performance can also change during the algorithm run due to the varying landscape of the black-box function in different parts of the search space. Moreover, the predictive accuracy of individual models is strongly influenced by the choice of their parameters.

Our central goal in this investigation is to uncover and comprehend the intricate relationships between the configuration of individual models and data about properties and behavior of the objective function gathered during optimization runs and employed for surrogate model training. This data, representing a subset of solutions explored by the algorithm, constitutes the core of our analysis. The primary objective behind this endeavor is to facilitate the development of specific rules and automated approaches for selecting and tuning surrogate models. By gaining a deep understanding of these relationships, we aim to provide a foundation upon which adaptive and intelligent optimization strategies can be built.

Moreover, these relationships hold the promise of being valuable inputs for metalearning. Metalearning leverages insights from the performance of multiple optimization tasks to adapt and improve the optimization process. By characterizing the connections between surrogate model settings, model performance, and optimization outcomes, we lay the groundwork for the integration of these insights into the realm of metalearning, enabling the development of highly adaptable and efficient optimization strategies that learn and adapt from their experiences.

In recent years, research into landscape analysis of objective functions (cf. the overview by Kerschke, 2017) has emerged in the context of algorithm selection and algorithm configuration. Recently, Thomaser et al. (2023) have selected benchmark functions for tuning algorithm configuration based on their landscape features' similarity to real-world optimization problems in vehicle dynamics. However, to our knowledge, such features have been investigated in connection with surrogate models in the context of black-box optimization using static settings only, where the model is selected once at the beginning of the optimization process (Saini et al., 2019). Moreover, the analysis of landscape features also had less attention so far than it deserves and not in the context of surrogate models (Renau et al., 2019, 2020). Additionally, recent research by Saleem and Gallagher (2022) has proposed an alternative approach. While our emphasis lies in representing the objective function through landscape features, their approach suggests representing the same objective function using a GP regression model.

In this paper, we study properties of features representing the fitness landscape in the context of the data from actual runs of a surrogate-assisted version of the CMA-ES on the noiseless part of the comparing continuous optimizers (COCO) benchmark function testbed (Hansen et al., 2016). From the large number of available features, we select a small number of representatives for subsequent research, based on their robustness to sampling and similarity to other features. Basic investigation in connection with landscape features of CMA-ES assisted by a surrogate model based on Gaussian processes (Rasmussen and Williams, 2006) was presented in a conference paper by Pitra et al. (2019). Here, we substantially more thoroughly investigate the relationships between selected representatives of landscape features and the error of several kinds of surrogate models using different settings of their parameters and the criteria for selecting points for their training. Such study is crucial due to completely different properties of data from runs of a surrogate-assisted algorithm compared with generally utilized sampling strategies, which imply different values of landscape features as emphasized in Renau et al. (2020).

The paper is structured as follows. Section 2 briefly summarizes the basics of surrogate modeling in the context of the CMA-ES, surrogate models used in this investigation, and landscape features from the literature. In Section 3, properties of landscape features are thoroughly tested and their connection to surrogate models is analyzed. Section 4 concludes the paper and suggests a few future research objectives.

The goal of black-box optimization is to find a feasible solution x* of some black-box function B:SR with the best value of B(x*), where SRD. For this function, we are only able to obtain the value B(x) of an objective functionB, also known as fitness, in a point x, but no analytic expression is known, neither an explicit one as a composition of known mathematical functions, nor an implicit one as a solution of an equation (e.g., algebraic or differential). In case of minimization:

2.1  Surrogate Modeling in the Context of the CMA-ES

Black-box optimization problems frequently appear in the real world, where the values of a fitness function B(x) can be obtained only empirically through experiments or via computer simulations. As recalled in the Introduction, such empirical evaluation is sometimes very time-consuming or expensive. Thus, we assume that evaluating the fitness represents a considerably higher cost than training a regression model.

As a surrogate model, we understand any regression model B^:XR that is trained on the already available input–output value pairs stored in an archiveA={(xi,yi)|yi=B(xi),i=1,...,N}, and is used instead of the original expensive fitness to evaluate some of the points needed by the optimization algorithm.

2.1.1  The CMA-ES

The CMA-ES proposed by Hansen and Ostermeier (1996) has become one of the most successful algorithms in the field of continuous black-box optimization. Its pseudocode is outlined in Algorithm 1.

After generating λ new candidate solutions using the mean m(g) of the mutation distribution added to a random Gaussian mutation with covariance matrix C(g) in generation g (Step 3), the fitness function B is evaluated for the new offspring (Step 4). The new mean m(g+1) of the mutation distribution is computed as the weighted sum of the μ best points among the λ ordered offspring x1,...,xλ (Step 5).

The sum of consecutive successful mutation steps of the algorithm in the search space (m(g+1)-m(g))/σ(g) is utilized to compute two evolution path vectors pσ and pc (Step 6). Successful steps are tracked in the sampling space and stored in pσ using the transformation C(g)-1/2. The evolution path pc is calculated similarly to pσ; however, the coordinate system is not changed (Step 8). The two remaining evolution path elements are a decay factor cσ decreasing the impact of successful steps with increasing generations, and μw used to normalize the variance of pσ.

The covariance matrix adaptation (Step 10) is performed using rank-one and rank-μ updates. The rank-one update utilizes pc to calculate the covariance matrix pcpc. The rank-μ update cumulates successful steps of μ best individuals in matrix Cμ (Step 9).

The step-size σ (Step 11) is updated according to the ratio between the evolution path length pσ and the expected length of a random evolution path. If the ratio is greater than 1, the step-size is increasing, and decreasing otherwise.

The CMA-ES uses restart strategies to deal with multimodal fitness landscapes and to avoid being trapped in local optima. A multi-start strategy where the population size is doubled in each restart nr is referred to as IPOP-CMA-ES (Auger and Hansen, 2005).

2.1.2  Training Sets

The level of the model precision is highly determined by the selection of the points from the archive to the training set T. Considering the surrogate models we use in this paper, we list the following training set selection (TSS) methods:

  • TSS full, taking all the already evaluated points; that is, T=A,

  • TSS knn, selecting the union of the sets of k-nearest neighbors of all points for which the fitness should be predicted, where k is user defined (e.g., in Kern et al., 2006),

  • TSS nearest, selecting the union of the sets of k-nearest neighbors of all points for which the fitness should be predicted, where k is maximal such that the total number of selected points does not exceed a given maximum number of points N max and no point is further from current mean m(g) than a given maximal distance r max (e.g., in Bajer et al., 2019).

The models in this paper use the Mahalanobis distance given by the CMA-ES matrix σ2C (see Equation 4) for selecting points into training sets, similarly to, for example, Kruisselbrink et al. (2010). Considering the fact that TSS knn is the original TSS method of the model from Kern et al. (2006) (see Subsection 2.1.3), we list it here and utilize it only in connection with this particular model. The particular parameters of the TSS methods employed in this study can be found in Subsection 3.3.1.

2.1.3  Response Surfaces

Basically, the purpose of surrogate modeling—to approximate an unknown functional dependence—coincides with the purpose of response surface modeling in the design of experiments (Hosder et al., 2001; Myers et al., 2009). Therefore, it is not surprising that typical response surface models, that is, low order polynomials, also belong to most traditional and most successful surrogate models (Rasheed et al., 2005; Kern et al., 2006; Auger et al., 2013; Hansen, 2019). Here, we present two surrogate models using at most a quadratic polynomial utilized in two successful surrogate-assisted variants of the CMA-ES: the local-metamodel-CMA (lmm-CMA) by Kern et al. (2006) and the linear-quadratic CMA-ES (lq-CMA-ES) by Hansen (2019).

Local-metamodel is a specific full quadratic model fj:RDR trained for xj,j=1,...,λ used in the lmm-CMA (Auger et al., 2013),
The model fj is trained on the set Nk(xj;A) of a given number k of nearest neighbors of xj with respect to a given archive A,
As the distance d in (3), the Mahalanobis distance for σ2C is used:

The linear-quadratic model in the algorithm lq-CMA-ES differs from the lmm-CMA in an important aspect:

Whereas surrogate models in the lmm-CMA are always full quadratic, the lq-CMA-ES admits also pure quadratic or linear models. The employed kind of model depends on the number of points in T (according to the employed TSS method).

2.1.4  Gaussian Processes

A Gaussian process (GP) is a collection of random variables (f(x))xX, XRD, any finite number of which has a joint Gaussian distribution (Rasmussen and Williams, 2006). The GP is completely defined by a mean function m:XR, typically assumed to be some constant m GP , and by a covariance function κ:X×XR such that
The value of f(x) is typically a noisy observation y=f(x)+ɛ, where ɛ is a zero-mean Gausssian noise with a variance σn2>0. Then
where I(p) equals 1 when a proposition p is true and 0 otherwise.
Consider now the prediction of the random variable f(x) in a point xX if we already know (B(x1),...,B(xn))=(f(x1)+ɛ1,...,f(xn)+ɛn) in points x1,...,xn. Introduce the vectors X=(x1,...,xn), y=(B(x1),...,B(xn)), k=(κ(x1,x),, κ(xn,x)) and the matrix KRn×n such that (K)i,j=κ(xi,xj). Then the probability density of the vector y is
where det(A) denotes the determinant of a matrix A. Further, as a consequence of the assumption of Gaussian joint distribution, also the conditional distribution of f(x) conditioned on y is Gaussian:

2.1.5  Regression Forest

Regression forest (Breiman, 2001) (RF) is an ensemble of regression decision trees (Breiman, 1984). Recently, the gradient tree boosting (Friedman, 2001) has been shown useful in connection with the CMA-ES (Pitra et al., 2018). Therefore, we will focus only on this method.

Let us consider binary regression trees, where each observation x passes through a series of binary split functions s associated with internal nodes and arrives in the leaf node containing a real-valued constant trained to be the prediction of an associated function value y. Let y^i(t) be the prediction of the i-th point of the t-th tree. The t-th tree ft is obtained in the t-th iteration of the boosting algorithm through optimization of the following function:
l is a differentiable convex loss function l:R2R, Tf is the number of leaves in a tree f, wf are weights of its individual leaves, and α,γ0 are penalization constants. The gain can be derived using (9) as follows (see Chen and Guestrin, 2016, for details):
where set SL+R is split into SL and SR, g(y)=y^(t-1)l(y,y^(t-1)) and h(y)=y^(t-1)2l(y,y^(t-1)) are the first and second order derivatives of the loss function.
The overall boosted forest prediction is obtained through averaging individual tree predictions, where each leaf j in a t-th tree has weight
where Sj is the set of all training inputs that end in the leaf j. As a prevention of overfitting, the random subsampling of input features or traning data can be employed. The detailed parameters for the regression forest employed in this study are outlined in Subsection 3.4.1.

2.2  Landscape Features

Let us consider a sample set S of N pairs of observations in the context of continuous black-box optimization S=(xi,yi)RD×R{}|i=1,...,N, where denotes missing yi value (e.g., xi was not evaluated yet). Then the sample set can be utilized to describe landscape properties using a landscape featureϕ:NNRN,D×R{}N,1R{±,}, where denotes impossibility of feature computation.

A large number of various landscape features have been proposed in recent years in literature. Mersmann et al. (2011) proposed six easy to compute feature sets (each containing a number of individual features for continuous domain) representing different properties: y-Distribution set with measures related to the distribution of the objective function values, Levelset features capturing the relative position of each value with respect to objective quantiles, Meta-Model features extracting the information from linear or quadratic regression models fitted to the sampled data, Convexity set describing the level of function landscape convexity, Curvature set with gradient and Hessian approximation statistics, and Local Search features related to local searches conducted from sampled points. The last three feature sets require additional objective function evaluations.

The cell-mapping (CM) approach proposed by Kerschke et al. (2014) discretizes the input space to a user-defined number of blocks (i.e., cells) per dimension. Afterwards, the corresponding features are based on the relations between the cells and points within. Six cell-mapping feature sets were defined: CM Angle, CM Convexity, CM Gradient Homogeneity, Generalized CM, Barrier, and SOO Trees by Flamm et al. (2002) and Derbel et al. (2019). It should be noted that CM approach is less useful in higher dimensions where the majority of cells is empty and feature computation can require a lot of time and memory.

Nearest better clustering (NBC) features proposed by Kerschke et al. (2015) are based on the detection of funnel structures. The calculation of such features relies on the comparison of distances from observations to their nearest neighbors and their nearest better neighbors, which are the nearest neighbors among the set of all observations with a better objective value. Lunacek and Whitley (2006) proposed the set of dispersion features comparing the dispersion among the data points and among subsets of these points from the sample set. The information content features of a continuous landscape by Muñoz, Kirley et al. (2015) are derived from methods for calculating the information content of discrete landscapes. In Kerschke (2017), three other feature sets were proposed: Basic set with features such as the number of points, search space boundaries, or dimension; the proportion of Principal Components for a given percentage of variance; and coefficients of Linear Model fitted in each cell. A comprehensive survey of landscape analysis can be found, for example, in Muñoz, Sun et al. (2015). A more detailed description of feature sets used in our research can be found in the online available supplementary material.1

In recent years, several surrogate-model approaches have been developed to increase the performance of the CMA-ES (cf. the overview in Pitra et al., 2017). Each such approach has two complementary aspects: the employed regression model and the so called evolution control (Jin et al., 2001) managing when to evaluate the model and the true objective function while replacing the fitness evaluation step (Step 4) in Algorithm 1. In Pitra et al. (2021), we have shown a significant influence of the surrogate model on the CMA-ES performance. Thus, we are interested in relationships between the prediction error of surrogate models using various features of the fitness function landscape.

First, we state the problem and research question connected with the relations between surrogate models and landscape features. Then, we present our set of new landscape features based on the CMA-ES state variables. Afterwards, we investigate the properties of landscape features in the context of the training set selection methods and select the most convenient features for further research. Finally, we analyze relationships between the selected features and measured errors of the surrogate models with various settings.

3.1  Problem Statement

The problem can be formalized as follows: In a generation g of a surrogate-assisted version of the CMA-ES, a set of surrogate models M with hyperparameters θ are trained utilizing particular choices of the training set T. The training set T is selected out of an archiveA (TA) containing all points in which the fitness has been evaluated so far, using some TSS method (see Section 2.1.2). Afterwards, M is tested on the set of points sampled using the CMA-ES distribution X te =xk|xkNm(g),σ(g)2C(g),k=1,...,α, where αN depends on the evolution control. The research question connected to this problem is: What relationships between the suitability of different models for predicting the fitness and the considered landscape features do the testing results indicate?

3.2  CMA-ES Landscape Features

To utilize additional information comprised in CMA-ES state variables, we have proposed a set of features based on the CMA-ES (Pitra et al., 2019).

In each CMA-ES generation g during the fitness evaluation step (Step 4), the following additional features ϕ are obtained for the set of points X=xii=1N:

  • Generation numberϕgeneration CMA =g indicates the phase of the optimization process.

  • Step-sizeϕstep_size CMA =σ(g) provides information about the extent of the approximated region.

  • Number of restartsϕrestart CMA =nr(g) performed till generation g may indicate landscape difficulty.

  • Mahalanobis mean distance of the CMA-ES mean m(g) to the sample mean μX of X
    where ΣX is the sample covariance of X. This feature indicates suitability of X for model training from the point of view of the current state of the CMA-ES algorithm.
  • Square of the pcevolution path length ϕevopath_c_norm CMA =pc(g)2 is the only possible non-zero eigenvalue of rank-one update covariance matrix pc(g+1)pc(g+1) (see Subsection 2.1.1). That feature providing information about the correlations between consecutive CMA-ES steps indicates a similarity of function landscapes among subsequent generations.

  • pσ evolution path ratio, that is, the ratio between the evolution path length pσ(g) and the expected length of a random evolution path used to update step-size.
    It provides useful information about distribution changes.
  • CMA similarity likelihood. This feature quantifies the degree of agreement between the distribution of the set of points X and the current CMA-ES distribution in the present generation, expressed on the logarithmic scale. In particular, it expresses on the logarithmic scale the likelihood of the set of points X concerning the CMA-ES distribution.

3.3  Landscape Features Investigation

Based on the studies in Bajer et al. (2019) and Pitra et al. (2021), we have selected the Doubly Trained Surrogate CMA-ES (DTS-CMA-ES) (Pitra et al., 2016) as a successful representative of surrogate-assisted CMA-ES algorithms.

During the model training procedure in the DTS-CMA-ES, shown in general in Algorithm 2, a set of training points T is selected and transformations of X tr using matrix 1σC-1/2 and y tr to zero mean and unit variance (T=(X tr ,y tr )) are calculated in Steps 2 and 3 before the model hyperparameters θ are fitted. In case of a successful fitting procedure, the resulting model is tested for constancy on an extra generated population due to the CMA-ES restraints against stagnation. However, if the fitting procedure is unsuccessful, such as encountering an error due to method failure, the model is considered not trained. The GP model utilizes the Matlab fmincon optimizer for hyperparameter fitting, and if fmincon fails due to an infeasible starting point or numerical problems, the CMA-ES is employed as an alternative (Bajer et al., 2019). Using the same transformation, the points in a testing set X te are transformed before obtaining the model prediction of fitness values y^ te , which is then inversely transformed to the original output space.

We accept hyperparameter fitting failures in our research because we prefer adopting established and published configurations to refining individual models. Specifically, for GPs and polynomial models, configurations were drawn from publications explicitly exploring model setups (Auger et al., 2013; Bajer et al., 2019; Hansen, 2019). For RFs, we followed Pitra et al. (2018), supplementing certain settings while actively investigating others (see Subsection 3.4.1). Although rerunning GP hyperparameter fitting could mitigate some failures, our intentional treatment of training as an automatic step aligns with our overarching goal: to comprehensively explore intricate relationships between the objective function features and the model.

3.3.1  Investigation Settings

To investigate landscape features in the context of surrogate-assisted optimization, we have generated2 a dataset D of sample sets using independent runs of the 8 model settings from Pitra et al. (2019) for the DTS-CMA-ES algorithm (Bajer et al., 2019; Pitra et al., 2016) on the 24 noiseless single-objective benchmark functions from the COCO framework (Hansen et al., 2016). All runs were performed in dimensions 2, 3, 5, 10, and 20 on instances 11–15. These instances all stem from the same base problem, and are obtained by translation and/or rotation in the input space and also translation of the objective values. The algorithm runs were terminated if the target fitness value 10-8 was reached or the budget of 250 function evaluations per dimension was depleted. Taking into account the double model training in DTS-CMA-ES, we have extracted archives A and testing sets X te only from the first model training. The DTS-CMA-ES was employed in the overall best non-adaptive settings from Bajer et al. (2019). The TSS methods were set with parameter values: k=min{2m dim ,|A|m dim }, where m dim =D(D+3)/2+1, for TSS knn as suggested by Auger et al. (2013), and r max =4Qχ2(0.99,D) and N max =20·D for TSS nearest according to the best settings for higher dimensions in (Bajer et al., 2019).

To obtain 100 comparable archives and testing sets for landscape features investigation, we have generated a new collection of sample sets D100, where points for new archives and new populations are created using the weighted sum of original archive distributions from D. The g-th generated dataset uses the weight vector w(g)=19(0,...,0g-3,1g-2,2g-1,3g,2g+1,1g+2,0g+3,...,0), which provides distribution smoothing across the available generations.3 The data from well-known benchmarks were also used by Saini et al. (2019) and Renau et al. (2020).

Because each TSS method in Subsection 2.1.2 results in a different training set T using identical A, we have performed all the feature investigations for each TSS method separately. By combining the two basic sample sets for feature calculation A and T with a population P consisting of the points without a known value of the original fitness to be evaluated by the surrogate model, we have obtained two new sets AP=AP and TP=TP. Step 2 of Algorithm 2 performing the transformation of the input could also influence the landscape features. Thus, we have utilized either transformed and non-transformed sets for feature calculations, resulting in 8 different sample sets (4 in case of TSS full due to T=A): A, A, AP, AP, T, T, TP, TP, where denotes the transformation.

From each generated run in D100, we have uniformly selected 100 generations. In each such generation, we have computed all features from the following feature sets for all three TSS methods on all sample sets:

These sets do not require additional evaluations of the objective function and can be computed even in higher dimensions. Some features were not computed due to the following reasons (see Section 2 in the supplementary material for feature definitions): From Φ Basic , we have used only ϕdim and ϕobs because the remaining ones were constant on the whole D100 dataset. ϕdim is identical regardless the sample set. ϕobs and all features from Φy-D were not computed using the transformation matrix because it does not influence their resulting values. The feature ϕlin_simple_intercept MM Φ MM was excluded because it is useless if fitness normalization is performed (see Step 3 in Algorithm 2). The remaining features from Φ MM , all the features from Φ NBC , and all three features from Φy-D were not computed on sample sets with P because it also does not influence their resulting values.

For the rest of the paper, we will consider features which are independent of the sample set (i.e., ϕdim and 5 Φ CMA features) as a part of A-based features only. This results in the total numbers of landscape features equal to 197 for TSS full (all were A-based) and 388 for TSS nearest and TSS knn (from which 191 features were T-based).

3.3.2  Feature Analysis Process and Its Results

The results of experiments concerning landscape features are presented in Tables S1–S6 in the supplementary material. In the paper, we report only main highlights of the results in Tables 1 and 2.

Table 1:

Proportion of features for individual TSS with robustness greater or equal to the threshold in the first column. The proportions in brackets represent T-based features for given TSS (TSS full has only A-based). The numbers in bold in the gray row are utilized for the following process.

thresholdTSS fullTSS nearestTSS knn
0.5 125/195 244/384 (119/189) 188/366 (63/171) 
0.6 82/195 158/384 (76/189) 131/366 (49/171) 
0.7 54/195 102/384 (48/189) 93/366 (39/171) 
0.8 43/195 80/384 (37/189) 73/366 (30/171) 
0.9 33/195 60/384 (27/189) 59/366 (26/171) 
0.99 28/195 50/384 (22/189) 30/366 (2/171) 
thresholdTSS fullTSS nearestTSS knn
0.5 125/195 244/384 (119/189) 188/366 (63/171) 
0.6 82/195 158/384 (76/189) 131/366 (49/171) 
0.7 54/195 102/384 (48/189) 93/366 (39/171) 
0.8 43/195 80/384 (37/189) 73/366 (30/171) 
0.9 33/195 60/384 (27/189) 59/366 (26/171) 
0.99 28/195 50/384 (22/189) 30/366 (2/171) 
Table 2:

Results of a medoid clustering into 14 clusters of the features for individual TSS methods, according to their Schweizer-Wolf measure σ. The clusters are separated by horizontal lines and their medoid representatives are marked as gray lines. N denotes the minimal number of points for which the feature calculation resulted in in at most 1% of cases. The third column shows the robustness—the proportion of measured cases such that the feature values of samples from the same CMA-ES distribution did not differ more than 0.05.

graphic
 
graphic
 
graphic
 
graphic
 

First, we have investigated the impossibility of feature calculation (i.e., the feature value ) for each feature. Such information can be valuable and we will consider it as a valid output of any feature. On the other hand, a large amount of values on the tested dataset suggests low usability of the respective feature. Therefore, we have excluded features yielding in more than 25% of all measured values, which were for all TSS methods the feature ϕeps_s Inf on AP, AP, TP, and TP and for the TSS knn also the features ϕratio_mean_02 Dis , ϕratio_median_02 Dis , ϕdiff_mean_02 Dis , ϕdiff_median_02 Dis on T, T, TP, and TP, as well as ϕquad_w_interact_adj_r2 MM on T and T. This decreased the numbers of features to 195 for TSS full, 384 for TSS nearest, and 366 for TSS knn. Many features are difficult to calculate using low numbers of points. Therefore, for each feature we have measured the minimal number of points N in a particular combination of feature and sample set, for which the calculation resulted in in at most 1% of cases. All measured values can be found in Tables S1–S6 in the supplementary material and values of selected robust features (see below) in Table 2. For the calculation of most of the features, the CMA-ES default population size value in 2D: N=λ def =4+3ln2=6, or initial point plus doubled default population size in 2D: N=1+2λ def =13 was sufficient in all sample sets indexed with P. Considering sample-set-independent features, no points are needed, because the values concern the CMA-ES iteration or CMA-ES run as a whole, not particular points. As can be seen from Tables S1–S6 in the supplementary material, the quantile of function values used for splitting the sample set decreases as the dispersion features Φ Dis require more points to be computable (ϕ). For quantile values, see Section 2.3 in the supplementary material. Finally, ϕlin_w_interact_adj_r2 MM and ϕquad_w_interact_adj_r2 MM have also high values of N, which is plausible taking into account their descriptions (see Section 2.6 in the supplementary material).

To enhance the comparability of the investigated features, we opted for a tailored normalization approach. Standardization, which scales features to have a mean of 0 and a standard deviation of 1, was deemed unsuitable for our features due to several reasons. Firstly, the use of standardization would result in the loss of information about the overall measure of centrality and overall variability of each landscape feature. This information is preserved when we choose normalization over standardization. Secondly, the features encompass a wide spectrum of ranges, from relatively narrow intervals such as [-1,1] for ϕnb_cor NBC to broader spans like [1.5·10-10,1.4·1015] for ϕstep_size CMA . Additionally, some features even contain infinite values.

In response to these challenges, we embraced a nonlinear transformation using the sigmoid function to normalize all features within the [0,1] interval. The chosen transformation function is given by:
where Q0.01 and Q0.99 are 0.01 and 0.99 quantiles of feature ϕ on the whole D100 dataset considering values ϕR{±}.

Unlike standardization, the proposed way of nonlinear normalization, employing the sigmoid function, does not depend on the particular distribution of the feature, with only one exception—the 0.01 and 0.99 quantiles, which are mapped to ϕ norm (Q0.01)=0.01 and ϕ norm (Q0.99)=0.99. By using quantiles, we effectively cut off the tails of the feature distribution, mitigating the impact of outliers and infinite values. This approach transforms infinity values to 0 and 1, fostering a more comparable representation of features with substantial differences in value sets. This nonlinear transformation achieves a balance between trimming extreme values and retaining the core data, aligning with our objective to enhance feature comparability while accommodating the unique characteristics of the features.

We have tested the dependency of individual features on the dimension using feature medians from 100 samples for each distribution from D100. The Friedman's test rejected the hypothesis that the feature medians are independent of the dimension for all features at the family-wise significance level 0.05 using the Bonferroni-Holm correction. Moreover, for most of the features, the subsequently performed pairwise tests rejected the hypothesis of equality of feature medians for all pairs of dimensions. There were only several features for which the hypothesis was not rejected for some pairs of dimensions (see Tables S1–S6 in the supplementary material). Therefore, the influence of the dimension on the vast majority of features is essential.

Our investigation into the impact of multiple landscape features on the predictive error of surrogate models necessitates a high level of robustness of these features against random sampling of points. Consistency is paramount in our analysis, as we aim to ensure that when the algorithm is in the same location under similar circumstances, the features provide comparable values or, at the very least, exhibit substantial similarity. This requirement stems from the problem statement, emphasizing the need for consistent feature behavior in identical or similar situations.

Additionally, it is essential to note that the features were initially designed for datasets generated using variations of uniform sampling and Latin hypercube design. Our current dataset, however, originates from a mixture of Gaussian distributions, each representing a distinct generation. This diversification in the sampling approach underscores the importance of robust features capable of maintaining consistency across different situations.

To have a robust set of k independent features, which return identical values for input in 95% cases, we would like all features to be identical in 1000.95k% cases. Thus, for our dataset D100 with 100 samples for each distribution even a small k requires all values to be identical, which is almost impossible to achieve for most of the investigated features. Therefore, we define feature robustness as a proportion of cases for which the difference between the 1st and 100th percentile calculated after standardization on samples from the same CMA-ES distribution is 0.05. Table 1 lists numbers of features achieving different levels of robustness. We have selected the robustness 0.9 to be used for subsequent analyses. The robustness calculated for individual features is listed in Tables S1–S6 in the supplementary material and its values for features with robustness 0.9 in Table 2. The chosen level of robustness excluded from further computations all features from Φ NBC and Φy-D for every TSS and all features from Φ Inf for TSS full and TSS nearest. Both Φ NBC and Φy-D are rather sensitive to the input data, where the influence of non-uniform sampling of data is probably not negligible. Features from Φ Inf have very varied robustness. Whereas ϕeps_max Inf provides high robustness around 0.85 on transformed sample sets (up to 0.97 for TSS knn), computations of ϕm0 Inf and ϕeps_s Inf resulted in the robustness 0.004 and 0.016, respectively, which were the two lowest among all features. The majority of Φ CMA features provided high robustness caused mainly due to the independence of most of the features on the sample set. Features from Φ Lvl based on quadratic discriminant analysis (qda) showed nearly double robustness compared to the rest of Φ Lvl features. The transformation of the input increased robustness of specific types of features from Φ Dis and Φ MM . In particular, it increased the robustness of difference-based features from Φ Dis to more than 0.99 and also of features based on model coefficients from Φ MM . TSS knn is more sample dependent, therefore, the number of points in a sample set can vary. This also decreases the robustness of some ratio-based features. On the other hand, coefficients of simple models from Φ MM show robustness over 0.99 and ϕcma_mean_dist CMA mostly over 0.9. The limited number of robust features supports the conclusion of Renau et al. (2020) that the distribution of the initial design notably influences landscape analysis. A noticeable dependence of robustness on which of the sets A, T, or P is used was not observed.

The large number of features suggests that for the purpose of investigation of their relationships to surrogate models, they should be clustered into a smaller number of groups of similar features. To this end, we have performed agglomerative hierarchical clustering according to 1-σ SW (ϕi,ϕj), where σ SW is the measure σ by Schweizer and Wolff (1981) and ϕi, ϕj are the vectors of all medians from D100 for the features i and j. To compensate for the ordering-dependency of agglomerative hierarchical clustering, we have performed 5 runs of clustering for each TSS method to find the optimal number of clusters. The number of clusters exceeding a threshold 0.9 for σ SW , averaged over all 15 runs, was 14; that number was subsequently used as the value of k for subsequent k-medoid clustering using again Schweizer-Wolff measure σ as a similarity. The features that are medoids of those 14 clusters are listed in Table 2. Even such a small number of feature representatives can be sufficient for achieving excellent performance in a subsequent investigation (Hoos et al., 2018; Renau et al., 2021). A majority of clusters contain features from the same feature set. Sometimes, the whole cluster is composed of the same features calculated only on different sample sets which suggests that the influence of feature calculation on various sample sets might be negligible in those clusters. On the other hand, feature clusters for TSS knn often all share the base sample set (A or T), or the same transformation even if the features are not from the same set of features. Considering the large numbers of available features, it is worth noticing that most of the medoids are identical or at least very similar for all TSS methods. TSS full and TSS nearest medoids share identical features, where only 4 (ϕobs, ϕdiff_median_02 Dis , ϕlda_qda_25 Lvl , ϕquad_simple_cond MM ) differ in the sample set (A vs. T, A vs. T, AP vs. TP, and A vs. T, respectively). Notice that sample sets differ only in using A or T. Moreover, 10 out of 14 representatives (ϕdim, ϕobs, ϕevopath_s_norm CMA , ϕrestart CMA , ϕcma_lik CMA , ϕdiff_median_02 Dis , ϕdiff_mean_05 Dis , ϕlda_qda_10 Lvl , ϕlda_qda_25 Lvl , ϕquad_simple_cond MM ) are the same for all considered TSS methods, whereas sample sets utilized for feature calculation sometimes differ. Such similarity can indicate great importance of those features for characterizing the fitness landscape in the CMA-ES surrogate modeling context.

3.4  Relationships of Landscape Features and Surrogate Models

To investigate the relationships between surrogate model errors and 14 landscape features selected for each TSS in the previous subsection, we have utilized 4 surrogate models: lq, lmm, GPs, and RFs. The latter two models in 8 and 9 different settings respectively. In Saini et al. (2019), Φ MM features were utilized to investigate connection with several surrogate models in default settings in the scenario, where the model is selected once for a specific problem regardless of the possible following optimization process.

3.4.1  Settings

We have split the dataset D into validation and testing parts (D val and D test ) on the covariance function level uniformly at random (considering following levels of D: dimension, function, instance, covariance function). More specifically, for each dimension, function, and instance in D, we have uniformly selected runs using 7 covariance functions to D test and 1 covariance to D val . In each of those runs, data from 25 uniformly selected generations were used.

The two error measures utilized in our research each follow different aspects of model precision: The mean-squared error ( MSE ) measures how much the model differs directly from the objective function landscape. On the other hand, the ranking difference error ( RDE ) (Bajer et al., 2019) reflects that the CMA-ES state variables are adjusted according to the ordering of μ best points from the current population due to the invariance of the CMA-ES with respect to monotonous transformations. The RDE of yRλ with respect to y'Rλ considering μ best components is defined:
where (ρ(y))i is the rank of yi among the components of y.

The regression model from lmm-CMA was used in its improved version published by Auger et al. (2013). The lmm model operates with the transformation matrix in its own way, thus, the transformation step during training (Step 2 in Algorithm 2) is not performed for this model.

The linear-quadratic model was used in the version published by Hansen (2019). The original version utilizes all data without a transformation; therefore, the input data for lq model are not transformed for TSS full.

The GP regression model was employed in the version published by Bajer et al. (2019) using 8 different covariance functions described in Table 3.

Table 3:

Experimental settings of GP covariances: σ02 – bias constant term, d – metric d(xp,xq), – characteristic length-scale, P – isotropic distance matrix P=-2I, x˜p,x˜q – inputs augmented by a bias unit, and (x) – length-scale (linear) function of x.

namecovariance function
linear κ LIN (xp,xq)=σ02+xpxq 
quadratic κQ(xp,xq)=σ02+xpxq2 
squared-exponential κ SE (d;σf,)=σf2exp-d222 
Matérn 52 κ Mat (d;σf,)=σf21+5d+5d232exp-5d 
rational quadratic κ RQ (d;σf,)=σf21+d222α-α,α>0 
neural network (Rasmussen and Williams, 2006κ NN (xp,xq)=σf2 arcsin 2x˜pPx˜q(1+2x˜pPx˜p)(1+2x˜qPx˜q) 
spatially varying length-scale (Gibbs, 1997κ Gibbs (xp,xq)=σf22(xp)(xq)(xp)2+(xq)2D2exp-(xp-xq)(xp-xq)(xp)2+(xq)2 
composite κ SE +Q=κ SE +κQ 
namecovariance function
linear κ LIN (xp,xq)=σ02+xpxq 
quadratic κQ(xp,xq)=σ02+xpxq2 
squared-exponential κ SE (d;σf,)=σf2exp-d222 
Matérn 52 κ Mat (d;σf,)=σf21+5d+5d232exp-5d 
rational quadratic κ RQ (d;σf,)=σf21+d222α-α,α>0 
neural network (Rasmussen and Williams, 2006κ NN (xp,xq)=σf2 arcsin 2x˜pPx˜q(1+2x˜pPx˜p)(1+2x˜qPx˜q) 
spatially varying length-scale (Gibbs, 1997κ Gibbs (xp,xq)=σf22(xp)(xq)(xp)2+(xq)2D2exp-(xp-xq)(xp-xq)(xp)2+(xq)2 
composite κ SE +Q=κ SE +κQ 

The experimental settings of the RF model involved 5 splitting methods obtained from algorithms CART (Breiman, 1984), SCRT (Dobra and Gehrke, 2002), OC1 (Murthy et al., 1994), PAIR (Chaudhuri et al., 1994), and SUPP (Hinton and Revow, 1996). These settings were determined using Latin-hypercube design on 100 samples from all combinations of the values in Table 4 for the number of trees in RF (n tree ), the number of points bootstrapped out of N training points (Nt), and the number of randomly subsampled dimensions used for training each individual tree (nD). The maximum tree depth was set to 8, following (Chen and Guestrin, 2016). Additionally, α and γ were set to 0, indicating no regularization is applied, aligning with traditional gradient tree boosting (Friedman, 2001). The remaining decision tree parameters were kept identical to settings from Pitra et al. (2018). The settings winning on the validation dataset D val are shown in Table 5. Preliminary testing showed that RFs performance in connection with DTS-CMA-ES is higher when the input data are not transformed. Thus, the transformation step is omitted during model training for both TSS methods.

Table 4:

Parameter values for Latin-hypercube design of experimental settings of RF: n tree —the number of trees in RF, Nt—the number of points bootstrapped out of N training points, and nD—the number of randomly subsampled dimensions used for training the individual tree.

parametervalues
n tree  {26,27,28,29,210} 
Nt {1/4,1/2,3/4,1}·N 
nD {1/4,1/2,3/4,1}·D 
parametervalues
n tree  {26,27,28,29,210} 
Nt {1/4,1/2,3/4,1}·N 
nD {1/4,1/2,3/4,1}·D 
Table 5:

Experimental settings of RF for 5 splitting methods found using Latin-hypercube design on all combinations of the number of trees in RF n tree , the number of bootstrapped training points Nt, and the number of randomly subsampled dimensions nD. The winning settings on the validation dataset D val for two TSS methods and two error measures are shown in the format [n tree ,Nt,nD] (in case of Nt and nD are shown only the multipliers of N and D, respectively).

TSSerror measureCARTSCRTOC1PAIRSUPP
full  MSE  [210,1,3/4] [28,1/4,1/2] [27,1,1] [27,1,3/4] [27,1,1/4] 
  RDE  [26,3/4,1] [28,3/4,1] [27,1,1] [210,3/4,1] [26,3/4,1] 
nearest  MSE  [28,1,3/4] [29,1/2,1] [27,1,1] [27,1,3/4] [210,1/4,1] 
  RDE  [210,3/4,1] [28,3/4,1] [27,1,1] [210,3/4,1] [26,3/4,1] 
TSSerror measureCARTSCRTOC1PAIRSUPP
full  MSE  [210,1,3/4] [28,1/4,1/2] [27,1,1] [27,1,3/4] [27,1,1/4] 
  RDE  [26,3/4,1] [28,3/4,1] [27,1,1] [210,3/4,1] [26,3/4,1] 
nearest  MSE  [28,1,3/4] [29,1/2,1] [27,1,1] [27,1,3/4] [210,1/4,1] 
  RDE  [210,3/4,1] [28,3/4,1] [27,1,1] [210,3/4,1] [26,3/4,1] 

3.4.2  Results

The results of analyzing relationships between landscape features and two error measures for selected surrogate models with the appropriate settings are presented in Figures 13, Tables 6 and 7, and Tables S7–S13 in the supplementary material.
Figure 1:

The visualization of the p-values of the Kolmogorov-Smirnov test comparing the equality of probability distributions of individual features on all data and on those data on which a particular model setting scored best. Non-red-colored squares denote KS test results rejecting the equality of distributions with the Holm correction at the family-wise significance level α=0.05; otherwise, p-values are visualized as red squares. TSS knn was employed only in connection with lmm model.

Figure 1:

The visualization of the p-values of the Kolmogorov-Smirnov test comparing the equality of probability distributions of individual features on all data and on those data on which a particular model setting scored best. Non-red-colored squares denote KS test results rejecting the equality of distributions with the Holm correction at the family-wise significance level α=0.05; otherwise, p-values are visualized as red squares. TSS knn was employed only in connection with lmm model.

Close modal
Table 6:

Percentages of cases when the model did not provide usable prediction (model not trained, its prediction failed, or prediction is constant). The cases when the TSS knn was not utilized in connection with the particular model are denoted as ”—” (see Subsection 2.1.2).

GPRFlmmlq
CARTCARTSCRTSCRTPAIRPAIRSUPPSUPP
TSSGibbsLINMatNNQRQSESE+QMSERDEMSERDEOC1MSERDEMSERDE
full 50.3 8.4 26.7 7.6 22.1 5.3 8.5 3.3 5.2 9.6 2.3 20.4 17.9 18.8 2.3 7.0 23.6 8.8 2.5 
nearest 40.9 29.5 27.8 24.1 19.8 4.5 7.6 3.1 23.0 22.7 23.4 23.3 23.4 22.9 23.0 23.7 23.1 10.3 2.4 
knn — — — — — — — — — — — — — — — — — 5.5 — 
GPRFlmmlq
CARTCARTSCRTSCRTPAIRPAIRSUPPSUPP
TSSGibbsLINMatNNQRQSESE+QMSERDEMSERDEOC1MSERDEMSERDE
full 50.3 8.4 26.7 7.6 22.1 5.3 8.5 3.3 5.2 9.6 2.3 20.4 17.9 18.8 2.3 7.0 23.6 8.8 2.5 
nearest 40.9 29.5 27.8 24.1 19.8 4.5 7.6 3.1 23.0 22.7 23.4 23.3 23.4 22.9 23.0 23.7 23.1 10.3 2.4 
knn — — — — — — — — — — — — — — — — — 5.5 — 
Table 7:

A pairwise comparison of the model settings RDE in different TSS. The percentage of wins of i-th model setting against j-th model setting over all available data is given in the i-th row and j-th column. The numbers in bold mark the row model setting being significantly better than the column model setting according to the two-sided Wilcoxon signed rank test with the Holm correction at family-wise significance level α=0.05.

graphic
 
graphic
 
Figure 2:

Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS full.

Figure 2:

Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS full.

Close modal
Figure 3:

Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS nearest.

Figure 3:

Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS nearest.

Close modal

We have summed up the cases when the model did not provide the prediction; that is, the error value is not available, in Table 6. Such cases can occur when hyperparameters fitting fails, fitness prediction fails, or the model is considered constant (Steps 4, 7, and 9 in Algorithm 2, respectively).

The numbers in Table 6 basically confirm that more complex methods are more likely to fail. Whereas the lq model, the most simple model among all tested, provided predictions in almost 98% of all cases, where the missing results can be attributed to constant predictions, the GP model with Gibbs covariance function provided on average only 55% of the required predictions. Generally, the models were able to provide predictions more often when using TSS full than when using TSS nearest, probably due to the locality of the training set, considering that it is easier to train a constant model on a smaller area, where the differences between objective values are more likely negligible. The TSS knn was the most successful selection method for lmm, probably because it was designed directly for this model. In the following investigation, the cases where the error values were missing for a particular model were excluded from all comparisons involving that model.

We have tested the statistical significance of the MSE and RDE differences for 19 surrogate model settings using TSS full and TSS nearest methods and also the lmm surrogate model utilizing TSS knn, that is, 39 different combinations of model settings ψ and TSS methods (ψ, TSS ), on all available sample sets using the Friedman test. The resulting p-values of the two Friedman tests, one for each error measure, are below the smallest double precision number. A pairwise comparison of the model settings with respect to MSE and RDE revealed significant differences among the vast majority of pairs of model settings according to the non-parametric two-sided Wilcoxon signed rank test with the Holm correction for the family-wise error. To better illustrate the differences between individual settings, we also count the percentage of cases at which one setting had the error lower than the other. The pairwise score and the statistical significance of the pairwise differences are summarized in Table 7 for RDE and Tables S7 and S8 in the supplementary material for both error measures. Results of statistical tests confirmed that the obtained MSE and RDE values are sufficiently diverse for further investigation of model settings suitability. The best overall results were provided by GPs with all covariances except LIN. Especially, GP using SE+Q as a covariance function in TSS nearest significantly outperformed all other (ψ, TSS ) combinations. The polynomial models using TSS nearest take the second place in such comparison, being outperformed only by the GP models mentioned above. Generally, models using TSS nearest provided better results than when using TSS full (in case of lmm also better than TSS knn). The percentages of RDE show smaller differences in precision than MSE due to the larger probability of error equality on the limited number of possible RDE values compared to the infinite number of possible MSE values. The significant variations in error values across all model settings, especially when employing different TSS methods for both error measures, emphasize the sensitivity of surrogate model performance to the selected configurations.

To compare the suitability of individual features as descriptors, we separated areas where the surrogate model M with a particular (ψ, TSS ) combination has the best performance. Subseqently, we compared the distributions of each feature value on the whole D test with a distribution of each feature value of the selected part using Kolmogorov-Smirnov test (KS test) for MSE and RDE separately. The hypothesis of distribution equality is tested at the family-wise significance level α=0.05 after the Holm correction. The resulting p-values summarized in Tables S9–S13 in the supplementary material and visualized in color in Figure 1 show significant differences in distribution among combinations of model settings and TSS method (ψ, TSS ) for the vast majority of considered features. Features from Φ Dis and ϕstep_size CMA provided the most significant differences for almost all models. For the GP models and the sample set leading to the lowest MSE , the p-values were often even below the smallest double precision number. These features also provided very low p-values for lmm and lq model. The only exception is ϕdiff_median_02 Dis (T) providing notably higher p-values for all model settings. Moreover, features calculated on T-based sample sets for TSS nearest provided in most cases higher p-values than when calculated on A-based. From the model point of view, differences of RF model settings are much lower than the rest of the settings for all the tested features. This might suggest the lower ability to distinguish between the individual RF settings maybe due to the low performance of these settings on the dataset. The p-values for the MSE and the RDE also differ notably mainly due to the different ranges of values of these two error measures. As its consequence, p-values for MSE more clearly suggest the distinctive power of individual features on model precision, regardless the fact that RDE is more convenient for CMA-ES related problems. Overall, the results of the KS test have shown that there is no tested landscape feature representative for which the selection of the covariance function would be negligible for the vast majority of model settings. In essence, the individualized optimization approach, crucial for adapting model settings to observed distribution patterns, is strengthened by the persistent and robust influence of these patterns across diverse error measures. This consistency increases confidence in their universal applicability, extending across different methodologies and training set selection methods.

To perform a multivariate statistical analysis, we have built two classification trees (CTs): one for TSS full and one for TSS nearest. The data for each TSS method described by 14 features selected in Subsection 3.3 were divided into 19 classes according to which of the 19 considered surrogate model settings achieved the lowest RDE . In case of a tie, the model with the lowest MSE among models with the lowest RDE was the chosen one. The CT trained on those data is the MATLAB implementation of the CART (Breiman, 1984), where the minimal number of cases in leaf was set to 5000, the twoing rule was used as a splitting criterion, ϕdim was considered as a discrete variable, and the remaining 13 features were considered as continuous. The resulting CTs are depicted in Figures 2 and 3.

The most successful kind of models are GPs, present in most of the leaves proving the leading role shown in pairwise comparisons of model setting errors. Gibbs and Mat 5/2 covariances of the GP models constitute the winning GP settings regardless of the fact that both provided the lowest numbers of predictions (see Table 6). This is probably caused by the removal of cases where predictions of any model setting were missing. The winning model setting of RF is OC1 method being the most often selected leaf of the CT for TSS full method. The polynomial models are not present in the resulting CT for TSS nearest. The feature ϕdiff_mean_25 Dis (AP) plays possibly the most important role in the CT for TSS full, being in the root node as well as in 4 other nodes, and also very important in the TSS nearest CT, where it is in the root node. This confirms the very strong decisive role of Φ Dis features indicated in the results of KS test. Basic features ϕdim and ϕobs in few nodes provide the decisions resulting in GP model if both feature values are small and in RF model otherwise. Such dependency coincides with the better ability of RF models to preserve the prediction quality with the growing dimension than the ability of GP models. Features from the Φ Lvl are also very important at least in connection with the TSS nearest. The Φ CMA features are present only in very few nodes. We can observe the connection between the feature ϕquad_simple_cond MM (A) and the lq model in the CT for TSS full showing the possible ability of Φ MM features to detect convenience of polynomial model usage. In conclusion, these insights into the connections between features and model settings have the potential to notably improve the efficiency of surrogate model utilization, offering valuable contributions to the design of metalearning-based systems in the context of surrogate-assisted evolutionary optimization algorithms.

This paper presents a study of relationships between the prediction error of the surrogate models, their settings, and features describing the fitness landscape during evolutionary optimization by the CMA-ES state-of-the-art optimizer for expensive continuous black-box tasks. The ultimate aim of this study was to lay the groundwork for the creation of specific rules and automated methodologies for the selection and tuning of surrogate models. Through a comprehensive exploration of the relationships between landscape features and model errors, we attempt to establish a basis that enables the development of adaptive and intelligent optimization strategies. Four models in 39 different settings for the DTS-CMA-ES, a surrogate-assisted version of the CMA-ES, were compared using three sets of 14 landscape features selected according to their properties, especially according to their robustness and similarity to other features. We analyzed MSE and RDE dependence of various models and model settings on the features calculated using three methods selecting sample sets extracted from DTS-CMA-ES runs on noiseless benchmarks from the COCO framework.

To our knowledge, this analysis of landscape feature properties and their connection to surrogate models in evolutionary optimization is more profound than any other published so far. The crucial distinctions compared to conducting landscape analysis solely with CMA-ES lie in our approach. We perform landscape analysis on the behavior of a specific model in a given generation. In contrast to other publications where researchers typically selected a specific algorithm at the outset and then let it run, our focus is on observing the behavior of a particular model within each generation.

Although our approach has been developed in the context of the DTS-CMA-ES, most features, excluding those intricately tied to the CMA-ES, can be used in connection with black-box optimization in general. It is essential to acknowledge the influence of the surrogate models evolutionary control, as highlighted in Pitra et al. (2021), particularly concerning dataset creation. Despite these nuances, the portability of our methodology remains preserved. The principles and insights gleaned from our study can likely be extended to various surrogate-assisted optimization contexts, laying a foundation for broader applications within this domain.

To facilitate the identification of the algorithm with the highest performance on the optimized function, most landscape features were designed to be calculated on some initial sample set derived from a more or less uniform distribution covering the entire search space. In this paper, we operate on a more local level using only data from the actual runs of the optimization algorithm generated in each generation using Gaussian distribution, that is, data completely different from the original design of those features. Therefore, the low number of robust and mostly similar features confirms the findings of Renau et al. (2020) that the distribution of the initial design has a notable impact on landscape analysis.

The influence of dimensionality on the majority of features proves to be essential. It has never occurred that more than four pairs of dimensions for a single feature were independent of each other, emphasizing the intricate relationship between feature behavior and dimensionality. Considering the context of the optimization problem's dimensionality becomes crucial, providing guidance for selecting features that consistently offer insights across different dimensions. Moreover, exploring methods to normalize or mitigate the impact of dimensionality on features emerges as a promising avenue for enhancing the reliability of landscape analysis in various optimization contexts.

The observation that numerous landscape features within the same feature group exhibit similarity aligns with expectations, given that features within a group often share conceptual or computational foundations. This suggests that utilizing representatives from these feature groups may be sufficient for analysis, simplifying the feature selection process. Additionally, the consistency among medoids for different TSS methods underscores the robustness of certain features. For instance, TSS full and TSS nearest medoids share identical features (ϕobs, ϕdiff_median_02 Dis , ϕlda_qda_25 Lvl , ϕquad_simple_cond MM ), with only slight variations related to sample sets. Furthermore, 10 out of 14 representatives (ϕdim, ϕobs, ϕevopath_s_norm CMA , ϕrestart CMA , ϕcma_lik CMA , ϕdiff_median_02 Dis , ϕdiff_mean_05 Dis , ϕlda_qda_10 Lvl , ϕlda_qda_25 Lvl , ϕquad_simple_cond MM ) remain consistent across all TSS methods, indicating their potential importance in characterizing the fitness landscape in the CMA-ES surrogate modeling context.

Additionally, a substantial number of features were found not to be robust, as indicated in Tables S1–S6 in the supplementary material. Interestingly, there was no noticeable dependence of robustness on whether the archive A, training set T, or prediction set P was used.

The observed substantial differences in error values among all 39 model settings, particularly with varying TSS methods for both MSE and RDE , underscore the sensitivity of surrogate model performance to chosen configurations. Notably, GP model settings demonstrated the highest performance, followed by polynomial models.

Most of the investigated features had distributions on sample sets with particular model settings achieving the lowest error values significantly different from the overall distribution of all data for both error measures and all 3 TSS methods. This relationship emphasizes the importance of tailoring model settings based on observed distribution patterns for an individualized approach to optimization tasks. The observed consistency of distributions across both error measures increases confidence in their robustness, indicating a universal impact and offering a foundational understanding applicable to diverse methodologies and training set selection methods.

Finally, the overall results have shown the dispersion features as highly influential on the model settings error values, followed by features based on CMA-ES state variables, features describing the similarity of fitness function to some simple model, features splitting space according to the black-box function values, and simple features such as dimension and number of observations. Prioritizing these features could enhance the efficiency of surrogate model usage.

The results of this research could be utilized to design a metalearning-based system (Kerschke et al., 2019; Saini et al., 2019) for selection of surrogate models in surrogate-assisted evolutionary optimization algorithms context. The landscape features investigated in this paper are in metalearning known as metafeatures. While our focus has been on CMA-ES due to its state-of-the-art status, the methodology is extensible to other algorithms. Exploring a broader array of models and algorithms, such as those proposed by Loshchilov et al. (2013), could enhance the scope of our findings. It is worth noting that our information is constrained by the nature of black-box optimization; obtaining easily more evaluations might obviate the need for surrogate modeling. Additionally, the study suggests that with an abundance of evaluations, rendering surrogate models unnecessary, features requiring additional assessments could be employed.

The research reported in this paper has been supported by the German Research Foundation (DFG) funded project and 467401796.

2

Source codes covering all mentioned datasets generation and experiments are available on https://github.com/bajeluk/surrogate-cmaes/tree/meta.

3

The weighted sum of the original archive distributions satisfies n=0g max wn(g)Nm(n),C(n)=Nn=0g max wn(g)m(n),n=0g max (wn(g))2C(n), where g max is the maximal generation reached by the considered original archive and m(n) and C(n) are the mean and covariance matrix in CMA-ES generation n.

Auger
,
A.
,
Brockhoff
,
D.
, and
Hansen
,
N.
(
2013
).
Benchmarking the local metamodel CMA-ES on the noiseless BBOB'2013 test bed
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1225
1232
.
Auger
,
A.
, and
Hansen
,
N.
(
2005
).
A restart CMA evolution strategy with increasing population size
.
Proceedings of the Congress on Evolutionary Computation
, Vol.
2
, pp.
1769
1776
.
Baerns
,
M.
, and
Holeňa
,
M.
(
2009
).
Combinatorial development of solid catalytic materials. Design of high-throughput experiments, data analysis, data mining
.
Imperial College Press/World Scientific, London
.
Bajer
,
L.
,
Pitra
,
Z.
,
Repický
,
J.
, and
Holeňa
,
M
. (
2019
).
Gaussian process surrogate models for the CMA evolution strategy
.
Evolutionary Computation
,
27
(
4
):
665
697
.
Breiman
,
L.
(
1984
).
Classification and regression trees
.
Chapman & Hall/CRC
.
Breiman
,
L
. (
2001
).
Random forests
.
Machine Learning
,
45
(
1
):
5
32
.
Büche
,
D.
,
Schraudolph
,
N. N.
, and
Koumoutsakos
,
P
. (
2005
).
Accelerating evolutionary algorithms with Gaussian process fitness function models
.
IEEE Transactions on Systems, Man, and Cybernetics, Part C
,
35
(
2
):
183
194
.
Chaudhuri
,
P.
,
Huang
,
M.-C.
,
Loh
,
W.-Y.
, and
Yao
,
R
. (
1994
).
Piecewise-polynomial regression trees
.
Statistica Sinica
,
4
(
1
):
143
167
.
Chen
,
T.
, and
Guestrin
,
C.
(
2016
).
XGBoost: A scalable tree boosting system
.
Knowledge Discovery in Databases
, pp.
785
794
.
Derbel
,
B.
,
Liefooghe
,
A.
,
Verel
,
S.
,
Aguirre
,
H.
, and
Tanaka
,
K.
(
2019
).
New features for continuous exploratory landscape analysis based on the SOO tree
.
Foundations of Genetic Algorithms
, pp.
72
86
.
Dobra
,
A.
, and
Gehrke
,
J.
(
2002
).
SECRET: A scalable linear regression tree algorithm
.
Knowledge Discovery in Databases
, pp.
481
487
.
Flamm
,
C.
,
Hofacker
,
I. L.
,
Stadler
,
P. F.
, and
Wolfinger
,
M. T
. (
2002
).
Barrier trees of degenerate landscapes
.
Zeitschrift für Physikalische Chemie International Journal of Research in Physical Chemistry and Chemical Physics
,
216
(
2
):
155
173
.
Forrester
,
A.
, and
Keane
,
A.
(
2009
).
Recent advances in surrogate-based optimization
.
Progress in Aerospace Sciences
,
45
:
50
79
.
Friedman
,
J. H
. (
2001
).
Greedy function approximation: A gradient boosting machine
.
The Annals of Statistics
,
29
(
5
):
1189
1232
.
Gibbs
,
M. N.
(
1997
).
Bayesian Gaussian processes for regression and classification
.
PhD thesis, Department of Physics, University of Cambridge
.
Hansen
,
N.
(
2006
).
The CMA evolution strategy: A comparing review
. In
Towards a new evolutionary computation, studies in fuzziness and soft computation
, pp.
75
102
.
Springer
.
Hansen
,
N.
(
2019
).
A global surrogate assisted CMA-ES
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
664
672
.
Hansen
,
N.
,
Auger
,
A.
,
Mersmann
,
O.
,
Tusar
,
T.
, and
Brockhoff
,
D.
(
2016
).
COCO: A platform for comparing continuous optimizers in a black-box setting
. arXiv:1603.08785.
Hansen
,
N.
, and
Ostermeier
,
A.
(
1996
).
Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation
.
Proceedings of the Congress on Evolutionary Computation
, pp.
312
317
.
Hinton
,
G. E.
, and
Revow
,
M.
(
1996
).
Using pairs of data-points to define splits for decision trees
. In
Advances in neural information processing systems
, Vol.
8
, pp.
507
513
.
MIT Press
.
Hoos
,
H. H.
,
Peitl
,
T.
,
Slivovsky
,
F.
, and
Szeider
,
S.
(
2018
).
Portfolio-based algorithm selection for circuit QBFs
.
Lecture Notes in Computer Science Principles and Practice of Constraint Programming
, pp.
195
209
.
Hosder
,
S.
,
Watson
,
L.
, and
Grossman
,
B.
(
2001
).
Polynomial response surface approximations for the multidisciplinary design optimization of a high speed civil transport
.
Optimization and Engineering
,
2
:
431
452
.
Jin
,
Y.
(
2011
).
Surrogate-assisted evolutionary computation: Recent advances and future challenges
.
Swarm and Evolutionary Computation
,
1
:
61
70
.
Jin
,
Y.
,
Olhofer
,
M.
, and
Sendhoff
,
B.
(
2001
).
Managing approximate models in evolutionary aerodynamic design optimization
.
Proceedings of the Congress on Evolutionary Computation
, pp.
592
599
.
Kern
,
S.
,
Hansen
,
N.
, and
Koumoutsakos
,
P.
(
2006
).
Local meta-models for optimization using evolution strategies
.
Parallel Problem Solving from Nature
, pp.
939
948
.
Kerschke
,
P.
(
2017
).
Comprehensive feature-based landscape analysis of continuous and constrained optimization problems using the R-package flacco
. arXiv:1708.05258.
Kerschke
,
P.
,
Hoos
,
H. H.
,
Neumann
,
F.
, and
Trautmann
,
H
. (
2019
).
Automated algorithm selection: Survey and perspectives
.
Evolutionary Computation
,
27
(
1
):
3
45
.
Kerschke
,
P.
,
Preuss
,
M.
,
Hernández
,
C.
,
Schütze
,
O.
,
Sun
,
J.-Q.
,
Grimme
,
C.
,
Rudolph
,
G.
,
Bischl
,
B.
, and
Trautmann
,
H.
(
2014
).
Cell mapping techniques for exploratory landscape analysis
.
EVOLVE V
, pp.
115
131
.
Kerschke
,
P.
,
Preuss
,
M.
,
Wessing
,
S.
, and
Trautmann
,
H.
(
2015
).
Detecting funnel structures by means of exploratory landscape analysis
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
265
272
.
Kruisselbrink
,
J.
,
Emmerich
,
M. T. M.
,
Deutz
,
A.
, and
Back
,
T.
(
2010
).
A robust optimization approach using Kriging metamodels for robustness approximation in the CMA-ES
.
Proceedings of the Congress on Evolutionary Computation
, pp.
1
8
.
Lee
,
H.
,
Jo
,
Y.
,
Lee
,
D.
, and
Choi
,
S.
(
2016
).
Surrogate model based design optimization of multiple wing sails considering flow interaction effect
.
Ocean Engineering
,
121
:
422
436
.
Loshchilov
,
I.
,
Schoenauer
,
M.
, and
Sebag
,
M
. (
2013
). Intensive surrogate model exploitation in self-adaptive surrogate-assisted CMA-ES (saACM-ES). In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
439
446
.
Lunacek
,
M.
, and
Whitley
,
D.
(
2006
).
The dispersion metric and the CMA evolution strategy
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
477
484
.
Mersmann
,
O.
,
Bischl
,
B.
,
Trautmann
,
H.
,
Preuss
,
M.
,
Weihs
,
C.
, and
Rudolph
,
G.
(
2011
).
Exploratory landscape analysis
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
829
836
.
Muñoz
,
M. A.
,
Kirley
,
M.
, and
Halgamuge
,
S. K
. (
2015
).
Exploratory landscape analysis of continuous space optimization problems using information content
.
IEEE Transactions on Evolutionary Computation
,
19
(
1
):
74
87
.
Muñoz
,
M. A.
,
Sun
,
Y.
,
Kirley
,
M.
, and
Halgamuge
,
S. K
. (
2015
).
Algorithm selection for black-box continuous optimization problems
.
Information Sciences
,
317
(
C
):
224
245
.
Murthy
,
S. K.
,
Kasif
,
S.
, and
Salzberg
,
S
. (
1994
).
A system for induction of oblique decision trees
.
Journal of Artificial Intelligence Research
,
2
(
1
):
1
32
.
Myers
,
R.
,
Montgomery
,
D.
, and
Anderson-Cook
,
C.
(
2009
).
Response surface methodology: Process and product optimization using designed experiments
.
John Wiley & Sons
.
Pitra
,
Z.
,
Bajer
,
L.
, and
Holeňa
,
M.
(
2016
).
Doubly trained evolution control for the Surrogate CMA-ES
.
Parallel Problem Solving from Nature
, pp.
59
68
.
Pitra
,
Z.
,
Bajer
,
L.
,
Repický
,
J.
, and
Holeňa
,
M.
(
2017
).
Overview of surrogate-model versions of covariance matrix adaptation evolution strategy
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1622
1629
.
Pitra
,
Z.
,
Hanuš
,
M.
,
Koza
,
J.
,
Tumpach
,
J.
, and
Holeňa
,
M.
(
2021
).
Interaction between model and its evolution control in surrogate-assisted CMA evolution strategy
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
528
536
.
Pitra
,
Z.
,
Repický
,
J.
, and
Holeňa
,
M.
(
2018
).
Boosted regression forest for the doubly trained surrogate covariance matrix adaptation evolution strategy
.
Proceedings of the 18th Conference of Information Technologies—Applications and Theory
, pp.
72
79
.
Pitra
,
Z.
,
Repický
,
J.
, and
Holeňa
,
M.
(
2019
).
Landscape analysis of Gaussian process surrogates for the covariance matrix adaptation evolution strategy
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
691
699
.
Rasheed
,
K.
,
Ni
,
X.
, and
Vattam
,
S
. (
2005
). Methods for using surrogate models to speed up genetic algorithm oprimization: Informed operators and genetic engineering. In
Knowledge Incorporation in Evolutionary Computation
, pp.
103
123
.
Rasmussen
,
C. E.
, and
Williams
,
C. K. I.
(
2006
).
Gaussian processes for machine learning
.
MIT Press
.
Renau
,
Q.
,
Doerr
,
C.
,
Dréo
,
J.
, and
Doerr
,
B.
(
2020
).
Exploratory landscape analysis is strongly sensitive to the sampling strategy
.
Parallel Problem Solving from Nature
, pp.
139
153
.
Renau
,
Q.
,
Dréo
,
J.
,
Doerr
,
C.
, and
Doerr
,
B.
(
2019
).
Expressiveness and robustness of landscape features
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
2048
2051
.
Renau
,
Q.
,
Dréo
,
J.
,
Doerr
,
C.
, and
Doerr
,
B.
(
2021
).
Towards explainable exploratory landscape analysis: Extreme feature selection for classifying BBOB functions
.
EvoStar '21
, pp.
17
33
.
Saini
,
B. S.
,
Lopez-Ibanez
,
M.
, and
Miettinen
,
K.
(
2019
).
Automatic surrogate modelling technique selection based on features of optimization problems
.
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1765
1772
.
Saleem
,
S.
, and
Gallagher
,
M.
(
2022
).
Using regression models for characterizing and comparing black box optimization problems
.
Swarm and Evolutionary Computation
,
68
:100981.
Schweizer
,
B.
, and
Wolff
,
E. F
. (
1981
).
On nonparametric measures of dependence for random variables
.
Annals of Statistics
,
9
(
4
):
879
885
.
Thomaser
,
A.
,
Vogt
,
M.-E.
,
Kononova
,
A. V.
, and
Bäck
,
T.
(
2023
).
Transfer of multi-objectively tuned CMA-ES parameters to a vehicle dynamics problem
.
Proceedings of Evolutionary Multi-Criterion Optimization
, pp.
546
560
.
Zaefferer
,
M.
,
Gaida
,
D.
, and
Bartz-Beielstein
,
T.
(
2016
).
Multi-fidelity modeling and optimization of biogas plants
.
Applied Soft Computing
,
48
:
13
28
.