Surrogate modeling has become a valuable technique for black-box optimization tasks with expensive evaluation of the objective function. In this paper, we investigate the relationships between the predictive accuracy of surrogate models, their settings, and features of the black-box function landscape during evolutionary optimization by the covariance matrix adaptation evolution strategy (CMA-ES) state-of-the-art optimizer for expensive continuous black-box tasks. This study aims to establish the foundation for specific rules and automated methods for selecting and tuning surrogate models by exploring relationships between landscape features and model errors, focusing on the behavior of a specific model within each generation in contrast to selecting a specific algorithm at the outset. We perform a feature analysis process, identifying a significant number of non-robust features and clustering similar landscape features, resulting in the selection of 14 features out of 384, varying with input data selection methods. Our analysis explores the error dependencies of four models across 39 settings, utilizing three methods for input data selection, drawn from surrogate-assisted CMA-ES runs on noiseless benchmarks within the comparing continuous optimizers (COCO) framework.
1 Introduction
When solving a real-world optimization problem, we often have no information about the analytic form of the objective function. Evaluation of such black-box functions is frequently expensive in terms of time and money (Baerns and Holeňa, 2009; Lee et al., 2016; Thomaser et al., 2023; Zaefferer et al., 2016), which has been for two decades the driving force of research into surrogate modeling of black-box objective functions (Büche et al., 2005; Forrester and Keane, 2009; Jin, 2011). Given a set of observations, a surrogate model can be fitted to approximate the landscape of the black-box function.
The covariance matrix adaptation evolution strategy (CMA-ES) by Hansen (2006), which we consider the state-of-the-art evolutionary black-box optimizer, has been frequently combined with surrogate models. However, different surrogate models can show significant differences in the performance of the same optimizer on different data (Pitra et al., 2021). Such performance can also change during the algorithm run due to the varying landscape of the black-box function in different parts of the search space. Moreover, the predictive accuracy of individual models is strongly influenced by the choice of their parameters.
Our central goal in this investigation is to uncover and comprehend the intricate relationships between the configuration of individual models and data about properties and behavior of the objective function gathered during optimization runs and employed for surrogate model training. This data, representing a subset of solutions explored by the algorithm, constitutes the core of our analysis. The primary objective behind this endeavor is to facilitate the development of specific rules and automated approaches for selecting and tuning surrogate models. By gaining a deep understanding of these relationships, we aim to provide a foundation upon which adaptive and intelligent optimization strategies can be built.
Moreover, these relationships hold the promise of being valuable inputs for metalearning. Metalearning leverages insights from the performance of multiple optimization tasks to adapt and improve the optimization process. By characterizing the connections between surrogate model settings, model performance, and optimization outcomes, we lay the groundwork for the integration of these insights into the realm of metalearning, enabling the development of highly adaptable and efficient optimization strategies that learn and adapt from their experiences.
In recent years, research into landscape analysis of objective functions (cf. the overview by Kerschke, 2017) has emerged in the context of algorithm selection and algorithm configuration. Recently, Thomaser et al. (2023) have selected benchmark functions for tuning algorithm configuration based on their landscape features' similarity to real-world optimization problems in vehicle dynamics. However, to our knowledge, such features have been investigated in connection with surrogate models in the context of black-box optimization using static settings only, where the model is selected once at the beginning of the optimization process (Saini et al., 2019). Moreover, the analysis of landscape features also had less attention so far than it deserves and not in the context of surrogate models (Renau et al., 2019, 2020). Additionally, recent research by Saleem and Gallagher (2022) has proposed an alternative approach. While our emphasis lies in representing the objective function through landscape features, their approach suggests representing the same objective function using a GP regression model.
In this paper, we study properties of features representing the fitness landscape in the context of the data from actual runs of a surrogate-assisted version of the CMA-ES on the noiseless part of the comparing continuous optimizers (COCO) benchmark function testbed (Hansen et al., 2016). From the large number of available features, we select a small number of representatives for subsequent research, based on their robustness to sampling and similarity to other features. Basic investigation in connection with landscape features of CMA-ES assisted by a surrogate model based on Gaussian processes (Rasmussen and Williams, 2006) was presented in a conference paper by Pitra et al. (2019). Here, we substantially more thoroughly investigate the relationships between selected representatives of landscape features and the error of several kinds of surrogate models using different settings of their parameters and the criteria for selecting points for their training. Such study is crucial due to completely different properties of data from runs of a surrogate-assisted algorithm compared with generally utilized sampling strategies, which imply different values of landscape features as emphasized in Renau et al. (2020).
The paper is structured as follows. Section 2 briefly summarizes the basics of surrogate modeling in the context of the CMA-ES, surrogate models used in this investigation, and landscape features from the literature. In Section 3, properties of landscape features are thoroughly tested and their connection to surrogate models is analyzed. Section 4 concludes the paper and suggests a few future research objectives.
2 Background
2.1 Surrogate Modeling in the Context of the CMA-ES
Black-box optimization problems frequently appear in the real world, where the values of a fitness function can be obtained only empirically through experiments or via computer simulations. As recalled in the Introduction, such empirical evaluation is sometimes very time-consuming or expensive. Thus, we assume that evaluating the fitness represents a considerably higher cost than training a regression model.
As a surrogate model, we understand any regression model that is trained on the already available input–output value pairs stored in an archive, and is used instead of the original expensive fitness to evaluate some of the points needed by the optimization algorithm.
2.1.1 The CMA-ES
The CMA-ES proposed by Hansen and Ostermeier (1996) has become one of the most successful algorithms in the field of continuous black-box optimization. Its pseudocode is outlined in Algorithm 1.
After generating new candidate solutions using the mean of the mutation distribution added to a random Gaussian mutation with covariance matrix in generation (Step 3), the fitness function is evaluated for the new offspring (Step 4). The new mean of the mutation distribution is computed as the weighted sum of the best points among the ordered offspring (Step 5).
The sum of consecutive successful mutation steps of the algorithm in the search space is utilized to compute two evolution path vectors and (Step 6). Successful steps are tracked in the sampling space and stored in using the transformation . The evolution path is calculated similarly to ; however, the coordinate system is not changed (Step 8). The two remaining evolution path elements are a decay factor decreasing the impact of successful steps with increasing generations, and used to normalize the variance of .
The covariance matrix adaptation (Step 10) is performed using rank-one and rank- updates. The rank-one update utilizes to calculate the covariance matrix . The rank- update cumulates successful steps of best individuals in matrix (Step 9).
The step-size (Step 11) is updated according to the ratio between the evolution path length and the expected length of a random evolution path. If the ratio is greater than 1, the step-size is increasing, and decreasing otherwise.
The CMA-ES uses restart strategies to deal with multimodal fitness landscapes and to avoid being trapped in local optima. A multi-start strategy where the population size is doubled in each restart is referred to as IPOP-CMA-ES (Auger and Hansen, 2005).
2.1.2 Training Sets
The level of the model precision is highly determined by the selection of the points from the archive to the training set . Considering the surrogate models we use in this paper, we list the following training set selection (TSS) methods:
TSS full, taking all the already evaluated points; that is, ,
TSS knn, selecting the union of the sets of -nearest neighbors of all points for which the fitness should be predicted, where is user defined (e.g., in Kern et al., 2006),
TSS nearest, selecting the union of the sets of -nearest neighbors of all points for which the fitness should be predicted, where is maximal such that the total number of selected points does not exceed a given maximum number of points and no point is further from current mean than a given maximal distance (e.g., in Bajer et al., 2019).
The models in this paper use the Mahalanobis distance given by the CMA-ES matrix (see Equation 4) for selecting points into training sets, similarly to, for example, Kruisselbrink et al. (2010). Considering the fact that TSS knn is the original TSS method of the model from Kern et al. (2006) (see Subsection 2.1.3), we list it here and utilize it only in connection with this particular model. The particular parameters of the TSS methods employed in this study can be found in Subsection 3.3.1.
2.1.3 Response Surfaces
Basically, the purpose of surrogate modeling—to approximate an unknown functional dependence—coincides with the purpose of response surface modeling in the design of experiments (Hosder et al., 2001; Myers et al., 2009). Therefore, it is not surprising that typical response surface models, that is, low order polynomials, also belong to most traditional and most successful surrogate models (Rasheed et al., 2005; Kern et al., 2006; Auger et al., 2013; Hansen, 2019). Here, we present two surrogate models using at most a quadratic polynomial utilized in two successful surrogate-assisted variants of the CMA-ES: the local-metamodel-CMA (lmm-CMA) by Kern et al. (2006) and the linear-quadratic CMA-ES (lq-CMA-ES) by Hansen (2019).
The linear-quadratic model in the algorithm lq-CMA-ES differs from the lmm-CMA in an important aspect:
Whereas surrogate models in the lmm-CMA are always full quadratic, the lq-CMA-ES admits also pure quadratic or linear models. The employed kind of model depends on the number of points in (according to the employed TSS method).
2.1.4 Gaussian Processes
2.1.5 Regression Forest
Regression forest (Breiman, 2001) (RF) is an ensemble of regression decision trees (Breiman, 1984). Recently, the gradient tree boosting (Friedman, 2001) has been shown useful in connection with the CMA-ES (Pitra et al., 2018). Therefore, we will focus only on this method.
2.2 Landscape Features
Let us consider a sample set of pairs of observations in the context of continuous black-box optimization , where denotes missing value (e.g., was not evaluated yet). Then the sample set can be utilized to describe landscape properties using a landscape feature, where denotes impossibility of feature computation.
A large number of various landscape features have been proposed in recent years in literature. Mersmann et al. (2011) proposed six easy to compute feature sets (each containing a number of individual features for continuous domain) representing different properties: y-Distribution set with measures related to the distribution of the objective function values, Levelset features capturing the relative position of each value with respect to objective quantiles, Meta-Model features extracting the information from linear or quadratic regression models fitted to the sampled data, Convexity set describing the level of function landscape convexity, Curvature set with gradient and Hessian approximation statistics, and Local Search features related to local searches conducted from sampled points. The last three feature sets require additional objective function evaluations.
The cell-mapping (CM) approach proposed by Kerschke et al. (2014) discretizes the input space to a user-defined number of blocks (i.e., cells) per dimension. Afterwards, the corresponding features are based on the relations between the cells and points within. Six cell-mapping feature sets were defined: CM Angle, CM Convexity, CM Gradient Homogeneity, Generalized CM, Barrier, and SOO Trees by Flamm et al. (2002) and Derbel et al. (2019). It should be noted that CM approach is less useful in higher dimensions where the majority of cells is empty and feature computation can require a lot of time and memory.
Nearest better clustering (NBC) features proposed by Kerschke et al. (2015) are based on the detection of funnel structures. The calculation of such features relies on the comparison of distances from observations to their nearest neighbors and their nearest better neighbors, which are the nearest neighbors among the set of all observations with a better objective value. Lunacek and Whitley (2006) proposed the set of dispersion features comparing the dispersion among the data points and among subsets of these points from the sample set. The information content features of a continuous landscape by Muñoz, Kirley et al. (2015) are derived from methods for calculating the information content of discrete landscapes. In Kerschke (2017), three other feature sets were proposed: Basic set with features such as the number of points, search space boundaries, or dimension; the proportion of Principal Components for a given percentage of variance; and coefficients of Linear Model fitted in each cell. A comprehensive survey of landscape analysis can be found, for example, in Muñoz, Sun et al. (2015). A more detailed description of feature sets used in our research can be found in the online available supplementary material.1
3 Landscape Analysis of Surrogate Models
In recent years, several surrogate-model approaches have been developed to increase the performance of the CMA-ES (cf. the overview in Pitra et al., 2017). Each such approach has two complementary aspects: the employed regression model and the so called evolution control (Jin et al., 2001) managing when to evaluate the model and the true objective function while replacing the fitness evaluation step (Step 4) in Algorithm 1. In Pitra et al. (2021), we have shown a significant influence of the surrogate model on the CMA-ES performance. Thus, we are interested in relationships between the prediction error of surrogate models using various features of the fitness function landscape.
First, we state the problem and research question connected with the relations between surrogate models and landscape features. Then, we present our set of new landscape features based on the CMA-ES state variables. Afterwards, we investigate the properties of landscape features in the context of the training set selection methods and select the most convenient features for further research. Finally, we analyze relationships between the selected features and measured errors of the surrogate models with various settings.
3.1 Problem Statement
The problem can be formalized as follows: In a generation of a surrogate-assisted version of the CMA-ES, a set of surrogate models with hyperparameters are trained utilizing particular choices of the training set . The training set is selected out of an archive () containing all points in which the fitness has been evaluated so far, using some TSS method (see Section 2.1.2). Afterwards, is tested on the set of points sampled using the CMA-ES distribution , where depends on the evolution control. The research question connected to this problem is: What relationships between the suitability of different models for predicting the fitness and the considered landscape features do the testing results indicate?
3.2 CMA-ES Landscape Features
To utilize additional information comprised in CMA-ES state variables, we have proposed a set of features based on the CMA-ES (Pitra et al., 2019).
In each CMA-ES generation during the fitness evaluation step (Step 4), the following additional features are obtained for the set of points :
Generation number indicates the phase of the optimization process.
Step-size provides information about the extent of the approximated region.
Number of restarts performed till generation may indicate landscape difficulty.
- Mahalanobis mean distance of the CMA-ES mean to the sample mean ofwhere is the sample covariance of . This feature indicates suitability of for model training from the point of view of the current state of the CMA-ES algorithm.11
Square of the evolution path length is the only possible non-zero eigenvalue of rank-one update covariance matrix (see Subsection 2.1.1). That feature providing information about the correlations between consecutive CMA-ES steps indicates a similarity of function landscapes among subsequent generations.
- evolution path ratio, that is, the ratio between the evolution path length and the expected length of a random evolution path used to update step-size.It provides useful information about distribution changes.12
- CMA similarity likelihood. This feature quantifies the degree of agreement between the distribution of the set of points and the current CMA-ES distribution in the present generation, expressed on the logarithmic scale. In particular, it expresses on the logarithmic scale the likelihood of the set of points concerning the CMA-ES distribution.13
3.3 Landscape Features Investigation
Based on the studies in Bajer et al. (2019) and Pitra et al. (2021), we have selected the Doubly Trained Surrogate CMA-ES (DTS-CMA-ES) (Pitra et al., 2016) as a successful representative of surrogate-assisted CMA-ES algorithms.
During the model training procedure in the DTS-CMA-ES, shown in general in Algorithm 2, a set of training points is selected and transformations of using matrix and to zero mean and unit variance () are calculated in Steps 2 and 3 before the model hyperparameters are fitted. In case of a successful fitting procedure, the resulting model is tested for constancy on an extra generated population due to the CMA-ES restraints against stagnation. However, if the fitting procedure is unsuccessful, such as encountering an error due to method failure, the model is considered not trained. The GP model utilizes the Matlab fmincon optimizer for hyperparameter fitting, and if fmincon fails due to an infeasible starting point or numerical problems, the CMA-ES is employed as an alternative (Bajer et al., 2019). Using the same transformation, the points in a testing set are transformed before obtaining the model prediction of fitness values , which is then inversely transformed to the original output space.
We accept hyperparameter fitting failures in our research because we prefer adopting established and published configurations to refining individual models. Specifically, for GPs and polynomial models, configurations were drawn from publications explicitly exploring model setups (Auger et al., 2013; Bajer et al., 2019; Hansen, 2019). For RFs, we followed Pitra et al. (2018), supplementing certain settings while actively investigating others (see Subsection 3.4.1). Although rerunning GP hyperparameter fitting could mitigate some failures, our intentional treatment of training as an automatic step aligns with our overarching goal: to comprehensively explore intricate relationships between the objective function features and the model.
3.3.1 Investigation Settings
To investigate landscape features in the context of surrogate-assisted optimization, we have generated2 a dataset of sample sets using independent runs of the 8 model settings from Pitra et al. (2019) for the DTS-CMA-ES algorithm (Bajer et al., 2019; Pitra et al., 2016) on the 24 noiseless single-objective benchmark functions from the COCO framework (Hansen et al., 2016). All runs were performed in dimensions 2, 3, 5, 10, and 20 on instances 11–15. These instances all stem from the same base problem, and are obtained by translation and/or rotation in the input space and also translation of the objective values. The algorithm runs were terminated if the target fitness value was reached or the budget of 250 function evaluations per dimension was depleted. Taking into account the double model training in DTS-CMA-ES, we have extracted archives and testing sets only from the first model training. The DTS-CMA-ES was employed in the overall best non-adaptive settings from Bajer et al. (2019). The TSS methods were set with parameter values: , where , for TSS knn as suggested by Auger et al. (2013), and and for TSS nearest according to the best settings for higher dimensions in (Bajer et al., 2019).
To obtain 100 comparable archives and testing sets for landscape features investigation, we have generated a new collection of sample sets , where points for new archives and new populations are created using the weighted sum of original archive distributions from . The -th generated dataset uses the weight vector , which provides distribution smoothing across the available generations.3 The data from well-known benchmarks were also used by Saini et al. (2019) and Renau et al. (2020).
Because each TSS method in Subsection 2.1.2 results in a different training set using identical , we have performed all the feature investigations for each TSS method separately. By combining the two basic sample sets for feature calculation and with a population consisting of the points without a known value of the original fitness to be evaluated by the surrogate model, we have obtained two new sets and . Step 2 of Algorithm 2 performing the transformation of the input could also influence the landscape features. Thus, we have utilized either transformed and non-transformed sets for feature calculations, resulting in 8 different sample sets (4 in case of TSS full due to ): , , , , , , , , where denotes the transformation.
These sets do not require additional evaluations of the objective function and can be computed even in higher dimensions. Some features were not computed due to the following reasons (see Section 2 in the supplementary material for feature definitions): From , we have used only and because the remaining ones were constant on the whole dataset. is identical regardless the sample set. and all features from were not computed using the transformation matrix because it does not influence their resulting values. The feature was excluded because it is useless if fitness normalization is performed (see Step 3 in Algorithm 2). The remaining features from , all the features from , and all three features from were not computed on sample sets with because it also does not influence their resulting values.
For the rest of the paper, we will consider features which are independent of the sample set (i.e., and 5 features) as a part of -based features only. This results in the total numbers of landscape features equal to 197 for TSS full (all were -based) and 388 for TSS nearest and TSS knn (from which 191 features were -based).
3.3.2 Feature Analysis Process and Its Results
The results of experiments concerning landscape features are presented in Tables S1–S6 in the supplementary material. In the paper, we report only main highlights of the results in Tables 1 and 2.
Proportion of features for individual TSS with robustness greater or equal to the threshold in the first column. The proportions in brackets represent -based features for given TSS (TSS full has only -based). The numbers in bold in the gray row are utilized for the following process.
threshold . | TSS full . | TSS nearest . | TSS knn . |
---|---|---|---|
0.5 | 125/195 | 244/384 (119/189) | 188/366 (63/171) |
0.6 | 82/195 | 158/384 (76/189) | 131/366 (49/171) |
0.7 | 54/195 | 102/384 (48/189) | 93/366 (39/171) |
0.8 | 43/195 | 80/384 (37/189) | 73/366 (30/171) |
0.9 | 33/195 | 60/384 (27/189) | 59/366 (26/171) |
0.99 | 28/195 | 50/384 (22/189) | 30/366 (2/171) |
threshold . | TSS full . | TSS nearest . | TSS knn . |
---|---|---|---|
0.5 | 125/195 | 244/384 (119/189) | 188/366 (63/171) |
0.6 | 82/195 | 158/384 (76/189) | 131/366 (49/171) |
0.7 | 54/195 | 102/384 (48/189) | 93/366 (39/171) |
0.8 | 43/195 | 80/384 (37/189) | 73/366 (30/171) |
0.9 | 33/195 | 60/384 (27/189) | 59/366 (26/171) |
0.99 | 28/195 | 50/384 (22/189) | 30/366 (2/171) |
Results of a medoid clustering into 14 clusters of the features for individual TSS methods, according to their Schweizer-Wolf measure . The clusters are separated by horizontal lines and their medoid representatives are marked as gray lines. denotes the minimal number of points for which the feature calculation resulted in in at most 1% of cases. The third column shows the robustness—the proportion of measured cases such that the feature values of samples from the same CMA-ES distribution did not differ more than 0.05.
First, we have investigated the impossibility of feature calculation (i.e., the feature value ) for each feature. Such information can be valuable and we will consider it as a valid output of any feature. On the other hand, a large amount of values on the tested dataset suggests low usability of the respective feature. Therefore, we have excluded features yielding in more than 25% of all measured values, which were for all TSS methods the feature on , , , and and for the TSS knn also the features , , , on , , , and , as well as on and . This decreased the numbers of features to 195 for TSS full, 384 for TSS nearest, and 366 for TSS knn. Many features are difficult to calculate using low numbers of points. Therefore, for each feature we have measured the minimal number of points in a particular combination of feature and sample set, for which the calculation resulted in in at most 1% of cases. All measured values can be found in Tables S1–S6 in the supplementary material and values of selected robust features (see below) in Table 2. For the calculation of most of the features, the CMA-ES default population size value in : , or initial point plus doubled default population size in : was sufficient in all sample sets indexed with . Considering sample-set-independent features, no points are needed, because the values concern the CMA-ES iteration or CMA-ES run as a whole, not particular points. As can be seen from Tables S1–S6 in the supplementary material, the quantile of function values used for splitting the sample set decreases as the dispersion features require more points to be computable (). For quantile values, see Section 2.3 in the supplementary material. Finally, and have also high values of , which is plausible taking into account their descriptions (see Section 2.6 in the supplementary material).
To enhance the comparability of the investigated features, we opted for a tailored normalization approach. Standardization, which scales features to have a mean of 0 and a standard deviation of 1, was deemed unsuitable for our features due to several reasons. Firstly, the use of standardization would result in the loss of information about the overall measure of centrality and overall variability of each landscape feature. This information is preserved when we choose normalization over standardization. Secondly, the features encompass a wide spectrum of ranges, from relatively narrow intervals such as for to broader spans like for . Additionally, some features even contain infinite values.
Unlike standardization, the proposed way of nonlinear normalization, employing the sigmoid function, does not depend on the particular distribution of the feature, with only one exception—the 0.01 and 0.99 quantiles, which are mapped to and . By using quantiles, we effectively cut off the tails of the feature distribution, mitigating the impact of outliers and infinite values. This approach transforms infinity values to 0 and 1, fostering a more comparable representation of features with substantial differences in value sets. This nonlinear transformation achieves a balance between trimming extreme values and retaining the core data, aligning with our objective to enhance feature comparability while accommodating the unique characteristics of the features.
We have tested the dependency of individual features on the dimension using feature medians from 100 samples for each distribution from . The Friedman's test rejected the hypothesis that the feature medians are independent of the dimension for all features at the family-wise significance level 0.05 using the Bonferroni-Holm correction. Moreover, for most of the features, the subsequently performed pairwise tests rejected the hypothesis of equality of feature medians for all pairs of dimensions. There were only several features for which the hypothesis was not rejected for some pairs of dimensions (see Tables S1–S6 in the supplementary material). Therefore, the influence of the dimension on the vast majority of features is essential.
Our investigation into the impact of multiple landscape features on the predictive error of surrogate models necessitates a high level of robustness of these features against random sampling of points. Consistency is paramount in our analysis, as we aim to ensure that when the algorithm is in the same location under similar circumstances, the features provide comparable values or, at the very least, exhibit substantial similarity. This requirement stems from the problem statement, emphasizing the need for consistent feature behavior in identical or similar situations.
Additionally, it is essential to note that the features were initially designed for datasets generated using variations of uniform sampling and Latin hypercube design. Our current dataset, however, originates from a mixture of Gaussian distributions, each representing a distinct generation. This diversification in the sampling approach underscores the importance of robust features capable of maintaining consistency across different situations.
To have a robust set of independent features, which return identical values for input in 95% cases, we would like all features to be identical in cases. Thus, for our dataset with 100 samples for each distribution even a small requires all values to be identical, which is almost impossible to achieve for most of the investigated features. Therefore, we define feature robustness as a proportion of cases for which the difference between the 1st and 100th percentile calculated after standardization on samples from the same CMA-ES distribution is . Table 1 lists numbers of features achieving different levels of robustness. We have selected the robustness to be used for subsequent analyses. The robustness calculated for individual features is listed in Tables S1–S6 in the supplementary material and its values for features with robustness in Table 2. The chosen level of robustness excluded from further computations all features from and for every TSS and all features from for TSS full and TSS nearest. Both and are rather sensitive to the input data, where the influence of non-uniform sampling of data is probably not negligible. Features from have very varied robustness. Whereas provides high robustness around 0.85 on transformed sample sets (up to 0.97 for TSS knn), computations of and resulted in the robustness 0.004 and 0.016, respectively, which were the two lowest among all features. The majority of features provided high robustness caused mainly due to the independence of most of the features on the sample set. Features from based on quadratic discriminant analysis (qda) showed nearly double robustness compared to the rest of features. The transformation of the input increased robustness of specific types of features from and . In particular, it increased the robustness of difference-based features from to more than 0.99 and also of features based on model coefficients from . TSS knn is more sample dependent, therefore, the number of points in a sample set can vary. This also decreases the robustness of some ratio-based features. On the other hand, coefficients of simple models from show robustness over 0.99 and mostly over 0.9. The limited number of robust features supports the conclusion of Renau et al. (2020) that the distribution of the initial design notably influences landscape analysis. A noticeable dependence of robustness on which of the sets , , or is used was not observed.
The large number of features suggests that for the purpose of investigation of their relationships to surrogate models, they should be clustered into a smaller number of groups of similar features. To this end, we have performed agglomerative hierarchical clustering according to , where is the measure by Schweizer and Wolff (1981) and , are the vectors of all medians from for the features and . To compensate for the ordering-dependency of agglomerative hierarchical clustering, we have performed 5 runs of clustering for each TSS method to find the optimal number of clusters. The number of clusters exceeding a threshold 0.9 for , averaged over all 15 runs, was 14; that number was subsequently used as the value of for subsequent -medoid clustering using again Schweizer-Wolff measure as a similarity. The features that are medoids of those 14 clusters are listed in Table 2. Even such a small number of feature representatives can be sufficient for achieving excellent performance in a subsequent investigation (Hoos et al., 2018; Renau et al., 2021). A majority of clusters contain features from the same feature set. Sometimes, the whole cluster is composed of the same features calculated only on different sample sets which suggests that the influence of feature calculation on various sample sets might be negligible in those clusters. On the other hand, feature clusters for TSS knn often all share the base sample set ( or ), or the same transformation even if the features are not from the same set of features. Considering the large numbers of available features, it is worth noticing that most of the medoids are identical or at least very similar for all TSS methods. TSS full and TSS nearest medoids share identical features, where only 4 (, , , ) differ in the sample set ( vs. , vs. , vs. , and vs. , respectively). Notice that sample sets differ only in using or . Moreover, 10 out of 14 representatives (, , , , , , , , , ) are the same for all considered TSS methods, whereas sample sets utilized for feature calculation sometimes differ. Such similarity can indicate great importance of those features for characterizing the fitness landscape in the CMA-ES surrogate modeling context.
3.4 Relationships of Landscape Features and Surrogate Models
To investigate the relationships between surrogate model errors and 14 landscape features selected for each TSS in the previous subsection, we have utilized 4 surrogate models: lq, lmm, GPs, and RFs. The latter two models in 8 and 9 different settings respectively. In Saini et al. (2019), features were utilized to investigate connection with several surrogate models in default settings in the scenario, where the model is selected once for a specific problem regardless of the possible following optimization process.
3.4.1 Settings
We have split the dataset into validation and testing parts ( and ) on the covariance function level uniformly at random (considering following levels of : dimension, function, instance, covariance function). More specifically, for each dimension, function, and instance in , we have uniformly selected runs using 7 covariance functions to and 1 covariance to . In each of those runs, data from 25 uniformly selected generations were used.
The regression model from lmm-CMA was used in its improved version published by Auger et al. (2013). The lmm model operates with the transformation matrix in its own way, thus, the transformation step during training (Step 2 in Algorithm 2) is not performed for this model.
The linear-quadratic model was used in the version published by Hansen (2019). The original version utilizes all data without a transformation; therefore, the input data for lq model are not transformed for TSS full.
The GP regression model was employed in the version published by Bajer et al. (2019) using 8 different covariance functions described in Table 3.
Experimental settings of GP covariances: – bias constant term, – metric , – characteristic length-scale, – isotropic distance matrix , – inputs augmented by a bias unit, and – length-scale (linear) function of .
name . | covariance function . |
---|---|
linear | |
quadratic | |
squared-exponential | |
Matérn | |
rational quadratic | |
neural network (Rasmussen and Williams, 2006) | |
spatially varying length-scale (Gibbs, 1997) | |
composite |
The experimental settings of the RF model involved 5 splitting methods obtained from algorithms CART (Breiman, 1984), SCRT (Dobra and Gehrke, 2002), OC1 (Murthy et al., 1994), PAIR (Chaudhuri et al., 1994), and SUPP (Hinton and Revow, 1996). These settings were determined using Latin-hypercube design on 100 samples from all combinations of the values in Table 4 for the number of trees in RF (), the number of points bootstrapped out of training points (), and the number of randomly subsampled dimensions used for training each individual tree (). The maximum tree depth was set to 8, following (Chen and Guestrin, 2016). Additionally, and were set to 0, indicating no regularization is applied, aligning with traditional gradient tree boosting (Friedman, 2001). The remaining decision tree parameters were kept identical to settings from Pitra et al. (2018). The settings winning on the validation dataset are shown in Table 5. Preliminary testing showed that RFs performance in connection with DTS-CMA-ES is higher when the input data are not transformed. Thus, the transformation step is omitted during model training for both TSS methods.
Parameter values for Latin-hypercube design of experimental settings of RF: —the number of trees in RF, —the number of points bootstrapped out of training points, and —the number of randomly subsampled dimensions used for training the individual tree.
parameter . | values . |
---|---|
parameter . | values . |
---|---|
Experimental settings of RF for 5 splitting methods found using Latin-hypercube design on all combinations of the number of trees in RF , the number of bootstrapped training points , and the number of randomly subsampled dimensions . The winning settings on the validation dataset for two TSS methods and two error measures are shown in the format (in case of and are shown only the multipliers of and , respectively).
TSS . | error measure . | CART . | SCRT . | OC1 . | PAIR . | SUPP . |
---|---|---|---|---|---|---|
full | ||||||
nearest | ||||||
TSS . | error measure . | CART . | SCRT . | OC1 . | PAIR . | SUPP . |
---|---|---|---|---|---|---|
full | ||||||
nearest | ||||||
3.4.2 Results
The visualization of the p-values of the Kolmogorov-Smirnov test comparing the equality of probability distributions of individual features on all data and on those data on which a particular model setting scored best. Non-red-colored squares denote KS test results rejecting the equality of distributions with the Holm correction at the family-wise significance level ; otherwise, p-values are visualized as red squares. TSS knn was employed only in connection with lmm model.
The visualization of the p-values of the Kolmogorov-Smirnov test comparing the equality of probability distributions of individual features on all data and on those data on which a particular model setting scored best. Non-red-colored squares denote KS test results rejecting the equality of distributions with the Holm correction at the family-wise significance level ; otherwise, p-values are visualized as red squares. TSS knn was employed only in connection with lmm model.
Percentages of cases when the model did not provide usable prediction (model not trained, its prediction failed, or prediction is constant). The cases when the TSS knn was not utilized in connection with the particular model are denoted as ”—” (see Subsection 2.1.2).
. | GP . | RF . | lmm . | lq . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | . | . | . | . | . | CART . | CART . | SCRT . | SCRT . | . | PAIR . | PAIR . | SUPP . | SUPP . | . | . |
TSS . | Gibbs . | LIN . | Mat . | NN . | Q . | RQ . | SE . | SEQ . | MSE . | RDE . | MSE . | RDE . | OC1 . | MSE . | RDE . | MSE . | RDE . | . | . |
full | 50.3 | 8.4 | 26.7 | 7.6 | 22.1 | 5.3 | 8.5 | 3.3 | 5.2 | 9.6 | 2.3 | 20.4 | 17.9 | 18.8 | 2.3 | 7.0 | 23.6 | 8.8 | 2.5 |
nearest | 40.9 | 29.5 | 27.8 | 24.1 | 19.8 | 4.5 | 7.6 | 3.1 | 23.0 | 22.7 | 23.4 | 23.3 | 23.4 | 22.9 | 23.0 | 23.7 | 23.1 | 10.3 | 2.4 |
knn | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 5.5 | — |
. | GP . | RF . | lmm . | lq . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | . | . | . | . | . | CART . | CART . | SCRT . | SCRT . | . | PAIR . | PAIR . | SUPP . | SUPP . | . | . |
TSS . | Gibbs . | LIN . | Mat . | NN . | Q . | RQ . | SE . | SEQ . | MSE . | RDE . | MSE . | RDE . | OC1 . | MSE . | RDE . | MSE . | RDE . | . | . |
full | 50.3 | 8.4 | 26.7 | 7.6 | 22.1 | 5.3 | 8.5 | 3.3 | 5.2 | 9.6 | 2.3 | 20.4 | 17.9 | 18.8 | 2.3 | 7.0 | 23.6 | 8.8 | 2.5 |
nearest | 40.9 | 29.5 | 27.8 | 24.1 | 19.8 | 4.5 | 7.6 | 3.1 | 23.0 | 22.7 | 23.4 | 23.3 | 23.4 | 22.9 | 23.0 | 23.7 | 23.1 | 10.3 | 2.4 |
knn | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 5.5 | — |
A pairwise comparison of the model settings RDE in different TSS. The percentage of wins of -th model setting against -th model setting over all available data is given in the -th row and -th column. The numbers in bold mark the row model setting being significantly better than the column model setting according to the two-sided Wilcoxon signed rank test with the Holm correction at family-wise significance level .
Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS full.
Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS full.
Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS nearest.
Classification tree analyzing the influence of landscape features on the most suitable model and its setting using TSS nearest.
We have summed up the cases when the model did not provide the prediction; that is, the error value is not available, in Table 6. Such cases can occur when hyperparameters fitting fails, fitness prediction fails, or the model is considered constant (Steps 4, 7, and 9 in Algorithm 2, respectively).
The numbers in Table 6 basically confirm that more complex methods are more likely to fail. Whereas the lq model, the most simple model among all tested, provided predictions in almost 98% of all cases, where the missing results can be attributed to constant predictions, the GP model with Gibbs covariance function provided on average only 55% of the required predictions. Generally, the models were able to provide predictions more often when using TSS full than when using TSS nearest, probably due to the locality of the training set, considering that it is easier to train a constant model on a smaller area, where the differences between objective values are more likely negligible. The TSS knn was the most successful selection method for lmm, probably because it was designed directly for this model. In the following investigation, the cases where the error values were missing for a particular model were excluded from all comparisons involving that model.
We have tested the statistical significance of the and differences for 19 surrogate model settings using TSS full and TSS nearest methods and also the lmm surrogate model utilizing TSS knn, that is, 39 different combinations of model settings and TSS methods , on all available sample sets using the Friedman test. The resulting p-values of the two Friedman tests, one for each error measure, are below the smallest double precision number. A pairwise comparison of the model settings with respect to and revealed significant differences among the vast majority of pairs of model settings according to the non-parametric two-sided Wilcoxon signed rank test with the Holm correction for the family-wise error. To better illustrate the differences between individual settings, we also count the percentage of cases at which one setting had the error lower than the other. The pairwise score and the statistical significance of the pairwise differences are summarized in Table 7 for and Tables S7 and S8 in the supplementary material for both error measures. Results of statistical tests confirmed that the obtained and values are sufficiently diverse for further investigation of model settings suitability. The best overall results were provided by GPs with all covariances except LIN. Especially, GP using SE+Q as a covariance function in TSS nearest significantly outperformed all other combinations. The polynomial models using TSS nearest take the second place in such comparison, being outperformed only by the GP models mentioned above. Generally, models using TSS nearest provided better results than when using TSS full (in case of lmm also better than TSS knn). The percentages of show smaller differences in precision than due to the larger probability of error equality on the limited number of possible values compared to the infinite number of possible values. The significant variations in error values across all model settings, especially when employing different TSS methods for both error measures, emphasize the sensitivity of surrogate model performance to the selected configurations.
To compare the suitability of individual features as descriptors, we separated areas where the surrogate model with a particular combination has the best performance. Subseqently, we compared the distributions of each feature value on the whole with a distribution of each feature value of the selected part using Kolmogorov-Smirnov test (KS test) for MSE and RDE separately. The hypothesis of distribution equality is tested at the family-wise significance level after the Holm correction. The resulting p-values summarized in Tables S9–S13 in the supplementary material and visualized in color in Figure 1 show significant differences in distribution among combinations of model settings and TSS method for the vast majority of considered features. Features from and provided the most significant differences for almost all models. For the GP models and the sample set leading to the lowest , the p-values were often even below the smallest double precision number. These features also provided very low p-values for lmm and lq model. The only exception is providing notably higher p-values for all model settings. Moreover, features calculated on -based sample sets for TSS nearest provided in most cases higher p-values than when calculated on -based. From the model point of view, differences of RF model settings are much lower than the rest of the settings for all the tested features. This might suggest the lower ability to distinguish between the individual RF settings maybe due to the low performance of these settings on the dataset. The p-values for the and the also differ notably mainly due to the different ranges of values of these two error measures. As its consequence, p-values for more clearly suggest the distinctive power of individual features on model precision, regardless the fact that is more convenient for CMA-ES related problems. Overall, the results of the KS test have shown that there is no tested landscape feature representative for which the selection of the covariance function would be negligible for the vast majority of model settings. In essence, the individualized optimization approach, crucial for adapting model settings to observed distribution patterns, is strengthened by the persistent and robust influence of these patterns across diverse error measures. This consistency increases confidence in their universal applicability, extending across different methodologies and training set selection methods.
To perform a multivariate statistical analysis, we have built two classification trees (CTs): one for TSS full and one for TSS nearest. The data for each TSS method described by 14 features selected in Subsection 3.3 were divided into 19 classes according to which of the 19 considered surrogate model settings achieved the lowest . In case of a tie, the model with the lowest among models with the lowest was the chosen one. The CT trained on those data is the MATLAB implementation of the CART (Breiman, 1984), where the minimal number of cases in leaf was set to 5000, the twoing rule was used as a splitting criterion, was considered as a discrete variable, and the remaining 13 features were considered as continuous. The resulting CTs are depicted in Figures 2 and 3.
The most successful kind of models are GPs, present in most of the leaves proving the leading role shown in pairwise comparisons of model setting errors. Gibbs and covariances of the GP models constitute the winning GP settings regardless of the fact that both provided the lowest numbers of predictions (see Table 6). This is probably caused by the removal of cases where predictions of any model setting were missing. The winning model setting of RF is OC1 method being the most often selected leaf of the CT for TSS full method. The polynomial models are not present in the resulting CT for TSS nearest. The feature plays possibly the most important role in the CT for TSS full, being in the root node as well as in 4 other nodes, and also very important in the TSS nearest CT, where it is in the root node. This confirms the very strong decisive role of features indicated in the results of KS test. Basic features and in few nodes provide the decisions resulting in GP model if both feature values are small and in RF model otherwise. Such dependency coincides with the better ability of RF models to preserve the prediction quality with the growing dimension than the ability of GP models. Features from the are also very important at least in connection with the TSS nearest. The features are present only in very few nodes. We can observe the connection between the feature and the lq model in the CT for TSS full showing the possible ability of features to detect convenience of polynomial model usage. In conclusion, these insights into the connections between features and model settings have the potential to notably improve the efficiency of surrogate model utilization, offering valuable contributions to the design of metalearning-based systems in the context of surrogate-assisted evolutionary optimization algorithms.
4 Conclusion
This paper presents a study of relationships between the prediction error of the surrogate models, their settings, and features describing the fitness landscape during evolutionary optimization by the CMA-ES state-of-the-art optimizer for expensive continuous black-box tasks. The ultimate aim of this study was to lay the groundwork for the creation of specific rules and automated methodologies for the selection and tuning of surrogate models. Through a comprehensive exploration of the relationships between landscape features and model errors, we attempt to establish a basis that enables the development of adaptive and intelligent optimization strategies. Four models in 39 different settings for the DTS-CMA-ES, a surrogate-assisted version of the CMA-ES, were compared using three sets of 14 landscape features selected according to their properties, especially according to their robustness and similarity to other features. We analyzed and dependence of various models and model settings on the features calculated using three methods selecting sample sets extracted from DTS-CMA-ES runs on noiseless benchmarks from the COCO framework.
To our knowledge, this analysis of landscape feature properties and their connection to surrogate models in evolutionary optimization is more profound than any other published so far. The crucial distinctions compared to conducting landscape analysis solely with CMA-ES lie in our approach. We perform landscape analysis on the behavior of a specific model in a given generation. In contrast to other publications where researchers typically selected a specific algorithm at the outset and then let it run, our focus is on observing the behavior of a particular model within each generation.
Although our approach has been developed in the context of the DTS-CMA-ES, most features, excluding those intricately tied to the CMA-ES, can be used in connection with black-box optimization in general. It is essential to acknowledge the influence of the surrogate models evolutionary control, as highlighted in Pitra et al. (2021), particularly concerning dataset creation. Despite these nuances, the portability of our methodology remains preserved. The principles and insights gleaned from our study can likely be extended to various surrogate-assisted optimization contexts, laying a foundation for broader applications within this domain.
To facilitate the identification of the algorithm with the highest performance on the optimized function, most landscape features were designed to be calculated on some initial sample set derived from a more or less uniform distribution covering the entire search space. In this paper, we operate on a more local level using only data from the actual runs of the optimization algorithm generated in each generation using Gaussian distribution, that is, data completely different from the original design of those features. Therefore, the low number of robust and mostly similar features confirms the findings of Renau et al. (2020) that the distribution of the initial design has a notable impact on landscape analysis.
The influence of dimensionality on the majority of features proves to be essential. It has never occurred that more than four pairs of dimensions for a single feature were independent of each other, emphasizing the intricate relationship between feature behavior and dimensionality. Considering the context of the optimization problem's dimensionality becomes crucial, providing guidance for selecting features that consistently offer insights across different dimensions. Moreover, exploring methods to normalize or mitigate the impact of dimensionality on features emerges as a promising avenue for enhancing the reliability of landscape analysis in various optimization contexts.
The observation that numerous landscape features within the same feature group exhibit similarity aligns with expectations, given that features within a group often share conceptual or computational foundations. This suggests that utilizing representatives from these feature groups may be sufficient for analysis, simplifying the feature selection process. Additionally, the consistency among medoids for different TSS methods underscores the robustness of certain features. For instance, TSS full and TSS nearest medoids share identical features (, , , ), with only slight variations related to sample sets. Furthermore, 10 out of 14 representatives (, , , , , , , , , ) remain consistent across all TSS methods, indicating their potential importance in characterizing the fitness landscape in the CMA-ES surrogate modeling context.
Additionally, a substantial number of features were found not to be robust, as indicated in Tables S1–S6 in the supplementary material. Interestingly, there was no noticeable dependence of robustness on whether the archive , training set , or prediction set was used.
The observed substantial differences in error values among all 39 model settings, particularly with varying TSS methods for both and , underscore the sensitivity of surrogate model performance to chosen configurations. Notably, GP model settings demonstrated the highest performance, followed by polynomial models.
Most of the investigated features had distributions on sample sets with particular model settings achieving the lowest error values significantly different from the overall distribution of all data for both error measures and all 3 TSS methods. This relationship emphasizes the importance of tailoring model settings based on observed distribution patterns for an individualized approach to optimization tasks. The observed consistency of distributions across both error measures increases confidence in their robustness, indicating a universal impact and offering a foundational understanding applicable to diverse methodologies and training set selection methods.
Finally, the overall results have shown the dispersion features as highly influential on the model settings error values, followed by features based on CMA-ES state variables, features describing the similarity of fitness function to some simple model, features splitting space according to the black-box function values, and simple features such as dimension and number of observations. Prioritizing these features could enhance the efficiency of surrogate model usage.
The results of this research could be utilized to design a metalearning-based system (Kerschke et al., 2019; Saini et al., 2019) for selection of surrogate models in surrogate-assisted evolutionary optimization algorithms context. The landscape features investigated in this paper are in metalearning known as metafeatures. While our focus has been on CMA-ES due to its state-of-the-art status, the methodology is extensible to other algorithms. Exploring a broader array of models and algorithms, such as those proposed by Loshchilov et al. (2013), could enhance the scope of our findings. It is worth noting that our information is constrained by the nature of black-box optimization; obtaining easily more evaluations might obviate the need for surrogate modeling. Additionally, the study suggests that with an abundance of evaluations, rendering surrogate models unnecessary, features requiring additional assessments could be employed.
Acknowledgments
The research reported in this paper has been supported by the German Research Foundation (DFG) funded project and 467401796.
Notes
Source codes covering all mentioned datasets generation and experiments are available on https://github.com/bajeluk/surrogate-cmaes/tree/meta.
The weighted sum of the original archive distributions satisfies , where is the maximal generation reached by the considered original archive and and are the mean and covariance matrix in CMA-ES generation .