## Abstract

Parameter tuning of evolutionary algorithms (EAs) is attracting more and more interest. In particular, the sequential parameter optimization (SPO) framework for the model-assisted tuning of stochastic optimizers has resulted in established parameter tuning algorithms. In this paper, we enhance the SPO framework by introducing transformation steps before the response aggregation and before the actual modeling. Based on design-of-experiments techniques, we empirically analyze the effect of integrating different transformations. We show that in particular, a rank transformation of the responses provides significant improvements. A deeper analysis of the resulting models and additional experiments with adaptive procedures indicates that the rank and the Box-Cox transformation are able to improve the properties of the resultant distributions with respect to symmetry and normality of the residuals. Moreover, model-based effect plots document a higher discriminatory power obtained by the rank transformation.

## 1. Introduction

Parameter tuning of evolutionary algorithms (EAs) represents a noisy optimization problem with expensive evaluations. An approach to this problem is provided by SPO of Bartz-Beielstein et al. (2005), in which usually Kriging models are applied in order to guide the optimization toward promising search points. Kriging is based on the assumption that the modeled data follow a deterministic, stationary, and centered Gaussian process (GP), which cannot be guaranteed in general. For example, distributions of responses often show an unwanted skewness.

Draper and Smith (1998) discussed different response transformations for regression. Bartz-Beielstein et al. (2010, p. 349) claimed that they tried transformations to improve the model fit, but dismissed the idea to improve the interpretability of the results. However, this is not a sound argument for optimization, because the original data can still be stored to build a human-interpretable model afterward. Hutter et al. (2010) demonstrated that it can be advantageous for the optimization performance to apply a logarithmic transformation to the data before modeling in SPO. In geostatistics, area Kriging originated from Cressie (1990), and this technique is known as lognormal Kriging (Marechal, 1974). The aim of this paper is to investigate whether other transformations can be as successful or even more successful in SPO applications. In particular, a rank transformation seems promising since it is established to process highly skewed data (Journel and Deutsch, 1997; Singh and Ananda, 2002). In contrast to these papers, which are concerned with modeling, it is not necessary to retransform the predictions of the model in SPO. The main aim of a parameter tuning procedure is the identification of the parameter setting with the best performance out of a set of possible solutions, that is, the incumbent solution (Hutter et al., 2010). The actual performance of this setting is then known based on the corresponding experiments. Another possible approach not considered here is to establish a ranking within the solutions merely using pair-wise comparisons and comparison-based surrogate models (Runarsson, 2006; Loshchilov et al., 2010). This approach can be used in evolutionary algorithms to decide whether a candidate solution is worth being evaluated on the expensive function, but it is not able to control the exploration of the search space by adding additional design points. This is a special feature of the Kriging-based infill criteria used in SPO.

The remainder of the paper is organized as follows. In Section 2, an introduction to the theoretical aspects of Kriging is given. The SPO framework with our proposed modifications is described in Section 3. Section 4 presents the main part of the paper. Based on a full factorial design of different transformation and aggregation methods, the benefit of the data transformations is analyzed. The best methods for each step are identified and reasons for the success or failure of the methods are discussed. In Section 5, we conclude the results and provide an outlook on future research topics.

## 2. Fundamentals

Parameter tuning of EAs represents an optimization problem min_{x∈X}*y*(**x**),^{1} where denotes the design space. A vector **x** ∈ **X** is called the design point. It contains the settings of the considered tuning parameters *x _{i}*,

*i*= 1, …,

*n*. Usually, the design space

**X**is bounded by box constraints

**l**and

**u**. The performance of the algorithm

*y*(

**x**) is evaluated by performing experiments on a test instance . In , information about the optimization task, that is the test function or simulator output of interest, and the desired target quality or an allowed runtime budget, are encoded. Based on the outcomes of the experiments for

**x**

_{i}, an aggregated response

*y*can be derived. Based on stochastic effects in the algorithm or simulator, the response of a design can only be evaluated as

_{i}*y*=

_{i}*y*(

**x**

_{i}) + ε

_{i}, where ε

_{i}can be denoted as random error or noise of the evaluation.

**D**= (

**x**

^{T}

_{1}, …,

**x**

^{T}

_{k})

^{T}and the corresponding responses

**y**= (

*y*

_{1}, …,

*y*)

_{k}^{T}, which approximate the actual objective function

*y*(

**x**). They consider the

*k*responses as , where the vector

**f**(

**x**) = (

*f*

_{1}(

**x**), …,

*f*(

_{p}**x**))

^{T}contains

*p*monomial regression functions, for example,

*x*

^{2}or tan(

*x*), is a

*p*-dimensional vector of corresponding regression coefficients, and Z(

**X**) is a zero mean (centered) stationary Gaussian process with dependencies specified by the covariance for a known correlation function

*r*and a constant process variance σ

^{2}

_{Z}. Consequently, it is assumed that the residuals to the regression function can be described as a multivariate Gaussian distribution.

**F**= (

**f**(

**x**

_{1}) ⋅ ⋅ ⋅

**f**(

**x**

_{k}))

^{T}is the

*k*×

*p*matrix of regression function values for each of the

*k*designs. Analogously, the vector of correlations between

**X**and the already evaluated designs

**r**(

**x**) = (

*r*(

**x**,

**x**

_{1}), …,

*r*(

**x**,

**x**

_{k}))

^{T}and the

*k*×

*k*intercorrelation matrix

**R**= (

**r**(

**x**

_{1}) ⋅ ⋅ ⋅

**r**(

**x**

_{k})) are defined in terms of the correlation function

*r*. Usually, a constant regression function

*f*(

**x**) = β

_{1}and an anisotropic Gaussian kernel are used. The model parameters control the activity of the Gaussian process in each dimension. The model based on the Gaussian correlation kernel is infinitely differentiable.

Based on the strength of the correlations, the corresponding prediction uncertainty can also be computed (Santner et al., 2003). The predictions can then be interpreted to follow a normal distribution with mean and standard deviation . By these means, the incorporation of the accuracy of an evaluation is possible (Huang et al., 2006; Picheny et al., 2010).

## 3. Approach

Our extended SPO framework is proposed in Algorithm 1. The initial design for the model generation is obtained by computing a Latin hypercube sample (LHS; McKay et al., 1979) within the specified box constraints. Then, *R*_{init} runs of the EA with different random seeds are performed for each design **x** ∈ **D** on the test instance defined by .

In the known SPO approaches, the aggregation (AG) into a single performance value for each design is carried out directly. In contrast, we introduce a local transformation (LT) step before the results of the different runs are aggregated. The aim of the LT is the preprocessing of the response distribution in order to improve the estimation of the AG, for example, a symmetric distribution of the residuals for the estimation of the mean or median, as well as the corresponding standard deviation.

For a rank transformation, the responses are ordered and ranks are assigned starting with 1 for the best response. Consequently, the data are transformed to a uniform distribution given a complete order on the response. If ties occur, all responses are substituted by the averaged rank, that is, the ordered responses **Y** = (0.1, 0.3, 0.3, 1)^{T} would be- come . Whereas the Box-Cox transformation can be performed designwise using a fixed exponent λ estimated based on the whole dataset, the rank transformation has to be computed based on ranks calculated from all available results. All these transformations can improve the characteristics of the result distribution, such as skewness, kurtosis, and the fit with the normal distribution.

An approximately normal distribution of the results allows the uncertainty of the evaluation to be incorporated into Kriging models with nugget effect (Cressie, 1993; Huang et al., 2006; Picheny et al., 2010). By these means, the robustness of a design point can also be considered. Moreover, the normal distribution is symmetrical, improving the estimation of most aggregation methods. Thus, we introduce an adaptive approach, which is also considered for the LT. It chooses between the Box-Cox and the rank transformation based on a Shapiro-Wilk test (Shapiro and Wilk, 1965), whereby the transformation obtaining a higher probability (*p* value) for H0, that is, the normality of the residuals, is applied. The residuals used in the test are computed by individual standardizations of the responses for each design vector. The Box-Cox LT is chosen because its range of transformations includes both logarithmic and no transformation. The rank LT seems suitable due to the fast convergence of the sum of uniformly distributed random numbers to the normal distribution (Weisstein, 2011), as a special case of the central limit theorem.

After the local transformation, the different results for each design are forwarded to the AG. Usually, the mean of the runs is used in SPO. The mean is defined as the average over the observed results, making its estimation sensitive to outliers. Thus, the use of the median—as a more robust performance index for AG—is additionally analyzed in this paper. As already observed by Hutter et al. (2010), the LT changes the actual aggregation of the individual responses. The combination of the mean and a logarithmic LT results in the logarithm of the geometric mean of the untransformed responses. This is particularly interesting with respect to often observed success or failure scenarios in multimodal optimization, because the geometric mean is well suited to aggregate binary random variables. Moreover, a local Box-Cox transformation can provide a slight shift from the arithmetic to the geometric mean.

The global transformation (GT) is performed after the aggregated responses have been computed. In this step, two aims have to be achieved. For the modeling, the prediction quality should be as good as possible. For the sequential optimization, a clearer identification of the optimal basin is desired. Hutter et al. (2009, 2010) have already shown that a logarithmic transformation can improve the performance of SPO for some selected test instances. We additionally consider the already mentioned Box-Cox and rank transformations. The Box-Cox transformation should improve the fit of the Kriging model which is based on a multivariate Gaussian process. The rank GT will rescale the objectives in a way that the discrimination of the good solutions is improved, in particular, if many solutions in the vicinity of the optimum are already evaluated. We also introduce an adaptive approach for the GT. Since the ability of identifying the optimal basin cannot easily be evaluated, the adaptive approach for the GT chooses the transformation of the aggregated responses that results in the best model with respect to a leave-one-out cross-validation (Tenne and Armfield, 2008). It thus focuses on the first aim of providing the best possible prediction quality based on the aggregated responses.

Finally, a Kriging model is computed from the transformed results, as described in Section 2. The model is used to predict the second moment of the expected improvement *E*(*I*^{2}) (Schonlau et al., 1997) for a huge set of parameter configurations. The best results of those candidate configurations are evaluated and the data are fed back into the model.

## 4. Experiments

The analysis of the experiments is divided into two parts. In the first part, the effects and interactions of the transformation and aggregation methods and their interactions with different problem properties are evaluated factually, based on an ANOVA on the complete experimental data. In the second part, these effects are further investigated, explained, and discussed. To accomplish this, selected subsets of the data are analyzed.

### 4.1. Research Questions

The first two research questions are of a factual nature. The latter two focus on the mechanisms behind the results:

Do local and global transformations of the responses, as well as the aggregation method, have significant effects and interactions with respect to the performance of SPO?

Are there significant interactions of the transformation and aggregation methods with the properties of the tuning problem to be solved?

Is it possible to derive general guidelines for the choice of LT, AG, and GT?

Why do specific transformations or combinations of them result in improved performance?

### 4.2. Pre-experimental Planning

The enhanced SPO framework will be applied for the tuning of stochastic search algorithms. Since we are interested in the effect of the transformations in a typical parameter tuning scenario, we use standard EAs on established test problems. Consequently, we do not have a requirement to define a theoretical noise distribution. The distribution is implicitly created by the characteristics of the EA and the test problem. To cover a wide field of possible applications, we want to consider single objective and multi-objective EAs and unimodal and multimodal test problems. These two characteristics are also used as problem features for the analysis.

For the LT, four static and our normality-test-based adaptive approach are considered. The static methods are no, logarithmic, Box-Cox, and rank transformation.^{2} As aggregation methods, the average response or mean and the median are applied. For the GT, again the four static approaches and our cross-validation-based adaptive approach are considered.

The outcome of an SPO run is the best design **d***. In order to evaluate **d***, we perform a high number of EA runs with a parameterization following **d*** and compute a performance index (mean, median; Hutter et al., 2010). Since we want to compare different aggregation methods, a choice of one of these performance indices would seriously bias the results. Thus, a nonparametric evaluation based on a grouping is performed. The EA runs of different SPO instances using the same seed are directly ranked based on the results for this seed. More specifically, *N*_{SPO} replications of an SPO instance are performed, each providing an approximation of the best design **d***. This design **d*** is evaluated based on *N*_{EA} replications of the respectively parameterized EA. Consequently, for the evaluation of each SPO instance *N*_{SPO} · *N*_{EA} EA runs with different seeds are available, which are all individually ranked. These ranks form the basis for the evaluation.

### 4.3. Setup

The considered test cases are summarized in Table 1. The multi-objective problems S_ZDT1 and R_ZDT4 are taken from the CEC 2007 competition (Huang et al., 2007). SPO is allowed a budget of *N* = 500 EA runs. As a single-objective EA, the (μ, λ)-evolutionary strategy (ES; Schwefel, 1995) is chosen. It uses Gaussian mutation with self-adapted step sizes. This algorithm is one of the most famous evolutionary algorithms and forms the basis for many recent approaches, such as the CMA-ES (Hansen and Ostermeier, 2001). We explicitly decided not to use the CMA-ES because it contains several heuristics to adapt its parameters online, which perturb the effect of the initial parameterization. For the multi-objective problems, the -metric-selection-based multi-objective EA (SMS-EMOA; Beume et al., 2007) is used. In this algorithm, polynomial mutation with fixed step size η_{m} = 15 and probability *p _{m}* = 1/

*n*and simulated binary crossover (SBX; Deb and Agrawal, 1995) are applied.

Problem . | Dim. . | Objectives . | Modality . | Algorithm . | Eval. . | Reference point . |
---|---|---|---|---|---|---|

Sphere | 30 | single | uni | (μ, λ)-ES | 10,000 | — |

Rastrigin | 10 | single | multi | (μ, λ)-ES | 50,000 | — |

S_ZDT1 | 30 | multi (2) | uni | SMS-EMOA | 10,000 | (2.05, 10.45)^{T} |

R_ZDT4 | 10 | multi (2) | multi | SMS-EMOA | 20,000 | (4.15, 524.95)^{T} |

Problem . | Dim. . | Objectives . | Modality . | Algorithm . | Eval. . | Reference point . |
---|---|---|---|---|---|---|

Sphere | 30 | single | uni | (μ, λ)-ES | 10,000 | — |

Rastrigin | 10 | single | multi | (μ, λ)-ES | 50,000 | — |

S_ZDT1 | 30 | multi (2) | uni | SMS-EMOA | 10,000 | (2.05, 10.45)^{T} |

R_ZDT4 | 10 | multi (2) | multi | SMS-EMOA | 20,000 | (4.15, 524.95)^{T} |

The corresponding tuning parameters and their box constraints are defined in Table 2. For each EA, three parameters are to be tuned: population size μ, selection pressure λ/μ, and initial step size σ_{init} in the single objective case; and μ, crossover distribution index η_{c}, and probability *p _{c}* in the multi-objective case. The logarithmic transformations of some of the design parameters are performed since DACE models rely on a stationary GP, that is, a constant activity of the response over the considered domain (see, e.g., Biermann et al., 2010). For example, a change of the population size from μ = 1 to μ = 10 surely has a greater effect on the response than a change from μ = 100 to μ = 110. The change of the magnitude of the parameter is more important than the change in the absolute value. The same holds for the selection pressure λ/μ and the initial step size σ

_{init}. Currently, there are no rules concerning when to apply these kinds of transformations in the tuning parameter space. The experimenter has to rely on expert knowledge and experience from practice.

. | (μ, λ)-ES . | SMS-EMOA . | ||||
---|---|---|---|---|---|---|

Parameter . | μ . | λ/μ . | σ_{init}
. | μ . | η_{c}
. | p
. _{c} |

Box constraints | {1, …, 100} | [2, 100] | [0.001, 5.12] | {1, …, 1000} | [1, 50] | [1/n, 1] |

Transformation | log_{10} | log_{10} | log_{10} | log_{10} | none | none |

. | (μ, λ)-ES . | SMS-EMOA . | ||||
---|---|---|---|---|---|---|

Parameter . | μ . | λ/μ . | σ_{init}
. | μ . | η_{c}
. | p
. _{c} |

Box constraints | {1, …, 100} | [2, 100] | [0.001, 5.12] | {1, …, 1000} | [1, 50] | [1/n, 1] |

Transformation | log_{10} | log_{10} | log_{10} | log_{10} | none | none |

We focus on the sequential model-based optimization in SPO. Here, the transformation and aggregation methods show effect. For the initial design **D**, *N*_{init} = 30 design points and *R*_{init} = 4 runs of each design point are used. Consequently, less than a quarter of the experimental budget is utilized for the initial design. On single objective problems, the best objective value found can be directly used to evaluate the corresponding run. In the multi-objective case, the difference in the hypervolume indicator with respect to a Pareto-optimal reference set is computed (Knowles et al., 2006). This is related to the measure internally optimized in the SMS-EMOA. The reference points for the hypervolume computation are listed in Table 1.

For the evaluation of the considered SPO instances, *N*_{SPO} = 50 replications of the complete SPO run are performed. Each **d*** is then evaluated *N*_{EA} = 50 times, resulting in ranks *r* ∈ [1, 50] based on the number of designs in the full factorial. Consequently, *N*_{SPO} · *N*_{EA}= 2,500 ranks are used as the basis for the evaluation of each design within the five-way ANOVA (Montgomery, 1997). The factors and factor levels considered are summarized in Table 3. For notational brevity, we assign a letter to each level so that a combination of LT, AG, and GT can be identified by its corresponding triple. For example, DAR means adaptive LT, mean AG, and rank GT. In summary, five LTs, two AGs, five GTs, and four test problems are considered, resulting in 500,000 experiments in total.

Factor . | Levels . |
---|---|

Local transformation (LT) | {none (N), Box-Cox (B), log (L), rank (R), adaptive (D)} |

Aggregation method (AG) | {mean (A), median (M)} |

Global transformation (GT) | {none (N), Box-Cox (B), log (L), rank (R), adaptive (D)} |

Modality of the problem (MOD) | {uni, multi} |

Objectives of the problem (OBJ) | {single, multi} |

Factor . | Levels . |
---|---|

Local transformation (LT) | {none (N), Box-Cox (B), log (L), rank (R), adaptive (D)} |

Aggregation method (AG) | {mean (A), median (M)} |

Global transformation (GT) | {none (N), Box-Cox (B), log (L), rank (R), adaptive (D)} |

Modality of the problem (MOD) | {uni, multi} |

Objectives of the problem (OBJ) | {single, multi} |

### 4.4. Results

The results are organized in a top-down approach. First, the ANOVA is analyzed and the overall results are factually presented in order to answer research questions 1 and 2. Based on the results, question 3 is answered by deriving guidelines for choosing the realizations of the different steps. Finally, subsets of the experiments providing very good or poor results are analyzed in order to discuss the reasons for the factual results. This discussion represents a first approach for answering question 4.

#### 4.4.1. Research Question 1: Significance of the Main and Interaction Effects

The ANOVA table of the complete full factorial experiment over all test functions is shown in Table 4. All analyzed main and interaction effects are significant with respect to standard critical levels, for example, α = 0.01. Moreover, the value of the *F* statistic allows the importance of the effects to be evaluated.

Source . | Sum of squares . | DOF . | Mean square . | F
. | Prob >F
. |
---|---|---|---|---|---|

LT | 2.6212e+5 | 4 | 6.5529e+4 | 323.2818 | 0 |

AG | 7.6166e+4 | 1 | 7.6166e+4 | 375.7595 | 0 |

GT | 4.3547e+5 | 4 | 1.0887e+5 | 537.0929 | 0 |

LT*AG | 4.1193e+3 | 4 | 1.0298e+3 | 5.0805 | 4.3142e–04 |

LT*GT | 1.0107e+6 | 16 | 6.3167e+4 | 311.6296 | 0 |

AG*GT | 1.7645e+4 | 4 | 4.4113e+3 | 21.7627 | 0 |

OBJ*LT | 1.3765e+4 | 4 | 3.4412e+3 | 16.9770 | 6.2839e–14 |

OBJ*AG | 2.6362e+5 | 1 | 2.6362e+5 | 1300.5003 | 0 |

OBJ*GT | 3.7972e+4 | 4 | 9.4930e+3 | 46.8329 | 0 |

MOD*LT | 8.8329e+4 | 4 | 2.2082e+4 | 108.9403 | 0 |

MOD*AG | 2.1828e+4 | 1 | 2.1828e+4 | 107.6884 | 0 |

MOD*GT | 2.8035e+5 | 4 | 7.0089e+4 | 345.7767 | 0 |

Error | 1.0134e+8 | 499,945 | 2.0270e+2 | ||

Total | 1.0385e+8 | 499,999 |

Source . | Sum of squares . | DOF . | Mean square . | F
. | Prob >F
. |
---|---|---|---|---|---|

LT | 2.6212e+5 | 4 | 6.5529e+4 | 323.2818 | 0 |

AG | 7.6166e+4 | 1 | 7.6166e+4 | 375.7595 | 0 |

GT | 4.3547e+5 | 4 | 1.0887e+5 | 537.0929 | 0 |

LT*AG | 4.1193e+3 | 4 | 1.0298e+3 | 5.0805 | 4.3142e–04 |

LT*GT | 1.0107e+6 | 16 | 6.3167e+4 | 311.6296 | 0 |

AG*GT | 1.7645e+4 | 4 | 4.4113e+3 | 21.7627 | 0 |

OBJ*LT | 1.3765e+4 | 4 | 3.4412e+3 | 16.9770 | 6.2839e–14 |

OBJ*AG | 2.6362e+5 | 1 | 2.6362e+5 | 1300.5003 | 0 |

OBJ*GT | 3.7972e+4 | 4 | 9.4930e+3 | 46.8329 | 0 |

MOD*LT | 8.8329e+4 | 4 | 2.2082e+4 | 108.9403 | 0 |

MOD*AG | 2.1828e+4 | 1 | 2.1828e+4 | 107.6884 | 0 |

MOD*GT | 2.8035e+5 | 4 | 7.0089e+4 | 345.7767 | 0 |

Error | 1.0134e+8 | 499,945 | 2.0270e+2 | ||

Total | 1.0385e+8 | 499,999 |

The most important main effect is observed for the GT. The best performance is obtained when using the rank GT (marginal mean rank 24.1, marginal median rank 24) followed by the adaptive (24.9, 24.5), no (25.5, 25.75), Box-Cox (26.3, 27), and logarithmic (26.7, 27.5) GT.^{3} Considering the second most important effect of AG, the median (25.1, 25) is slightly better than the mean (25.9, 26). The least important, but still considerable, main effect is the one of the LT. Again, the best performance is achieved by the rank LT (24.2, 24), followed by the adaptive (25.3, 25), Box-Cox (25.8, 26), logarithmic (26, 26), and no (26.2, 26) LT.

The most important interaction is the one of LT and GT. The ranks for evaluating the performance of each group are shown in Figure 1 (left). All combinations applying the rank LT and/or GT result in a mean and median rank better than the average. Interestingly, even the combinations of any actual LT and no GT result in an above-average performance. The bad performance of the logarithmic and Box-Cox GT is mainly caused by the results of the combinations of these two transformations, also when performed as part of the adaptive LT. Nevertheless, all combinations of transformations are better than the approach without any transformations (30.9, 33). The best results are obtained using the rank LT and no (23.4, 22) or a Box-Cox (23.5, 23) GT.

The interaction with the AG does not change the ranking of the different GTs. Nevertheless, the difference between the rank and the adaptive GT is much smaller for the median than for the mean—(23.9, 23) and (24.2, 23) compared to (24.3, 24) and (25.6, 25.5). The interaction of AG and LT does not show any changes in the already observed rankings of AG and LT.

#### 4.4.2. Research Question 2: Interactions with the Problem Instance

The most important interaction effect is the one of the AG and the number of the objectives. This is shown in Figure 1 (right). While for multi-objective problems, the mean (25.2, 25) results in a slightly better SPO performance than the median (25.8, 26), the median (24.4, 23.5) is much more suited for single-objective problems compared to the mean (26.6, 27). This interaction even exceeds the main effect of the AG. The interaction of the OBJ with the LT and GT does not change the ranking of the different approaches.

The interaction between the modality of the problem and the GT is the second most important interaction. It mainly consists of a damping of the GT effect for multimodal problem instances. In this case, only the difference between rank (25.2, 25) and logarithmic GT (25.9, 26) is significant, whereas the ranking described in the previous paragraph holds for the unimodal test problems. For the interaction of the LT and the modality of the problem, similar damping effects as for the GT can be observed. When multimodal test instances are considered, the rank (25, 25) and the adaptive LT (25, 25) show an almost comparable performance followed by the Box-Cox (25.5, 26), logarithmic (25.9, 26), and no (26, 26) LT. This reduction in the differences between the approaches may be based on a floor effect, that is, the test functions may be too hard to solve by any parameterization of the EA within the given budget (Cohen, 1995).

The modality of the problem also shows an important interaction effect with the AG. Whereas the median (25.3, 25) is only slightly better than the mean (25.7, 26) for unimodal problems, this effect is potentiated on multimodal problems leading to a clearer superiority of the median (24.9, 24.5) over the mean (26.1, 26).

#### 4.4.3. Research Question 3: Guidelines

For the practitioner, the best overall combination is of the main interest. With regard to all algorithmic factors, LT, AG, and GT, the combinations RMB (23.1, 22), BMN (23.2, 22), RMN (23.3, 22), DMN (23.4, 22), DMD (23.6, 22), LMR (23.6, 22), RAN (23.5, 23), and DMR (23.6, 23) obtain the best ranks in the experiments. The differences in the mean ranks of these eight combinations are not significant at the α = 0.05 level.

Recall that the use of the ranks in the evaluation of the approaches is necessary to cross-compare the results on the different test instances and to analyze the interactions of the algorithmic factors with the number of objectives and modality. The actual objective values obtained by the different combinations are shown in Figure 2. In this figure, the trade-offs to be paid between mean and median performance are shown depending on the underlying test problem. As expected, the confidence intervals are larger on multimodal problems. On unimodal problems, less conflict between mean and median performance exists, leading to small nondominated fronts. The almost perfect correlation on S_ZDT1 is caused by unimodal almost symmetric response distributions as already observed by Mersmann et al. (2010). The figures also show that configurations without transformations or combinations of Box-Cox, adaptive and logarithmic LT and GT always rank among the worst, while the best configurations mentioned before are always on or in the vicinity of the nondominated front. Moreover, the slight advantage for the median AG observed in the ANOVA seems to be based on the ranking approach. Depending on the aggregation preferred by the user, both the median and the mean AG can be successful. In particular, the combinations RAN, RAB, DAR, DAD, BAR, LAR, and RAR seem to work well based on Figure 2.

#### 4.4.4. Research Question 4: Analysis of Specific Approaches

The guidelines proposed in the previous paragraph are solely based on the observed results. In order to promote our recommendations, we now analyze and discuss why rank transformations (LT or GT) and a Box-Cox GT can be beneficial. The aggregation can be chosen based on the preference of the decision maker and is thus not further discussed. In the following analyses, we focus on the main reasons for using the transformations:

Improving the prediction quality of the surrogate model; and

Identifying the optimal basin.

For the first aim, we analyze the effect of the LT and GT on the distribution of the input data and visualize the prediction of the surrogate models. The latter is then also used for evaluating the second aim. More specifically, the identification of the optimal basin is visually analyzed by means of model-based effect plots.

#### 4.4.4.1 Improving the Prediction Quality

Histograms of the responses used for building the surrogate model (after LT, AG, and GT) are provided in Figure 3 for different GT. The other factors are fixed to the SPO default, that is, no LT and a mean AG. All untransformed distributions are heavily skewed to the right. By applying a logarithmic transformation, the distribution becomes slightly left-skewed on the sphere and approximately symmetric on the Rastrigin problem, whereas the response distributions of the multi-objective problems are still right-skewed. Consequently, this transformation can only sometimes improve the response distribution. The Box-Cox transform is able to reduce the skewness on all considered test instances. Only on the multi-objective problems is the distribution still slightly skewed to the right. The additional degree of freedom λ of the Box-Cox transformation (cf. Equation (3)) can thus improve the symmetry of the final response distribution. This is particularly important because Kriging as applied in SPO realizes a centered Gaussian process with constant variance σ^{2}_{Z} (cf. Equation (1)). A skewed distribution would result in problems in predicting the responses situated at the tail of the skewed distribution. We did not visualize the rank GT in Figure 3 because it produces an (almost) uniform distribution (cf. Section 3) which is always symmetric. Nevertheless, the rank GT introduces a stair structure to the data, which might be problematic to be modeled by a continuous approach such as Kriging.

The effect of the LT is analyzed in Figure 4 and Figure 5, where we show the whole sequence of transformations based on the median SPO runs of two selected combinations on S_ZDT1. In Figure 4, the effect of the different steps in the successful combination RAB is depicted. The bad combination LAL is shown in Figure 5. After the rank LT, the characteristic uniform distribution of responses can be observed. The extreme right-skewness of the original distribution is compensated for. The logarithmic LT is only able to slightly reduce the skewness of the distribution. In both cases, the individually standardized residuals show an improved fit with the normal distribution (second row). The right-skewness of the original distribution is compensated for and the estimated probability density (black line) comes close to the normal distribution (dashed line). After the AG, the distribution of the combination RAB again deviates from the uniform distribution, whereas the distribution of the combination LAL almost does not change. The skewness of the distribution resulting from RAB is not as extreme as in the LAL case. The final Box-Cox GT does not severely change the distribution. Thus, the main contribution to the performance of this combination is made by the rank LT, which is also confirmed by the experimental results in which the combination RAN is equally successful. Nevertheless, the Box-Cox GT slightly improves the symmetry. In contrast, the logarithmic GT results in a complete change of the distribution, from right-skewed to extremely left-skewed. A left-skewed distribution is the worst basis for building a surrogate model for optimization. Based on the limitations of the Kriging model mentioned before, the prediction quality in the vicinity of good responses is bad. The detection of promising design points becomes almost impossible.

The relative amounts of the realized LTs in the adaptive approach are shown in Figure 6. The adaptive approach chooses the LT based on the Shapiro-Wilk normality test on the residuals. It can thus be used in order to analyze the ability to produce approximately normally distributed residuals using the Box-Cox and rank distribution. The choice of the LT shows considerable interactions with the problem and the state of the sequential optimization. While the rank LT is only rarely chosen on S_ZDT1, it makes up a considerable amount in the beginning of the SPO run on the Sphere and R_ZDT4 and most of the SPO run on Rastrigin. In summary, both the rank and the Box-Cox transformation seem to be able to reduce the skewness of the distribution and provide symmetric response distributions for building the surrogate model. Nevertheless, the Box-Cox transformation seems to be more suited to produce almost normal residual distributions. A possible reason for this observation can be found in the violation of the two main conditions of the central limit theorem. The tuning parameters determine the performance of a design. Thus, the distribution of rank-transformed responses for a specific design is not necessarily uniform. As a result, the responses cannot be treated as independent and equally distributed random numbers.

The trajectories of the relative amounts in the adaptive GT are not shown because there are no interactions with the state of the SPO run. Interestingly, the most frequently chosen GT approaches differ for all test problems. On the sphere the logarithmic GT is chosen in 42%, on Rastrigin the rank GT results in the best model in 39%, on S_ZDT1 the Box-Cox GT is preferred in 32%, and on R_ZDT4 no transformation provides the best prediction quality in 44% of the considered cases. Summing up over all problems, the logarithmic GT is chosen in 28%, no GT in 27%, the Box-Cox GT in 23%, and the rank transform in 22%. Compared to the results of the ANOVA, this ranking is in contrast with the performance of the different GT when applied in SPO, where the logarithmic and no GT perform worse. Recalling that the transformation is chosen based on the cross-validation error of the model, the prediction quality seems not to be the most important factor for the SPO performance. This observation may explain why the adaptive approach is not the most successful GT. A similar observation has also been made by Lim et al. (2010), who mention a “bless of uncertainty” in model-assisted optimization.

#### 4.4.4.2 Identifying the Optimal Basin

In order to evaluate the identification of the optimal basin, model-based effect plots of the most important EA parameters are shown in Figure 7 and Figure 8.^{4} These effect plots show big differences in the shape and size of the optimal basin (area within the inner contour line colored in white). On S_ZDT1 (Figure 7), the model without transformations (left) predicts a very smooth main effect for the population size μ with a broad basin μ* ∈ [10, 100]. For the probability of crossover *p _{c}*, hardly any influence can be found. In contrast, the Box-Cox and rank LT lead to a much smaller basin of attraction, where the optimal parameter values for both, μ* ∈ [25, 40] and

*p**

_{c}∈ [0.4, 0.6], can be identified more precisely. This effect can also be seen in the placing of the sequentially sampled design points (black circles) of the corresponding SPO approach which is much more focused for the latter LTs. Nevertheless, the prediction quality of the model based on the rank LT seems to be poor. Outside the optimal basin, the Kriging model interpolates the responses by very local peaks. Consequently, no correlation information is available in between. This leads to predictions based solely on the regression function (cf. Equation (2)), which is the mean in our experiments. We expect this behavior to be based on the choice of an interpolating Kriging model in SPO. The dense sampling in the optimal region leads to many noisy observations in the direct vicinity of each other. This increases the estimation of the activity parameters θ

_{i}and results in a fast decrease of the correlations. The example shows that it is not necessary to have a good model in order to perform a successful optimization.

Interestingly, a completely opposing effect can be seen on Rastrigin in Figure 8. Without any transformations, no correlation structure of the data can be found. Technically, the complete intercorrelation matrix **R** is almost diagonal in such cases, that is, no correlations of neighboring observations can be detected and only the diagonal (self-correlation) is nonzero. There are many local optima which are not exploited in the sequential optimization. Even the detection of the optimal basin fails. Thus, the bad model is based on the bad distribution of response values instead of many noisy observations in a direct vicinity. A logarithmic GT again results in a successful detection of the optimal basin. The rank GT provides the smoothest landscape with the biggest basin of attraction. In both cases, the exploitation of the basin is successful.

Based on the effect plot, our thoughts on the implication of the skewed response distributions can be further refined. The response distributions of the NAN approach are depicted in Figure 3 (left column). These histograms suggest that the extremely right-skewed distributions of NAN result in a bad identification of the optimal basin. If the mean of all responses is close to the mode of the distribution (Sphere), the basin is too large to identify the best from the good. If the mean is situated on the right of the mode, as caused by the outliers with high response values on Rastrigin, a diagonal intercorrelation matrix **R** will likely result. In contrast, the almost symmetrical distributions of NAL, RAN, and NAR lead to a good approximation of the basin.

Nevertheless, the symmetry of the response distribution cannot be the only reason for improved performance. If this would be the case, the Box-Cox and the rank transformation would provide comparable results. The rank transformation always results in an almost uniform distribution, independent of the original one. If many solutions in the vicinity of the optimum exist, it is likely that these solutions have a similar response value. The differences in the evaluation of the response are mainly based on stochastic variation. The rank transformation stretches this variation to a significant portion of the transformed response space. This may improve the discrimination of the best from the good designs. This stretching effect can be seen by comparing the NAN and RAN effect plots in Figure 7. Whereas almost the complete area of predictions below the mean rank of approximately 430 is situated in the densely sampled basin of the optimum for RAN, no differences in the responses can be detected in the model of NAN. On Rastrigin, as shown in Figure 8, the sampling is not as dense as on S_ZDT1. Thus, the focusing is not as strong, and a smooth model results. As a result, the rank transformation adaptively increases the resolution around the optimum with increasing sampling density.

## 5. Conclusion and Outlook

Based on four test instances representing different problem classes, we analyzed the effect of applying different LT, AG, and GT approaches within the SPO framework. Generally, the results are in no way restricted to parameter tuning or SPO. They should be applicable to all model-assisted optimizers for noisy optimization relying on a normal or symmetric distribution of responses. In summary, the performance of the NAN and NMN approaches is the worst of all considered combinations. Thus, at least one transformation should be performed whereby the use of the LT is more important than the one of the GT. All combinations of Box-Cox and logarithmic LT and GT should also be avoided. The use of the rank transformation either as LT or as GT always results in an above-average performance. The adaptive approaches also work well, but cannot show significant improvements compared to the sole use of the Box-Cox or rank transformation. Consequently, the normality of the residuals and the prediction quality of the model are not the main contributors to the SPO performance. The computational overhead of the adaptive approaches is not necessary. We recommend using a rank LT because the problem of introducing discrete steps in the response is relaxed by the subsequent AG. A GT is not necessary, but in order to ensure a symmetric distribution of the results, a Box-Cox GT is recommended. Recall that the Box-Cox GT can also result in almost no transformation of the data.

The superiority of the rank transformation was shown based on an established and systematically chosen, but small set of test instances. Evidence for the generality of the result was provided by showing that symmetric response distributions and an adaptive increase of the resolution with respect to the sampling density can be obtained by the rank transformation. However, additional experiments on further test instances should be performed in order to substantiate these findings. Moreover, the analysis of LTs should also be performed in the context of Kriging models with homogeneous and heterogeneous nugget effect. In order to cope with different kinds of problems, an adaptive approach that selects the transformation function based on a normality test on the transformed responses was therefore introduced.

## Acknowledgments

This paper is based on investigations of project D5 “Synthesis and multi-objective model-based optimization of process chains for manufacturing parts with functionally graded properties” as part of the collaborative research center SFB/TR TRR 30, which is kindly supported by the Deutsche Forschungsgemeinschaft (DFG).

## Notes

^{1}

Minimization problems are considered in this paper. Maximization problems can be transformed to corresponding minimization problems min_{x∈X} − *y*(**x**).

^{2}

Logarithmic and Box-Cox transformations can only be computed in the domain . Thus, we use shifted responses and *transformGlobal* (**y** − min{**y**} + ε), where ε = 2.2204 · 10^{−16} denotes the machine precision.

^{3}

Marginal mean and median ranks denote the mean and median ranks of the groups taking the average performance over the currently unconsidered factors. The mean rank over all experiments is . We choose a rounding to one decimal place based on the width of the 95% confidence interval of the mean and median which is between 0.1 and 0.6 in our experiments (cf. Figure 1).

^{4}

Since the true performance of the EAs is not known and a higher density of observations would result in numerically unstable models, no better estimation of the true contours can be provided.

## References

*Proceedings of the 2nd International Conference on Engineering Optimization (EngOpt 2010)*, Lisbon, Portugal