## Abstract

All commonly used stochastic optimisation algorithms have to be parameterised to perform effectively. Adaptive parameter control (APC) is an effective method used for this purpose. APC repeatedly adjusts parameter values during the optimisation process for optimal algorithm performance. The assignment of parameter values for a given iteration is based on previously measured performance. In recent research, time series prediction has been proposed as a method of projecting the probabilities to use for parameter value selection. In this work, we examine the suitability of a variety of prediction methods for the projection of future parameter performance based on previous data. All considered prediction methods have assumptions the time series data has to conform to for the prediction method to provide accurate projections. Looking specifically at parameters of evolutionary algorithms (EAs), we find that all standard EA parameters with the exception of population size conform largely to the assumptions made by the considered prediction methods. Evaluating the performance of these prediction methods, we find that linear regression provides the best results by a very small and statistically insignificant margin. Regardless of the prediction method, predictive parameter control outperforms state of the art parameter control methods when the performance data adheres to the assumptions made by the prediction method. When a parameter's performance data does not adhere to the assumptions made by the forecasting method, the use of prediction does not have a notable adverse impact on the algorithm's performance.

## 1 Introduction

Evolutionary algorithms (EAs) have shown good performance and robustness in solving hard optimisation problems. In recent years, it has been acknowledged that the robustness of EAs can largely be attributed to the adjustability of the algorithm parameters (Bäck et al., 2000; Eiben and Schut, 2008; Michalewicz and Schmidt, 2007), which enable the optimisation procedure to adapt to different search spaces. That said, it comes as no surprise that the parameterisation of EAs greatly affects the performance of the algorithm (Aleti and Moser, 2013; Bäck et al., 2000; Eiben and Smit, 2011; Lobo, 2011; Michalewicz and Schmidt, 2007).

The parameterisation is usually the responsibility of the practitioner, who may not be an expert in EAs. Even after many years of research into EAs and other stochastic approaches, there are no straightforward parametrisation guidelines for interdisciplinary users of these methods (Arcuri and Fraser, 2011). General guidelines regarding the crossover rate recommend a divergent variety of values, such as 0.6 (De Jong, 1995), 0.95 (Grefenstette, 1986), or “any value in the range [0.75, 0.95]” (Schaffer et al., 1989). De Jong (1995) recommends a mutation rate equal to 0.001, whereas Grefenstette (1986) argues that the mutation rate should be less than 0.05.

Any one of these conflicting guidelines is the ideal configuration for a particular search space. As the search space of a newly arising problem is typically unknown, general principles about EA parameter configurations for universal applicability can hardly be formulated (Mitchell, 1996) and EA parameterisation presents itself as an optimisation problem in its own right (Eiben and Schut, 2008). Pressed for time, practitioners tend to choose parameter values based on few preliminary trials on the problem at hand. It has been demonstrated that different parameter settings are required also for different optimisation stages of the same problem instance (Bäck et al., 2000; Cervantes and Stephens, 2009; Thierens, 2002). The “optimal” values for the mutation rate of an EA were studied using a set of benchmark problems (Cervantes and Stephens, 2009). It became clear that these values depend on the state of the search space and any fixed mutation rate will produce suboptimal results. Time dependency has since been reported for several EA parameters—the mutation rate (Aleti and Moser, 2011; Aleti et al., 2012; Bäck and Schütz, 1996; Davis, 1991), the crossover rate (Aleti and Moser, 2011; Aleti et al., 2012; Davis, 1991), and the population size (Bäck et al., 2000; Lobo, 2011).

As prior tuning of an EA's parameter values does not provide optimal performance, a great number of researchers have since proposed to reset the parameter values of an EA repeatedly during the optimisation process (Bäck et al., 2000; De Jong, 2007; Eiben et al., 2007; Fialho et al., 2010; Michalewicz and Fogel, 2004), an approach known as parameter control. The intuitive motivation comes from the way the optimisation process unfolds from a more diffused global search, requiring parameter values responsible for the exploration of the search space, to a more focused local optimisation process, requiring parameter values which help with the convergence of the algorithm.

One of the most conspicuous ways of varying parameter values during the optimisation process is adaptive parameter control (Corne et al., 2002; Fialho et al., 2009; Hong et al., 2000; Igel and Kreutz, 2003; Thierens, 2002), in which properties of an EA optimisation process (typically the quality of the solutions produced) are monitored and the change in the properties is used as a signal to change the parameter values. The area of adaptive parameter control has been researched more actively in recent times (Aleti and Moser, 2011; Fialho et al., 2009; Hansen and Ostermeier, 2001; Thierens, 2005). Parameter values are assessed based on past performance and subsequently adapted for the next iteration of the algorithm.

In essence, adaptive parameter control methods assess the potential future success of parameter values based on the estimated effect of the parameter values in the last iterations (Fialho et al., 2010; Hong et al., 2002; Igel et al., 2005). Quality attribution mechanisms use this quality measure gained from recent performance (time *t*) to decide which value(s) to use in the immediate future. It is generally argued that the past effectiveness of parameter values provides the information necessary to project their chance of leading to good results at iteration *t*+1. In previous work (Aleti and Moser, 2011), we have suggested that using time series prediction to attribute a quality metric to the parameter values to choose from for the next iteration is beneficial to the performance of the algorithm. The predictive parameter control method used linear regression to forecast the quality of mutation and crossover rates, and outperformed best-performing approaches with the same functionality, AP (Thierens, 2005), PM (Thierens, 2005), and DMAB (Fialho et al., 2009).

However, time series prediction methods require the data series to have a number of statistical properties, such as linearity of the relationship between dependent and independent variables, normality of the error distribution, independence of the errors, homoscedasticity, and stationarity of the mean and standard deviation. Different methods have been made available to accommodate different types of data. Typical EA parameters whose adaptation is meaningful to automate using predictive parameter control are mutation rate, crossover rate, mutation and crossover operators, population, and mating pool sizes. In the current work, we investigate the statistical properties of the quality of these parameter values recorded over time. The analysis of the statistical properties is employed to examine the suitability of a number of forecasting techniques for adaptive parameter value optimisation and provide recommendations on the adequacy of these methods for forecasting the quality to be expected from the use of certain parameter values. To demonstrate the benefits of predictive parameterisation, we present a comparison with state of the art quality attribution strategies—average quality attribution (AQA) and extreme quality attribution (EQA).

## 2 Adaptive Parameter Control for Evolutionary Algorithms

Adaptive parameter control describes the application of separate meta-optimisation methods which use feedback from the optimisation process to evaluate the effect of parameter value choices and adjust the parameter values over the iterations. The steps of adaptive parameter control as applied to an EA are depicted in Figure 1.

The optimisation process of EAs begins with a set of initial solutions, which is evolved in iterations until a stopping criterion is reached. The new generations of solutions are created with the help of crossover, mutation, and selection operators. These operators, their probabilities, the size of the population, and the number of new solutions produced in every iteration, are examples of parameters that can be adjusted by a parameter control method. Every iteration, the solutions created are evaluated using the objective function(s) of the problem at hand. The results from this fitness evaluation process can be used by the adaptive parameter control methods to assess the effect of the parameter values on the performance of the optimisation algorithm. This feedback collection as a part of the parameter control method records the change in the properties of the population—usually fitness, occasionally also diversity—as the performance of the algorithm.

The feedback from the optimisation process is used by the parameter effect assessment strategy to estimate the effect of the parameter values on the change in solution properties. The approximated effect of the parameter values is used to estimate a quality value which is attributed to the parameter value to project the potential of these values in the next iteration(s). These quality attributions provide the probability distributions used to select the parameter values to be employed to create the next generation of solutions. The selection mechanism subsequently configures the components of an EA (variation, selection, and replacement operators).

State of the art adaptive parameter control methods can be classified into two main categories: methods that use (i) an AQA strategy (Igel et al., 2005; Thierens, 2005), where the quality of parameter values is estimated as the average of the qualities measured in the past iterations, or (ii) an EQA (Thierens, 2005; Fialho et al., 2010), in which the best quality measured so far is used to estimate the overall quality of parameter values.

Usually, adaptive parameter control methods use a predefined time window *w* to approximate the quality of parameter values for the following iteration. Some methods use this time window to estimate the quality of parameter values as the average effect on algorithm performance (Igel et al., 2005). The probability matching (PM) technique (Thierens, 2005) estimates the quality of a parameter value based on a running average of past rewards. Rewards are allocated on the basis of the outcome of the optimisation process that the parameter value was used in. A minimum probability is enforced on values that do not receive rewards in order to maintain a nonzero probability. The motivation of a minimum value is the assumption that parameter values that do not perform well at present might be optimal in the future. PM has been criticised for the fact that the probability values resulting from the reward allocations poorly reflect the relative differences in algorithm performance when using the values. Values with vastly superior performance may only be differentiated by a marginal increase of the probability of being chosen in the next step.

Adaptive pursuit (AP) (Thierens, 2005) was conceived with the goal of improving the performance of PM by ensuring an appropriate difference in probabilities depending on experienced performance. After an iteration of the optimisation process, AP establishes the respective rewards for the parameter values used, but only applies the maximum award to the value of the best-performing algorithm instance. All other values have their probabilities of future use diminished. A nonzero probability is enforced as a minimum probability.

Fialho et al. (2008, 2009, 2010) employ the maximum effect in *w* iterations to approximate the quality of parameter values in the preceding iteration *t*−1. A normalisation scheme is used, which divides parameter effects by the best effect achieved so far by any parameter value. The dynamic multi-armed bandit (DMAB) (Fialho et al., 2009) also employs a performance-based rewards approach, but here the probability of re-using a certain value is based on a tradeoff between the number of times the parameter value was used and the reward gained from its previous performance. Rather than using a rewards-based weighted adjustment of the probabilities, the DMAB completely recalculates the probabilities when a change in the rewards distribution is detected. DMAB uses the Page-Hinkley test to detect a change in the rewards distribution.

The most common way of attributing quality to the parameters is by using the improvement in specific properties of the generated solutions with respect to a reference solution *s _{ref}*. This reference point can be either the parents of the generated solutions (Fialho et al., 2009; Hong et al., 2002; Igel et al., 2003; Igel et al., 2005; Wong et al., 2003), the best solution so far (Davis, 1989; Eiben et al., 2004), or a specified quantile, for example, the median (Julstrom, 1995, 1997). For example, if a mutation rate 0.01 is applied to create a solution from its parent

*s*, the effect of the mutation rate on the performance of the algorithm is equal to the fitness gain .

Contemporary parameter control methods (Corne et al., 2002; Fialho et al., 2010; Igel et al., 2005; Whitacre et al., 2009) attribute the quality derived from recent performance directly to the parameter values as a basis for the choice to make in the next iteration (time *t*+1). One might argue that these approaches are one step behind, in that they represent the parameter value which is optimal for the previous iteration (time *t*). Ideally, we would use a setting optimised for the beginning iteration (time *t*+1). It follows that an ideal parameter control method would attempt to predict successful parameter values for time *t* based on previous performance.

## 3 Predictive Quality Attribution

Given a set , of algorithm parameters with values , , where *m* can be a discrete value or an interval of continuous values, the adaptive method has the task of deriving the optimal next value to optimise the influence of *v _{i}* on the performance of the algorithm. As an example, when the mutation rate is dynamically adjusted using four intervals (

*m*=4), stands for the second interval. In the discrete case of optimising the type of mutation operator , stands for the second operator type. The probability of choosing values from a particular interval depends on the previous performance of values sampled from this interval. Values from intervals which have not been used have a small initial probability of being chosen. When a value is used but does not produce any above-average outcome, its probability of being chosen reverts to the initial value.

*k*independent algorithm instances are started in parallel. Each optimisation process optimises the same problem instance. The processes set their adaptive parameters for an iteration at a time, choosing the respective value probabilistically from the distribution obtained from the performance-based prediction. At the end of an iteration, the average solution fitness among all instances is established. Each value interval keeps a history list of success rates from all iterations. Algorithm instances which produce above-average population fitness gain are considered successful. For each parameter value or interval, the number of times the value has been used and the number of times the value led to a success (above-average increase in population fitness) are established. The effect of each parameter value is calculated using Equation (1).

*e*of parameter intervals/types are recorded for each iteration, resulting in the time series , where

*t*is the current iteration. The time series is used to forecast the effect of the parameter intervals/types for the next iterations. The purpose of predictive modelling is to estimate the expected effect of a parameter interval/type in the future iterations. A predictive model in adaptive parameter control has the general form: where is the response, is a vector of predictors, and ε is a random error. The different predictive models vary in the type of the response, which can be discrete or continuous, and the function

*f*. In this paper, we consider predictive models with continuous response, where the response is the expected quality of parameter values/intervals. The probability of sampling from a range is the combined quality of its members , normalised by the summed qualities of all intervals.

Forecasting necessitates the use of analytical methods to identify the relationship between future (predicted) effects of parameter values and past effects. The forecasting techniques explored in this work are the most prominent methods available for time series prediction, ranging from simple models such as linear regression (LR), simple moving average (SMA), and exponentially weighted moving average (EWMA) to more sophisticated formulations such as the autoregressive integrated moving average (ARIMA).

LR fits a linear equation to the observed data; hence, it makes the assumption that the data points have a linear relationship with time. If this is not the case, LR cannot be expected to provide accurate predictions and other alternatives have to be explored. The SMA model measures trends by smoothing the data using the mean of the data values. As a result, the trend in the data is extrapolated into a forecast. The assumption is that prior trend movements will continue. EWMA, on the other hand, focuses mostly on recent effects and captures shorter trends in the data. As a result, an EWMA is expected to reduce any lags in the simple moving average such that the fitting line is closer to the real values of the effects compared to a simple moving average. EWMA is vulnerable to sudden changes in the data, and hence is easily affected by noise. If this problem arises, it can be counteracted by increasing the length of the moving average term or using SMA instead. The ARIMA model is a combination of moving average models and a special case of the linear regression model. The ARIMA model is considered more powerful than LR, but also more complex.

Using the most sophisticated model available does not guarantee better results, as all models make assumptions about the data used for projection. For an optimal outcome, the parameter performance data must conform to the assumptions made by the forecasting method. It is necessary to investigate the data with regard to the assumptions to make recommendations of which forecasting methods are likely to produce well-parameterised algorithms. To this end, the analysis of the properties of the data is complemented with the performance of the algorithm when parameterised using the respective forecasting methods.

### 3.1 Linear Regression

Linear regression builds a linear mathematical model of past observations that can be used to forecast future values of a discrete time series. Regression models are based on the notion of regression to the mean, that is, if a time step is *x* standard deviations from the mean, the model predicts that the next value will be standard deviations from the mean of the time series, where *r* is the correlation coefficient between time and parameter effects. This is an indication of the extent to which the linear model can be used to predict the deviation of the effect of a parameter value from the mean when the deviation of time *t* from the mean is known.

*t*is equal to: where

*t*is the total number of observations, is the mean of the effects in the last

*t*iterations, is the mean of the time steps and is its standard deviation at time

*t*. The correlation coefficient

*r*is equal to the average of the product of the standardized effect and time calculated as follows:

The current parameter effect and time are standardised by expressing them in units of standard deviations from their own mean. The best fit of the model is calculated by minimising the mean squared error.

### 3.2 Simple Moving Average

*k*is the “width” of the moving average and denotes the number of observations used to compute the average. The forecast is equal to the average of the last

*k*data points in the time series, centred at ; the forecast will thus be delayed by time steps. The selection of the window width

*k*depends on the type of prediction of interest, which can be short, intermediate, or long term. Short-term averages respond quickly to changes in the effect of the parameter values, while long-term averages are slow to react.

### 3.3 Exponentially Weighted Moving Average

### 3.4 Autoregressive Integrated Moving Average

The use of ARIMA models is appropriate when the time series data show a stable trend with time and do not have many outliers. ARIMA can perform better than exponential smoothing techniques given a sufficient number of observations and a stable correlation between past observations. The ARIMA model is composed of three parts, which are calculated individually: the integrated (I) part, the autoregressive part (AR) and the moving average part (MA).

The integrated part *I*(*d*) is used if the data show evidence of nonstationarity, that is, the time series is not stable with time and exhibits a trend, and *d* is the number of nonseasonal differences. The stationarity of the time series is measured using the stationarity in mean and variance. The first step in the identification of the appropriate ARIMA model is the identification of a value for *d*, that is, the order(s) of differencing needed to make the time series stationary. The process of differencing subtracts each observation at time *t* from the observation at time *t*−1.

The *AR*(*p*) model is used to describe the correlation between the current value of the time series and *p* past values. The order *p* of the *AR* model describes the number of lags (past values) included in the model prediction, that is, the number of the autoregressive terms. In adaptive parameter control, *AR*(1) describes the correlation between the prediction of the effect at time *t* and the immediately preceding value in the time series, equal to , where is the first lag (lag 1) autoregressive parameter, and the error.

The moving average (*MA*(*q*)) part of the ARIMA model considers the time series as an unevenly weighted moving average of random errors , capturing the deviation of the current effect as a finite weighted sum of *q* previous deviations of the effects, plus a random error. In the first order *MA*(1), the error of the current value is directly correlated with the error of the immediate past observation, equal to , where , are the errors at times *t* and *t*−1, and is the first order moving average coefficient. Both the AR model and the MA model can be extended to a higher order model in order to include more information than that of the immediate past observation.

## 4 Statistical Properties of Time Series Data

The field of statistics defines a number of properties which a data series has to have to be modelled by time series prediction methods. Different methods have been made available to accommodate different types of data. The relevant statistical properties are

Linearity of the relationship between dependent and independent variables,

Normality of the error (residual) distribution,

Independence of the errors (residuals),

Homoscedasticity, which defines the variance of the errors as constant with time, and

Stationarity of the mean and standard deviation.

Linear regression requires the first four properties; SMA and EWMA assume only the last two. It is generally accepted that ARIMA produces accurate predictions only after a threshold of 50 data points. In addition, ARIMA requires the data to meet all but the first assumption.

### 4.1 Testing for the Statistical Properties

For each of the five properties, we perform statistical tests which reveal to what degree the property exists in the given data series.

#### 4.1.1 Linearity

Linearity can be determined visually by plotting the data. A more reliable method is to check for a statistical significance of the linearity by fitting a linear model and reporting the *p* value which indicates the significance of the regression. A *p* value<.05 indicates that the coefficient belonging to the linear term in the model is nonzero, which implies that the time series is linear with all data points having equal influence on the fitted line. The hypothesis is formulated as follows:

*H*_{0}: The correlation coefficient is equal to zero.*H*_{1}: The correlation coefficient is not zero.

#### 4.1.2 Normality

The calculation of the confidence intervals for the model coefficients in linear regression is based on the assumptions of normally distributed errors. In terms of PQA, normality of errors implies that most of the effects of the parameter values are expected to fall close to the regression line and very few to fall far from the line. Since the estimation of model coefficients is based on the minimisation of the squared error, the presence of extreme data points can exert a disproportionate influence on the estimates. To check for the normality assumption, we use the Kolmogorov-Smirnov (KS) test (Gibbons and Chakraborti, 2003) which tests the null hypothesis that the errors of the effects are normally distributed. The null hypothesis is rejected if the *p* value is smaller than 0.05, which indicates that the data cannot be considered normally distributed. The hypothesis is formulated as follows:

*H*_{0}: The data follow a normal distribution.*H*_{1}: The data do not follow a normal distribution.

#### 4.1.3 Independence

Linear regression models make the assumption that the error terms are independent. One example of the violation of this assumption is autocorrelation, which occurs if each error term in the time series is related to its immediate predecessor. The Durbin-Watson (DW) statistical test measures the correlation between the error terms and their immediate predecessors. The *D* statistic tests the null hypothesis that the errors from a least-squares regression are not autocorrelated. The values of *D* vary in the interval [0, 4]. For independent error terms, the value of *D* is expected to be close to 2. The value of *D* tends to be smaller than 2 for positive autocorrelation between the terms, and greater than 2 for negative autocorrelation.

*H*_{0}: The autocorrelation of errors is zero.*H*_{1}: The autocorrelation of errors is less than zero.

The null hypothesis is rejected if the *p* value is less than .05.

#### 4.1.4 Homoscedasticity

Homoscedasticity refers to the assumption that the effects of parameter values exhibit a similar variance with time. If homoscedasticity is violated, it is difficult to measure the true standard deviation of the residuals, which can result in confidence intervals that are too wide or too narrow. In particular, if the variance of the errors increases with time, confidence intervals for predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of placing too much weight on a small subset of the data, that is, the subset where the error variance is the largest when estimating coefficients.

Residuals are tested for homoscedasticity using the Breusch-Pagan (BP) test, which regresses square residuals to independent variables. The null hypothesis for the BP test is homosedasticity. The alternative hypothesis is that the error variance varies with a set of regressors.

*H*_{0}: The data exhibit a similar variance with time.*H*_{1}: The error variance varies with a set of regressors.

If the *p* value is less than .05, the null hypothesis is rejected, which means that the data are not homoscedastic.

#### 4.1.5 Stationarity

A time-series is stationary if the mean and standard deviation remain the same with time. Formally, the effect of parameter values are said to be stationary if the distribution of the time-series measured during the time segment [1, *t*] is the same as the distribution of the effects shifted by some lag *k*, that is, in the time segment [1+*k*, *t*+*k*].

In the application of the PQA, this means that it does not matter when the effects are observed, since the distribution of the parameter effects does not depend on time. In the case of the mutation rate, the assumption is that the diffuse part of the search with large values is followed by a more focussed part, which favours lower mutation rate values. Hence, a decreasing trend in the time series is expected.

To check for the stationarity of the effects of parameter values, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS; Kwiatkowski et al., 1992) test, which tests the null hypothesis that the data are stationary, formulated as follows:

*H*_{0}: The data are stationary.*H*_{1}: The data are not stationary.

If the *p* value is less than .05, the null hypothesis is rejected, which means that the data are not stationary.

## 5 Design of Experiments

We are interested in investigating statistical properties of the time-series data obtained from the effects of parameter values over time. Typical EA parameters whose adaptation is meaningful to automate using parameter control are the mutation rate, crossover rate, the mutation and crossover operators, the population, and mating pool sizes. More specialised EA implementations may have additional parameters. In this work we limit the investigation to the parameters that EAs most commonly use.

The real-valued parameters (mutation rate, crossover rate, population size, and mating pool size) are discretised into intervals. The quality attribution targets the intervals. The selection strategy samples the new parameter values from the intervals chosen according to the quality attribution mechanism. The definition of the interval bounds is likely to affect the performance of the algorithm and potentially the properties of the performance data to analyse. Therefore, we perform the experiments twice: with a total of two range intervals over the whole meaningful parameter range and a second time with intervals that divide the complete meaningful range into four intervals. All intervals are evenly spread within the range.

It is obvious that the practitioner is still left with the task of defining the range bounds for meaningful values of tunable parameters. For most well-known parameters, bounds are given in the literature. When practitioners customise algorithms and in the process define new parameters, the definition of range bounds is an unavoidable task even when using adaptive parameter control methods. Inexperienced practitioners would best avoid this dilemma by adhering to well-known EAs which permit the application of known bounds as listed in Table 1.

Parameter . | Ranges/values . |
---|---|

Mutation rate, 2 ranges | [0.001,0.249], [0.25,0.49] |

Mutation rate, 4 ranges | [0.001,0.1249], [0.125,0.249], [0.25,0.3749], [0.375,0.49] |

Crossover rate, 2 ranges | [0.6,0.79], [0.8,0.99] |

Crossover rate, 4 ranges | [0.6,0.69], [0.7,0.79], [0.8,0.89], [0.9,0.99] |

Population size, 2 ranges | [20,60], [61,100] |

Population size, 4 ranges | [20,40], [41,60], [61,80], [81,100] |

Mating pool size, 2 ranges | [0.1,0.39], [0.4,0.69] |

Mating pool size, 4 ranges | [0.1,0.249], [0.25,0.39], [0.4,0.549], [0.55,0.69] |

Mutation operator | Single-point, uniform |

Crossover operator | Single-point, uniform |

Parameter . | Ranges/values . |
---|---|

Mutation rate, 2 ranges | [0.001,0.249], [0.25,0.49] |

Mutation rate, 4 ranges | [0.001,0.1249], [0.125,0.249], [0.25,0.3749], [0.375,0.49] |

Crossover rate, 2 ranges | [0.6,0.79], [0.8,0.99] |

Crossover rate, 4 ranges | [0.6,0.69], [0.7,0.79], [0.8,0.89], [0.9,0.99] |

Population size, 2 ranges | [20,60], [61,100] |

Population size, 4 ranges | [20,40], [41,60], [61,80], [81,100] |

Mating pool size, 2 ranges | [0.1,0.39], [0.4,0.69] |

Mating pool size, 4 ranges | [0.1,0.249], [0.25,0.39], [0.4,0.549], [0.55,0.69] |

Mutation operator | Single-point, uniform |

Crossover operator | Single-point, uniform |

### 5.1 Benchmark Problems

The current work on parameterising EAs is necessarily experimental in nature. A set of complex benchmark problems ensures the results are representative of typical combinatorial optimisation processes in which EAs are often used. The quadratic assignment problem (QAP) is a well-known and very complex combinatorial optimisation problem. In the current work, we use the QAP in its single-objective and multi-objective formulations. Another difficult and equally well-known problem tailored to the optimisation behaviour of EAs is the royal road (RR) function.

#### 5.1.1 Quadratic Assignment Problem

*n*facilities have to be allocated to

*n*locations, such that the total cost is minimised and every location has only one assigned facility. The total cost is calculated as the flow between the facilities multiplied by the costs for placing the facilities at their respective locations. More formally, the problem is modelled with two , representing the cost and the flow. The aim is to assign

*n*utilities to

*n*locations with minimal cost. The candidate assignments are evaluated according to the following cost function: where

*n*is the number of facilities and locations,*D*is the distance between location_{ij}*i*and location*j*,*C*is the cost of the flow between neighbouring utilities_{kl}*k*and*l*,*u*is 1 if utility_{ik}*i*is assigned to location*k*, and 0 otherwise.

**Single-***Objective**Formulation*

*Objective*

*Formulation*

*f*

_{max}is the maximum value and

*f*

_{min}is the minimum value that the fitness function

*f*can take. Since QAP is a minimisation problem, the normalisation of the results converts it into a maximisation problem.

**Multi-***Objective**Formulation*

*Objective*

*Formulation*

*m*objectives according to Equation (12).

The instances used for the current work were published by Knowles and Corne (2003). File names ending in rl denote instances designed to be real-like with rugged fitness landscapes.

#### 5.1.2 The Royal Road Problem

The RR functions are parametrised landscapes, first introduced by Mitchell, Forrest, and Holland (Jones, 1994). They are especially devised to demonstrate that there exist problems which are easier to solve using a GA than a hill climbing algorithm.

The function of the form is used to define a search task in which one wants to locate strings that produce high fitness values. The string is composed of 2^{k} nonoverlapping contiguous sections each of length *b* + *g*, where *b* is known as the block and *g* is known as the gap. These sections form the first-level schemata, composed of short, low-order, highly fit schemata or building blocks. A schema *s _{i}* is a combination of ones, zeros, and the special character “*”, which is interpreted as “do not care.” Lower-level schemata are used to form intermediate-level higher-fitness schemata, which in turn are combined to create even higher-fitness schemata.

*x*proceeds in two steps, the part calculation and the bonus calculation. The overall fitness assigned to that string is the sum of these two calculations. The part calculation considers each block individually and assigns a fitness according to the number of 1s. The block fitnesses are later summed to form the total fitness of the part calculation. The fitness function

*RRP*(

*x*) for the part calculation of a string

*x*is defined as: where

The aim of the bonus calculation is to reward complete blocks and some combinations of complete blocks. Holland's approach assigns rewards for attaining a certain level. At the lowest level, rewards are given for complete blocks. If such a block exists, it receives a fitness equal to . Any additional complete blocks receive a fitness of *u*.

### 5.2 Data Collection

The data series we wish to analyse is the performance feedback data for the typical EA parameters identified. The data gathered are the success rate (Equation (1)) of the particular parameter or interval. The success rate is calculated as a ratio of the times the value or interval was used *successfully* to the number of times the value or interval was used. Successful usage is defined as the optimisation outcome producing above-average fitness values, where the average fitness is calculated across all new population members produced in the iteration under scrutiny.

To investigate the characteristics of the data, we ran as many parallel trials as we had discrete parameter values/intervals to investigate. In each trial, one of the parameters was the design factor and all others were set to fixed values which were identical in all parallel trials. It is clear that this procedure cannot eliminate possible interactions between certain parameter values. However, we can expect to reduce these interactions to a possible minimum. Every instance was granted 15, 000 function evaluations. Every 150 function evaluations, the effect of the parameter intervals was recorded. In total, each experiment produces 100 data points for each parameter range. Approximate algorithms are not expected to deliver exact and repeatable results. Therefore, for the current data analysis, all trials were repeated 30 times for each parameter value.

An example of two time series recording the success rate of mutation rate in the ranges [0.001,0.249) and (0.25,0.49] are shown in Table 2. The scatter plots of the data (shown in Figure 2, with time on the *x* axis) are useful for establishing trends in the time series. It can be observed that both time series are nonlinear. In fact, neither of them seems to exhibit any particular trend. Furthermore, the value-specific or interval-specific time series data from the optimisation process are tested for linearity, normality, independence, homoscedasticity, and stationarity using the statistical tests described in the previous sections. The results from the statistical tests that evaluate the linearity, normality of the residuals (KS test), independence of the residuals (DW test), homoscedasticity of the time series (BP test), and stationarity of the data (KPSS test) are shown in Table 2.

Parameter range . | Linearity test . | KS test . | DW test . | BP test . | KPSS test . |
---|---|---|---|---|---|

Mutation rate [0.001,0.249) | 0.1435 | 0.9684 | 1 | <0.01 | 0.04 |

Mutation rate (0.25,0.49] | 0.2366 | 0.4445 | 1 | <0.01 | 0.1 |

Parameter range . | Linearity test . | KS test . | DW test . | BP test . | KPSS test . |
---|---|---|---|---|---|

Mutation rate [0.001,0.249) | 0.1435 | 0.9684 | 1 | <0.01 | 0.04 |

Mutation rate (0.25,0.49] | 0.2366 | 0.4445 | 1 | <0.01 | 0.1 |

The *p* value from fitting a linear model to the data helps decide whether there is a significant relationship between the variables in the linear regression model of the time series at a .05 significance level. The *p* values are greater than .05 for both ranges of the mutation rate, hence the null hypothesis is not rejected and we can say that there is no significant relationship between the variables of the regression model. It can be concluded that the time series in these two examples are not linear.

The KS test is employed to test the assumption of normally distributed errors. The *p* values for both time series are greater than .05, which means that the null hypothesis stating that the data follow a normal distribution is not rejected. To test for the independence of error terms we record the *p* value from the DW test, which checks whether the errors from a least-squares regression are autocorrelated. The null hypothesis stating that the autocorrelation of errors is zero is rejected if the *p* value is less that .05. The *p* values from the DW test for the two time series are not less than .05, hence the null hypothesis is not rejected. This indicates that the error terms are not autocorrelated. Residuals are tested for homoscedasticity using the BP test, with a null hypothesis formulated as the data exhibit a similar variance with time, which is rejected with a *p* value less than .05 for both time series. Finally, the stationarity of the time series is analysed using the KPSS test. The *p* value from testing the time series of the mutation rate in the range [0.001, 0.249) is less than .05, which indicates that this time series is not stationary. On the contrary, the time series of the mutation rate in the range (0.25, 0.49] is stationary, with a *p* value equal to .1.

## 6 Evaluation of Statistical Properties

Table 3 summarises the results from the statistical tests when the continuous values have been divided into two ranges. Table 4 lists the same outcomes in the case where four intervals were used. The discrete parameters are listed in Table 5. The percentages represent the runs among the 30 trials in which H_{0} was not rejected. We use these results to derive conclusions regarding the appropriateness of using forecasting techniques to predict the effects of the parameter values.

. | . | Linear . | Normal . | Independent . | Homoscedastic . | Stationary . |
---|---|---|---|---|---|---|

Parameter . | Problem . | (%) . | (%) . | (%) . | (%) . | (%) . |

Mutation rate | QAP | 97 | 100 | 80 | 90 | 97 |

[0.0010, 0.25] | MQAP | 97 | 100 | 93 | 93 | 93 |

Royal Road | 93 | 90 | 87 | 80 | 80 | |

Mutation rate | QAP | 100 | 100 | 70 | 97 | 100 |

[0.2505, 0.5] | MQAP | 100 | 100 | 73 | 93 | 100 |

Royal Road | 93 | 93 | 93 | 87 | 100 | |

Crossover rate | QAP | 100 | 100 | 83 | 90 | 93 |

[0.6, 0.799] | MQAP | 100 | 100 | 93 | 90 | 90 |

Royal Road | 93 | 93 | 93 | 87 | 87 | |

Crossover rate | QAP | 100 | 100 | 83 | 97 | 93 |

[0.8, 1.0] | MQAP | 100 | 100 | 83 | 90 | 90 |

Royal Road | 100 | 97 | 80 | 87 | 90 | |

Population | QAP | 100 | 100 | 100 | 53 | 13 |

size | MQAP | 100 | 100 | 90 | 50 | 13 |

[20, 60] | Royal Road | 90 | 97 | 80 | 43 | 10 |

Population | QAP | 33 | 73 | 100 | 77 | 13 |

size | MQAP | 33 | 80 | 83 | 73 | 17 |

[61, 100] | Royal Road | 30 | 97 | 80 | 87 | 13 |

Mating pool | QAP | 100 | 100 | 77 | 93 | 97 |

[0.1, 0.4] | MQAP | 100 | 100 | 80 | 93 | 97 |

Royal Road | 100 | 97 | 73 | 87 | 90 | |

Mating pool | QAP | 100 | 100 | 80 | 87 | 100 |

[0.41, 0.7] | MQAP | 100 | 100 | 80 | 90 | 100 |

Royal Road | 100 | 97 | 80 | 87 | 100 |

. | . | Linear . | Normal . | Independent . | Homoscedastic . | Stationary . |
---|---|---|---|---|---|---|

Parameter . | Problem . | (%) . | (%) . | (%) . | (%) . | (%) . |

Mutation rate | QAP | 97 | 100 | 80 | 90 | 97 |

[0.0010, 0.25] | MQAP | 97 | 100 | 93 | 93 | 93 |

Royal Road | 93 | 90 | 87 | 80 | 80 | |

Mutation rate | QAP | 100 | 100 | 70 | 97 | 100 |

[0.2505, 0.5] | MQAP | 100 | 100 | 73 | 93 | 100 |

Royal Road | 93 | 93 | 93 | 87 | 100 | |

Crossover rate | QAP | 100 | 100 | 83 | 90 | 93 |

[0.6, 0.799] | MQAP | 100 | 100 | 93 | 90 | 90 |

Royal Road | 93 | 93 | 93 | 87 | 87 | |

Crossover rate | QAP | 100 | 100 | 83 | 97 | 93 |

[0.8, 1.0] | MQAP | 100 | 100 | 83 | 90 | 90 |

Royal Road | 100 | 97 | 80 | 87 | 90 | |

Population | QAP | 100 | 100 | 100 | 53 | 13 |

size | MQAP | 100 | 100 | 90 | 50 | 13 |

[20, 60] | Royal Road | 90 | 97 | 80 | 43 | 10 |

Population | QAP | 33 | 73 | 100 | 77 | 13 |

size | MQAP | 33 | 80 | 83 | 73 | 17 |

[61, 100] | Royal Road | 30 | 97 | 80 | 87 | 13 |

Mating pool | QAP | 100 | 100 | 77 | 93 | 97 |

[0.1, 0.4] | MQAP | 100 | 100 | 80 | 93 | 97 |

Royal Road | 100 | 97 | 73 | 87 | 90 | |

Mating pool | QAP | 100 | 100 | 80 | 87 | 100 |

[0.41, 0.7] | MQAP | 100 | 100 | 80 | 90 | 100 |

Royal Road | 100 | 97 | 80 | 87 | 100 |

. | . | Linear . | Normal . | Independent . | Homoscedastic . | Stationary . |
---|---|---|---|---|---|---|

Parameter . | Problem . | (%) . | (%) . | (%) . | (%) . | (%) . |

Mutation rate | QAP | 97 | 97 | 97 | 90 | 80 |

[0.001, 0.124] | MQAP | 97 | 97 | 97 | 90 | 80 |

Royal Road | 90 | 90 | 87 | 80 | 77 | |

Mutation rate | QAP | 100 | 100 | 100 | 80 | 63 |

[0.125, 0.249] | MQAP | 97 | 93 | 97 | 83 | 63 |

Royal Road | 100 | 97 | 97 | 80 | 60 | |

Mutation rate | QAP | 100 | 100 | 97 | 90 | 87 |

[0.25, 0.3749] | MQAP | 100 | 97 | 97 | 90 | 80 |

Royal Road | 97 | 100 | 100 | 93 | 83 | |

Mutation rate | QAP | 100 | 90 | 97 | 87 | 80 |

[0.375, 0.49] | MQAP | 100 | 90 | 93 | 87 | 83 |

Royal Road | 97 | 77 | 93 | 83 | 80 | |

Crossover rate | QAP | 100 | 100 | 93 | 83 | 80 |

[0.6, 0.69] | MQAP | 97 | 100 | 97 | 87 | 83 |

Royal Road | 100 | 93 | 90 | 83 | 83 | |

Crossover rate | QAP | 100 | 100 | 97 | 90 | 73 |

[0.7, 0.79] | MQAP | 100 | 100 | 97 | 93 | 73 |

Royal Road | 100 | 97 | 93 | 80 | 70 | |

Crossover rate | QAP | 100 | 100 | 97 | 87 | 80 |

[0.8, 0.89] | MQAP | 100 | 100 | 93 | 90 | 83 |

Royal Road | 100 | 90 | 93 | 80 | 77 | |

Crossover rate | QAP | 100 | 100 | 100 | 83 | 93 |

[0.9, 0.99] | MQAP | 100 | 97 | 97 | 90 | 90 |

Royal Road | 97 | 90 | 97 | 77 | 83 | |

Population | QAP | 100 | 97 | 100 | 40 | 10 |

size | MQAP | 100 | 100 | 97 | 50 | 13 |

[20, 40] | Royal Road | 97 | 80 | 97 | 47 | 10 |

Population | QAP | 100 | 100 | 100 | 40 | 3 |

size | MQAP | 97 | 97 | 100 | 53 | 7 |

[41, 60] | Royal Road | 97 | 100 | 97 | 40 | 10 |

Population | QAP | 100 | 100 | 100 | 57 | 17 |

size | MQAP | 100 | 97 | 97 | 50 | 13 |

[61, 80] | Royal Road | 93 | 93 | 97 | 53 | 10 |

Population | QAP | 100 | 100 | 100 | 53 | 47 |

size | MQAP | 100 | 97 | 100 | 47 | 50 |

[81, 100] | Royal Road | 97 | 100 | 90 | 50 | 43 |

Mating pool | QAP | 100 | 100 | 90 | 80 | 63 |

[0.1, 0.249] | MQAP | 100 | 97 | 93 | 83 | 70 |

Royal Road | 97 | 93 | 93 | 80 | 70 | |

Mating pool | QAP | 100 | 100 | 97 | 83 | 97 |

[0.25, 0.39] | MQAP | 100 | 97 | 97 | 90 | 90 |

Royal Road | 93 | 93 | 97 | 80 | 90 | |

Mating pool | QAP | 100 | 100 | 100 | 80 | 70 |

[0.4, 0.549] | MQAP | 100 | 97 | 100 | 83 | 83 |

Royal Road | 97 | 100 | 97 | 87 | 83 | |

Mating pool | QAP | 100 | 100 | 97 | 80 | 77 |

[0.55, 0.69] | MQAP | 100 | 97 | 97 | 80 | 77 |

Royal Road | 90 | 93 | 93 | 83 | 73 |

. | . | Linear . | Normal . | Independent . | Homoscedastic . | Stationary . |
---|---|---|---|---|---|---|

Parameter . | Problem . | (%) . | (%) . | (%) . | (%) . | (%) . |

Mutation rate | QAP | 97 | 97 | 97 | 90 | 80 |

[0.001, 0.124] | MQAP | 97 | 97 | 97 | 90 | 80 |

Royal Road | 90 | 90 | 87 | 80 | 77 | |

Mutation rate | QAP | 100 | 100 | 100 | 80 | 63 |

[0.125, 0.249] | MQAP | 97 | 93 | 97 | 83 | 63 |

Royal Road | 100 | 97 | 97 | 80 | 60 | |

Mutation rate | QAP | 100 | 100 | 97 | 90 | 87 |

[0.25, 0.3749] | MQAP | 100 | 97 | 97 | 90 | 80 |

Royal Road | 97 | 100 | 100 | 93 | 83 | |

Mutation rate | QAP | 100 | 90 | 97 | 87 | 80 |

[0.375, 0.49] | MQAP | 100 | 90 | 93 | 87 | 83 |

Royal Road | 97 | 77 | 93 | 83 | 80 | |

Crossover rate | QAP | 100 | 100 | 93 | 83 | 80 |

[0.6, 0.69] | MQAP | 97 | 100 | 97 | 87 | 83 |

Royal Road | 100 | 93 | 90 | 83 | 83 | |

Crossover rate | QAP | 100 | 100 | 97 | 90 | 73 |

[0.7, 0.79] | MQAP | 100 | 100 | 97 | 93 | 73 |

Royal Road | 100 | 97 | 93 | 80 | 70 | |

Crossover rate | QAP | 100 | 100 | 97 | 87 | 80 |

[0.8, 0.89] | MQAP | 100 | 100 | 93 | 90 | 83 |

Royal Road | 100 | 90 | 93 | 80 | 77 | |

Crossover rate | QAP | 100 | 100 | 100 | 83 | 93 |

[0.9, 0.99] | MQAP | 100 | 97 | 97 | 90 | 90 |

Royal Road | 97 | 90 | 97 | 77 | 83 | |

Population | QAP | 100 | 97 | 100 | 40 | 10 |

size | MQAP | 100 | 100 | 97 | 50 | 13 |

[20, 40] | Royal Road | 97 | 80 | 97 | 47 | 10 |

Population | QAP | 100 | 100 | 100 | 40 | 3 |

size | MQAP | 97 | 97 | 100 | 53 | 7 |

[41, 60] | Royal Road | 97 | 100 | 97 | 40 | 10 |

Population | QAP | 100 | 100 | 100 | 57 | 17 |

size | MQAP | 100 | 97 | 97 | 50 | 13 |

[61, 80] | Royal Road | 93 | 93 | 97 | 53 | 10 |

Population | QAP | 100 | 100 | 100 | 53 | 47 |

size | MQAP | 100 | 97 | 100 | 47 | 50 |

[81, 100] | Royal Road | 97 | 100 | 90 | 50 | 43 |

Mating pool | QAP | 100 | 100 | 90 | 80 | 63 |

[0.1, 0.249] | MQAP | 100 | 97 | 93 | 83 | 70 |

Royal Road | 97 | 93 | 93 | 80 | 70 | |

Mating pool | QAP | 100 | 100 | 97 | 83 | 97 |

[0.25, 0.39] | MQAP | 100 | 97 | 97 | 90 | 90 |

Royal Road | 93 | 93 | 97 | 80 | 90 | |

Mating pool | QAP | 100 | 100 | 100 | 80 | 70 |

[0.4, 0.549] | MQAP | 100 | 97 | 100 | 83 | 83 |

Royal Road | 97 | 100 | 97 | 87 | 83 | |

Mating pool | QAP | 100 | 100 | 97 | 80 | 77 |

[0.55, 0.69] | MQAP | 100 | 97 | 97 | 80 | 77 |

Royal Road | 90 | 93 | 93 | 83 | 73 |

. | . | Linear . | Normal . | Independent . | Homoscedastic . | Stationary . |
---|---|---|---|---|---|---|

Parameter . | Problem . | (%) . | (%) . | (%) . | (%) . | (%) . |

Single-point | QAP | 100 | 100 | 70 | 93 | 97 |

mutation | MQAP | 100 | 97 | 70 | 83 | 83 |

Royal Road | 90 | 90 | 73 | 87 | 87 | |

Uniform | QAP | 100 | 100 | 77 | 93 | 93 |

mutation | MQAP | 100 | 97 | 80 | 83 | 90 |

Royal Road | 97 | 87 | 73 | 87 | 90 | |

Single-point | QAP | 100 | 100 | 77 | 87 | 93 |

crossover | MQAP | 100 | 97 | 80 | 83 | 90 |

Royal Road | 97 | 93 | 73 | 87 | 87 | |

Uniform | QAP | 100 | 100 | 77 | 97 | 93 |

crossover | MQAP | 100 | 97 | 77 | 83 | 93 |

Royal Road | 100 | 90 | 73 | 87 | 83 |

. | . | Linear . | Normal . | Independent . | Homoscedastic . | Stationary . |
---|---|---|---|---|---|---|

Parameter . | Problem . | (%) . | (%) . | (%) . | (%) . | (%) . |

Single-point | QAP | 100 | 100 | 70 | 93 | 97 |

mutation | MQAP | 100 | 97 | 70 | 83 | 83 |

Royal Road | 90 | 90 | 73 | 87 | 87 | |

Uniform | QAP | 100 | 100 | 77 | 93 | 93 |

mutation | MQAP | 100 | 97 | 80 | 83 | 90 |

Royal Road | 97 | 87 | 73 | 87 | 90 | |

Single-point | QAP | 100 | 100 | 77 | 87 | 93 |

crossover | MQAP | 100 | 97 | 80 | 83 | 90 |

Royal Road | 97 | 93 | 73 | 87 | 87 | |

Uniform | QAP | 100 | 100 | 77 | 97 | 93 |

crossover | MQAP | 100 | 97 | 77 | 83 | 93 |

Royal Road | 100 | 90 | 73 | 87 | 83 |

In general, the time series data are linear with normally distributed errors. The lowest number of normally distributed residuals resulted from the experiment with the population size in the interval [61,100], in which the null hypothesis was rejected in 27% of the cases (cf. Table 3). The time series belonging to the population values in this range is also the only time series that was not linear in the majority of the cases (the null hypothesis was rejected in 70% of the runs).

The DW test used to verify the assumption of no correlation between the error terms of the time series shows that performance data of parameter values are independent for the majority of the cases, with the lowest percentage scored by the mutation rate in the interval [0.2505,0.5] (70% of the runs have independent errors).

The BP test was used to measure the homoscedasticity of the time series data. In general, homoscedasticity holds for the majority of the time series data, although the null hypothesis can only be confirmed in 80–90% of the runs for all parameter intervals. The lowest percentages of homoscedastic runs are in the experiments with the ranges [20,40] and [41,60] of the population size, in which only 40% of the runs are homoscedastic. Similar results were observed in the other ranges of the population size.

The KPSS test verifies the null hypothesis that the data are stationary. It can be observed that stationarity is a common feature in the effects of the single-point mutation, uniform mutation, single-point crossover, and uniform crossover operators. Similarly, in some of the intervals of the mutation rate ([0.0010, 0.25] and [0.2505, 0.5]), crossover rate ([0.6, 0.799], [0.8, 1.0], and [0.9, 0.99]) and mating pool size ([0.1, 0.4], [0.41, 0.7], and [0.25, 0.39]) the null hypothesis was not rejected in more than 90% of the runs with the mating pool size in the interval [0.41, 0.7] and mutation rate in the range [0.2505, 0.5] being the most stationary processes (100% of the runs are stationary).

However, the crossover rate in the interval [0.7, 0.79], the mutation rate in the interval [0.125, 0.249], and the mating pool size in the intervals [0.1, 0.249], [0.4, 0.549], and [0.55, 0.69] have a moderately low number of trials which confirm stationarity. Furthermore, less than 20% of the runs in all ranges of the population size confirm stationarity. The time series belonging to the population size in the interval [41, 60] has the lowest percentage of confirmation of the null hypothesis.

For the assessment of the suitability of forecasting methods LR, SMA, EWMA, and ARIMA based on statistical properties it is necessary to reach a conclusion regarding the parameters themselves instead of the individual value options. For the comparison, the assumptions made by the considered forecasting models have been summarised in Table 6.

Linearity | Normality | Independence | Homoscedasticity | Stationarity | |

LR | + | + | + | + | – |

SMA | – | – | – | + | + |

EWMA | – | – | – | + | + |

ARIMA | – | + | + | + | + |

Linearity | Normality | Independence | Homoscedasticity | Stationarity | |

LR | + | + | + | + | – |

SMA | – | – | – | + | + |

EWMA | – | – | – | + | + |

ARIMA | – | + | + | + | + |

Fortunately, the characteristics of parameter value effects are reasonably consistent over the values or intervals of the same parameter. We also find that dividing the parameter value ranges into two versus four intervals causes no significant differences in the behaviour of the statistical properties.

The results flag stationarity as the least satisfied condition, where the null hypothesis can be confirmed for some intervals in only around 70% of the trials. Excluding population size for its problematic outcomes on many assumptions, most parameters’ interval-specific results are mixed regarding stationarity. This would suggest that the best choice for all of these is LR, alternatively ARIMA with an appropriate differencing method. The condition of independence of errors seems threatened only when larger intervals are used (Table 3). Larger intervals seem to complicate the attribution of causality of parameter effects. If we assume smaller intervals, independence of errors seems unproblematic.

Homoscedasticity has low values only in the case of the intervals of the mating pool sizes. A decision has to be made whether a homoscedastic result in around 80% of the cases is sufficient to regard the mutation pool size effect data as homoscedastic. Since homoscedasticity is a prerequisite for the use of all considered models, we can only accept these low scores and investigate whether the use of the forecasting methods on the mating pool size leads to good overall results.

The parameter of population size scores poorly on both the test for homoscedasticity and stationarity. As all suggested forecasting models require homoscedasticity, we can only decide that none of these methods can be expected to provide accurate performance predictions on population size values. Perhaps the least unsuitable method for this parameter is LR for the reason that LR does not require stationarity.

Consequently, the resulting summary Table 7, which shows the suitability of forecasting methods for parameters, displays a “” sign against population size and all discussed forecasting methods. This indicates that the time series data of this parameter only partially follows the assumptions made by the forecasting model, which cannot be recommended for use without reserve. All other parameters have been marked as “+” for all forecasting methods considered. This decision has been made notwithstanding the limitations discussed above.

Parameter | LR | SMA | EWMA | ARIMA |

Mutation rate | + | + | + | + |

Crossover rate | + | + | + | + |

Population size | ||||

Mating pool size | + | + | + | + |

Mutation operator | + | + | + | + |

Crossover operator | + | + | + | + |

Parameter | LR | SMA | EWMA | ARIMA |

Mutation rate | + | + | + | + |

Crossover rate | + | + | + | + |

Population size | ||||

Mating pool size | + | + | + | + |

Mutation operator | + | + | + | + |

Crossover operator | + | + | + | + |

## 7 Performance Comparison of the Prediction Methods

The parameters mutation rate, crossover rate, mating pool size, mutation, and crossover operators are all similar in their adherence to the expectations made by the considered prediction models. From the outcome of the conformity tests we cannot definitively conclude which prediction model is clearly best suited for each of them. Therefore, in this section, we apply all considered forecasting methods to each of the parameters except the population size, which conforms too poorly to several of the assumptions.

To avoid overfitting of the forecasting models, we employ regularization techniques in which a part of the data is reserved to test whether the model generalises well. We randomly split the data into the training and the testing set, and fit the model to the training set by minimising the squared error. The model is then evaluated in the testing set, by measuring the squared error and comparing it to the training set.

### 7.1 Experimental Settings

In the first instance, mutation rate, crossover rate, mating pool size, crossover operator, and mutation operator are set adaptively using predictive parameter control. The population size is not discussed in this section due to its problematic statistical properties. The parameter values or intervals used for these trials are listed in Table 1.

We performed 30 runs for each optimisation scheme with different quality attribution strategies. The parameter value effect is measured based on the fitness improvement over the previous iteration. In the case of a multi-objective setting we use the hypervolume indicator, which has been identified as the best unary measure to reflect the fitness of a population in a multi-objective setting (Zitzler et al., 2003). The results are presented as the mean of 30 runs over 55 iterations. The problem instance used for QAP is BUR26E, for mQAP we used KC30-3fl-2rl. The Bur* problems’ first matrices are average typists’ typing times whereas the second are the frequencies of combinations of two letters. The multiobjective problem KC30-3fl-2rl has 30 facilities/locations, three objectives, and is the second real-like instance of this size created by Knowles and Corne (2003).

### 7.2 Results

The development of the population fitness averaged over 30 trials, plotted in Figure 3, shows that using LR for the parameter optimisation helps find better solutions at an earlier stage of the optimisation process. In the case of the QAP problems, single-objective and multi-objective, the quality achieved by LR is not attained by the other methods. The performances of all predictive models converge to the same quality on the RR problem. The optimisation scheme using ARIMA as a quality attribution scheme does not perform very well during the initial iterations, as could be expected given the requirement of a minimum of 50 data points.

On the other hand, certain ranges of mutation rate, crossover rate, and mating pool size meet the stationarity assumption, which is a prerequisite of SMA, EWMA, and ARIMA (see Table 7), only in a moderate number of runs. As a result, these models may not fit perfectly the time series data we are controlling. This explains the weaker performance of SMA, EWMA, and ARIMA models when compared with LR. As the differences in performance between the models are not statistically significant, the results suggest that LR can be regarded as representative of all considered prediction models.

## 8 Experimental Evaluation

The experimental validation of predictive parameter control using the considered techniques is carried out on all parameters regardless of adherence to statistical properties as established in a previous section. Practitioners will not, in general, have the leisure to perform tests on statistical properties of parameters. We are therefore obliged to provide recommendations how to proceed when assumptions are not met or the adherence to statistical assumptions is unknown. We provide a validation for all parameters using all forecasting techniques to establish the potential of adverse effects on the performance of the EA.

Having investigated the appropriate forecasting method to use for PQA, we investigate whether using prediction as part of adaptive parameter control is beneficial to the result quality. The PQA method using LR is compared to the average and extreme quality attribution schemes, which are currently the best-performing quality attribution strategies.

### 8.1 Benchmark Quality Attribution Strategies

*W*to approximate the quality of parameter values as the average effect on algorithm performance. The AQA is given in Equation 15. where is the effect of parameter value , assessed as the fitness difference between the solution produced by applying parameter value and its parent. The sum of the effect values of in the last

*w*iterations (i.e., ) is divided by the total number of solutions created using that parameter value () normalising the quality of .

*w*applications in a descending order. Each parameter effect is assigned a rank value

*w*−

*r*, where

*r*refers to the rank-position of the effect value. The higher the rank position, the lower is the rank value. The final rank value for each parameter effect is calculated as: where is the decay factor, which emphasizes the influence of the top-ranked effect values on the overall quality of that parameter value, hence it belongs to the class of EQA strategies. Next, the area under the curve is calculated by going down the sorted list, and drawing the receiving operator curve starting from the origin (Fialho et al., 2010). Since AUC uses only the ranks of the parameter effects, this method is quite robust with respect to hyperparameter and fitness transformations, as shown in the experimental evaluation by Fialho et al. (2010).

### 8.2 Experimental Settings

For the benefit of the comparison between PQA, AQA, and EQA, we control the mutation rate, the crossover rate, the mutation operator, the crossover operator, the population size, and the mating pool size with values/ranges depicted in Table 1.

The experimental study in Section 4 demonstrated that the effect of the population size during the optimisation time steps does not adhere to the assumptions made by the forecasting techniques. Hence we conduct a separate set of experiments for the population size and another for the mutation rate, the crossover rate, the mutation operator, the crossover operator, and the mating pool size. The latter set of experiments is performed to examine the performance of PQA when used to predict the quality of parameter values which conform to the assumptions of the forecasting techniques. The population size is investigated for an insight into the problem—whether the use of PQA can have adverse effects on the performance of the algorithm when the assumptions of the forecasting models are clearly not met.

Approximate algorithms are not expected to deliver exact and repeatable results, but to provide good approximate solutions where exact approaches cannot be devised. Hence, results concerning the performance of approximate algorithms, such as EAs, are usually reported as mean values over repeated trials. To obtain a fair comparison, the generally accepted approach is to allow the same number of function evaluations for each trial. Every trial instance is granted 90, 000 function evaluations. To obtain a fair comparison among the different quality attribution schemes we repeat each run 30 times. To check for a statistical difference of the results, the different effect attribution schemes of the optimisation methods are validated using the KS nonparametric test (Pettitt and Stephens, 1977). The final solutions of each optimisation scheme are recorded and the results are compared by using the normalised fitness for the single-objective problems and the hypervolume indicator for the multi-objective problem.

### 8.3 Hyperparameters

All adaptive algorithms involve hyperparameters, which need to be tuned depending on the optimisation problem at hand. This defines another optimisation problem, which can become quite computationally expensive if we attempt an exhaustive exploration of the search space. However, the number of hyperparameters is usually lower than the number of algorithm parameters that should be controlled.

The tuning of the hyperparameter values was performed using a sequential parameter optimisation (SPO) (Bartz-Beielstein et al., 2005), which tests each hyperparameter combination using several runs. To decrease the number of tests required, we employ a racing technique, which uses a variable number of runs depending on the performance of the hyperparameter configuration. Hyperparameter configurations are tested against the best configuration so far, using at least the same number of function evaluations as employed for the best configuration. The results from the tuning process are shown in Table 8.

Benchmark . | Hyperparameter . | Value . | Description . |
---|---|---|---|

EQA | 0.5 | Decay factor | |

PQA, EQA | p_{min} | 0.1 | Minimum selection probability |

EQA, AQA, PQA | W | 100 | Window size |

Benchmark . | Hyperparameter . | Value . | Description . |
---|---|---|---|

EQA | 0.5 | Decay factor | |

PQA, EQA | p_{min} | 0.1 | Minimum selection probability |

EQA, AQA, PQA | W | 100 | Window size |

The only hyperparameter for the PQA method is the minimum probability used for the exploration of new parameter values. For the minimum selection probability, we employ the same settings as AP and PM, as shown in Table 8.

### 8.4 Benchmark Problems

The optimisation schemes are employed to optimise the RR problem, described in Section 5.1.2, four well-known instances of the QAP (CHR20A, BUR26E, TAI30A, and STE36B) and three instances of the multi-objective QAP (KC10-2fl-5rl, KC30-3fl-1rl, and KC30-3fl-2rl). CHR20a is known to be among the hardest of the CHR* problems (Christofides and Benavent, 1989). The BUR* problems’ first matrices are average typists’ typing times; whereas the second are the frequencies of combinations of two letters. TAI*a problems are considered to have rugged fitness landscapes, whereas TAI*b instances are structured (Merz and Freisleben, 2000). STE36b is a variation of a computer component placement problem with the goal to minimise the lengths of the connecting wires (Steinberg, 1961). The multi-objective problem KC10-2fl-5rl has 10 facilities/locations, two objectives, and is the fifth real-like instance of this size created by Knowles and Corne (2003). By analogy, KC30-3fl-1rl and KC30-3fl-2rl have 30 facilities and three objectives.

### 8.5 Results

Figure 4 shows the results of the 30 trials in which the mutation rate, crossover rate, mating pool size, mutation operator, and crossover operator were controlled. The experimental results are clearly not normally distributed, but the mean and 25th percentile of the PQA method are consistently above the respective values of the benchmarks.

The means and standard deviations are listed in Table 9, which shows a significant difference between the result groups of AQA, EQA, and PQA. The mean performance of PQA is consistently above the averages of the benchmarks. The performance of EQA is clearly worse than that of the other two quality attribution strategies.

. | Mean . | Standard deviation . | ||||
---|---|---|---|---|---|---|

Problem . | AQA . | EQA . | PQA . | AQA . | EQA . | PQA . |

CHR20A | 0.4909 | 0.4845 | 0.4972 | 1.627E–02 | 1.306E–02 | 1.920E–02 |

BUR26E | 0.2416 | 0.2395 | 0.2422 | 2.049E–03 | 1.893E–03 | 2.365E–03 |

TAI30A | 0.4909 | 0.4845 | 0.4982 | 1.627E–02 | 1.306E–02 | 1.870E–02 |

STE36B | 0.7978 | 0.7984 | 0.8308 | 7.823E–03 | 9.038E–03 | 8.704E–03 |

RR | 0.0064 | 0.0063 | 0.0072 | 1.537E–04 | 6.3723E–03 | 1.473E–03 |

KC10-2fl-5rl | 0.8062 | 0.8061 | 0.8067 | 1.040E–03 | 8.284E–04 | 9.555E–04 |

KC30-3fl-1rl | 0.1984 | 0.1961 | 0.1995 | 3.255E–03 | 3.974E–03 | 3.806E–03 |

KC30-3fl-2rl | 0.5199 | 0.5175 | 0.5223 | 7.312E–03 | 5.439E–03 | 4.821E–03 |

. | Mean . | Standard deviation . | ||||
---|---|---|---|---|---|---|

Problem . | AQA . | EQA . | PQA . | AQA . | EQA . | PQA . |

CHR20A | 0.4909 | 0.4845 | 0.4972 | 1.627E–02 | 1.306E–02 | 1.920E–02 |

BUR26E | 0.2416 | 0.2395 | 0.2422 | 2.049E–03 | 1.893E–03 | 2.365E–03 |

TAI30A | 0.4909 | 0.4845 | 0.4982 | 1.627E–02 | 1.306E–02 | 1.870E–02 |

STE36B | 0.7978 | 0.7984 | 0.8308 | 7.823E–03 | 9.038E–03 | 8.704E–03 |

RR | 0.0064 | 0.0063 | 0.0072 | 1.537E–04 | 6.3723E–03 | 1.473E–03 |

KC10-2fl-5rl | 0.8062 | 0.8061 | 0.8067 | 1.040E–03 | 8.284E–04 | 9.555E–04 |

KC30-3fl-1rl | 0.1984 | 0.1961 | 0.1995 | 3.255E–03 | 3.974E–03 | 3.806E–03 |

KC30-3fl-2rl | 0.5199 | 0.5175 | 0.5223 | 7.312E–03 | 5.439E–03 | 4.821E–03 |

The gap between result qualities widens in favour of PQA as the problem difficulty increases. The smallest instances of the QAP (CHR20A) and mQAP (KC10-2fl-5rl) can be assumed to be the least challenging, and there the results are not as clearly in favour of PQA. The RR problem and the larger instances of QAP (STE36B) and mQAP (KC30-3fl-1rl, KC10-2fl-5rl) are evidently solved to better quality using PQA, as they are more complex than the smaller problem instances.

To check for a statistical significance of the results we used the KS non-parametric test (Pettitt and Stephens, 1977). PQA was compared with AQA and EQA with a null hypothesis of no difference between the performances. The 30 values of the hypervolume indicators of the repeated trials for each problem instance were submitted to the KS analysis (Pettitt and Stephens, 1977) and results are shown in Table 10. All KS tests used for establishing differences between independent datasets under the assumption that they are not normally distributed, result in a rejection of the null hypothesis with a minimum *d* value of 0.2333 at a 95% confidence level. Hence we conclude that the superior performance of the PQA method is statistically significant.

. | PQA versus AQA . | PQA versus EQA . | ||
---|---|---|---|---|

Problem . | d
. | p
. | d
. | p
. |

CHR20A | 0.2333 | 0.042 | 0.3333 | 0.045 |

BUR26E | 0.3000 | 0.048 | 0.3000 | 0.049 |

TAI30A | 0.3294 | 0.046 | 0.3517 | 0.039 |

STE36B | 0.9667 | 0.000 | 0.9667 | 0.000 |

RR | 0.6333 | 0.000 | 0.7333 | 0.000 |

KC10-2fl-5rl | 0.3333 | 0.045 | 0.4000 | 0.011 |

KC30-3fl-1rl | 0.2897 | 0.016 | 0.4333 | 0.005 |

KC30-3fl-2rl | 0.3433 | 0.035 | 0.4333 | 0.005 |

. | PQA versus AQA . | PQA versus EQA . | ||
---|---|---|---|---|

Problem . | d
. | p
. | d
. | p
. |

CHR20A | 0.2333 | 0.042 | 0.3333 | 0.045 |

BUR26E | 0.3000 | 0.048 | 0.3000 | 0.049 |

TAI30A | 0.3294 | 0.046 | 0.3517 | 0.039 |

STE36B | 0.9667 | 0.000 | 0.9667 | 0.000 |

RR | 0.6333 | 0.000 | 0.7333 | 0.000 |

KC10-2fl-5rl | 0.3333 | 0.045 | 0.4000 | 0.011 |

KC30-3fl-1rl | 0.2897 | 0.016 | 0.4333 | 0.005 |

KC30-3fl-2rl | 0.3433 | 0.035 | 0.4333 | 0.005 |

Given the outcomes of the tests for statistical properties, the population size was controlled individually. Figure 5 depicts the results of the 30 runs for each problem instance and quality attribution scheme. PQA is slightly outperformed by the EQA strategy in some of the problems. These results indicate that using forecasting techniques to predict the quality of the population choices for the future iterations may not always be beneficial.

The means and standard deviations of the three quality attribution schemes are shown in Table 11. It can be observed that the mean performance of PQA is usually not higher than that of the other two quality attribution schemes. To check for a statistical significance of the difference in the performances of the quality attribution schemes we submitted the results from the 30 runs to the KS nonparametric test (Pettitt and Stephens, 1977), with the null hypothesis that there is no significant difference between the result sets.

. | Mean . | Standard deviation . | ||||
---|---|---|---|---|---|---|

Problem . | AQA . | EQA . | PQA . | AQA . | EQA . | PQA . |

CHR20A | 0.4660 | 0.4669 | 0.4651 | 4.85E–03 | 3.51E–03 | 4.26E–03 |

BUR26E | 0.2460 | 0.2461 | 0.2459 | 8.92E–05 | 6.88E–05 | 1.31E–04 |

TAI30A | 0.4977 | 0.4982 | 0.4962 | 3.57E–03 | 3.23E–03 | 5.94E–03 |

STE36B | 0.7934 | 0.8164 | 0.7945 | 3.29E–03 | 4.22E–02 | 3.64E–03 |

RR | 6.56E–03 | 6.57E–03 | 6.56E–03 | 2.14E–05 | 2.27E–05 | 1.82E–05 |

KC10-2fl-5rl | 0.8065 | 0.8065 | 0.8065 | 2.74E–05 | 3.30E–06 | 2.74E–05 |

KC30-3fl-1rl | 0.1936 | 0.1993 | 0.1974 | 5.68E–03 | 4.53E–03 | 6.74E–03 |

KC30-3fl-2rl | 0.5754 | 0.5785 | 0.5745 | 4.92E–03 | 4.14E–03 | 3.67E–03 |

. | Mean . | Standard deviation . | ||||
---|---|---|---|---|---|---|

Problem . | AQA . | EQA . | PQA . | AQA . | EQA . | PQA . |

CHR20A | 0.4660 | 0.4669 | 0.4651 | 4.85E–03 | 3.51E–03 | 4.26E–03 |

BUR26E | 0.2460 | 0.2461 | 0.2459 | 8.92E–05 | 6.88E–05 | 1.31E–04 |

TAI30A | 0.4977 | 0.4982 | 0.4962 | 3.57E–03 | 3.23E–03 | 5.94E–03 |

STE36B | 0.7934 | 0.8164 | 0.7945 | 3.29E–03 | 4.22E–02 | 3.64E–03 |

RR | 6.56E–03 | 6.57E–03 | 6.56E–03 | 2.14E–05 | 2.27E–05 | 1.82E–05 |

KC10-2fl-5rl | 0.8065 | 0.8065 | 0.8065 | 2.74E–05 | 3.30E–06 | 2.74E–05 |

KC30-3fl-1rl | 0.1936 | 0.1993 | 0.1974 | 5.68E–03 | 4.53E–03 | 6.74E–03 |

KC30-3fl-2rl | 0.5754 | 0.5785 | 0.5745 | 4.92E–03 | 4.14E–03 | 3.67E–03 |

The null hypothesis was not rejected in any of the KS tests (see Table 12), which means that the superior performances of the AQA and EQA methods compared to the PQA strategy are not statistically significant.

. | PQA versus AQA . | PQA versus EQA . | ||
---|---|---|---|---|

Problem . | d
. | p
. | d
. | p
. |

CHR20A | 0.2000 | 0.537 | 0.2667 | 0.200 |

BUR26E | 0.2000 | 0.537 | 0.3000 | 0.109 |

TAI30A | 0.3750 | 0.162 | 0.3750 | 0.162 |

STE36B | 0.2500 | 0.633 | 0.3750 | 0.162 |

RR | 0.2667 | 0.200 | 0.2667 | 0.200 |

KC10-2fl-5rl | 0.3750 | 0.162 | 0.3750 | 0.162 |

KC30-3fl-1rl | 0.3500 | 0.121 | 0.3542 | 0.113 |

KC30-3fl-2rl | 0.2000 | 0.890 | 0.4000 | 0.136 |

. | PQA versus AQA . | PQA versus EQA . | ||
---|---|---|---|---|

Problem . | d
. | p
. | d
. | p
. |

CHR20A | 0.2000 | 0.537 | 0.2667 | 0.200 |

BUR26E | 0.2000 | 0.537 | 0.3000 | 0.109 |

TAI30A | 0.3750 | 0.162 | 0.3750 | 0.162 |

STE36B | 0.2500 | 0.633 | 0.3750 | 0.162 |

RR | 0.2667 | 0.200 | 0.2667 | 0.200 |

KC10-2fl-5rl | 0.3750 | 0.162 | 0.3750 | 0.162 |

KC30-3fl-1rl | 0.3500 | 0.121 | 0.3542 | 0.113 |

KC30-3fl-2rl | 0.2000 | 0.890 | 0.4000 | 0.136 |

This indicates that using PQA to forecast the effect of the population size does not have a negative effect on the performance of the algorithm, despite the fact that this parameter does not meet the assumptions made by the forecasting models. Hence, we expect that when new EA parameters are controlled, prediction can be used without a significant negative effect on the performance of the algorithm.

## 9 Conclusion

Controlling parameters adaptively relies on strategies to extract effects of parameter values from the optimisation process and extract information which is subsequently used to attribute a measure of quality to the parameter values used. In previous work (Aleti and Moser, 2011), we have suggested that using prediction to attribute a quality metric to the parameter values to choose from for the next iteration is beneficial to the performance of the algorithm. Many forecasting techniques exist for time series prediction. Four of these were considered for use in PQA.

The suitability of each forecasting technique for predicting the effect of six EA parameters (mutation rate, crossover rate, population size, mating pool size, the type of crossover operator, and the type of mutation operator) was assessed. It was established that all considered parameters, with the exception of the population size, adhered, within a certain tolerance, to the assumptions required for the use of each of the considered forecasting techniques. Hence, all forecasting methods could reasonably be considered for the projection of future parameter quality in PQA.

Despite being one of the simplest forecasting strategies, LR has proven to be the most effective model to predict the effect of EA parameters, albeit by a negligible margin. In the experimental studies we have used the PQA method with LR to adjust the mutation rate, crossover rate, mating pool size, mutation operator, and crossover operator throughout the optimisation process carried out using different problem instances. The trials have demonstrated that PQA outperforms the quality attribution methods currently considered most successful.

Although the population size was not found to conform to the assumptions made by the forecasting techniques to a reasonable tolerance, the results of the experiments have shown that using the PQA strategy to control the population size did not have a significant negative effect on the performance of the EA. Based on the experiments with the population size, we argue that if EA parameters are controlled without prior testing for statistical properties, the use of PQA has a potential benefit without the expectation of any negative effects on the performance of the algorithm. The exploration of the best possible prediction method is unlikely to be affected by the discretisation, given that we employ two different settings.