## Abstract

Selection hyper-heuristics (HHs) are randomised search methodologies which choose and execute heuristics during the optimisation process from a set of low-level heuristics. A machine learning mechanism is generally used to decide which low-level heuristic should be applied in each decision step. In this article, we analyse whether sophisticated learning mechanisms are always necessary for HHs to perform well. To this end we consider the most simple HHs from the literature and rigorously analyse their performance for the LeadingOnes benchmark function. Our analysis shows that the standard Simple Random, Permutation, Greedy, and Random Gradient HHs show no signs of learning. While the former HHs do not attempt to learn from the past performance of low-level heuristics, the idea behind the Random Gradient HH is to continue to exploit the currently selected heuristic as long as it is successful. Hence, it is embedded with a reinforcement learning mechanism with the shortest possible memory. However, the probability that a promising heuristic is successful in the next step is relatively low when perturbing a reasonable solution to a combinatorial optimisation problem. We generalise the “simple” Random Gradient HH so success can be measured over a fixed period of time $\tau $, instead of a single iteration. For LeadingOnes we prove that the *Generalised Random Gradient (GRG)* HH can learn to adapt the neighbourhood size of Randomised Local Search to optimality during the run. As a result, we prove it has the best possible performance achievable with the low-level heuristics (Randomised Local Search with different neighbourhood sizes), up to lower-order terms. We also prove that the performance of the HH improves as the number of low-level local search heuristics to choose from increases. In particular, with access to $k$ low-level local search heuristics, it outperforms the best-possible algorithm using any subset of the $k$ heuristics. Finally, we show that the advantages of GRG over Randomised Local Search and Evolutionary Algorithms using standard bit mutation increase if the anytime performance is considered (i.e., the performance gap is larger if approximate solutions are sought rather than exact ones). Experimental analyses confirm these results for different problem sizes (up to $n=108$) and shed some light on the best choices for the parameter $\tau $ in various situations.

## 1 Introduction

Many successful applications of randomised search heuristics to real-world optimisation problems have been reported. Despite these successes, it is still difficult to decide which particular search heuristic is a good choice for the problem at hand and what parameter settings should be used. In particular, while it is well understood that each heuristic will be efficient on some classes of problems and inefficient on others (Wolpert and Macready, 1997), very little guidance is available explaining how to choose an algorithm for a given problem. The high-level idea behind the field of hyper-heuristics (HHs) is to overcome this difficulty by evolving the search heuristic for the problem rather than choosing one in advance (or applying several arbitrary ones until a satisfactory solution is found). The overall goal is to automate the design and the tuning of the algorithm and parameters for the optimisation problem, hence to achieve a more generally applicable system.

HHs are usually classified into two categories: *generation* HHs and *selection* HHs (Burke et al., 2013). The former typically work offline and generate new heuristics from components of existing heuristics, while the latter typically work online and repeatedly select from a set of low-level heuristics which one to apply in the next state of the optimisation process. The low-level heuristics can be further classified as either *construction heuristics*, which build a solution incrementally, or *perturbation heuristics*, which start with a complete solution and try to iteratively improve it.

In this article, selection-perturbation HHs will be considered. Selection HHs belong to the wide field of *algorithm selection from algorithm portfolios* (Bezerra et al., 2018). A noteworthy example of such a system is the SATZilla approach for the SAT problem (Xu et al., 2008). We refer to Kotthoff (2016) for an overview of approaches to algorithm selection for combinatorial optimisation problems. It should be noted that HHs may also be used for other purposes. For example, a HH may be used as a parameter control method if it is asked to select at each state between different heuristics that only differ by one parameter value, rather than between different general purpose algorithms. See Doerr and Doerr (2018) for a comprehensive review of parameter control techniques. In general though, while the aim of parameter control mechanisms is to essentially choose between different parameter values for a given algorithm, selection HHs normally work at a higher level, since their aim is to learn which low-level heuristics (i.e., algorithms) work best at different states of the optimisation process; i.e., very different algorithms may perform better in different areas of the search space.

Selection HHs consist of two separate components: (1) a *heuristic selection method*, often referred to as the *learning mechanism*, to decide which heuristic is to be applied in the next step of the optimisation process, and (2) a *move acceptance operator* to decide whether the newly produced search points should be accepted.

Move acceptance operators are classified into *deterministic* ones, which make the same decisions independent of the state of the optimisation process, and *non-deterministic* ones, which might make different decisions for the same solutions at different times.

The majority of heuristic selection methods in the literature apply machine learning techniques that generate scores for each heuristic based on their past performance. A commonly used method for the purpose is reinforcement learning (Cowling et al., 2001; Nareyek, 2004; Burke et al., 2003). However, also considerably simpler heuristic selection methods have been successfully applied in the literature. These include, among others: *Simple Random* heuristic selection which just chooses heuristics uniformly at random; *Random Permutation* which generates a random ordering of the low-level heuristics and applies them in that order; *Greedy* heuristic selection which applies all the heuristics to the current search point and returns the best found solution; *Random Gradient* which chooses a random heuristic and continues to apply it as long as it is successful; and *Random Permutation Gradient* which is a combination of the Permutation and Random Gradient methodologies (Cowling et al., 2001, 2002; Berberoǧlu and Uyar, 2011).

Apart from the Greedy one, an obvious advantage of these simple methods is that they execute very quickly. On the other hand, it is unclear how well they perform in terms of how effective they are at identifying (near) optimal solutions in a short time. While the Simple Random, Random Permutation, and Greedy mechanisms do not attempt to learn anything from the past performance of the applied heuristics, Random Gradient and Permutation Random Gradient can still be considered as intelligent selection mechanisms because they embed a reinforcement learning mechanism, albeit with the shortest memory length possible. Experimental work has suggested that such mechanisms may be useful for highly rugged search landscapes (Burke et al., 2013). In this article, we will rigorously evaluate the performance of simple heuristic selection methods for a unimodal benchmark problem and focus most of our attention on the Random Gradient method that uses “minimal” intelligence.

Numerous successes of selection HHs for NP-hard optimisation problems have been reported, including scheduling (Cowling et al., 2001, 2002; Gibbs et al., 2010), timetabling (Özcan et al., 2010), vehicle routing (Asta and Özcan, 2014), and cutting and packing (López-Camacho et al., 2014). However, their theoretical understanding is very limited. Some insights into the behaviour of selection HHs have been achieved via landscape analyses (Maden et al., 2009; Ochoa, Qu et al., 2009; Ochoa, Vázquez-Rodríguez et al., 2009). Concerning their performance, the most sophisticated HH that has been analysed is the one considered by Doerr et al. (2016), where the different neighbourhood sizes $k$ of the Randomised Local Search (RLS$k$) algorithm were implicitly used as low-level heuristics and a reinforcement learning mechanism was applied as the heuristic selection method. The authors proved that this HH can track the best local search neighbourhood size (i.e., the best fitness-dependent number of distinct bits to be flipped) for OneMax, while they showed experimentally that it outperforms RLS$1$ and the (1 $+$ 1) Evolutionary Algorithm ((1 $+$ 1) EA) for LeadingOnes. The few other available runtime analyses consider the less sophisticated heuristic selection methodologies which are the focus of this article.

Lehre and Özcan (2013) presented the first runtime analysis of selection-perturbation HHs. They considered the Simple Random HH (Cowling et al., 2001) that at each step randomly chooses between a 1-bit flip and a 2-bit flip operator as low-level heuristics, and presented an example benchmark function class, called GapPath, where it is necessary to use more than one of the low-level heuristics to optimise the problem. Similar example functions have also been constructed by He et al. (2012).

A comparative time-complexity analysis of selection HHs has been presented by Alanazi and Lehre (2014). They considered several of the common simple selection mechanisms, namely Simple Random, Permutation, Random Gradient, and Greedy (Cowling et al., 2001, 2002) and analysed their performance for the standard LeadingOnes benchmark function when using a low-level heuristic set consisting of a 1-bit flip and a 2-bit flip operator (i.e., the same set previously considered by Lehre and Özcan (2013)). Their runtime analyses show that the four simple selection mechanisms have the same asymptotic expected runtime, while an experimental evaluation indicates that the runtimes are indeed equivalent already for small problem dimensions. Recently, additive Reinforcement Learning selection was also shown (under mild assumptions) to often have asymptotically equivalent performance to Simple Random selection, including for the same problem setting (i.e., LeadingOnes selecting between 1-bit flip and 2-bit flip operators) (Alanazi and Lehre, 2016). In particular, the results indicate that selection mechanisms such as additive Reinforcement Learning and Random Gradient do not learn to exploit the more successful low-level heuristics and end up having the same performance as Simple Random selection.

The main idea behind Random Gradient is to continue to exploit the currently selected heuristic as long as it is successful. Unlike construction heuristics, where iterating a greedy move on a currently successful heuristic may work for several consecutive construction steps, the probability that a promising heuristic is successful in the next step is relatively low when perturbing a reasonable solution to a combinatorial optimisation problem.

To this end, in this article we propose to generalise the Random Gradient selection-perturbation mechanism such that success can be measured over a fixed period of time $\tau $, which we call the learning period, instead of doing so after each iteration. We refer to the generalised HH as Generalised Random Gradient (GRG). We use the LeadingOnes benchmark function and the 1-bit flip and a 2-bit flip operators as low-level heuristics to show that with this simple modification Random Gradient can be surprisingly fast, even if no sophisticated machine learning mechanism is used to select from the set of low-level heuristics. We first derive the exact leading constants in the expected runtimes of the simple mechanisms for LeadingOnes, thus proving that they all have expected runtime $ln(3)2n2+o(n2)\u22480.54931n2+o(n2)$, confirming what was implied by the experimental analysis of Alanazi and Lehre (2014). This result indicates that all the simple mechanisms essentially choose operators at random in each iteration. Thus they have worse performance than the single operator that always flips one bit (i.e., RLS$1$), which takes $12n2$ expected iterations to optimise LeadingOnes. We then provide upper bounds on the expected runtime of GRG for the same function. We rigorously prove that the generalised HH has a better expected running time than RLS$1$ for appropriately chosen values for the parameter $\tau $. Furthermore, we prove that GRG can achieve an expected runtime of $1+ln(2)4n2+o(n2)\u22480.42329n2+o(n2)$ for LeadingOnes, when $\tau $ satisfies both $\tau =\omega (n)$ and $\tau \u226412-\u025bnln(n)$, for some constant $0<\u025b<12$. This is the best possible expected runtime for an unbiased (1 $+$ 1) black box algorithm using only the same two mutation operators. We refer to such performance as *optimal* for the HH as no better expected runtime, up to lower order terms, may be achieved with any combination of the available low-level heuristics.

Afterwards we turn our attention to low-level heuristic sets that contain arbitrarily many operators (i.e., ${1,\u2026,k}$-bit flips, $k=\Theta (1)$) as commonly used in practical applications. We first show that including more operators is detrimental to the performance of the simple mechanisms for LeadingOnes. Then, we prove that the performance of GRG improves as the number of available operators to choose from increases. In particular, when choosing from $k$ operators as low-level heuristics, GRG is in expectation faster than the best possible performance achievable by any algorithm using any combination of $m<k$ of the operators, for any $k=\Theta (1)$.

We conclude the article with an experimental analysis of the HH for increasing problem sizes, up to $n=108$. The experiments confirm that GRG outperforms its low-level heuristics. For two operators, it is shown how proper choices for the parameter $\tau $ can lead to the near optimal performance already for these problem sizes. When the HH is allowed to choose between more than two operators, the experiments show that, already for the considered problem sizes, the performance of GRG improves with more operators for appropriate choices of the learning period $\tau $.

The article is structured as follows. In Section 2, we formally introduce the HH framework together with the simple selection HHs and GRG. In Section 3, we analyse the simple and generalised HHs using two low-level heuristics for the LeadingOnes benchmark function. In Section 4, we present the results for the HHs that choose between an arbitrary number of low-level heuristics. Section 5 presents the experimental analysis. In the conclusion, we present a discussion and some avenues for future work.

Compared with its conference version (Lissovoi et al., 2017), this article has been considerably extended. The results of Section 3 have been generalised to hold for any values of $\tau $ less than $12-\u025bnlnn$, for some constant $0<\u025b<12$. In general, the results have been considerably strengthened. In particular, compared to the extended abstract, in the present manuscript we prove that GRG optimises LeadingOnes in the best possible expected runtime achievable, up to lower order terms. Section 4 is a completely new addition, while a more comprehensive set of experiments is included in Section 5. Due to space restrictions the proofs of the theorems and lemmata of this article are provided as supplementary material, available at https://www.mitpressjournals.org/doi/suppl/10.1162/evco_a_00258.

## 2 Preliminaries

### 2.1 The Hyper-Heuristic Framework and the Heuristic Selection Methods

Let $S$ be a finite search space, $H$ a set of low-level heuristics and $f:S\u2192R+$ a fitness function. Algorithm 1 shows the pseudocode representation for a simple selection HH as used in previous experimental and theoretical work (Cowling et al., 2001, 2002; Alanazi and Lehre, 2014). Throughout the article, the term runtime refers to the number of fitness evaluations used by Algorithm 1 until it finds the optimum. Note that the strict inequality in line 5 of Algorithm 1 is consistent with previous HH literature (Alanazi and Lehre, 2014), and using “$\u2265$” instead would make no difference for the benchmark problem we consider in this work (i.e., LeadingOnes).

Different HHs are obtained from the framework described in Algorithm 1 according to which heuristic selection method is applied in Step 3. While sophisticated machine learning mechanisms are usually used, the following *learning mechanisms* have also been commonly considered in the literature to solve combinatorial optimisation problems (Cowling et al., 2001, 2002):

**Simple Random**, which selects a low-level heuristic $h\u2208H$ independently with probability $ph$ in each iteration. Each heuristic is usually selected uniformly at random; that is, $ph=1|H|$;**Permutation**, which generates a random ordering of the low-level heuristics in $H$ and returns them in that sequence when called by the HH;**Greedy**, which applies all the low-level heuristics in $H$ in parallel and returns the best found solution;**Random Gradient**, which randomly selects a low-level heuristic $h\u2208H$, and keeps using it as long as it obtains improvements.

Alanazi and Lehre (2014) derived upper and lower bounds on the expected runtime of Algorithm 1 for the LeadingOnes benchmark function when using these four simple heuristic selection mechanisms equipped with $H={1BitFlip,2BitFlip}$. The first low-level heuristic flips one bit chosen uniformly at random while the second one chooses two bits to flip with replacement (i.e., the same bit may flip twice).

The runtime analysis performed by Alanazi and Lehre (2014) only provided bounds of the same asymptotic order on the expected runtime for all four mechanisms. An experimental analysis they carried out suggested, however, that all mechanisms have the same performance as just choosing the low-level heuristics at random for LeadingOnes. We will prove this by deriving the exact expected runtimes of the simple mechanisms in Section 3. Differently from the other three mechanisms, the idea behind Random Gradient is to try to learn from the past performance of the heuristics. However, by making a heuristic selection decision in every iteration, the mechanism does not have enough time to appreciate whether the selected heuristic is a good choice for the current optimisation state.

In this article, we generalise the Random Gradient mechanism to allow a longer period of time to appreciate whether a low-level heuristic is successful or not, before deciding whether to select a different low-level heuristic. Our aim is to maintain the intrinsic ideas of the simple learning mechanism while generalising it sufficiently to allow for learning to take place. The pseudocode of the resulting HH is given in Algorithm 2, while the proposed learning mechanism works as follows:

**Generalised Random Gradient (GRG)**: A low-level heuristic is chosen uniformly at random (*Decision Stage*) and run for a learning period of fixed time $\tau $. If an improvement is found before the end of the period, then a new period of time $\tau $ is immediately initialised (i.e., $ct$ in Algorithm 2 is set to 0 immediately) (*Exploitation Stage*). If the chosen operator fails to provide an improvement in $\tau $ iterations, a new operator is chosen at random.

Our aim is to prove that the simple GRG HH runs in the best possible expected runtime achievable for LeadingOnes using mutation operators with different neighbourhood sizes (i.e., ${1BitFlip,\u2026,kBitFlip}$ mutation operators) as low-level heuristics. We will first derive the best possible expected runtime achievable by applying the $k$ operators in any order, and then prove that GRG matches it up to lower order terms.

Throughout the article, when discussing the $mBitFlip$ operator, we refer to the mutation operator which flips $m=\Theta (1)$ bits in the bit-string with replacement; that is, it is possible to flip and re-flip the same bit within the same mutation step. Since this has been used in previous literature on the topic (Lehre and Özcan, 2013; Alanazi and Lehre, 2014, 2016), we naturally continue with this choice. We will also prove that the presented results hold also if operators that flip $m$ bits without replacement are used (i.e., operators that select a new bit-string with Hamming distance $m$ from the original bit-string). In particular, we will show that any performance differences are limited to lower order terms in the expected runtimes. The latter operators are well known RLS algorithms with neighbourhood size $m$ (i.e., RLS$m$).

### 2.2 The Pseudo-Boolean Benchmark Function LeadingOnes

The LeadingOnes (LO) benchmark function counts the number of consecutive one-bits in a bit string before the first zero-bit: $LeadingOnes(x):=\u2211i=1n\u220fj=1ixj.$

The unrestricted black box complexity^{1} of LeadingOnes is $O(nloglogn)$ (Afshani et al., 2013) and there exist randomised search heuristics with expected runtimes of $o(n2)$. Recently some Estimation of Distribution Algorithms (EDAs) have been presented which have surprisingly good performance for the problem. Doerr and Krejca (2018) introduced a modified compact Genetic Algorithm (cGA) called sig-cGA that, rather than updating the frequency vector in every generation, does so only once it notices a significance in its history of samples (i.e., once a significant number of 1s or 0s are sampled in a certain bit position, as opposed to a uniform binomial distribution of samples). They proved that the algorithm optimises LeadingOnes, OneMax and BinVal in $O(nlogn)$ expected function evaluations. Two other algorithms have been proven to optimise LeadingOnes in the same asymptotic expected time: a stable compact Genetic Algorithm (scGA) that biases updates that favour frequencies that move towards $12$ (Friedrich et al., 2016) and a Convex Search Algorithm (CSA) using binary uniform convex hull recombination (for sufficiently large populations and with an appropriate restart strategy) (Moraglio and Sudholt, 2017). While the latter two algorithms have exceptional performance for LeadingOnes, their runtime is very poor for OneMax, respectively providing runtimes at least exponential and super-polynomial in the problem size with high probability (Doerr and Krejca, 2018).

The unbiased black box complexity of LeadingOnes is $\Theta (n2)$ (Lehre and Witt, 2012). If biased mutation operators are allowed but truncation selection is imposed then no asymptotic improvement may be achieved over the unbiased black box complexity. Indeed, Doerr and Lengler (2018) recently proved that the best possible asymptotic performance of any (1 $+$ 1) elitist black-box algorithm for LeadingOnes is $\Omega (n2)$. This bound is matched by the performance of simple well studied heuristics. RLS has an expected runtime of $0.5n2$ fitness function evaluations (Buzdalov and Buzdalova, 2015) and it is well known that the standard (1 $+$ 1) EA (with mutation rate $1n$) has an expected runtime of $e-12n2-o(n2)\u22480.85914n2-o(n2)$ (Böttcher et al., 2010). Böttcher et al. (2010) also showed that the best static mutation rate for the (1 $+$ 1) EA is $\u22481.5936n$, which improves the expected runtime to $e4n2\xb1O(n)\u22480.77201n2\xb1O(n)$. Furthermore, they showed that the (1 $+$ 1) EA with an appropriately chosen dynamic mutation rate (i.e., $1LO(x)+1$) can outperform any static choice, giving an expected runtime of approximately $0.68n2\xb1o(n2)$. Note that the (1 $+$ 1) EA is slowed down by the approximately $37%$ of the iterations in which no bits are flipped (Jansen and Zarges, 2011). The best possible (1 $+$ 1) EA variant that uses standard bit mutation (discounting iterations in which no bits flip) with the best possible mutation rate at each step has an expected runtime of approximately $0.404n2\xb1o(n2)$ (Doerr et al., 2019). This is still worse than the best expected performance achievable by an unbiased (1 $+$ 1) black box algorithm, approximately $0.388n2\xb1o(n2)$ (Doerr and Wagner, 2018; Doerr, 2018). In this article, we will prove that GRG equipped with a sufficient number of low-level heuristics can match this optimal expected runtime.

## 3 Hyper-Heuristics Are Faster than Their Low-Level Heuristics

In this section, we show that GRG can outperform its constituent low-level heuristics even when the latter are efficient for the problem at hand. To achieve this, we equip the HH with the minimal set of low-level heuristics: $H={1BitFlip,2BitFlip}$. We first introduce a lower bound on the expected runtime of all algorithms which use only the 1BitFlip and 2BitFlip operators for LeadingOnes. Afterwards we will show that GRG matches this expected runtime up to lower order terms.

To achieve the first result we will rely on the following theorem.

Theorem ^{1} holds for any unbiased (1 $+$ 1) black-box algorithm for LeadingOnes (Doerr, 2018). Since we analyse unbiased (1 $+$ 1) black-box algorithms, each differing by their mutation operator, we can apply Theorem ^{1} in our analysis.

$Ai$ is the expected time to find an improvement when the current solution $x$ consists of $i$ leading 1-bits. Hence, the theorem implies that the overall expected runtime is minimised if the heuristic $h\u2208H$ with lowest expected improvement time $Ai$ is applied whenever the current solution has $i$ leading 1-bits, for all $0\u2264i<n$. Naturally, the heuristic with lowest expected time $Ai$ is the one with the highest *success* probability (i.e., the probability of producing a fitness improvement).

The 1BitFlip operator has a success probability of $P(Imp1\u2223LO(x)=i)=1n$. The success probability of the 2BitFlip operator is $P(Imp2\u2223LO(x)=i)=1n\xb7n-i-1n\xb72=2n-2i-2n2$. Hence, for $i\u2265n2$, the 1BitFlip operator is more effective (i.e., has a higher probability of success and hence a lower expected waiting time for each improvement $Ai$), while 2BitFlip is preferable before. Note that both are equally effective when $LO(x)=n2-1$. The expected runtime of an algorithm that applies these operators in such a way is a lower bound on the expected runtime of all unbiased (1 $+$ 1) black-box algorithms using only the same two operators.

The best-possible expected runtime of any unbiased (1 $+$ 1) black-box algorithm using only the 1BitFlip and 2BitFlip operators for LeadingOnes is $1+ln(2)4n2+O(n)\u22480.42329n2+O(n)$.

Thus, we have proven a theoretical lower bound of $1+ln(2)4n2+O(n)\u22480.42329n2+O(n)$ for the expected runtime of any unbiased (1 $+$ 1) black-box algorithm using the 1BitFlip and 2BitFlip operators for LeadingOnes, which is better in the leading constant than the expected runtime of RLS (i.e., $0.5n2$).

The rest of the section is structured as follows. In Subsection 3.1, we show that the simple HH mechanisms all perform equivalently to Simple Random for LeadingOnes, up to lower-order terms. In Subsection 3.2, we show that GRG is much faster and matches the optimal expected runtime derived in Theorem ^{2}, up to lower-order terms.

### 3.1 Simple Mechanisms

In this section, we show that the standard simple heuristic selection mechanisms (i.e., Simple Random, Permutation, Greedy, and Random Gradient) all have the same expected runtime for LeadingOnes, up to lower-order terms. Note that, from the experiments performed by Alanazi and Lehre (2014), we know that these lower-order terms do not have any visible impact on the average runtime already for small problem sizes.

The following theorem derives the expected runtime for the Simple Random mechanism. The subsequent corollary extends the result to the other mechanisms.

Let $p1$ be the probability of choosing the 1BitFlip mutation operator, and $1-p1$ the probability of choosing the 2BitFlip mutation operator. Then, the expected runtime of the Simple Random mechanism using $H={1BitFlip,2BitFlip}$, with $p1\u2208(0,1)$, for LeadingOnes is $14(1-p1)ln2-p1p1n2+o(n2).$ If $p1=0$ the expected runtime is infinite. If $p1=1$, the expected runtime is $0.5n2$.

When $p1=0.5$ (i.e., there is an equal chance of choosing each operator in each iteration), the standard Simple Random mechanism has an expected improvement time of $Ai=2n23n-2i-2$, and an expected runtime of $ln(3)2n2+o(n2)\u22480.54931n2+o(n2)$. The expected runtime improves with increasing $p1$, hence the optimal choice is $p1=1$ (i.e., RLS).

The expected runtime of the Permutation, Greedy, and Random Gradient mechanisms using $H={1BitFlip,2BitFlip}$ for LeadingOnes is $ln(3)2n2+o(n2)\u22480.54931n2+o(n2).$

We point out that the lower bound for the Random Gradient HH contradicts the lower bound of $n29(4+3ln103)\u22480.846n2+o(n2)$ found by Alanazi and Lehre (2014). However, their bound results from a small mistake in their proof. They should have found a lower bound of $n29(3ln103)+o(n2)\u22480.401n2+o(n2)$, which agrees with our derived expected runtime of $ln32n2+o(n2)\u22480.549n2+o(n2)$.

### 3.2 Generalised Random Gradient

In this subsection, we present a rigorous theoretical analysis of GRG for LeadingOnes. The main result of this subsection is that GRG is able to match, up to lower-order terms, the best-possible performance of any algorithm using the 1BitFlip and 2BitFlip operators for LeadingOnes, as presented in Theorem ^{2}. We state the main result in Corollary ^{5} now.

[Of Theorem ^{6}] The expected runtime of the Generalised Random Gradient hyperheuristic using $H={1BitFlip,2BitFlip}$ for LeadingOnes, with $\tau $ that satisfies both $\tau =\omega (n)$ and $\tau \u226412-\u025bnln(n)$, for some constant $0<\u025b<12$, is at most $1+ln(2)4n2+o(n2)\u22480.42329n2+o(n2).$

This improves significantly upon the result of Lissovoi et al. (2017). While the previous result showed that the HH outperformed its low-level heuristics, Corollary ^{5} also shows that the expected runtime matches the optimal one up to lower-order terms. Our previous work only considered setting $\tau =cn$ for constant $c$, a learning period for which optimal expected runtime could not be proven.

We first present the necessary prerequisite results for the proof of Corollary ^{5}. The following theorem is very general as it provides an upper bound on the expected runtime of GRG for any value of $\tau $ smaller than $12nlnn$. In particular, Theorem ^{6} allows us to identify values of $\tau $ for which the expected runtime of the HH is the optimal one achievable using the two operators. This is the result highlighted in Corollary ^{5} and depicted in Figure 1.

Our proof technique partitions the search space into $w$ stages, each representing an equal range of fitness values. The expected runtime of the algorithm does not depend on $w$. However, $w$ does affect the upper bound on the runtime we obtain. The greater the number of stages, the tighter the upper bound the theorem statement provides as long as $w=onexp(\tau /n)$.

Figure 1 presents the theoretical upper bounds from Theorem ^{6} for a range of linear $\tau $ values. For $\tau =5n$, GRG already outperforms RLS, giving an expected runtime of $0.46493n2+o(n2)$. For $\tau =30n$, the performance improves to $0.42385n2+o(n2)$, matching the best possible performance from Theorem ^{2} up to 3 decimal places. We have seen that GRG is able to exactly match this optimal performance, up to lower-order terms, for $\tau =\omega (n)$ in Corollary ^{5}.

In particular, the main result of this section (i.e., Corollary ^{5} of Theorem ^{6}) states that the expected runtime of GRG using $H={1BitFlip,2BitFlip}$ for LeadingOnes with $\tau $ that satisfies both $\tau =\omega (n)$ and $\tau \u226412-\u025bnln(n)$, for some constant $0<\u025b<12$, is at most $1+ln(2)4n2+o(n2)\u2248042329n2+o(n2)$.

## 4 Increasing the Choice of Low-Level Heuristics Leads to Improved Performance

Runtime analyses of HHs are easier if the algorithms can choose between only two operators. However, in realistic contexts, HHs have to choose between many more operators.

In this section, we consider the previously analysed GRG HH, yet extend the set $H$ of low-level heuristics to be of size $k\u22652$ that is, $|H|=k=\Theta (1)$. We extend upon the previous analysis by considering the $k$ operators as $k$ different mutation operators, each flipping between 1 and $k$ bits with replacement uniformly at random. This more accurately represents HH approaches employed for real-world problems.

As before, we consider the LeadingOnes benchmark function. We will rigorously show that the performance of the simple mechanisms deteriorates as the number ($|H|=k=\Theta (1)$) of operators increases. In addition, we prove decreasing upper bounds on the expected runtime of GRG as the number of operators increases.

The main result of this section is the following theorem, which states that GRG that chooses between $k$ stochastic mutation operators has better expected performance than any algorithm, including the best-possible, using less than $k$ stochastic mutation operators for LeadingOnes. We present the statement of Theorem ^{7} now.

The expected runtime for LeadingOnes of the Generalised Random Gradient hyper-heuristic using $H={1BitFlip,\u2026,kBitFlip}$ and $k=\Theta (1)$, with $\tau $ that satisfies both $\tau =\omega (n)$ and $\tau \u22641k-\u025bnln(n)$ for some constant $0<\u025b<1k$, is smaller than the best-possible expected runtime of any unbiased (1 $+$ 1) black-box algorithm using any strict subset of ${1BitFlip,\u2026,kBitFlip}$.

Theorem ^{7} highlights the power of HHs as general-purpose problem solvers. The inclusion of more heuristics to the set of low-level heuristics is implied to be preferable, showcasing the impressive learning capabilities of even simple HHs. Figure 2 highlights the meaning of Theorem ^{7} for $k=3$ and $k=5$. The expected runtime of GRG with three operators is better than the best possible expected runtime achievable using the first two operators in any combination. The figure also highlights that with five operators the expected runtime is better than any one achievable using fewer operators.

We now present the necessary prerequisite analysis to obtain the result. We first derive the best-possible expected runtime achievable by any unbiased (1 $+$ 1) black-box algorithm using ${1BitFlip,\u2026,kBitFlip}$ mutation operators for LeadingOnes. Before this, we introduce the following two helpful lemmata (which hold for all problem sizes $n\u22651$).

Lemma ^{9} follows from Lemma ^{8}. In particular, an operator that flips $m$ bits (i.e., $mBitFlip$) has a higher probability of success than an operator that flips $m-1$ bits (i.e., $m-1BitFlip$) or less when $i<nm-1$. It is worth noting that only an operator which flips an odd number of bits can make progress when $LO(x)=n-1$, and the 1BitFlip operator has the best success probability at this point (i.e., $P(Imp1\u2223LO(x)=i)=1n$). We additionally note through a simple calculation that the 1BitFlip operator has the highest probability of success in the second half of the search space (i.e., when $n2\u2264i<n$).

In Theorem ^{10}, we present the best-possible expected runtime for any unbiased (1 $+$ 1) black-box algorithm using ${1BitFlip,\u2026,kBitFlip}$ as mutation operators for LeadingOnes. Similar results have recently been presented by Doerr (2018) and Doerr and Wagner (2018) for mutation operators that flip bits without replacement. The expected runtimes are the same up to lower-order terms.

In particular, taking limits as $n\u2192\u221e$, we have $E(T1,Opt)=12n2$, $E(T2,Opt)=1+ln(2)4n2\u22480.42329n2$, $E(T3,Opt)=13+ln(2)2-ln(3)4n2\u22480.40525n2$, $E(T5,Opt)=372111520+ln(2)2-ln(3)4n2\u22480.39492n2$. A closed form result for $E(Tk,Opt)$ is difficult to find as is the limit for the best-possible expected runtime as $k\u2192\u221e$. A numerical analysis by Doerr and Wagner (2018) suggests an expected runtime of $E(T\u221e,Opt)\u22480.388n2\xb1o(n2)$.

### 4.1 “Simple” Mechanisms

We will now see how the simple learning mechanisms (Simple Random, Permutation, Greedy, and Random Gradient) perform when having to choose between $k$ operators (i.e., $H={1BitFlip,\u2026,kBitFlip}$) for LeadingOnes. We will show that incorporating more operators is detrimental to the performance of the simple mechanisms for LeadingOnes.

We start again by stating the expected runtime of the Simple Random mechanism, and use this as a basis for the other three mechanisms. Recall that the standard Simple Random mechanism chooses each operator uniformly at random in each iteration (i.e., with probability $1k$ when using $k$ operators).

Note that the expected runtime of the Simple Random mechanism increases with $k$. Hence, incorporating more operators is detrimental to its performance. In particular, the expected runtimes when using 1, 2, and 3 operators are respectively (up to lower order terms) $12n2$, $ln(3)2n2\u22480.54931n2$, and $324arctan22n2\u22480.65281n2$.

We now prove the same deteriorating performance for the other simple HHs.

### 4.2 The Generalised Random Gradient Hyper-Heuristic Has the Best Possible Performance Achievable

In this subsection, we present a rigorous theoretical analysis of the expected runtime of GRG using $k=\Theta (1)$ operators for LeadingOnes. The following general theorem provides an upper bound on the expected runtime of GRG using $k=\Theta (1)$ low-level stochastic mutation heuristics of different neighbourhood size for any value of $\tau $ smaller than $1knlnn$. The theorem allows us to identify values of $\tau $ for which the expected runtime of GRG is the optimal expected runtime that may be achieved by using $k$ operators. This result will be highlighted in Corollary ^{14} for values of $\tau =\omega (n)$. The main result of this section has been presented in Theorem ^{7}, which shows that increasing the number of operators (i.e., $|H|$) that GRG has access to leads to faster expected runtimes and, in particular, expected runtimes that are strictly smaller than that of any unbiased (1 $+$ 1) black-box algorithm using any strict subset of the same set of operators.

Similarly to Theorem ^{6}, the proof technique partitions the search space into $w$ stages, each representing an equal range of fitness values. As before, the expected runtime of the algorithm does not depend on $w$. However, $w$ does affect the upper bound we obtain. The greater the number of stages, the tighter the upper bound the theorem provides as long as $w=onexp(k\tau /n)$.

Theorem ^{13} presents upper bounds on the expected runtime of GRG, with access to an arbitrary number $k=\Theta (1)$ of low-level stochastic mutation heuristics of different neighbourhood sizes for LeadingOnes using any learning period $\tau \u22641k-\u025bnln(n)$, for some constant $0<\u025b<1k$. The bound on the expected runtime improves as $\tau $ increases up to the limit provided by the theorem. Figure 3 shows the relationship between the duration of the learning period and theoretical upper bounds for different $k$-operator variants of GRG from Theorem ^{13}. The upper bounds found by GRG with more operators are better than the ones with fewer operators, as implied by Corollary ^{10}. In particular, the upper bound for GRG with $k$-operators is smaller than the best possible expected runtime for any algorithm that has fewer than $k$ operators, as implied by Theorem ^{7}. We depicted this result explicitly for $k=3$ and $k=5$ in Figure 2. When $\tau =\omega (n)$ and $\tau \u22641k-\u025b$ for some constant $0<\u025b<1k$, GRG is able to find the best possible runtime achievable with the low-level heuristics, up to lower-order terms, as will be presented in Corollary ^{14}.

Doerr and Wagner (2018) calculated that the best expected runtime for any unbiased (1 $+$ 1) black-box algorithm using such mutation operators for LeadingOnes is (up to lower-order terms) $\u22480.388n2$. Corollary ^{14} states that GRG can match this theoretical performance limit up to one decimal place with 4 low-level heuristics, up to two decimal places with 11 low-level heuristics, and up to three decimal places with 18 low-level heuristics. In Table 1 we present some of the most interesting parameter combinations of $k$ and $\tau $.

$k$ . | $\tau =5n$ . | $\tau =50n$ . | $\tau =100n$ . | $\tau =110nln(n)$ . |
---|---|---|---|---|

2 | 0.46493 | 0.42363 | 0.42329 | 0.42329 |

3 | 0.46802 | 0.40579 | 0.40525 | 0.40525 |

4 | 0.48102 | 0.39897 | 0.39830 | 0.39830 |

5 | 0.49630 | 0.39568 | 0.39492 | 0.39492 |

11 | $3.090\xd71023$ | 8785.8 | 0.38987 | 0.38987 |

18 | $1.886\xd71044$ | $5.363\xd71024$ | 1034.8 | 0.38899 |

$k$ . | $\tau =5n$ . | $\tau =50n$ . | $\tau =100n$ . | $\tau =110nln(n)$ . |
---|---|---|---|---|

2 | 0.46493 | 0.42363 | 0.42329 | 0.42329 |

3 | 0.46802 | 0.40579 | 0.40525 | 0.40525 |

4 | 0.48102 | 0.39897 | 0.39830 | 0.39830 |

5 | 0.49630 | 0.39568 | 0.39492 | 0.39492 |

11 | $3.090\xd71023$ | 8785.8 | 0.38987 | 0.38987 |

18 | $1.886\xd71044$ | $5.363\xd71024$ | 1034.8 | 0.38899 |

We now show that for appropriate values of the parameter $\tau $, the $k$-operator GRG runs in the best possible expected runtime achievable for a mechanism using $k$-operators, up to lower-order terms. The following corollary provides an upper bound on the expected runtime for sufficiently large learning periods (i.e., $\tau =\omega (n)$). The bound matches the best expected runtime achievable proven in Theorem ^{10}.

### 4.3 Anytime Performance

In this subsection, we present the expected fixed target running times of the algorithms considered thus far. Fixed target running times have been previously presented by Carvalho Pinto and Doerr (2017). In order to achieve the runtimes presented in Corollary ^{15} (and depicted in Figure 4) we have adapted Theorems ^{11} and ^{13} to sum up to a fixed target $LO(x)=X\u2264n$, while keeping the problem size as $n$ (respectively, $w$) within the summands.

Let $X\u2264n$. The expected time needed to reach for the first time a search point $x$ of LeadingOnes-value at least $X$ is at most:

$k2\xb7\u2211i=0X-11\u2211m=1km\xb71n\xb7n-i-1nm-1\xb1o(n2)$ for the Simple Random mechanism with $k$ operators;

$n22\xb7\u2211j=1\u2308X\xb7wn\u2309k\xb7\tau n+\u2211m=1kem\tau n1-jwm-1\xb7Mmj,ww\xb7\u2211m=1kem\tau n1-jwm-1-k+o(n2)$ for GRG using $H={1BitFlip,\u2026,kBitFlip}$, with $\tau \u22641k-\u025bnln(n)$ for some constant $0<\u025b<1k$, where $Mm(j,w)$ are as defined in Theorem

^{13}.

Figure 4 shows the expected fixed target running times using the set $H={1BitFlip,\u2026,kBitFlip}$ for the Simple Random mechanism with $k=1,2,3$ and GRG with $\tau =10n$ and $k=2,3,4$ for LeadingOnes. We have used $w=1,000$ in Theorem ^{13} to achieve the GRG plots.

We see that all the GRG variants outperform the respective Simple Random variants for any fixed target independent of how many operators are used by GRG. All GRG variants are close to matching the anytime performance of the optimal GRG and the greater the number of operators the HH has access to, the better the anytime performance. For example, GRG with 4 operators outperforms all other algorithms for every fixed target $0<X\u2264n$.

Interestingly, the Simple Random mechanisms with more operators outperform those with fewer operators for smaller targets. For example, for a fixed target of $LO(x)=n2=5,000$, the expected first hitting time of Simple Random with 2 operators is $\u224818.9%$ smaller than that of RLS (i.e., Simple Random with 1 operator, Simple$1$), while Simple Random with 3 operators is $\u224826.0%$ faster than RLS.

### 4.4 Extension of the Results to Standard Randomised Local Search Heuristics

In this subsection, we will extend the results presented so far in Section 4 such that they also hold for hyper-heuristics that select from the set $H={RLS1,\u2026,RLSk}$. In particular, we will show that the main result of this section (i.e., Theorem ^{7}) also holds for RLS algorithms with different neighbourhood sizes (that flip distinct bits without replacement). For this purpose, we will show that the difference in the improvement probabilities of the $RLSm$ algorithms and the $mBitFlip$ operators (from Lemma ^{8}) are limited to lower-order $O1n2$ terms. A simple argument will then prove that the remaining results of this section also hold for $RLSm$ operators.

Through a simple application of Lemma ^{16} (rather than applying Lemma ^{8}), the results for the simple mechanisms presented in Subsection 4.1 (i.e., Theorem ^{11} and Corollary ^{12}), and the general results for GRG (i.e., Theorem ^{13} and Corollary ^{14}), also hold for the hyper-heuristics using the heuristic set $H={RLS1,\u2026,RLSk}$, up to lower-order $\xb1o(n2)$ terms.

We can also extend Theorems ^{10} and ^{7} to algorithms using the heuristic set ${RLS1,\u2026,RLSk}$. We know from Theorem ^{10} that the $mBitFlip$ operator is optimal (i.e., it has the highest probability of improvement amongst all BitFlip operators) during the time when $nm+1\u2264i\u2264nm-1$. Doerr (2018) showed that $RLSm$ is optimal (i.e., has the highest probability of improvement among all RLS operators) when $n-mm+1\u2264i\u2264nm-m-1m$. The differences in the expected runtime of the best-possible algorithms using the two heuristic sets would therefore amount to the expected time spent improving from (at most) $m+1=\Theta (1)$ different LeadingOnes fitness values. Since the expected time to improve from each of these fitness values is at most $\Theta (n)$, the difference between the expected runtime of the best-possible algorithm using ${RLS1,\u2026,RLSk}$ operators and that using $H={1BitFlip,\u2026,kBitFlip}$ operators, for LeadingOnes is therefore limited to lower-order $o(n2)$ terms. Hence, Theorem ^{10} also holds for the heuristic set ${RLS1,\u2026,RLSk}$, up to lower-order $\xb1o(n2)$ terms. With similar arguments, Theorem ^{7} also holds for sets of RLS operators.

## 5 Complementary Experimental Analysis

In the previous sections, we proved that GRG performs efficiently for the LeadingOnes benchmark function for large enough problem sizes $n$. In this section, we present some experimental results to shed light on its performance for different problem sizes up to $n=108$. All parameter combinations have been simulated $10,000$ times.

In order to efficiently handle larger problem dimensions experimentally, we do not simulate each individual mutation performed by the HH, but rather sample the waiting times for a fitness-improving mutation to occur using a geometric distribution (with the success probability $p$ depending on the current operator and LeadingOnes value of the current solution). Specifically, suppose $Tim$ is a random variable denoting the number of mutations required to get one fitness-improving mutation. Since $Tim$ counts the number of independent trials each with probability $p$ of success up to and including the first successful trial, $P(Tim\u2264k)=1-(1-p)k$ by the properties of the geometric distribution. Given access to a $uniform(0,1)$-distributed random variable $U$, $Tim$ can be sampled by computing $log(1-U)log(1-p)$.

### 5.1 Two Low-Level Heuristics ($k=2$)

We first consider GRG using two operators only (i.e., $H={1BitFlip,2BitFlip}$) and look at the impact on the average runtime of the parameter $\tau $ and of the problem size $n$.

Figure 5 shows how the average runtimes of GRG for LeadingOnes vary with the duration of the learning period $\tau $ for problem sizes $n=10,000$ and $n=50,000$. The performance of GRG clearly depends on the choice of $\tau $. It is worth noting that as the problem size increases, for $\tau \u22480.55nln(n)$, the runtime seems to be approaching the optimal performance proven in Corollary ^{5} (i.e., $1+ln(2)4n2\u22480.42329n2$). For well chosen $\tau $ values, the HH beats the expected runtime of RLS and also the experimental runtime for the recently presented reinforcement learning HH that chooses between different neighbourhood sizes for $RLSk$ (Doerr et al., 2016). They reported an average runtime of $0.450n2$ for the parameter choices they used. As $\tau $ increases past $0.6n2$, we see a detriment in the performance of the HH. It is worth noting, however, that for $n=50,000$, it is required that $\tau >1.5nln(n)=811,483$ to be worse than the RLS expected runtime of $0.5n2$, indicating that the parameter is robust.

Figure 6 shows the effects of increasing the problem size $n$ for a variety of fixed values of $\tau $. We can see that an increased problem size leads to faster average runtimes. In particular, the HH requires a problem size of at least 200 before it outperforms RLS. The performance difference between the $\tau $ values decreases with increased $n$, indicating that further increasing $n$ would lead to similar, optimal performance for a large range of values, as implied by Corollary ^{5}. For $n=108$, the runtime for $\tau =0.6nlnn$ is $\u22480.42716n2$, only slightly deviated from the optimal value of $\u22480.42329n2$.

### 5.2 More than Two Low-Level Heuristics ($k\u22652$)

We now consider GRG using $k$ operators, $H={1BitFlip,\u2026,kBitFlip}$, and look at the impact of the parameter $\tau $.

Figure 7 shows the average fixed target running times using the set $H={1BitFlip,\u2026,kBitFlip}$ for the Simple Random mechanism with $k=1,2,3$ operators, and GRG with $\tau =10n$ and $k=2,3,4$ for LeadingOnes. We have used a problem size of $n=10,000$. The conclusions drawn from the experiments match the theoretical results shown in Figure 4. We see that although the final average runtime of the Simple Random mechanisms with more operators increases, they are faster initially. In particular, while $LO(x)\u2264n2=5,000$, Simple Random with 3 operators is preferable to the variant with 2 or 1 (i.e., RLS). This Simple Random variant even outperforms GRG (with our arbitrarily chosen learning period $\tau =10n$) for fixed targets smaller than $\u22487,000$. For GRG we see, just like in Figure 4, that GRG with more operators is more effective than GRG with fewer operators for any fixed target. However, the difference is smaller than the theoretical results suggest, implying that while more operators are preferable, GRG with fewer operators is still effective for these problem sizes; that is, as the problem size increases so does the difference in performance in favour of larget sets $H$.

Figure 8 shows the average runtimes for GRG when it chooses between 2 to 5 mutation operators with different neighbourhood sizes for a LeadingOnes problem size $n=100,000$. We see that incorporating more operators can be beneficial to the performance of GRG. Whilst the 2-operator HH achieves a best average performance of $\u22480.43742n2$, the 5-operator HH achieves a best average performance of $\u22480.42906n2$. It is, however, important to set $\tau $ appropriately to achieve the best performance. For $0\u22640.35nln(n)\u2264\tau $ and $\tau \u22651.5nln(n)$, the 2-operator HH outperforms the others. The results imply that the more operators are incorporated in the HH, the shorter the range of $\tau $ values for which it performs the best in comparison with the HH with fewer operators. We know from Corollary ^{14} that for a sufficiently large problem size, GRG with 5-operators will tend towards the theoretical optimal expected runtime of $\u22480.39492n2$. Furthermore, we know from Theorem ^{7} that GRG equipped with more operators will be faster.

Recently Doerr and Wagner (2018) analysed a (1 $+$ 1) EA for LeadingOnes, where the mutation rate is updated on the fly. The number of bits to be flipped is sampled from a binomial distribution $B(n,p)$ and, differently from standard bit mutation, they do not allow a 0BitFlip to occur by resampling until a non-zero bitflip occurs.^{2} Then a multiplicative comparison-based update rule, similar to the 1/5th rule from combinatorial optimisation (Kern et al., 2004) is applied to update the parameter $p$ (i.e., the mutation rate). A successful mutation increases the mutation rate by a multiplicative factor $A>1$, while an unsuccessful mutation multiplies the mutation rate by a multiplicative factor $b<1$.

An experimental analysis is performed to tentatively identify the leading constants in the expected runtime for the best combinations of $A$ and $b$. The best leading constant that has been identified is 0.4063 (using the best parameter configuration for which an average of at least 1000 runs has been reported; i.e., $A=1.2$ and $b=0.85$). While it is unclear whether their identified value indeed scales with the problem size (they only test problem sizes up to $n=1500$), such a leading constant is worse than the $\u22480.388n2$ achieved by GRG. However, we note that the best possible expected runtime that such an EA (that learns to automatically adapt the neighbourhood size to optimality) can achieve is $\u22480.404n2$ (Doerr et al., 2019), due to the random sampling of the number of bits to flip; that is, the best adaptive RLS is faster than the best adaptive EA.

## 6 Conclusion

Our foundational understanding of the performance of hyper-heuristics (HHs) is improving. Algorithm selection from algorithm portfolio systems and selection HHs generally use sophisticated machine learning algorithms to identify online (during the run) which low-level mechanisms have better performance during different stages of the optimisation process. Recently it has been proven that a reinforcement learning mechanism, which assigns different scores to each low-level heuristic according to how well it performs, allows a simple HH to run in the best expected runtime achievable with RLS$k$ low-level heuristics for the OneMax benchmark problem, up to lower-order terms (Doerr et al., 2016).

In this article, we considered whether sophisticated learning mechanisms are always necessary for a HH to effectively learn to apply the right low-level heuristic.

We considered four of the most simple learning mechanisms commonly applied in the literature to combinatorial optimisation problems, namely, Simple Random, Permutation, Greedy, and Random Gradient and showed that they all have the same performance for the LeadingOnes benchmark function when equipped with RLS$k$ low-level heuristics. While the former three mechanisms do not attempt to learn from the past performance of the low-level heuristics, the idea behind Random Gradient is to continue applying a heuristic so long as it is successful. We argued that looking at the performance of a heuristic after one single application is insufficient to appreciate whether it is a good choice or not. To this end, we generalised the existing Random Gradient learning mechanism to allow success to be measured over a longer period of time and called such time the *learning period*. We showed that the Generalised Random Gradient (GRG) HH can learn to adapt the neighbourhood size $k=\Theta (1)$ of $RLSk$ optimally during the run for the LeadingOnes benchmark function. As a byproduct, we proved that, up to lower-order terms, GRG has the best possible runtime achievable by any algorithm that uses the same low-level heuristics. In particular, it is faster than well-known unary unbiased evolutionary and local-search algorithms, including the (1 $+$ 1) Evolutionary Algorithm ((1 $+$ 1) EA) and Randomised Local Search (RLS). Furthermore, we also showed that for targets smaller than $n$ (i.e., anytime performance), the advantages of GRG over RLS and EAs using standard bit mutation are even greater (i.e., they are even better as approximation algorithms for the problem).

To apply the generalised HH, a value for the learning period $\tau $ is required. Although our results indicate that $\tau $ is a fairly robust parameter (i.e., for $n=10,000$, GRG achieved faster experimental runtimes than that of the (1 $+$ 1) EA for all tested values of $\tau $ between 1 and $106$, and faster than RLS for all tested values of $\tau $ between $28,000$ and $120,000$), setting it appropriately will lead to optimal performance. Clearly $\tau $ must be large enough to have at least a constant expected number of successes within $\tau $ steps, if the HH has to learn about the operator performance. Naturally, setting too large values of $\tau $ may lead to large runtimes since switching operators requires $\Omega (\tau )$ steps.

We have also rigorously shown that the performance of the simple mechanisms deteriorates as the choice of operators with different neighbourhood sizes increases, while the performance of GRG improves with a larger choice, as desired for practical applications. In particular, GRG is able to outperform in expectation any unbiased (1 $+$ 1) black-box algorithm with access to strictly smaller sets of operators.

Recently, Doerr et al. (2018) have equipped the GRG HH with an adaptive update rule to automatically adapt the parameter $\tau $ throughout the run (i.e., the learning period can change its duration during the optimisation process). They proved that the HH is able to achieve the same optimal performance, up to lower-order terms, for the LeadingOnes benchmark function. This so-called Adaptive Random Gradient HH, equipped with two operators, experimentally outperforms the best setting of GRG, confirming that $\tau $ should not be fixed throughout the optimisation process.

Several further directions can be explored in future work. Firstly, the performance of GRG on a broader class of problems, including classical ones from combinatorial optimisation, should be rigorously studied. Secondly, more sophisticated HHs that have shown superior performance in practical applications should be analysed, such as the machine learning approaches which keep track of the historical performance of each low-level heuristic and use this information to decide which one to be applied next. In particular, it should be highlighted when more sophisticated learning mechanisms are required and the reasons behind the requirement. Thirdly, the understanding of HHs that switch between elitist and nonelitist low-level heuristics for multimodal functions should be improved (Lissovoi et al., 2019). Finally, considering more sophisticated low-level heuristics (e.g., with different population sizes) will bring a greater understanding of the general performance of selection HHs and their wider application in real-world optimisation. Another area where a foundational theoretical understanding is lacking is that of parameter tuning (Hall et al., 2019).

## Acknowledgments

This work was supported by EPSRC under grant EP/M004252/1.

## Notes

^{1}

That is, the expected number of fitness function evaluations performed by the best-possible black-box algorithm until the optimum is found; that is, there are no restrictions on which queries to the black box (i.e., which search points to be evaluated) the algorithm is allowed to make.

^{2}

Note that while not allowing 0-bit flips is a reasonable choice for an extremely simplified EA such as the (1 $+$ 1) EA, it is doubtful that it is a good idea in general. For instance, we conjecture that the expected runtime of the ($\mu +1$) EA (Witt, 2006) and ($\mu +1$) GA (Corus and Oliveto, 2018, 2019) would deteriorate for OneMax and LeadingOnes, while nonelitist EAs and GAs would require exponential time to optimise any function with a unique optimum (Corus et al., 2018; Oliveto and Witt, 2014, 2015). An alternative implementation that allows 0-bit flips to occur, but simply avoids evaluating offspring that are identical to their parents, would solve the issue while at the same time producing the results reported by Doerr and Wagner (2018). Such an idea was suggested by Carvalho Pinto and Doerr (2017).

## References

*EvoWorkshops*

*k*-bit mutation with self-adjusting

*k*outperforms standard bit mutation. In

## Author notes

^{*}

An extended abstract of this manuscript has appeared at the 2017 Genetic and Evolutionary Computation Conference (GECCO 2017) (Lissovoi et al., 2017).