## Abstract

We reinvestigate a fundamental question: How effective is crossover in genetic algorithms in combining building blocks of good solutions? Although this has been discussed controversially for decades, we are still lacking a rigorous and intuitive answer. We provide such answers for royal road functions and OneMax, where every bit is a building block. For the latter, we show that using crossover makes *every* (+) genetic algorithm at least twice as fast as the fastest evolutionary algorithm using only standard bit mutation, up to small-order terms and for moderate and . Crossover is beneficial because it can capitalize on mutations that have both beneficial and disruptive effects on building blocks: crossover is able to repair the disruptive effects of mutation in later generations. Compared to mutation-based evolutionary algorithms, this makes multibit mutations more useful. Introducing crossover changes the optimal mutation rate on OneMax from to . This holds both for uniform crossover and *k*-point crossover. Experiments and statistical tests confirm that our findings apply to a broad class of building block functions.

## 1 Introduction

Ever since the early days of genetic algorithms (GAs), researchers have wondered when and why crossover is an effective search operator. In evolutionary biology it has been folklore that crossover can speed up adaptation by bringing together multiple beneficial changes that resulted from independent mutation events, famously illustrated by Muller (1932, diagram 1). The same view was taken in evolutionary computation, where building blocks were regarded as schemata of high fitness (see, e.g., Davis, 1991, p. 18; Mitchell, Forrest, and Holland, 1992; and De Jong and Spears, 1992). But as Watson and Jansen (2007) put it, *there has been a considerable difficulty in demonstrating this rigorously and intuitively*.

Many attempts at understanding crossover have been made in the past. Mitchell et al. (1992) presented so-called *royal road* functions as an example where supposedly genetic algorithms outperform other search algorithms due to the use of crossover. Royal roads divide a bit string into disjoint blocks. Each block makes a positive contribution to the fitness if all bits therein are set to 1. Blocks thus represent schemata, and all-1s configurations are building blocks of optimal solutions. However, the same authors later concluded that simple randomized hill climbers performed better than GAs (Forrest and Mitchell, 1993; Mitchell, Holland, and Forrest, 1994).

The role of crossover has been studied from multiple angles, including algebra (Rowe, Vose, and Wright, 2002), Markov chain models (Vose, 1999), infinite population models and dynamical systems (see De Jong, 2006, ch. 6, for an overview), and statistical mechanics (see, e.g., Prügel-Bennett and Rogers, 2001; Shapiro, 2001, and the references therein).

Also in biology the role of crossover is far from settled. In population genetics exploring the advantages of recombination, or sexual reproduction, is a famous open question (Barton and Charlesworth, 1998) and has been called “the queen of problems in evolutionary biology” by Bell (1982) and others. Evolutionary processes were found to be harder to analyze than those using only asexual reproduction, as they represent quadratic dynamical systems (Arora, Rabani, and Vazirani, 1994; Rabani, Rabinovich, and Sinclair, 1998).

Recent work in population genetics has focused on studying the “speed of adaptation,” which describes the efficiency of evolution, in a similar vein to research in evolutionary computation (Weissman and Barton, 2012; Weissman, Feldman, and Fisher, 2010). We refer the interested reader to Paixão, Badkobeh, Barton, Corus, Dang, Friedrich, Lehre, Sudholt, Sutton, and Trubenová (2015) and Paixao, Pérez Heredia, Sudholt, and Trubenova (2015) for steps toward unifying research in both fields. Furthermore, a new theory of mixability has been proposed from the perspective of theoretical computer science (Livnat, Papadimitriou, Dushoff, and Feldman, 2008; Livnat, Papadimitriou, Pippenger, and Feldman, 2010), arguing that recombination favors individuals that are good mixers, that is, individuals that create good offspring when being recombined with others.

Several researchers independently reported empirical observations that using crossover improves the performance of evolutionary algorithms (EAs) on the simple function OneMax (Lässig, 2009; Rowe, 2015) but were unable to explain why. The fact that even settings as simple as OneMax are not well understood demonstrates the need for a solid theory and serves as motivation for this work.

Runtime analysis has become a major area of research that can give rigorous evidence and proven theorems (Neumann and Witt, 2010; Auger and Doerr, 2011; Jansen, 2013). However, studies so far have eluded the most fundamental setting of building block functions. Crossover was proven to be superior to mutation only on constructed artificial examples like (Jansen and Wegener, 2002; Kötzing, Sudholt, and Theile, 2011) and “real royal road” functions (Jansen and Wegener, 2005; Storch and Wegener, 2004), the H-IFF problem (Dietzfelbinger, Naudts, Van Hoyweghen, and Wegener, 2003), coloring problems inspired by the Ising model from physics (Fischer and Wegener, 2005; Sudholt, 2005),^{1} computing unique input-output sequences for finite state machines (Lehre and Yao, 2011), selected problems from multiobjective optimization (Qian, Yu, and Zhou, 2013), and the all-pairs shortest path problem (Doerr, Happ, and Klein, 2012a; Sudholt and Thyssen, 2012; Neumann and Theile, 2010). H-IFF (Dietzfelbinger et al., 2003) and the Ising model on trees (Sudholt, 2005) consist of hierarchical building blocks. But none of these papers addressed single-level building blocks in a setting as simple as royal roads.

Watson and Jansen (2007) presented a constructed building block function and proved exponential performance gaps between EAs using only mutation and a GA. However, the definition of the internal structure of building blocks is complicated and artificial, and they used a tailored multideme GA to get the necessary diversity. With regard to how GAs combine building blocks, their approach does not give the intuitive explanation one is hoping for.

This paper presents such an intuitive explanation, supported by rigorous analyses. We consider royal roads and other functions composed of building blocks such as monotone polynomials. is a special case where every bit is a building block. We give rigorous proofs for OneMax and show how the main proof arguments transfer to broader classes of building block functions. Experiments support the latter.

Our main results are as follows.

We show in Section 3 that on OneMax

*every*() GA with uniform crossover and standard bit mutation is at least twice as fast as*every*evolutionary algorithm (EA) that only uses standard bit mutations (up to small-order terms). More precisely, the dominating term in the expected number of function evaluations decreases from to . This holds provided that the parent population and offspring population sizes and are moderate, so that the inertia of a large population does not slow down exploitation. The reason for this speedup is that the GA can store a neutral mutation (a mutation not altering the parent’s fitness) in the population, along with the respective parent. It can then use crossover to combine the good building blocks between these two individuals, improving the current best fitness. In other words, crossover can capitalize on mutations that have both beneficial and disruptive effects on building blocks, as crossover is able to repair the disruptive effects of mutation in later generations.The use of uniform crossover leads to a shift in the optimal mutation rate on OneMax. Section 4 demonstrates this for a simple greedy () GA that always selects parents among the current best individuals. While for mutation-based EAs is the optimal mutation rate (Witt, 2013), the greedy () GA has an optimal mutation rate of (ignoring small-order terms). This is because introducing crossover makes neutral mutations more useful and larger mutation rates increase the chance of a neutral mutation. Optimality is proved by means of a matching lower bound on the expected optimization time of the greedy () GA that applies to

*all*mask-based crossover operators (where each bit value is taken from either parent). Using the optimal mutation rate, the expected number of function evaluations is .These results are not limited to uniform crossover or the absence of linkage. Section 5 shows that the same results hold for GAs using

*k*-point crossover, for arbitrary*k*, under slightly stronger conditions on and , if the crossover probability*p*is set to an appropriately small value._{c}The reasoning for OneMax carries over to other functions with a clear building block structure. Experiments in Section 6 reveal similar performance differences as on OneMax for royal road functions and random polynomials with unweighted, positive coefficients. This is largely confirmed by statistical tests. There is evidence that findings also transfer to weighted building block functions like linear functions, provided that the population can store solutions with different fitness values and different building blocks until crossover is able to combine them. This is not the case for the greedy () GA, but a simple () GA is significantly faster on random linear functions than the optimal mutation-based EA for this class of functions, the () EA (Witt, 2013).

The first result, the analysis for uniform crossover, is remarkably simple and intuitive. It gives direct insight into the working principles of GAs. Its simplicity also makes it very well suited for teaching purposes.

This work extends a preliminary conference paper (Sudholt, 2012) with parts of the results, where results were restricted to one particular GA, the greedy () GA. This extended version presents a general analytical framework that applies to all () GAs, subject to mild conditions, and includes the greedy () GA as a special case. To this end, we provide tools for analyzing parent and offspring populations in () GAs, which we believe are of independent interest.

Moreover, results for *k*-point crossover have been improved. The leading constant in the upper bound for *k*-point crossover in Sudholt (2012) was by an additive term of larger than that for uniform crossover, for mutation rates of . This left open the question whether *k*-point crossover is as effective as uniform crossover for assembling building blocks in OneMax. Here we provide a new and refined analysis, which gives an affirmative answer, under mild conditions on the crossover probability.

### 1.1 Related Work

The literature on recombination is too vast to be reviewed comprehensively. Sastry et al. (2005) reviewed early literature and gave recommendations on the design of competent genetic algorithms based on building blocks.

In more recent work, Prügel-Bennett (2010) presented five mechanisms that advantage populations with crossover, based on empirical evidence and nonrigorous theory:

Putting together building blocks from different solutions

Focusing search by crossover on variables where parents differ

The ability of a population to act as a low-pass filter of the landscape

Hedging against bad luck in the initialization and other decisions made

The opportunity of learning useful parameter values to balance exploration against exploitation

This work explicitly addresses the first mechanism, for which Prügel-Bennett (2010, sec. IIIA) notes “it is nontrivial to construct a toy problem which demonstrated how the building block hypothesis would work.” It is shown here that the best known toy problem, OneMax, serves this purpose. We also implicitly address the second benefit, focusing search, as our analysis reveals that crossover very quickly exploits diversity in the population to create improvements on OneMax.

In terms of rigorous runtime analysis, Kötzing et al. (2011) considered the search behavior of an idealized GA on OneMax to highlight the potential benefits of crossover under ideal circumstances. If a GA is able to recombine two individuals with equal fitness that result from independent evolutionary lineages, the fitness gain can be of order . The idealized GA would therefore be able to optimize OneMax in expected time (Kötzing et al., 2011). However, this idealization cannot reasonably be achieved in realistic EAs with common search operators; hence the result should be regarded as an academic study on the *potential* benefit of crossover.

A related strand of research deals with the analysis of the Simple GA on OneMax. The Simple GA is one of the best known and best researched GAs in the field. It uses a generational model where parents are selected using fitness-proportional selection and the generated offspring form the next population. Neumann, Oliveto, and Witt (2009) showed that the Simple GA without crossover with high probability cannot optimize OneMax in less than exponential time. The reason is that the population typically contains individuals of similar fitness, and then fitness-proportional selection is similar to uniform selection. Oliveto and Witt (2014) extended this result to uniform crossover: the Simple GA with uniform crossover and population size , , still needs exponential time on OneMax. It even needs exponential time to reach a solution of fitness larger than for an arbitrary constant . Oliveto and Witt (2013) relaxed their condition on the population size to . Their work does not exclude that crossover is advantageous, particularly since under the right circumstances crossover may lead to a large increase in fitness (Kötzing et al., 2011). But if there is an advantage, it is not noticeable, as the Simple GA with crossover still fails badly on OneMax (for the stated moderate population sizes).

One year after Sudholt (2012) was published, Doerr, Doerr, and Ebel (2013a) presented a groundbreaking result: they designed an EA that was proven to optimize OneMax (and any simple transformation thereof) in time . This is a spectacular result, as all black-box search algorithms using only unbiased unary operators—operators modifying one individual only, and not exhibiting any inherent search bias—need time , as shown by Lehre and Witt (2012). So their EA shows that crossover can lower the expected running time by more than a constant factor. They call their algorithm a (1+(, )) EA: starting with one parent, it first creates offspring by mutation, with a random and potentially high mutation rate. Then it selects the best mutant and crosses it times with the original parent, using parameterized uniform crossover (the probability of taking a bit from the first parent is not always , but a parameter of the algorithm). This leads to a number of expected function evaluations. This bound was recently tightened to (Doerr and Doerr, 2015b) and can be further decreased to by self-adjusting (Doerr and Doerr, 2015a).

The (1+(, )) EA from Doerr et al. (2013a) is very cleverly designed to work efficiently on OneMax and similar functions. It uses a nonstandard EA design because of its two phases of environmental selection. Other differences are that mutation is performed before crossover, and mutation is not fully independent for all offspring: the number of flipping bits is a random variable determined as for standard bit mutations, but the same number of flipping bits is then used in all offspring. The focus of this work is different, as our goal is to understand how standard EAs operate and how crossover can be used to speed up building block assembly in commonly used () EAs.

## 2 Preliminaries

We measure the performance of the algorithm with respect to the number of function evaluations performed until an optimum is found, referred to as *optimization time*. For steady-state algorithms this equals the number of generations (apart from the initialization), and for EAs with offspring populations such as () EAs or () GAs the optimization time is by a factor of larger than the number of generations. Note that the number of generations needed to optimize a fitness function can often be easily decreased by using offspring populations or parallel evolutionary algorithms (Lässig and Sudholt, 2014). But this significantly increases the computational effort within one generation, so the number of function evaluations is a more fair and widely used measure.

Looking at function evaluations is often motivated by the fact that this operation dominates the execution time of the algorithm. Then the number of function evaluations is a reliable measure for wall clock time. However, the wall clock time might increase when introducing crossover as an additional search operator. Also, when increasing the mutation rate, more pseudorandom numbers might be required. Jansen and Zarges (2011) point out a case where this effect leads to a discrepancy between the number of function evaluations and wall clock time. This concern must be taken seriously when aiming at reducing wall clock time. However, each implementation must be checked individually in this respect (Jansen and Zarges, 2011). Therefore, we keep this concern in mind but still use the number of function evaluations in the following.

## 3 Uniform Crossover Makes () EAs Twice as Fast

We show that, under mild conditions, every () GA is at least twice as fast as its counterpart without crossover. For the latter, that is, evolutionary algorithms using only standard bit mutation, the author proved the following lower bound on the running time of a very broad class of mutation-based EAs (Sudholt, 2013). It covers all possible selection mechanisms, parent or offspring populations, and even parallel evolutionary algorithms. We slightly rephrase this result.

^{13}). Also the mutation rate is the best possible choice for OneMax, leading to a lower bound of For the special case of , Doerr, Fouz, and Witt (2011) improved this bound toward .

We show that for a range of () EAs, as defined in the following, introducing uniform crossover can cut the dominant term of the running time in half, for the standard mutation rate .

*f*, This in particular implies that equally fit solutions are selected with the same probability. Condition (1) is satisfied for all common selection mechanisms: uniform selection, fitness-proportional selection, tournament selection, cut selection, and rank-based mechanisms.

The class of () EAs covered in this work is defined in Algorithm 1. All () EA s therein create offspring through crossover and mutation, or just mutation, and then pick the best out of the previous search points and the new offspring.

In the case of ties, we pick solutions that have the fewest duplicates among the considered search points. This strategy was used by Jansen and Wegener (2005) in their groundbreaking work on real royal roads; it ensures a sufficient degree of diversity whenever the population contains different search points of the same fitness.

Before stating the main result of this section, we provide two lemmas showing how to analyze population dynamics. Both lemmas are of independent interest and may prove useful in other studies of population-based EAs.

The following lemma estimates the expected time until individuals with fitness at least *i* take over the whole population. It generalizes Lemma ^{3} in Sudholt (2009), which in turn goes back to Witt’s (2006) analysis of the () EA. Note that the lemma applies to arbitrary fitness functions, arbitrary values for and , and arbitrary crossover operators; it merely relies on fundamental and universal properties of cut selection and standard bit mutations.

*n*-bit fitness function. Assume the current population contains at least one individual of fitness

*i*. The expected number of function evaluations needed for the () GA before all individuals in its current population have fitness at least

*i*is at most This holds for any tie-breaking rule used in the environmental selection.

Call an individual *fit* if it has fitness at least *i*. Now estimate the expected number of generations until the population is taken over by fit individuals, called the *expected takeover time*. As fit individuals are always preferred to nonfit individuals in the environmental selection, the expected takeover time equals the expected number of generations until fit individuals have been created, starting with one fit individual.

Now divide the run of the () GA into phases in order to get a lower bound on the number of fit individuals at certain time steps. The *j*th phase, , starts with the first offspring creation in the first generation, where the number of fit individuals is at least . It ends in the first generation where this number is increased to . Let *T _{j}* describe the random number of generations spent in the

*j*th phase. Starting with a new generation with fit individuals in the parent population, consider a phase of offspring creations, disregarding generation bounds.

*N*denote the random number of new fit offspring created in the phase. Then and by classical Chernoff bounds (see, e.g., Mitzenmacher and Upfal, 2005, ch. 4) If , the phase is called unsuccessful and we consider another phase of offspring creations. The expected waiting time for a successful phase is at most , and the expected number of offspring creations until is at most .

_{i}The following simple but handy lemma relates success probabilities for created offspring to the expected number of function evaluations needed to complete a generation where such an event has first happened.

The expected number of trials for an event with probability *q* to occur is . To complete the generation, at most further function evaluations are required.

Now we are able to prove the main result of this section.

The main difference between the upper bound for () GAs and the lower bound for all mutation-based EAs is an additional factor of in the denominator of the upper bound. This is a factor of 2 for and an even larger gain for larger mutation rates.

For the default value of , this shows that introducing crossover makes EAs at least twice as fast as the fastest EA using only standard bit mutation. It also implies that introducing crossover makes EAs at least twice as fast as their counterparts without crossover (i.e., where ).

^{4}:

In order to prove the general bound (2), we consider canonical fitness levels, that is, the *i*th fitness level contains all search points with fitness *i*. We estimate the time spent on each level *i*, that is, when the best fitness in the current population is *i*. For each fitness level we consider three cases. The first case applies when the population contains individuals on fitness levels less than *i*. The second case is when the population only contains copies of a single individual on level *i*. The third case occurs when the population contains more than one individual on level *i*; then the population contains different building blocks that can be recombined effectively by crossover.

All these cases capture the typical behavior of a () GA, albeit some of these cases, and even whole fitness levels, may be skipped. We obtain an upper bound on its expected optimization time by summing up expected times the () GA may spend in all cases and on all fitness levels.

*Case i.1*. The population contains an individual on level *i* and at least one individual on a lower fitness level.

A sufficient condition for leaving this case is that all individuals in the population obtain fitness at least *i*. Since the () GA never accepts worsenings, the case is left for good.

*i*has already been estimated in Lemma

^{2}. Applying this lemma to all fitness levels

*i*, the overall time spent in all cases is at most

*Case i.2*. The population contains copies of the same individual *x* on level *i*.

In this case, each offspring created by the () GA will be a standard mutation of *x*. This is obvious for offspring where the () GA decides not to use crossover. If crossover is used, the () GA will pick , create by crossover, and hence perform a mutation on *x*.

The () GA leaves this case for good if either a better search point is created or if it creates another search point with *i* ones. In the latter case, we will create a population with two different individuals on level *i*. Note that due to the choice of the tie-breaking rule in the environmental selection, the () GA will always maintain at least two individuals on level *i*, unless an improvement with larger fitness is found.

*n*−

*i*suitable 1-bit flips. The probability of creating a different search point on level

*i*is at least , as it is sufficient to flip one of

*i*1-bits, to flip one of

*n*−

*i*0-bits, and not to flip any other bit. The probability of either event happening in one offspring creation is thus at least By Lemma

^{3}, the expected number of function evaluations in case is at most The expected number of functions evaluations made in all cases is hence at most The last sum can be estimated as follows. Separating the summand for , We use Equation 3.3.20 in Abramowitz and Stegun (1964) to simplify the integral and get Plugging this into (4) yields that the expected time in all cases is at most

*Case i.3*. The population only contains individuals on level *i*, not all of which are identical.

In this case we can rely on crossover recombining two different individuals on level *i*. As they both have different building blocks, namely, different bits are set to 1, there is a good chance that crossover will generate an offspring with a higher number of 1-bits.

*X*denote the number of 1-bits among these positions in the offspring. Note that

*X*is binomially distributed with parameters and and its expectation is

*d*. We estimate the probability of getting a surplus of 1-bits, as this leads to an improvement in fitness. This estimate holds for any . Since , Mutation keeps all 1-bits with probability at least . Together, the probability of increasing the current best fitness in one offspring creation is at least By Lemma

^{3}, the expected number of function evaluations in case is at most The total expected time spent in all cases is hence at most as .

The conditions on and are fairly tight; see Remark ^{19} in the appendix. The conditions on *p _{c}* can be relaxed to include ; see Remark

^{20}in the appendix.

It is remarkable that the waiting time for successful crossovers in cases is only of order . For small values of and , for instance, , the time spent in all cases is , which is negligible compared to the overall time bound of order . This shows how effective crossover is in recombining building blocks.

Also note that the proof of Theorem ^{4} is relatively simple, as it uses only elementary arguments and, along with Lemmas ^{2} and ^{3}, it is fully self-contained. The analysis therefore lends itself for teaching purposes on the behavior of evolutionary algorithms and the benefits of crossover.

The analysis has revealed that fitness-neutral mutations, that is, mutations creating a different search point of the same fitness, can help to escape from the case of a population with identical individuals. Even though these mutations do not immediately yield an improvement in terms of fitness, they increase the diversity in the population. Crossover is very efficient in exploiting this gained diversity by combining two different search points at a later stage. From Prügel-Bennett’s (2010) perspective, this corresponds to crossover focusing search on bits that differ between parents.

This means that crossover can capitalize on mutations that have both beneficial and disruptive effects on building blocks: crossover is able to repair the disruptive effects of mutation in later generations.

An interesting consequence is that this affects the optimal mutation rate on OneMax. For EAs using only standard bit mutations, Witt (2013) proved that is the optimal mutation rate for the () EA on all linear functions. Recall that the () EA is the optimal mutation-based EA (in the sense of Theorem ^{1}) on OneMax (Sudholt, 2013).

For mutation-based EAs on OneMax, neutral mutations are neither helpful nor detrimental. With crossover acting as repair mechanism, neutral mutations now become helpful. Increasing the mutation rate increases the likelihood of neutral mutations. In fact, we can easily derive better upper bounds from Theorem ^{4} for slightly larger mutation rates, thanks to the additional term in the denominator of the upper bound.

^{4}is obtained for . For this choice the dominant term in (3) becomes

## 4 The Optimal Mutation Rate

Corollary 5 gives the mutation rate that yields the best upper bound on the running time that can be obtained with the proof of Theorem ^{4}. However, it does not establish that this mutation rate is indeed optimal for any GA. After all, another mutation rate leads to a smaller expected optimization time.

In the following, we show for a simple (2 + 1) GA (Algorithm 2) that the upper bound from Theorem ^{4} is indeed tight up to small-order terms, which establishes as the optimal mutation rate for that (2 + 1) GA. Proving lower bounds on expected optimization times is often a notoriously hard task, hence we restrict ourselves to a simple bare-bones GA that captures the characteristics of GAs covered by Theorem ^{4} and is easy to analyze. The latter is achieved by fixing as many parameters as possible.

As the upper bound from Theorem ^{4} grows with and , we pick the smallest possible values: and . The parent selection is made as simple as possible: we select parents uniformly at random from the current best individuals in the population. In other words, if we define the parent population as the set of individuals that have a positive probability to be chosen as parents, the parent population only contains individuals of the current best fitness. We call this parent selection ``greedy'' because it is a greedy strategy to choose the current best search points as parents.

In the context of the proof of Theorem ^{4}, greedy parent selection implies that cases are never reached, as the parent population never spans more than one fitness level. So the time spent in these cases is 0. This also allows us to eliminate one further parameter by setting , as lower values for *p _{c}* were only beneficial in cases . Setting minimizes our estimate for the time spent in cases . So Theorem

^{4}extends toward this GA (see also Remark

^{20}in the appendix).

We call the resulting GA a ``greedy () GA'' because its main characteristic is the greedy parent selection. The greedy () GA is defined in Algorithm 2.^{2}

The following result applies to the greedy () GA using any kind of mask-based crossover. A mask-based crossover is a recombination operator where each bit value is taken from either parent; that is, it is not possible to introduce a bit value that is not represented in any parent. All common crossovers are mask-based crossovers: uniform crossover, including parameterized uniform crossover, as well as *k*-point crossovers for any *k*. The following result even includes biased operators like a bitwise OR, which induces a tendency to increase the number of 1-bits.

^{6}is This matches the upper bound (3) up to small-order terms, showing for the greedy () GA that the new term in the denominator of the bound from Theorem

^{4}was not a coincidence. For , the lower bound is at least Together, this establishes the optimal mutation rate for the greedy () GA on OneMax.

For the greedy () GA with uniform crossover on OneMax, mutation rate minimizes the expected number of function evaluations, up to small-order terms.

For the proof of Theorem ^{6} we use the following lower-bound technique based on fitness levels by the author.

Consider a partition of the search space into nonempty sets . A search algorithm is in *A _{i}* or on level

*i*if the best individual created so far is in

*A*. If there are for where

_{i}the probability of traversing from level

*i*to level*j*in one step is at most for all ,for all

*i*, andfor all and some ,

^{6}:

We prove a lower bound for the following sped-up GA instead of the original greedy () GA. Whenever it creates a new offspring with the same fitness, but a different bit string as the current best individual, we assume the following. The algorithm automatically performs a crossover between the two. Also, we assume that this crossover leads to the best possible offspring in a sense that all bits where both parents differ are set to 1 (i.e., the algorithm performs a bitwise OR). That is, if both search points have *i* 1-bits and Hamming distance , then the resulting offspring has *i* + *k* 1-bits.

Because of these assumptions, at the end of each generation there is always a single best individual. For this reason we can model the algorithm by a Markov chain representing the current best fitness.

The analysis follows a lower bound for EAs on OneMax (Sudholt, 2013, Theorem ^{9}). As in Sudholt (2013) we consider the following fitness-level partition that focuses only on the very last fitness values. Let . Let for and contain all remaining search points. We know from Sudholt (2013) that the GA is initialized in with probability at least if *n* is large enough.

*i*to fitness

*i*+

*k*equals According to Sudholt (2013, Lemma

^{2}), for the considered fitness levels the former probability is bounded by The latter probability is bounded by

*u*and along with some such that all conditions of Theorem

_{i}^{8}are fulfilled. Define and Observe that, for every , In order to fulfill the second condition in Theorem

^{8}, we consider the following normalized variables: and . As , this proves the first condition of Theorem

^{8}.

Following the proof of Theorem ^{9} in Sudholt (2013), it is easy to show that for we get for all with [the calculations in Sudholt (2013, pp. 427–428) carry over by replacing with ]. This establishes the third and last condition.

^{8}and recalling that the first fitness level is reached with probability at least , we get a lower bound of where the last step used that all factors , and are , and for any positive constants . Bounding as in Sudholt (2013) and absorbing all small-order terms in the term from the statement gives the claimed bound.

We also ran experiments to see whether the outcome matches our inspection of the dominating terms in the running time bounds for realistic problem dimensions. We chose bits and recorded the average optimization time over 1,000 runs. The mutation rate *p* was set to with . The result is shown in Figure 1.

One can see that for every mutation rate the greedy () GA has a lower average optimization time. As predicted, the performance difference becomes larger as the mutation rate increases. The optimal mutation rates for both algorithms match minimal average optimization times. Note also that the variance/standard deviation was much lower for the GA for higher mutation rates. Preliminary runs for and bits gave very similar results. More experiments and statistical tests are given in Section 6.1.

## 5 *k*-Point Crossover

*k*-point crossover operator picks

*k*cutting points from uniformly at random without replacement. These cutting points divide both parents into segments that are then assembled from alternating parents. That is, for parents and cutting points the offspring will be the suffix being if

*k*is odd and if

*k*is even.

For uniform crossover we have seen that populations containing different search points of equal fitness are beneficial, as uniform crossover can easily combine the good ``building blocks.'' This holds regardless of the Hamming distance between these different individuals and the position of bits where individuals differ.

The () GA with *k*-point crossover is harder to analyze, as there the probability of crossover creating an improvement depends on the Hamming distance of parents and the position of differing bits.

Consider parents that differ in two bits, where these bit positions are quite close. Then 1-point crossover has a high probability of taking both bits from the same parent. In order to recombine those building blocks, the cutting point has to be chosen between the two bit positions. A similar effect occurs for 2-point crossover if the two bit positions are on opposite ends of the bit string.

The following lemma gives a lower bound on the probability that *k*-point crossover combines the right building blocks on OneMax if two parents are equally fit and differ in two bits. The lemma and its proof may be of independent interest.

We identify cutting points with bits such that cutting point *a* results in two strings and . We say that a cutting point *a* separates *i* and *i* + *d* if . Note that the prefix is always taken from *x*. The claim now follows from showing that the number of separating cutting points is odd with the claimed probability.

*i*and

*i*+

*d*. This variable follows a hypergeometric distribution , illustrated by the following urn model with red and white balls. The urn contains

*N*balls,

*d*of which are red. We draw

*k*balls uniformly at random without replacement. Then describes the number of red balls drawn. We define the probability of being odd, for and , as Note that for and for For all and all the following recurrence holds. Imagine drawing the first cutting point separately. With probability , the cutting point is a separating cutting point, and then an even number of further separating cutting points is needed among the remaining cutting points, drawn from a random variable . With the remaining probability , the number of remaining cutting points must be even, and this number is drawn from a random variable . Hence Assume for an induction that for all , This is true for as, using for , For , combining (6) and (7) yields The upper bound follows similarly: By induction, the claim follows.

In the setting of Lemma ^{9}, the probability of *k*-point crossover creating an improvement depends on the distance between the two differing bits. Fortunately, for search points that result from a mutation of one another, this distance has a favorable distribution. This is made precise in the following lemma.

We first show the following. For any fixed index *i* and any integer , there are exactly two positions *j* such that . If and are fixed, the only values for *j* that result in either or are , and . Note that at most two of these values are in . Hence, there are at most two feasible values for *j* for every . Similarly, for there is just one position such that .

*x*. If , assume that first the 0-bit is chosen uniformly at random, and then consider the uniform random choice of a corresponding 1-bit. As each bit has a probability of of being selected, and at most two choices lead to a particular value of , we have The case follows symmetrically by considering the uniform choice of the 0-bit among choices.

Taken together, Lemma ^{9} and Lemma ^{10} indicate that *k*-point crossover has a good chance of finding improvements through recombining the right ``building blocks.'' However, this is based on the population containing potential parents of equal fitness that only differ in two bits.

The following analysis shows that the population is likely to contain such a favorable pair of parents. However, such a pair might get lost again if other individuals of the same fitness are being created, after all duplicates have been removed from the population. For parents that differ in more than 2 bits, Lemma ^{9} does not apply; hence we do not have an estimate of how likely such a crossover will find an improvement.

In order to avoid this problem, we consider a more detailed tie-breaking rule. As before, individuals with fewer duplicates are preferred. In case there are still ties after considering the number of duplicates, the () GA will retain older individuals. This refined tie-breaking rule is shown in Algorithm 3. As shown in the remainder, it implies that once a favorable pair of parents with Hamming distance 2 has been created, this pair will never get lost.

This tie-breaking rule, called ``dup-old,'' differs from the one used for the experiments in Figure 1 and those in Section 6. There, we broke ties uniformly at random in case individuals are tied with respect to both fitness and the number of duplicates. We call the latter rule ``dup-rnd.'' Experiments for the greedy () GA comparing tie-breaking rules dup-old and dup-rnd over 1,000 runs indicate that performance differences are very small (see Figure 2).^{3}

Note, however, that on functions with plateaus, like royal road functions, retaining the older individuals prevents the () GA from performing random walks on the plateau, once the population has spread such that there are no duplicates of any individual. In this case performance is expected to deteriorate when breaking ties toward older individuals.

With the refined tie-breaking rule, the performance of () GAs is as follows.

This bound equals the upper bound (3) for () GAs with uniform crossover. It improves upon the previous upper bound for the greedy () GA (Sudholt, 2012, Theorem ^{8}), whose dominant term was by an additive term of larger. The reason is that for the () GA favorable parents could get lost, which is now prevented by the dup-old tie-breaking rule and conditions on *p _{c}*.

The conditions as well as are useful because they allow us to estimate the probability that a single good individual takes over the whole population with copies of itself.

In the remainder of this section we work toward proving Theorem ^{11} and assume that for some *n*_{0} chosen such that all asymptotic statements that require a large enough value of *n* hold true. For there is nothing to prove, as the statement holds trivially for bounded *n*.

We again estimate the time spent on each fitness level *i*, that is, when the best fitness in the current population is *i*. To this end, the focus is on the higher fitness levels where the probability of creating an offspring on the same level can be estimated nicely. The time for reaching these higher fitness levels only constitutes a small-order term, compared to the claimed running time bound. The following lemma proves this claim in a more general setting than needed for the proof of Theorem ^{11}. In particular, it holds for arbitrary tie-breaking rules and crossover operators.

For every () GA implementing Algorithm 1 with , and for a constant , using any initialization and any crossover operator, the expected time until a fitness level is reached for the first time is .

A proof is given in the appendix.

In the remainder of the section we focus on higher fitness levels and specify the different cases on each such fitness level. The cases , , and are similar to the ones for uniform crossover, with additional conditions on the similarity of individuals in cases and . We also have an additional error state that accounts for undesirable and unexpected behavior. We pessimistically assume that the error state cannot be left toward other cases on level *i*.

*Case i.1*. The population contains an individual on level *i* and at least one individual on a lower fitness level.

*Case i.2*. The population contains copies of an individual *x* on level *i*.

*Case i.3*. The population contains two search points with current best fitness *i*, where *y* resulted from a mutation of *x* and the Hamming distance of *x* and *y* is 2.

*Case i.error*. An error state reached from any case when the best fitness is *i* and none of the prior cases applies.

The difference from the analysis of uniform crossover is that in case we rely on the population collapsing to copies of a single individual. This helps to estimate the probability of creating a favorable parent-offspring pair in case , as the () GA effectively only performs mutations of *x* while being in case .

^{4}, we use Lemma

^{2}and get that the expected time spent in all cases is at most

^{4}, as both crossover operators are working on identical individuals. As before, case is left if either a better offspring is created or a different offspring with

*i*ones is created. In the latter case, either case or the error state

*i*.error is reached. By the proof of Theorem

^{4}, we know that the expected time spent in cases across all levels

*i*is bounded by

Now we estimate the total time spent in all cases . As this time turns out to be comparably small, the fact that not all these cases are actually reached can be ignored.

*x*and

*y*differ, then is a random variable with support . By the law of total expectation, We first bound the conditional expectation by considering probabilities for improvements. If then crossover is successful if crossover is performed (probability

*p*), if the search point where bit

_{c}*a*is 1 is selected as first parent (probability at least ), if the remaining search point in is selected as second parent (probability at least ), and if cutting points are chosen that lead to a fitness improvement. The latter event has probability at least by Lemma

^{9}, with . Finally, we need to assume that the following mutation does not destroy any fitness improvements (probability at least ). The probability of a successful crossover is then at least, using , Another means of escaping from case is by not using crossover but having mutation create an improvement. The probability for this is at least for a constant . Applying Lemma

^{3}, Note that this upper bound is nonincreasing with . We are therefore pessimistic when replacing by the pessimistic probability estimations from Lemma

^{10}. Combining this with (8) and (10) yields The last sum is estimated as follows. Along with , , and , we get For the sum we then have the following: as the integral is . This completes the proof.

The remainder of the proof is devoted to estimating the expected time spent in the error state. To this end we need to consider events that take the () GA ``off course,'' that is, deviating from situations described in cases , , and .

Since case is based on offspring with Hamming distance 2 to their parents, one potential failure is that an offspring with fitness *i* but Hamming distance greater than 2 to its parent is being created. This probability is estimated in the following lemma.

The proof is found in the appendix.

Another potential failure occurs if the population does not collapse to copies of a single search point, that is, the transition from case to case is not made. First estimate the probability of mutation unexpectedly creating an individual with fitness *i*.

Note that for the special case , Doerr, Johannsen, and Winzen (2012b, Lemma ^{13}) give an upper bound of . This is because the highest probability for a jump to fitness level *i* is attained when the parent is on level . However, for larger mutation probabilities this is no longer true in general; there are cases where the probability of jumping to level *i* is maximized for parents on lower fitness levels. Hence, a closer inspection of transition probabilities between different fitness levels is required; see the proof in the appendix.

Using Lemma ^{15}, we can now estimate the probability of the () GA not collapsing to copies of a single search point as described in case .

^{11}, with parameters , , and , for some constant , and fix a fitness level . The probability that the () GA will reach a population containing different individuals with fitness

*i*before either reaching a population containing only copies of the same individual on level

*i*or reaching a higher fitness level is at most

We show that there is a good probability of repeatedly creating clones of individuals with fitness *i* (or finding an improvement) and avoiding the following *bad* event. A bad event happens if an individual on fitness level *i* is created in one offspring creation by means other than through cloning an existing individual on level *i*.

*p*, bound the probability of a bad event by the trivial bound 1. Otherwise, such an individual needs to be created through mutation from either a worst fitness level or by mutating a parent on level

_{c}*i*. The probability for the former is bounded from above by Lemma

^{15}. The probability for the latter is at most , as it is necessary to flip one out of

*n*−

*i*0-bits. Using , the probability of a bad event on level

*i*is hence bounded from above by where is a constant. The () GA will only reach a population containing different individuals with fitness

*i*as stated if a bad event happens before the population has collapsed to copies of a single search point or moved on to a higher fitness level.

*i*is reached for the first time. Since it might be possible to create several such individuals in one generation, we consider all offspring creations being executed sequentially and consider the possibility of bad events for all offspring creations following the first offspring on level

*i*. Let

*X*be the number of function evaluations following this generation before all individuals in the population have fitness at least

*i*. By Lemma

^{2}, we have Considering up to further offspring creations in the first generation leading to level

*i*, and completing the generation at the end of the

*X*function evaluations, we have fewer than trials for bad events. The probability that one of these is bad is bounded by Absorbing in the

*O*-term yields the claimed result.

Now we can estimate the expected time spent in all error states error for .

The () GA only spends time in an error state if it is actually reached. So first calculate the probability that state error is reached from case , , or .

^{16}states that the probability of reaching a population with different individuals on level

*i*before reaching case or a better fitness level is We pessimistically ignore the possibility that case might be reached if this happens; thus the above is an upper bound for the probability of reaching error from case .

^{14}is in force. According to Lemma

^{14}the probability of leaving case by creating a different individual with fitness

*i*is at least . The probability of doing this with an offspring of Hamming distance greater than 2 to its parent is at most (second statement of Lemma

^{14}). So the conditional probability of reaching the error state when leaving case toward another case on level

*i*is at most In case note that case is reached if there is a single offspring with fitness

*i*and Hamming distance 2 to its parent. Such an offspring is guaranteed to survive, as we assume and offspring with many duplicates are removed first. Thus in case several offspring with fitness

*i*and differing from their parent are created,

*all*of them need to have Hamming distance larger than 2 in order to reach error from case . This probability decreases with increasing ; hence the probability bound (11) also holds for .

Finally, case implies that there exists a parent-offspring pair with Hamming distance 2. In a new generation these two offspring, or at least one copy of each, will always survive: individuals with multiple duplicates are removed first, and if among current parents and offspring more than individuals exist with no duplicates, *x* and *y* will be preferred over newly created offspring. So the probability of reaching the error state from case is 0.

^{3}as before, this translates to at most expected function evaluations. So the expected time spent in case error is at most as both and . The total expected time across all error states is at most

Now Theorem ^{11} follows from all previous lemmas.

^{11}:

The claimed upper bound now follows from adding the upper bounds on the expected time on the smaller fitness levels (Lemma ^{12}) to the expected times spent in all considered cases (Lemma ^{13} and Lemma ^{17}).

Some of the technical conditions from Theorem ^{11} involving , and *p _{c}* could be relaxed if it is possible to generalize Lemmas

^{9}and

^{10}toward more than 2 differing bits between individuals of equal fitness.

## 6 Extensions to Other Building Block Functions

### 6.1 Royal Roads and Monotone Polynomials

So far, our theorems and proofs have been focused on OneMax only. This is because we do have very strong results about the performance of EAs on OneMax at hand. However, the insights gained stretch far beyond OneMax. Royal road functions generally consist of larger blocks of bits. All bits in a block need to be set to 1 in order to contribute to the fitness; otherwise the contribution is 0. All blocks contribute the same amount to the fitness, and the fitness is just the sum of all contributions.

The fundamental insight we have gained for neutral mutations also applies to royal road functions. If there is a mutation that completes one block but destroys another block, this is a neutral mutation and the offspring will be stored in the population of a () GA. Then crossover can recombine all finished blocks in the same way as for OneMax. The only difference is that the destroyed block may evolve further. More neutral mutations can occur that only alter bits in the destroyed block. Then the population can be dominated by many similar solutions, and it becomes harder for crossover to find a good pair for recombination. However, as crossover has generally a very high probability of finding improvements, the last effect probably plays only a minor role.

A theoretical analysis of general royal roads up to the same level of detail as for OneMax is harder but not impossible. So far, results on royal roads and monotone polynomials have been mostly asymptotic (Wegener and Witt, 2005; Doerr, Sudholt, and Witt, 2013b). Only recently, Doerr and Künnemann (2013) presented a tighter runtime analysis of offspring populations for royal road functions, which may lend itself to a generalization of our results on OneMax in future work.

For now, we use experiments to see whether the performance is similar to that on OneMax. We use royal roads with bits and block size 5, that is, we have 200 pairwise disjoint blocks of 5 bits each. We also consider random monotone polynomials. Instead of using disjoint blocks, we use 1,000 monomials of degree 5 (conjunctions of 5 bits): each monomial is made up of 5 bit positions chosen uniformly at random, without replacement. This leads to a function similar to royal roads, but ``blocks'' are broken up and can share bits; bit positions are completely random. Figure 3 shows the average optimization times in 1,000 runs on all these functions, for the () EA and the greedy () GA with uniform, 1-point, and 2-point crossover. We chose the last two because *k*-point crossovers for odd *k* treat ends of bit strings differently from those for even *k*: for odd *k* two bits close to opposite ends of a bitstring have a high probability to be taken from different parents, whereas for even *k* there is a high chance that both will be taken from the same parent (see Lemma ^{9} for and the special case of ).

For consistency and simplicity, use and the tie-breaking rule dup-rnd in all settings, that is, ties in fitness are broken toward minimum numbers of duplicates and any remaining ties are broken uniformly at random. For OneMax this does not perfectly match the conditions of Theorem ^{11}, as they require a lower crossover probability, , and tie-breaking rule dup-old. But the experiments show that *k*-point crossover is still effective when these conditions are not met.

On OneMax both *k*-point crossovers are better than the () EA but slightly worse than uniform crossover. This is in accordance with the observation from our analyses that improvements with *k*-point crossover might be harder to find if the differing bits are in close proximity.

For royal roads the curves are very similar. The difference between the () EA and the greedy () GA is just a bit smaller. For random polynomials there are visible differences, albeit smaller. Mann--Whitney *U* tests confirm that wherever there is a noticeable gap between the curves, there is a statistically significant difference on a significance level of .001. The outcome of Mann--Whitney *U* tests is summarized in Table 1.

. | . | () EA . | Uniform . | 1-point . |
---|---|---|---|---|

OneMax | uniform | |||

1-point | for | |||

2-point | for | (11 ex.) | ||

Royal road | uniform | for | ||

1-point | for | for (1 ex.) | ||

2-point | for | for (5 ex.) | (6 ex.) | |

Random | uniform | (1 ex.) | ||

polynomial | 1-point | for (3 ex.) | (13 ex.) | |

2-point | for (1 ex.) | (6 ex.) | (6 ex.) |

. | . | () EA . | Uniform . | 1-point . |
---|---|---|---|---|

OneMax | uniform | |||

1-point | for | |||

2-point | for | (11 ex.) | ||

Royal road | uniform | for | ||

1-point | for | for (1 ex.) | ||

2-point | for | for (5 ex.) | (6 ex.) | |

Random | uniform | (1 ex.) | ||

polynomial | 1-point | for (3 ex.) | (13 ex.) | |

2-point | for (1 ex.) | (6 ex.) | (6 ex.) |

For very small mutation rates the tests were not significant. For mutation rates no less than all differences between the () EA and all greedy () GAs were statistically significant, apart from a few exceptions on random polynomials. For OneMax the difference between uniform crossover and *k*-point crossover was significant for . For royal roads the majority of such comparisons showed statistical significance, with a number of exceptions. However, for random polynomials the majority of comparisons were not statistically significant. Most comparisons between 1-point and 2-point crossover did not show statistical significance.

These findings give strong evidence that the insights drawn from the analysis on OneMax transfer to broader classes of functions where building blocks need to be assembled.

### 6.2 Linear Functions

Doerr et al. (2013a) provided empirical evidence that their (1+(, )) EA is faster than the () EA on linear functions with weights drawn uniformly at random from .

It is an open question whether this also holds for more common GAs, that is, those implementing Algorithm 1. Experiments in Doerr et al. (2013a) on the greedy () GA found that on random linear functions “no advantage of the () GA over the () EA is visible.” We provide an explanation for this observation and reveal why the () GA is not well suited for weighted building blocks, whereas other GAs might be.

The reason the () GA behaves like the () EA in the presence of weights is that in case the current population of the () GA contains two members with different fitness, the () GA ignores the inferior one. So it behaves as if the population only contained the fitter individual. Since the () GA will select the fitter individual twice for crossover, followed by mutation, it essentially just mutates the fitter individual. This behavior of the () GA then equals that of a () EA working on the fitter individual.

The () GA is more efficient than the () EA on OneMax (and other building-block functions where all building blocks are equally important) as it can easily generate and store individuals with equal fitness in the population and recombine their different building blocks. However, in the presence of weights, chances of creating individuals of equal fitness might be very slim, and then the () GA behaves like the () EA.

As long as the population of the () GA does not contain two different individuals with the same fitness, the () GA is equivalent to the () EA.