We reinvestigate a fundamental question: How effective is crossover in genetic algorithms in combining building blocks of good solutions? Although this has been discussed controversially for decades, we are still lacking a rigorous and intuitive answer. We provide such answers for royal road functions and OneMax, where every bit is a building block. For the latter, we show that using crossover makes every (+) genetic algorithm at least twice as fast as the fastest evolutionary algorithm using only standard bit mutation, up to small-order terms and for moderate and . Crossover is beneficial because it can capitalize on mutations that have both beneficial and disruptive effects on building blocks: crossover is able to repair the disruptive effects of mutation in later generations. Compared to mutation-based evolutionary algorithms, this makes multibit mutations more useful. Introducing crossover changes the optimal mutation rate on OneMax from to . This holds both for uniform crossover and k-point crossover. Experiments and statistical tests confirm that our findings apply to a broad class of building block functions.
Ever since the early days of genetic algorithms (GAs), researchers have wondered when and why crossover is an effective search operator. In evolutionary biology it has been folklore that crossover can speed up adaptation by bringing together multiple beneficial changes that resulted from independent mutation events, famously illustrated by Muller (1932, diagram 1). The same view was taken in evolutionary computation, where building blocks were regarded as schemata of high fitness (see, e.g., Davis, 1991, p. 18; Mitchell, Forrest, and Holland, 1992; and De Jong and Spears, 1992). But as Watson and Jansen (2007) put it, there has been a considerable difficulty in demonstrating this rigorously and intuitively.
Many attempts at understanding crossover have been made in the past. Mitchell et al. (1992) presented so-called royal road functions as an example where supposedly genetic algorithms outperform other search algorithms due to the use of crossover. Royal roads divide a bit string into disjoint blocks. Each block makes a positive contribution to the fitness if all bits therein are set to 1. Blocks thus represent schemata, and all-1s configurations are building blocks of optimal solutions. However, the same authors later concluded that simple randomized hill climbers performed better than GAs (Forrest and Mitchell, 1993; Mitchell, Holland, and Forrest, 1994).
The role of crossover has been studied from multiple angles, including algebra (Rowe, Vose, and Wright, 2002), Markov chain models (Vose, 1999), infinite population models and dynamical systems (see De Jong, 2006, ch. 6, for an overview), and statistical mechanics (see, e.g., Prügel-Bennett and Rogers, 2001; Shapiro, 2001, and the references therein).
Also in biology the role of crossover is far from settled. In population genetics exploring the advantages of recombination, or sexual reproduction, is a famous open question (Barton and Charlesworth, 1998) and has been called “the queen of problems in evolutionary biology” by Bell (1982) and others. Evolutionary processes were found to be harder to analyze than those using only asexual reproduction, as they represent quadratic dynamical systems (Arora, Rabani, and Vazirani, 1994; Rabani, Rabinovich, and Sinclair, 1998).
Recent work in population genetics has focused on studying the “speed of adaptation,” which describes the efficiency of evolution, in a similar vein to research in evolutionary computation (Weissman and Barton, 2012; Weissman, Feldman, and Fisher, 2010). We refer the interested reader to Paixão, Badkobeh, Barton, Corus, Dang, Friedrich, Lehre, Sudholt, Sutton, and Trubenová (2015) and Paixao, Pérez Heredia, Sudholt, and Trubenova (2015) for steps toward unifying research in both fields. Furthermore, a new theory of mixability has been proposed from the perspective of theoretical computer science (Livnat, Papadimitriou, Dushoff, and Feldman, 2008; Livnat, Papadimitriou, Pippenger, and Feldman, 2010), arguing that recombination favors individuals that are good mixers, that is, individuals that create good offspring when being recombined with others.
Several researchers independently reported empirical observations that using crossover improves the performance of evolutionary algorithms (EAs) on the simple function OneMax (Lässig, 2009; Rowe, 2015) but were unable to explain why. The fact that even settings as simple as OneMax are not well understood demonstrates the need for a solid theory and serves as motivation for this work.
Runtime analysis has become a major area of research that can give rigorous evidence and proven theorems (Neumann and Witt, 2010; Auger and Doerr, 2011; Jansen, 2013). However, studies so far have eluded the most fundamental setting of building block functions. Crossover was proven to be superior to mutation only on constructed artificial examples like (Jansen and Wegener, 2002; Kötzing, Sudholt, and Theile, 2011) and “real royal road” functions (Jansen and Wegener, 2005; Storch and Wegener, 2004), the H-IFF problem (Dietzfelbinger, Naudts, Van Hoyweghen, and Wegener, 2003), coloring problems inspired by the Ising model from physics (Fischer and Wegener, 2005; Sudholt, 2005),1 computing unique input-output sequences for finite state machines (Lehre and Yao, 2011), selected problems from multiobjective optimization (Qian, Yu, and Zhou, 2013), and the all-pairs shortest path problem (Doerr, Happ, and Klein, 2012a; Sudholt and Thyssen, 2012; Neumann and Theile, 2010). H-IFF (Dietzfelbinger et al., 2003) and the Ising model on trees (Sudholt, 2005) consist of hierarchical building blocks. But none of these papers addressed single-level building blocks in a setting as simple as royal roads.
Watson and Jansen (2007) presented a constructed building block function and proved exponential performance gaps between EAs using only mutation and a GA. However, the definition of the internal structure of building blocks is complicated and artificial, and they used a tailored multideme GA to get the necessary diversity. With regard to how GAs combine building blocks, their approach does not give the intuitive explanation one is hoping for.
This paper presents such an intuitive explanation, supported by rigorous analyses. We consider royal roads and other functions composed of building blocks such as monotone polynomials. is a special case where every bit is a building block. We give rigorous proofs for OneMax and show how the main proof arguments transfer to broader classes of building block functions. Experiments support the latter.
Our main results are as follows.
We show in Section 3 that on OneMax every () GA with uniform crossover and standard bit mutation is at least twice as fast as every evolutionary algorithm (EA) that only uses standard bit mutations (up to small-order terms). More precisely, the dominating term in the expected number of function evaluations decreases from to . This holds provided that the parent population and offspring population sizes and are moderate, so that the inertia of a large population does not slow down exploitation. The reason for this speedup is that the GA can store a neutral mutation (a mutation not altering the parent’s fitness) in the population, along with the respective parent. It can then use crossover to combine the good building blocks between these two individuals, improving the current best fitness. In other words, crossover can capitalize on mutations that have both beneficial and disruptive effects on building blocks, as crossover is able to repair the disruptive effects of mutation in later generations.
The use of uniform crossover leads to a shift in the optimal mutation rate on OneMax. Section 4 demonstrates this for a simple greedy () GA that always selects parents among the current best individuals. While for mutation-based EAs is the optimal mutation rate (Witt, 2013), the greedy () GA has an optimal mutation rate of (ignoring small-order terms). This is because introducing crossover makes neutral mutations more useful and larger mutation rates increase the chance of a neutral mutation. Optimality is proved by means of a matching lower bound on the expected optimization time of the greedy () GA that applies to all mask-based crossover operators (where each bit value is taken from either parent). Using the optimal mutation rate, the expected number of function evaluations is .
These results are not limited to uniform crossover or the absence of linkage. Section 5 shows that the same results hold for GAs using k-point crossover, for arbitrary k, under slightly stronger conditions on and , if the crossover probability pc is set to an appropriately small value.
The reasoning for OneMax carries over to other functions with a clear building block structure. Experiments in Section 6 reveal similar performance differences as on OneMax for royal road functions and random polynomials with unweighted, positive coefficients. This is largely confirmed by statistical tests. There is evidence that findings also transfer to weighted building block functions like linear functions, provided that the population can store solutions with different fitness values and different building blocks until crossover is able to combine them. This is not the case for the greedy () GA, but a simple () GA is significantly faster on random linear functions than the optimal mutation-based EA for this class of functions, the () EA (Witt, 2013).
The first result, the analysis for uniform crossover, is remarkably simple and intuitive. It gives direct insight into the working principles of GAs. Its simplicity also makes it very well suited for teaching purposes.
This work extends a preliminary conference paper (Sudholt, 2012) with parts of the results, where results were restricted to one particular GA, the greedy () GA. This extended version presents a general analytical framework that applies to all () GAs, subject to mild conditions, and includes the greedy () GA as a special case. To this end, we provide tools for analyzing parent and offspring populations in () GAs, which we believe are of independent interest.
Moreover, results for k-point crossover have been improved. The leading constant in the upper bound for k-point crossover in Sudholt (2012) was by an additive term of larger than that for uniform crossover, for mutation rates of . This left open the question whether k-point crossover is as effective as uniform crossover for assembling building blocks in OneMax. Here we provide a new and refined analysis, which gives an affirmative answer, under mild conditions on the crossover probability.
1.1 Related Work
The literature on recombination is too vast to be reviewed comprehensively. Sastry et al. (2005) reviewed early literature and gave recommendations on the design of competent genetic algorithms based on building blocks.
In more recent work, Prügel-Bennett (2010) presented five mechanisms that advantage populations with crossover, based on empirical evidence and nonrigorous theory:
Putting together building blocks from different solutions
Focusing search by crossover on variables where parents differ
The ability of a population to act as a low-pass filter of the landscape
Hedging against bad luck in the initialization and other decisions made
The opportunity of learning useful parameter values to balance exploration against exploitation
This work explicitly addresses the first mechanism, for which Prügel-Bennett (2010, sec. IIIA) notes “it is nontrivial to construct a toy problem which demonstrated how the building block hypothesis would work.” It is shown here that the best known toy problem, OneMax, serves this purpose. We also implicitly address the second benefit, focusing search, as our analysis reveals that crossover very quickly exploits diversity in the population to create improvements on OneMax.
In terms of rigorous runtime analysis, Kötzing et al. (2011) considered the search behavior of an idealized GA on OneMax to highlight the potential benefits of crossover under ideal circumstances. If a GA is able to recombine two individuals with equal fitness that result from independent evolutionary lineages, the fitness gain can be of order . The idealized GA would therefore be able to optimize OneMax in expected time (Kötzing et al., 2011). However, this idealization cannot reasonably be achieved in realistic EAs with common search operators; hence the result should be regarded as an academic study on the potential benefit of crossover.
A related strand of research deals with the analysis of the Simple GA on OneMax. The Simple GA is one of the best known and best researched GAs in the field. It uses a generational model where parents are selected using fitness-proportional selection and the generated offspring form the next population. Neumann, Oliveto, and Witt (2009) showed that the Simple GA without crossover with high probability cannot optimize OneMax in less than exponential time. The reason is that the population typically contains individuals of similar fitness, and then fitness-proportional selection is similar to uniform selection. Oliveto and Witt (2014) extended this result to uniform crossover: the Simple GA with uniform crossover and population size , , still needs exponential time on OneMax. It even needs exponential time to reach a solution of fitness larger than for an arbitrary constant . Oliveto and Witt (2013) relaxed their condition on the population size to . Their work does not exclude that crossover is advantageous, particularly since under the right circumstances crossover may lead to a large increase in fitness (Kötzing et al., 2011). But if there is an advantage, it is not noticeable, as the Simple GA with crossover still fails badly on OneMax (for the stated moderate population sizes).
One year after Sudholt (2012) was published, Doerr, Doerr, and Ebel (2013a) presented a groundbreaking result: they designed an EA that was proven to optimize OneMax (and any simple transformation thereof) in time . This is a spectacular result, as all black-box search algorithms using only unbiased unary operators—operators modifying one individual only, and not exhibiting any inherent search bias—need time , as shown by Lehre and Witt (2012). So their EA shows that crossover can lower the expected running time by more than a constant factor. They call their algorithm a (1+(, )) EA: starting with one parent, it first creates offspring by mutation, with a random and potentially high mutation rate. Then it selects the best mutant and crosses it times with the original parent, using parameterized uniform crossover (the probability of taking a bit from the first parent is not always , but a parameter of the algorithm). This leads to a number of expected function evaluations. This bound was recently tightened to (Doerr and Doerr, 2015b) and can be further decreased to by self-adjusting (Doerr and Doerr, 2015a).
The (1+(, )) EA from Doerr et al. (2013a) is very cleverly designed to work efficiently on OneMax and similar functions. It uses a nonstandard EA design because of its two phases of environmental selection. Other differences are that mutation is performed before crossover, and mutation is not fully independent for all offspring: the number of flipping bits is a random variable determined as for standard bit mutations, but the same number of flipping bits is then used in all offspring. The focus of this work is different, as our goal is to understand how standard EAs operate and how crossover can be used to speed up building block assembly in commonly used () EAs.
We measure the performance of the algorithm with respect to the number of function evaluations performed until an optimum is found, referred to as optimization time. For steady-state algorithms this equals the number of generations (apart from the initialization), and for EAs with offspring populations such as () EAs or () GAs the optimization time is by a factor of larger than the number of generations. Note that the number of generations needed to optimize a fitness function can often be easily decreased by using offspring populations or parallel evolutionary algorithms (Lässig and Sudholt, 2014). But this significantly increases the computational effort within one generation, so the number of function evaluations is a more fair and widely used measure.
Looking at function evaluations is often motivated by the fact that this operation dominates the execution time of the algorithm. Then the number of function evaluations is a reliable measure for wall clock time. However, the wall clock time might increase when introducing crossover as an additional search operator. Also, when increasing the mutation rate, more pseudorandom numbers might be required. Jansen and Zarges (2011) point out a case where this effect leads to a discrepancy between the number of function evaluations and wall clock time. This concern must be taken seriously when aiming at reducing wall clock time. However, each implementation must be checked individually in this respect (Jansen and Zarges, 2011). Therefore, we keep this concern in mind but still use the number of function evaluations in the following.
3 Uniform Crossover Makes () EAs Twice as Fast
We show that, under mild conditions, every () GA is at least twice as fast as its counterpart without crossover. For the latter, that is, evolutionary algorithms using only standard bit mutation, the author proved the following lower bound on the running time of a very broad class of mutation-based EAs (Sudholt, 2013). It covers all possible selection mechanisms, parent or offspring populations, and even parallel evolutionary algorithms. We slightly rephrase this result.
We show that for a range of () EAs, as defined in the following, introducing uniform crossover can cut the dominant term of the running time in half, for the standard mutation rate .
The class of () EAs covered in this work is defined in Algorithm 1. All () EA s therein create offspring through crossover and mutation, or just mutation, and then pick the best out of the previous search points and the new offspring.
In the case of ties, we pick solutions that have the fewest duplicates among the considered search points. This strategy was used by Jansen and Wegener (2005) in their groundbreaking work on real royal roads; it ensures a sufficient degree of diversity whenever the population contains different search points of the same fitness.
Before stating the main result of this section, we provide two lemmas showing how to analyze population dynamics. Both lemmas are of independent interest and may prove useful in other studies of population-based EAs.
The following lemma estimates the expected time until individuals with fitness at least i take over the whole population. It generalizes Lemma 3 in Sudholt (2009), which in turn goes back to Witt’s (2006) analysis of the () EA. Note that the lemma applies to arbitrary fitness functions, arbitrary values for and , and arbitrary crossover operators; it merely relies on fundamental and universal properties of cut selection and standard bit mutations.
Call an individual fit if it has fitness at least i. Now estimate the expected number of generations until the population is taken over by fit individuals, called the expected takeover time. As fit individuals are always preferred to nonfit individuals in the environmental selection, the expected takeover time equals the expected number of generations until fit individuals have been created, starting with one fit individual.
Now divide the run of the () GA into phases in order to get a lower bound on the number of fit individuals at certain time steps. The jth phase, , starts with the first offspring creation in the first generation, where the number of fit individuals is at least . It ends in the first generation where this number is increased to . Let Tj describe the random number of generations spent in the jth phase. Starting with a new generation with fit individuals in the parent population, consider a phase of offspring creations, disregarding generation bounds.
The following simple but handy lemma relates success probabilities for created offspring to the expected number of function evaluations needed to complete a generation where such an event has first happened.
The expected number of trials for an event with probability q to occur is . To complete the generation, at most further function evaluations are required.
Now we are able to prove the main result of this section.
The main difference between the upper bound for () GAs and the lower bound for all mutation-based EAs is an additional factor of in the denominator of the upper bound. This is a factor of 2 for and an even larger gain for larger mutation rates.
For the default value of , this shows that introducing crossover makes EAs at least twice as fast as the fastest EA using only standard bit mutation. It also implies that introducing crossover makes EAs at least twice as fast as their counterparts without crossover (i.e., where ).
In order to prove the general bound (2), we consider canonical fitness levels, that is, the ith fitness level contains all search points with fitness i. We estimate the time spent on each level i, that is, when the best fitness in the current population is i. For each fitness level we consider three cases. The first case applies when the population contains individuals on fitness levels less than i. The second case is when the population only contains copies of a single individual on level i. The third case occurs when the population contains more than one individual on level i; then the population contains different building blocks that can be recombined effectively by crossover.
All these cases capture the typical behavior of a () GA, albeit some of these cases, and even whole fitness levels, may be skipped. We obtain an upper bound on its expected optimization time by summing up expected times the () GA may spend in all cases and on all fitness levels.
Case i.1. The population contains an individual on level i and at least one individual on a lower fitness level.
A sufficient condition for leaving this case is that all individuals in the population obtain fitness at least i. Since the () GA never accepts worsenings, the case is left for good.
Case i.2. The population contains copies of the same individual x on level i.
In this case, each offspring created by the () GA will be a standard mutation of x. This is obvious for offspring where the () GA decides not to use crossover. If crossover is used, the () GA will pick , create by crossover, and hence perform a mutation on x.
The () GA leaves this case for good if either a better search point is created or if it creates another search point with i ones. In the latter case, we will create a population with two different individuals on level i. Note that due to the choice of the tie-breaking rule in the environmental selection, the () GA will always maintain at least two individuals on level i, unless an improvement with larger fitness is found.
Case i.3. The population only contains individuals on level i, not all of which are identical.
In this case we can rely on crossover recombining two different individuals on level i. As they both have different building blocks, namely, different bits are set to 1, there is a good chance that crossover will generate an offspring with a higher number of 1-bits.
It is remarkable that the waiting time for successful crossovers in cases is only of order . For small values of and , for instance, , the time spent in all cases is , which is negligible compared to the overall time bound of order . This shows how effective crossover is in recombining building blocks.
Also note that the proof of Theorem 4 is relatively simple, as it uses only elementary arguments and, along with Lemmas 2 and 3, it is fully self-contained. The analysis therefore lends itself for teaching purposes on the behavior of evolutionary algorithms and the benefits of crossover.
The analysis has revealed that fitness-neutral mutations, that is, mutations creating a different search point of the same fitness, can help to escape from the case of a population with identical individuals. Even though these mutations do not immediately yield an improvement in terms of fitness, they increase the diversity in the population. Crossover is very efficient in exploiting this gained diversity by combining two different search points at a later stage. From Prügel-Bennett’s (2010) perspective, this corresponds to crossover focusing search on bits that differ between parents.
This means that crossover can capitalize on mutations that have both beneficial and disruptive effects on building blocks: crossover is able to repair the disruptive effects of mutation in later generations.
An interesting consequence is that this affects the optimal mutation rate on OneMax. For EAs using only standard bit mutations, Witt (2013) proved that is the optimal mutation rate for the () EA on all linear functions. Recall that the () EA is the optimal mutation-based EA (in the sense of Theorem 1) on OneMax (Sudholt, 2013).
For mutation-based EAs on OneMax, neutral mutations are neither helpful nor detrimental. With crossover acting as repair mechanism, neutral mutations now become helpful. Increasing the mutation rate increases the likelihood of neutral mutations. In fact, we can easily derive better upper bounds from Theorem 4 for slightly larger mutation rates, thanks to the additional term in the denominator of the upper bound.
4 The Optimal Mutation Rate
Corollary 5 gives the mutation rate that yields the best upper bound on the running time that can be obtained with the proof of Theorem 4. However, it does not establish that this mutation rate is indeed optimal for any GA. After all, another mutation rate leads to a smaller expected optimization time.
In the following, we show for a simple (2 + 1) GA (Algorithm 2) that the upper bound from Theorem 4 is indeed tight up to small-order terms, which establishes as the optimal mutation rate for that (2 + 1) GA. Proving lower bounds on expected optimization times is often a notoriously hard task, hence we restrict ourselves to a simple bare-bones GA that captures the characteristics of GAs covered by Theorem 4 and is easy to analyze. The latter is achieved by fixing as many parameters as possible.
As the upper bound from Theorem 4 grows with and , we pick the smallest possible values: and . The parent selection is made as simple as possible: we select parents uniformly at random from the current best individuals in the population. In other words, if we define the parent population as the set of individuals that have a positive probability to be chosen as parents, the parent population only contains individuals of the current best fitness. We call this parent selection ``greedy'' because it is a greedy strategy to choose the current best search points as parents.
In the context of the proof of Theorem 4, greedy parent selection implies that cases are never reached, as the parent population never spans more than one fitness level. So the time spent in these cases is 0. This also allows us to eliminate one further parameter by setting , as lower values for pc were only beneficial in cases . Setting minimizes our estimate for the time spent in cases . So Theorem 4 extends toward this GA (see also Remark 20 in the appendix).
We call the resulting GA a ``greedy () GA'' because its main characteristic is the greedy parent selection. The greedy () GA is defined in Algorithm 2.2
The following result applies to the greedy () GA using any kind of mask-based crossover. A mask-based crossover is a recombination operator where each bit value is taken from either parent; that is, it is not possible to introduce a bit value that is not represented in any parent. All common crossovers are mask-based crossovers: uniform crossover, including parameterized uniform crossover, as well as k-point crossovers for any k. The following result even includes biased operators like a bitwise OR, which induces a tendency to increase the number of 1-bits.
For the greedy () GA with uniform crossover on OneMax, mutation rate minimizes the expected number of function evaluations, up to small-order terms.
For the proof of Theorem 6 we use the following lower-bound technique based on fitness levels by the author.
Consider a partition of the search space into nonempty sets . A search algorithm is in Ai or on level i if the best individual created so far is in Ai. If there are for where
the probability of traversing from level i to level j in one step is at most for all ,
for all i, and
for all and some ,
We prove a lower bound for the following sped-up GA instead of the original greedy () GA. Whenever it creates a new offspring with the same fitness, but a different bit string as the current best individual, we assume the following. The algorithm automatically performs a crossover between the two. Also, we assume that this crossover leads to the best possible offspring in a sense that all bits where both parents differ are set to 1 (i.e., the algorithm performs a bitwise OR). That is, if both search points have i 1-bits and Hamming distance , then the resulting offspring has i + k 1-bits.
Because of these assumptions, at the end of each generation there is always a single best individual. For this reason we can model the algorithm by a Markov chain representing the current best fitness.
The analysis follows a lower bound for EAs on OneMax (Sudholt, 2013, Theorem 9). As in Sudholt (2013) we consider the following fitness-level partition that focuses only on the very last fitness values. Let . Let for and contain all remaining search points. We know from Sudholt (2013) that the GA is initialized in with probability at least if n is large enough.
Following the proof of Theorem 9 in Sudholt (2013), it is easy to show that for we get for all with [the calculations in Sudholt (2013, pp. 427–428) carry over by replacing with ]. This establishes the third and last condition.
We also ran experiments to see whether the outcome matches our inspection of the dominating terms in the running time bounds for realistic problem dimensions. We chose bits and recorded the average optimization time over 1,000 runs. The mutation rate p was set to with . The result is shown in Figure 1.
One can see that for every mutation rate the greedy () GA has a lower average optimization time. As predicted, the performance difference becomes larger as the mutation rate increases. The optimal mutation rates for both algorithms match minimal average optimization times. Note also that the variance/standard deviation was much lower for the GA for higher mutation rates. Preliminary runs for and bits gave very similar results. More experiments and statistical tests are given in Section 6.1.
5 k-Point Crossover
For uniform crossover we have seen that populations containing different search points of equal fitness are beneficial, as uniform crossover can easily combine the good ``building blocks.'' This holds regardless of the Hamming distance between these different individuals and the position of bits where individuals differ.
The () GA with k-point crossover is harder to analyze, as there the probability of crossover creating an improvement depends on the Hamming distance of parents and the position of differing bits.
Consider parents that differ in two bits, where these bit positions are quite close. Then 1-point crossover has a high probability of taking both bits from the same parent. In order to recombine those building blocks, the cutting point has to be chosen between the two bit positions. A similar effect occurs for 2-point crossover if the two bit positions are on opposite ends of the bit string.
The following lemma gives a lower bound on the probability that k-point crossover combines the right building blocks on OneMax if two parents are equally fit and differ in two bits. The lemma and its proof may be of independent interest.
We identify cutting points with bits such that cutting point a results in two strings and . We say that a cutting point a separates i and i + d if . Note that the prefix is always taken from x. The claim now follows from showing that the number of separating cutting points is odd with the claimed probability.
In the setting of Lemma 9, the probability of k-point crossover creating an improvement depends on the distance between the two differing bits. Fortunately, for search points that result from a mutation of one another, this distance has a favorable distribution. This is made precise in the following lemma.
We first show the following. For any fixed index i and any integer , there are exactly two positions j such that . If and are fixed, the only values for j that result in either or are , and . Note that at most two of these values are in . Hence, there are at most two feasible values for j for every . Similarly, for there is just one position such that .
Taken together, Lemma 9 and Lemma 10 indicate that k-point crossover has a good chance of finding improvements through recombining the right ``building blocks.'' However, this is based on the population containing potential parents of equal fitness that only differ in two bits.
The following analysis shows that the population is likely to contain such a favorable pair of parents. However, such a pair might get lost again if other individuals of the same fitness are being created, after all duplicates have been removed from the population. For parents that differ in more than 2 bits, Lemma 9 does not apply; hence we do not have an estimate of how likely such a crossover will find an improvement.
In order to avoid this problem, we consider a more detailed tie-breaking rule. As before, individuals with fewer duplicates are preferred. In case there are still ties after considering the number of duplicates, the () GA will retain older individuals. This refined tie-breaking rule is shown in Algorithm 3. As shown in the remainder, it implies that once a favorable pair of parents with Hamming distance 2 has been created, this pair will never get lost.
This tie-breaking rule, called ``dup-old,'' differs from the one used for the experiments in Figure 1 and those in Section 6. There, we broke ties uniformly at random in case individuals are tied with respect to both fitness and the number of duplicates. We call the latter rule ``dup-rnd.'' Experiments for the greedy () GA comparing tie-breaking rules dup-old and dup-rnd over 1,000 runs indicate that performance differences are very small (see Figure 2).3
Note, however, that on functions with plateaus, like royal road functions, retaining the older individuals prevents the () GA from performing random walks on the plateau, once the population has spread such that there are no duplicates of any individual. In this case performance is expected to deteriorate when breaking ties toward older individuals.
With the refined tie-breaking rule, the performance of () GAs is as follows.
This bound equals the upper bound (3) for () GAs with uniform crossover. It improves upon the previous upper bound for the greedy () GA (Sudholt, 2012, Theorem 8), whose dominant term was by an additive term of larger. The reason is that for the () GA favorable parents could get lost, which is now prevented by the dup-old tie-breaking rule and conditions on pc.
The conditions as well as are useful because they allow us to estimate the probability that a single good individual takes over the whole population with copies of itself.
In the remainder of this section we work toward proving Theorem 11 and assume that for some n0 chosen such that all asymptotic statements that require a large enough value of n hold true. For there is nothing to prove, as the statement holds trivially for bounded n.
We again estimate the time spent on each fitness level i, that is, when the best fitness in the current population is i. To this end, the focus is on the higher fitness levels where the probability of creating an offspring on the same level can be estimated nicely. The time for reaching these higher fitness levels only constitutes a small-order term, compared to the claimed running time bound. The following lemma proves this claim in a more general setting than needed for the proof of Theorem 11. In particular, it holds for arbitrary tie-breaking rules and crossover operators.
For every () GA implementing Algorithm 1 with , and for a constant , using any initialization and any crossover operator, the expected time until a fitness level is reached for the first time is .
A proof is given in the appendix.
In the remainder of the section we focus on higher fitness levels and specify the different cases on each such fitness level. The cases , , and are similar to the ones for uniform crossover, with additional conditions on the similarity of individuals in cases and . We also have an additional error state that accounts for undesirable and unexpected behavior. We pessimistically assume that the error state cannot be left toward other cases on level i.
Case i.1. The population contains an individual on level i and at least one individual on a lower fitness level.
Case i.2. The population contains copies of an individual x on level i.
Case i.3. The population contains two search points with current best fitness i, where y resulted from a mutation of x and the Hamming distance of x and y is 2.
Case i.error. An error state reached from any case when the best fitness is i and none of the prior cases applies.
The difference from the analysis of uniform crossover is that in case we rely on the population collapsing to copies of a single individual. This helps to estimate the probability of creating a favorable parent-offspring pair in case , as the () GA effectively only performs mutations of x while being in case .
Now we estimate the total time spent in all cases . As this time turns out to be comparably small, the fact that not all these cases are actually reached can be ignored.
The remainder of the proof is devoted to estimating the expected time spent in the error state. To this end we need to consider events that take the () GA ``off course,'' that is, deviating from situations described in cases , , and .
Since case is based on offspring with Hamming distance 2 to their parents, one potential failure is that an offspring with fitness i but Hamming distance greater than 2 to its parent is being created. This probability is estimated in the following lemma.
The proof is found in the appendix.
Another potential failure occurs if the population does not collapse to copies of a single search point, that is, the transition from case to case is not made. First estimate the probability of mutation unexpectedly creating an individual with fitness i.
Note that for the special case , Doerr, Johannsen, and Winzen (2012b, Lemma 13) give an upper bound of . This is because the highest probability for a jump to fitness level i is attained when the parent is on level . However, for larger mutation probabilities this is no longer true in general; there are cases where the probability of jumping to level i is maximized for parents on lower fitness levels. Hence, a closer inspection of transition probabilities between different fitness levels is required; see the proof in the appendix.
Using Lemma 15, we can now estimate the probability of the () GA not collapsing to copies of a single search point as described in case .
We show that there is a good probability of repeatedly creating clones of individuals with fitness i (or finding an improvement) and avoiding the following bad event. A bad event happens if an individual on fitness level i is created in one offspring creation by means other than through cloning an existing individual on level i.
Now we can estimate the expected time spent in all error states error for .
The () GA only spends time in an error state if it is actually reached. So first calculate the probability that state error is reached from case , , or .
Finally, case implies that there exists a parent-offspring pair with Hamming distance 2. In a new generation these two offspring, or at least one copy of each, will always survive: individuals with multiple duplicates are removed first, and if among current parents and offspring more than individuals exist with no duplicates, x and y will be preferred over newly created offspring. So the probability of reaching the error state from case is 0.
Now Theorem 11 follows from all previous lemmas.
Some of the technical conditions from Theorem 11 involving , and pc could be relaxed if it is possible to generalize Lemmas 9 and 10 toward more than 2 differing bits between individuals of equal fitness.
6 Extensions to Other Building Block Functions
6.1 Royal Roads and Monotone Polynomials
So far, our theorems and proofs have been focused on OneMax only. This is because we do have very strong results about the performance of EAs on OneMax at hand. However, the insights gained stretch far beyond OneMax. Royal road functions generally consist of larger blocks of bits. All bits in a block need to be set to 1 in order to contribute to the fitness; otherwise the contribution is 0. All blocks contribute the same amount to the fitness, and the fitness is just the sum of all contributions.
The fundamental insight we have gained for neutral mutations also applies to royal road functions. If there is a mutation that completes one block but destroys another block, this is a neutral mutation and the offspring will be stored in the population of a () GA. Then crossover can recombine all finished blocks in the same way as for OneMax. The only difference is that the destroyed block may evolve further. More neutral mutations can occur that only alter bits in the destroyed block. Then the population can be dominated by many similar solutions, and it becomes harder for crossover to find a good pair for recombination. However, as crossover has generally a very high probability of finding improvements, the last effect probably plays only a minor role.
A theoretical analysis of general royal roads up to the same level of detail as for OneMax is harder but not impossible. So far, results on royal roads and monotone polynomials have been mostly asymptotic (Wegener and Witt, 2005; Doerr, Sudholt, and Witt, 2013b). Only recently, Doerr and Künnemann (2013) presented a tighter runtime analysis of offspring populations for royal road functions, which may lend itself to a generalization of our results on OneMax in future work.
For now, we use experiments to see whether the performance is similar to that on OneMax. We use royal roads with bits and block size 5, that is, we have 200 pairwise disjoint blocks of 5 bits each. We also consider random monotone polynomials. Instead of using disjoint blocks, we use 1,000 monomials of degree 5 (conjunctions of 5 bits): each monomial is made up of 5 bit positions chosen uniformly at random, without replacement. This leads to a function similar to royal roads, but ``blocks'' are broken up and can share bits; bit positions are completely random. Figure 3 shows the average optimization times in 1,000 runs on all these functions, for the () EA and the greedy () GA with uniform, 1-point, and 2-point crossover. We chose the last two because k-point crossovers for odd k treat ends of bit strings differently from those for even k: for odd k two bits close to opposite ends of a bitstring have a high probability to be taken from different parents, whereas for even k there is a high chance that both will be taken from the same parent (see Lemma 9 for and the special case of ).
For consistency and simplicity, use and the tie-breaking rule dup-rnd in all settings, that is, ties in fitness are broken toward minimum numbers of duplicates and any remaining ties are broken uniformly at random. For OneMax this does not perfectly match the conditions of Theorem 11, as they require a lower crossover probability, , and tie-breaking rule dup-old. But the experiments show that k-point crossover is still effective when these conditions are not met.
On OneMax both k-point crossovers are better than the () EA but slightly worse than uniform crossover. This is in accordance with the observation from our analyses that improvements with k-point crossover might be harder to find if the differing bits are in close proximity.
For royal roads the curves are very similar. The difference between the () EA and the greedy () GA is just a bit smaller. For random polynomials there are visible differences, albeit smaller. Mann--Whitney U tests confirm that wherever there is a noticeable gap between the curves, there is a statistically significant difference on a significance level of .001. The outcome of Mann--Whitney U tests is summarized in Table 1.
|.||.||() EA .||Uniform .||1-point .|
|1-point||for||for (1 ex.)|
|2-point||for||for (5 ex.)||(6 ex.)|
|polynomial||1-point||for (3 ex.)||(13 ex.)|
|2-point||for (1 ex.)||(6 ex.)||(6 ex.)|
|.||.||() EA .||Uniform .||1-point .|
|1-point||for||for (1 ex.)|
|2-point||for||for (5 ex.)||(6 ex.)|
|polynomial||1-point||for (3 ex.)||(13 ex.)|
|2-point||for (1 ex.)||(6 ex.)||(6 ex.)|
For very small mutation rates the tests were not significant. For mutation rates no less than all differences between the () EA and all greedy () GAs were statistically significant, apart from a few exceptions on random polynomials. For OneMax the difference between uniform crossover and k-point crossover was significant for . For royal roads the majority of such comparisons showed statistical significance, with a number of exceptions. However, for random polynomials the majority of comparisons were not statistically significant. Most comparisons between 1-point and 2-point crossover did not show statistical significance.
These findings give strong evidence that the insights drawn from the analysis on OneMax transfer to broader classes of functions where building blocks need to be assembled.
6.2 Linear Functions
Doerr et al. (2013a) provided empirical evidence that their (1+(, )) EA is faster than the () EA on linear functions with weights drawn uniformly at random from .
It is an open question whether this also holds for more common GAs, that is, those implementing Algorithm 1. Experiments in Doerr et al. (2013a) on the greedy () GA found that on random linear functions “no advantage of the () GA over the () EA is visible.” We provide an explanation for this observation and reveal why the () GA is not well suited for weighted building blocks, whereas other GAs might be.
The reason the () GA behaves like the () EA in the presence of weights is that in case the current population of the () GA contains two members with different fitness, the () GA ignores the inferior one. So it behaves as if the population only contained the fitter individual. Since the () GA will select the fitter individual twice for crossover, followed by mutation, it essentially just mutates the fitter individual. This behavior of the () GA then equals that of a () EA working on the fitter individual.
The () GA is more efficient than the () EA on OneMax (and other building-block functions where all building blocks are equally important) as it can easily generate and store individuals with equal fitness in the population and recombine their different building blocks. However, in the presence of weights, chances of creating individuals of equal fitness might be very slim, and then the () GA behaves like the () EA.
As long as the population of the () GA does not contain two different individuals with the same fitness, the () GA is equivalent to the () EA.