A decent number of lower bounds for non-elitist population-based evolutionary algorithms has been shown by now. Most of them are technically demanding due to the (hard to avoid) use of negative drift theorems—general results which translate an expected movement away from the target into a high hitting time. We propose a simple negative drift theorem for multiplicative drift scenarios and show that it can simplify existing analyses. We discuss in more detail Lehre's (2010) negative drift in populations method, one of the most general tools to prove lower bounds on the runtime of non-elitist mutation-based evolutionary algorithms for discrete search spaces. Together with other arguments, we obtain an alternative and simpler proof of this result, which also strengthens and simplifies this method. In particular, now only three of the five technical conditions of the previous result have to be verified. The lower bounds we obtain are explicit instead of only asymptotic. This allows us to compute concrete lower bounds for concrete algorithms, but also enables us to show that super-polynomial runtimes appear already when the reproduction rate is only a factor below the threshold. For the special case of algorithms using standard bit mutation with a random mutation rate (called uniform mixing in the language of hyper-heuristics), we prove the result stated by Dang and Lehre (2016b) and extend it to mutation rates other than , which includes the heavy-tailed mutation operator proposed by Doerr et al. (2017). We finally use our method and a novel domination argument to show an exponential lower bound for the runtime of the mutation-only simple genetic algorithm on OneMax for arbitrary population size.
Lower bounds for the runtimes of evolutionary algorithms are important as they can warn the algorithm user that certain algorithms or certain parameter settings will not lead to good solutions in acceptable time. Unfortunately, the existing methods to obtain such results, for non-elitist algorithms in particular, are very technical and thus difficult to use.
One reason for this high complexity is the use of drift analysis, which seems hard to circumvent. Drift analysis (see Lengler, 2020) is a set of tools that all try to derive useful information on a hitting time (e.g., the first time a solution of a certain quality is found) from information on the expected progress in one iteration. The hope is that the progress in a single iteration can be analyzed with only moderate difficulty and then the drift theorem does the remaining work. While more direct analysis methods exist and have been successfully used for simple algorithms, for population-based algorithms and in particular non-elitist ones, it is hard to imagine that the complicated population dynamics can be captured in proofs not using more advanced tools such as drift analysis.
Drift analysis has been used with great success to prove upper bounds on runtimes of evolutionary algorithms. Tools such as the additive (He and Yao, 2001), multiplicative (Doerr et al., 2012), and variable drift theorem (Mitavskiy et al., 2009; Johannsen, 2010) all allow us to easily obtain an upper bound on a hitting time solely from the expected progress in one iteration. Unfortunately, proving matching lower bounds is much harder since here the drift theorems require additional technical assumptions on the distribution of the progress in one iteration. This is even more true in the case of so-called negative drift, where the drift is away from the target and we aim at proving a high lower bound on the hitting time.
In this work, we propose a very simple negative drift theorem for the case of multiplicative drift (Lemma 3). We briefly show that this result can ease two classic lower bound analyses (also in Section 3).
In more detail, we use the new drift theorem (and some more arguments) to rework Lehre's negative drift in populations method (Lehre, 2010). This highly general analysis method can show exponential lower bounds on the runtime of a large class of evolutionary algorithms solely by comparing the so-called reproduction rate of individuals in the population with a threshold that depends only on the mutation rate.
The downside of Lehre's method is that both the result and its proof are very technical. To apply the general result (and not the specialization to algorithms using standard bit mutation), five technical conditions need to be verified, which requires the user to choose suitable values for six different constants; these have an influence on the lower bound one obtains. This renders the method of Lehre hard to use. Among the 54 citations to Lehre (2010) (according to Google scholar on June 9, 2020), only the two works (Lehre, 2011) and (Dang and Lehre, 2016b) apply this method. To hopefully ease future analyses of negative drift in populations, we revisit this method and obtain the following improvements.
A simpler result: We manage to show essentially the same lower bound by only verifying three of the five conditions of Lehre's result (Theorems 4 and 5). This also reduces the number of constants one needs to choose from six to four.
A non-asymptotic result: Our result gives explicit lower bounds, that is, free from asymptotic notation or unspecified constants. Consequently, our specialization to algorithms using standard bit mutation (Theorem 6) also gives explicit bounds. This allows us one to prove concrete bounds for specific situations [e.g., that the EA with needs more than 13 million fitness evaluations to find the optimum of the OneMax problem defined over bit strings of length ; see the example following Theorem 6] and gives more fine-grained theoretical results [by choosing Lehre's constant as a suitable function of the problems size, we show that a super-polynomial runtime behavior is observed already when the reproduction rate is only a factor below the threshold; see Corollary 7]. With the absence of asymptotic notation, we can also analyze algorithms using standard bit mutation with a mutation rate chosen randomly from a discrete set of alternatives (Section 6). Such a result was stated by Dang and Lehre (2016b), however only for mutation rates that are . Our result does not need this restriction and thus, for example, applies also to the heavy-tailed mutation operator proposed by Doerr et al. (2017).
A simple proof: Besides the important aspect that a proof guarantees the result to be mathematically correct, an understandable proof can also tell us why a result is correct and give further insights into working principles of algorithms. While every reader will have a different view on what the ideal proof looks like, we felt that Lehre's proof, combining several deep and abstract tools such as multi-type branching processes, eigenvalue arguments, and Hajek's drift theorem (Hajek, 1982), does not easily give a broader understanding of the proof mechanics and the working principles of the algorithms analyzed. We hope that our proof, based on a simple potential function argument together with our negative drift theorem, is more accessible.
Finally, we analyze an algorithm using fitness proportionate selection. The negative drift in populations method is not immediately applicable to such algorithms since it is hard to provide a general unconditional upper bound on the reproduction rate: If all but one individual have a very low fitness, then this best individual has a high reproduction rate. We therefore show that at all times all search points are at least as good (in a stochastic domination sense) as random search points. This allows to argue that the reproduction rates are low and then gives a simple proof of an exponential lower bound for the mutation-only simple genetic algorithm (simple GA) with arbitrary population size optimizing the simple OneMax benchmark, improving over the mildly sub-exponential lower bound in Neumann et al. (2009) and the exponential lower bound for large population sizes only in Lehre (2011).
1.1 Related Works
A number of different drift theorems dealing with negative drift have been proven so far, among others, in Happ et al. (2008), Oliveto and Witt (2011, 2012a), Rowe and Sudholt (2014), Oliveto and Witt (2015), Kötzing (2016), Lengler and Steger (2018), and Witt (2019) (note that in early works, the name “simplified drift theorem” was used for such results). They all require some additional assumptions on the distribution of the one-step progress, which makes them non-trivial to use. We refer to Lengler (2020, Section 2.4.3) for more details. Another approach to negative drift was used in Antipov et al. (2019) and Doerr (2019b, 2020a). There the original process was transformed suitably (via an exponential function), but in a way that the drift of the new process still is negative or at most a small constant. To this transformed process the lower bound version of the additive drift theorem (He and Yao, 2001) was applied, which gave large lower bounds since the target, due to the exponential rescaling, now was far from the starting point of the process.
In terms of lower bounds for non-elitist algorithms, besides Lehre's general result (Lehre, 2010), the following results for particular algorithms exist (always, is the problem size, can be any positive constant, and is the base of the natural logarithm). Jägersküpper and Storch (2007, Theorem 1) showed that the EA with is inefficient on any pseudo-Boolean function with a unique optimum. The asymptotically tight condition to yield a super-polynomial runtime was given by Rowe and Sudholt (2014). Happ et al. (2008) showed that two simple (1+1)-type hillclimbers with fitness proportionate selection cannot optimize efficiently any linear function with positive weights. Neumann et al. (2009) showed that a mutation-only variant of the simple GA with fitness proportionate selection is inefficient on the OneMax function when the population size is at most polynomial, and it is inefficient on any pseudo-Boolean function with unique global optimum when . The mildly subexponential lower bound for OneMax was improved to an exponential lower bound by Lehre (2011), but only for . In a series of remarkable works up to Oliveto and Witt (2015), Oliveto and Witt showed that the true simple GA using crossover cannot optimize OneMax efficiently when . None of these results gives an explicit lower bound or specifies the base of the exponential function. In Antipov et al. (2019), an explicit lower bound for the runtime of the EA is proven (but stated only in the proof of Theorem 3.1 in Antipov et al., 2019). Section 3 of Antipov et al. (2019) bears some similarity with ours, in fact, one can argue that our work extends (Antipov et al., 2019, Section 3) from a particular algorithm to the general class of population-based processes regarded by Lehre (2010) (where, naturally, Antipov et al., 2019 did not have the negative multiplicative drift result and therefore did not obtain bounds that hold with high probability).
This work is an extended version of a paper that appeared in the proceedings of Parallel Problem Solving from Nature 2020 (Doerr, 2020c). This version contains all proofs (for reasons of space, in Doerr, 2020c, only Lemma 1 and Theorem 2 were proven), a new section on standard bit mutation with random mutation rates, and several additional details.
2 Notation and Preliminaries
The classic mutation operator standard bit mutation creates an offspring by flipping each bit of the parent independently with some probability , which is called mutation rate.
We shall twice need the notion of stochastic domination and its relation to standard bit mutation, so we quickly collect these ingredients of our proofs. We refer to Doerr (2019a) for more details on stochastic domination and its use in runtime analysis.
For two real-valued random variables and , we say that stochastically dominates, written as , if for all we have . Stochastic domination is a very flexible way of saying that is larger than . It implies , but not only this, we also have for any monotonically increasing function .
Let be two random variables taking values in some set . Let be monotonically increasing, that is, we have for all with . Then .
Significantly improving over previous related arguments in Droste et al. (2000, Section 5) and Doerr et al. (2012, Lemma 13), Witt (2013, Lemma 6.1) showed the following natural domination argument for offspring generated via standard bit mutation with mutation rate at most . We note that the result is formulated in Witt (2013) only for , but the proof in Witt (2013) or a symmetry argument immediately shows the following general version.
3 Negative Multiplicative Drift
The following elementary result allows us to prove lower bounds on the time to reach a target in the presence of multiplicative drift away from the target. While looking innocent, it has the potential to replace the more complicated lower bound arguments previously used in analyses of non-elitist algorithms. We discuss this briefly at the end of this section.
For all , .
- Let and . Then for all integers ,and .
The proof is an easy computation of expectations and an application of Markov's inequality similar to the direct proof of the multiplicative drift theorem in Doerr and Goldberg (2013). We do not see a reason why the result should not also hold for processes taking more than a finite number of values, but since we are only interested in the finite setting, we spare ourselves the more complicated world of continuous probability spaces.
We now argue that our negative multiplicative drift theorem is likely to find applications beyond ours to the negative drift in populations method in the following section. To this aim, we regard two classic lower bound analyses of non-elitist algorithms and point out where our drift theorem would have eased the analysis.
Neumann et al. (2009) show that the variant of the simple genetic algorithm (simple GA) not using crossover needs time to optimize the simple OneMax benchmark. The key argument in Neumann et al. (2009) is as follows. The potential of the population in iteration is defined as . For this potential, it is shown (Neumann et al., 2009, Lemma 7) that if , then for some constant . By bluntly estimating in the case that , this bound could easily be extended to for some number . This suffices to employ our negative drift theorem and obtain the desired lower bound. Without our drift theorem at hand, in Neumann et al. (2009) the potential was considered; it was argued that it displays an additive drift away from the target and that satisfies certain concentration statements necessary for the subsequent use of a negative drift theorem for additive drift.
A second example where we feel that our drift theorem can ease the analysis is the work of Oliveto and Witt (2014, 2015) on the simple GA with crossover optimizing OneMax. Due to the use of crossover, this work is much more involved, so we shall not go into detail and simply point the reader to the location where negative drift occurs. In Lemma 19 of Oliveto and Witt (2015), a multiplicative drift statement (away from the target) is proven. To use a negative drift theorem for additive drift (Oliveto and Witt, 2015, Theorem 2), in the proof of Lemma 20 the logarithm of the original process is regarded. So here again, we think that a direct application of our drift theorem would have eased the analysis.
4 Negative Drift in Populations Revisited
In this section, we use our negative multiplicative drift result and some more arguments to rework Lehre's negative drift in populations method (Lehre, 2010) and obtain Theorem 4 further below. This method allows an analysis of a broad class of evolutionary algorithms, namely all that can be described via the following type of population process.
4.1 Population Selection-Mutation Processes
A population selection-mutation (PSM) process (called population selection-variation algorithm in Lehre, 2010) is the following type of random process. Let be a finite set. We call the search space and its elements solution candidates or individuals. Let be called the population size of the process. An ordered multi-set of cardinality , in other words, a -tuple, over the search space is called a population. Let be the set of all populations. For , we write to denote the elements of . We also write to denote that there is an such that .
A PSM process starts with some, possibly random, population . In each iteration , a new population is generated from the previous one as follows. Via a (possibly) randomized selection operator, a -tuple of individuals is selected and then each of them creates an offspring through the application of a randomized mutation operator.
The selection operator can be arbitrary except that it only selects individuals from . In particular, we do not assume that the selected individuals are independent. Formally speaking, the outcome of the selection process is a random -tuple such that are the selected parents.
From each selected parent , a single offspring is generated via a randomized mutation operator. Formally speaking, for each , is a probability distribution on and we write to indicate that is sampled from this distribution. We assume that each sample, that is, each call of a mutation operator, uses independent randomness. With this notation, we can write the new population as with . From the definition it is clear that a PSM process is a Markov process with state space . A pseudocode description of PSM processes is given in Algorithm 1.
The following characteristic of the selection operator was found to be crucial for the analysis of PSM processes in Lehre (2010). Let and . Then the random variable , called reproduction number of the -th individual in , denotes the number of times was selected from as parent. Its expectation is called reproduction rate.
Example: We now describe how the EA fits into this framework. That it fits into this framework and that the reproduction number is was already stated in Lehre (2010), but how exactly this works out, to the best of our knowledge, was never made precise so far, and is also not totally trivial.
We specify that when talking about the EA, given in pseudocode in Algorithm 2, we mean the basic EA which starts with a parent population of search points chosen independently and uniformly at random from . In each iteration, offspring are generated, each by selecting a parent individual uniformly at random (with repetition) and mutating it via standard bit mutation with mutation rate . The next parent population is selected from these offspring by taking best individuals, breaking ties randomly.
This algorithm can be modeled as a PSM process with population size (not ). To do so, we need a slightly non-standard initialization of the population. We generate by first taking random search points and then generating each , , by choosing (with replacement) a random one of the base individuals and mutating it. With this definition, each individual in is uniformly distributed in , but these individuals are not independent.
Given a population consisting of individuals, the selection operator first selects a set of best individuals from , breaking ties randomly. Formally speaking, this is a tuple of indices in . Then a random vector is chosen and the selected parents are taken as . The next population is obtained by applying the mutation operator to each of these, that is, , where denotes standard bit mutation with mutation rate .
From this description, it is clear that each individual of each population of the EA has a reproduction rate of .
4.2 Our “Negative Drift in Populations” Result
We prove the following version of the negative drift in populations method.
Consider a PSM process with associated reproduction numbers as defined in Section 4.1. Let , called potential function, and with . Assume that for all we have . Let the first time we have a search point with potential or less in the population. Assume that the following three conditions are satisfied.
There is an such that for all populations with and all with , we have .
- There is a and a such that for all with we have
- There is a such for all with , we have
for all , we have .
Before proceeding with the proof, we compare our result with Lehre (2010, Theorem 1). We first note that, apart from a technicality which we discuss toward the end of this comparison, the assumptions of our result are weaker than the ones in Lehre (2010) since we do not need the technical fourth and fifth assumption of Lehre (2010), which in our notation would read as follows.
- There is a such that for all and all with and all with and we have
- There is a such that for all with and and all with and we have
The one point where the result of Lehre (2010) potentially is stronger is that it needs assumptions only on the “average drift” from the random search point at time conditional on having a fixed potential, whereas we require the same bound on the “point-wise drift,” that is, conditional on the current search point being equal to a particular search point of this potential. Let us make this more precise. Lehre uses the notation to denote the Markov process on associated with the mutation operator [unfortunately, it is not said in Lehre (2010) what is, that is, how this process is started]. Then defines the potential gain in step when the current state has potential . With this notation, instead of our second and third conditions, Lehre (2010) requires only the weaker conditions (here again translated into our notation).
For all and all , .
For all , .
So Lehre only requires that the random individual at time , conditional on having a certain potential, gives rise to a certain drift, whereas we require that each particular individual with this potential gives rise to this drift. On the formal level, Lehre's condition is much weaker than ours (assuming that the unclear point of what is can be fixed). That said, to exploit such weaker conditions, one would need to be able to compute such average drifts and they would need to be smaller than the worst-case point-wise drift. We are not aware of many examples where average drift was successfully used in drift analysis (one is Jägersküpper's remarkable analysis of the linear functions problem; Jägersküpper, 2008) despite the fact that many classic drift theorems only require conditions on the average drift to hold.
We now prove Theorem 4. Before stating the formal proof, we describe on a high level its main ingredients and how it differs from Lehre's proof.
The main challenge when using drift analysis is designing a potential function that suitably measures the progress. For simple hillclimbers and optimization problems, the fitness of the current solution may suffice, but already the analysis of the EA on linear functions resisted such easy approaches (He and Yao, 2001; Droste et al., 2002; Doerr et al., 2012; Witt, 2013). For population-based algorithms, the additional challenge is to capture the quality of the whole population in a single number. We note at this point that the notion of “negative drift in populations” was used in Lehre to informally describe the characteristic of the population processes regarded, but drift analysis as a mathematical tool was employed only on the level of single individuals and the resulting findings were lifted to the whole population via advanced tools like branching processes and eigenvalue arguments.
To prove upper bounds, in Witt (2006), Chen et al. (2009), Lehre (2011), Dang and Lehre (2016a), Corus et al. (2018), Antipov et al. (2018), and Doerr and Kötzing (2019), implicitly or explicitly potential functions were used that build on the fitness of the best individual in the population and the number of individuals having this fitness. Regarding only the current-best individuals, these potential functions might not be suitable for lower bound proofs.
The lower bound proofs in Neumann et al. (2009), Oliveto and Witt (2014, 2015), and Antipov et al. (2019) all define a natural potential for single individuals, namely the Hamming distance to the optimum, and then lift this potential to populations by summing over all individuals an exponential transformation of their base potential (this ingenious definition was, to the best of our knowledge, not known in the theory of evolutionary algorithms before the work of Neumann et al., 2009). This is the type of potential we shall use as well, and given the assumptions of Theorem 4, it is not surprising that is a good choice. For this potential, we shall then show with only mild effort that it satisfies the assumptions of our drift theorem, which yields the desired lower bounds on the runtime (using that a single good solution in the population already requires a very high potential due to the exponential scaling). We now give the details of this proof idea.
Let . Since contains no individual with potential below , we have . Hence, also the assumption of Lemma 3 is fulfilled.
We note that the proof above actually shows the following slightly stronger statement, which can be useful when working with random initial populations (as, for example, in the following section).
Theorem 4 remains valid when the assumption that all initial individuals have potential at least is replaced by the assumption .
5 Processes Using Standard Bit Mutation
Since many EAs use standard bit mutation, as in Lehre (2010) we now simplify our main result for processes using standard bit mutation and for being the Hamming distance to a target solution. Hence, in this section, we have and is obtained from by flipping each bit of independently with probability . Since our results are non-asymptotic, we can work with any .
Consider a PSM process (see Section 4.1) with search space , using standard bit mutation with mutation rate as mutation operator, and such that is uniformly distributed in for each (possibly with dependencies among the individuals). Let be the target of the process. For all , let denote the Hamming distance from the target.
Let and such that , that is, such that . Let . Let be integers such that and .
Selection condition: Assume that for all populations with and all with , we have .
The proof of this result is a reduction to Theorem 4. To show that the second and third condition of Theorem 4 are satisfied, one has to estimate , which is not difficult since can be written as sum of independent random variables. With a similar computation and some elementary calculus, we show that the weaker starting condition of Theorem 5 is satisfied.
We apply Theorem 4. To show the second and third condition of the theorem, let and let be the random offspring generated from . We use the shorthand . We note that can be expressed as a sum of independent random variables such that for , we have and , and for , we have and .
As a simple example for an application of this result, let us consider the classic EA (with uniform selection for variation, truncation selection for inclusion into the next generation, and mutation rate ) with optimizing some function , , with unique global optimum. For simplicity, let us take as performance measure , that is, the number of fitness evaluations in all iterations up to the one in which the optimum was found. Since , we have . By taking , we obtain a concrete lower bound of an expected number of more than 13 million fitness evaluations until the optimum is found (regardless of and ).
Since Theorem 6 is slightly technical, we now formulate the following corollary, which removes the variable without significantly weakening the result. We note that the proof of this result applies Theorem 6 with a non-constant , so we do not see how such a result could have been proven from Lehre (2010).
Consider a PSM process as in Theorem 6. Let be the target of the process. For all , let denote the Hamming distance from the target. Assume that there is an such that
, which is equivalent to ;
there is an such that for all populations with and for all , we have .
The main argument is employing Theorem 6 with the and computing that this small has no significant influence on the exponential term of the bounds.
We note that since is always at most one. By assumption, . Assume first that . If , then , which is super-polynomial. If , then and is again super-polynomial. This shows the claimed super-polynomiality for .
For , we have and thus is exponential in .
The asymptotic statements of the with-high-probability claims follow analogously.
6 Standard Bit Mutation with Random Mutation Rate
To analyze a uniform mixing hyper-heuristic which uses standard-bit mutation with a mutation rate randomly chosen from a finite set of alternatives, Dang and Lehre (2016b, Theorem 2) extend Theorem 6 to such mutation operators. They do not give a proof of their result, stating that it would be similar to the proof of the result for classic standard bit mutation (Lehre, 2010, Theorem 4). Since we did not find this so obvious, we reprove this result now with our methods. The non-asymptoticity of our result allows to extend it to super-constant numbers of mutation rates and to mutation rates other than . We note that such situations appear naturally with the heavy-tailed mutation operator proposed in Doerr et al. (2017).
We show the following result, which extends Theorem 6.
Let . Let , , and such that . Let be the mutation operator which, in each application independently, chooses an with probability for all and then applies standard bit mutation with mutation rate .
Consider a PSM process (see Section 4.1) with search space , using this mutation operator , and such that each initial individual is uniformly distributed in (not necessarily independently). Let be the target of the process. For all , let denote the Hamming distance from the target.
Selection condition: Assume that for all populations with and all with , we have .
It is clear that when using standard bit mutation with a random mutation rate, then the drift—regardless of whether we just regard the fitness or an exponential transformation of it—is a convex combination of the drift values of each of the individual mutation operators. The reason why this argument does not immediately extend Theorem 6 to random mutation rates is that the mutation rate also occurs in the exponential term in equation (4). Apart from this difficulty, however, we can reuse large parts of the proof of Theorem 6.
Equation (5) defining the admissible values for and thus for the starting point of the negative drift regime is not very convenient to work with in general. We stated it nevertheless because in particular situations it might be useful, for example, to show an inapproximability result, that is, that a certain algorithm cannot come closer to the optimum than by a certain margin in subexponential time. The following weaker assumption is easier to work with and should, in most cases, give satisfactory results as well.
If in an asymptotic setting and can be taken as constants, then this lemma and Theorem 8 show an exponential lower bound on the runtime. This proves (Dang and Lehre, 2016b, Theorem 2) and extends it to mutation rates that are not necessarily .
As an example where mutation rates other than occur, we now regard the heavy-tailed mutation operator proposed in Doerr et al. (2017). This operator was shown to give a uniformly good performance of the EA on all jump functions, whereas each fixed mutation rate was seen to be good only for a small range of jump sizes. The heavy-tailed operator and variations of it have shown a good performance also in other works, for example, Mironovich and Buzdalov (2017); Friedrich et al. (2018a, 2018b); Friedrich, Quinzan et al. (2018); Wu et al. (2018); Antipov et al. (2020a, 2020b); Antipov and Doerr (2020); and Ye et al. (2020). The heavy-tailed mutation operator is nothing else than standard bit mutation with a random mutation rate, chosen from a heavy-tailed distribution. Doerr et al. (2017) defined it as follows. Let be a constant. This will be the only parameter of the mutation operator, however, one with not too much importance, so Doerr et al. (2017) simply propose to take . In each invocation of the mutation operator, a number , is chosen from the power-law distribution with exponent . Hence, , where is the normalizing constant . Once is determined, standard bit mutation with mutation rate is employed.
For fixed , the expression on the left-hand side in Lemma 9 is . This is a convex combination of terms and by comparing the coefficients, we easily see that this expression is decreasing in . Computing , we see that for all , a reproduction number of at most is small enough to lead to exponential runtimes. This is higher than for standard bit mutation with mutation rate , where only suffices to show exponential runtimes. This observation fits to our general feeling that larger mutation rates can be destructive, from which in particular non-elitist algorithms suffer.
For the limiting value we note that and . Hence, for sufficiently large, even a value of admits exponential lower bounds for runtimes.
Without going into details, and in particular without full proofs, we note that these estimates are tight, and this for all mutation operators of the type discussed in this section. Consider such a mutation operator such that for some . We take the EA optimizing OneMax as example. Assume that at some time we have in our parent population individuals on the highest non-empty fitness level . In expectation, each of them generates offspring. Each of these offspring is an exact copy of the parent with probability . Consequently, in the next generation the expected number of individuals on level or higher (as long as level is not full) is . This is enough to show polynomial runtimes via the level-based method (Lehre, 2011; Dang and Lehre, 2016a; Corus et al., 2018; Doerr and Kötzing, 2019) when is large enough. For example, the computation just made shows that condition (G2) in Doerr and Kötzing (2019, Theorem 3.2) is satisfied.
We note that this tightness stems from the fact that the term appears both in the drift computation here and, via the probability to generate a copy of a parent, in the fitness level method for populations. We do not think that this is a coincidence, but leave working out the details to a future work.
7 Fitness Proportionate Selection
In this section, we apply our method to a mutation-only version of the simple genetic algorithm (simple GA). We note that this algorithm traditionally is used with crossover (Goldberg, 1989). The mutation-only version has been regarded in the runtime analysis community mostly because runtime analyses for crossover-based algorithms are extremely difficult. While the first runtime analysis for the mutation-only version (Neumann et al., 2009) appeared in 2009 and showed a near-exponential lower bound on OneMax for arbitrary polynomially bounded population sizes, the first analysis of the crossover-based version from 2012 (Oliveto and Witt, 2012b) could only show a significantly sub-exponential lower bound ( for a constant which is at most ) and this for population sizes below . We note that the current best result (Oliveto and Witt, 2015) gives a similar lower bound for population sizes below . Both works call these runtimes exponential, and we acknowledge that this definition for exponential runtimes exists, but given the substantial difference between (which is less than 3.5 for all ) and we prefer to reserve the notion “exponential” for the latter.
The mutation-only version of the simple GA with population size is described in Algorithm 3. This algorithm starts with a population of random individuals from . In each iteration , it computes from the previous population a new population by times independently selecting an individual from via fitness proportionate selection and mutating it via standard bit mutation with mutation rate .
The precise known results for the performance of Algorithm 3 on the OneMax benchmark are the following. Neumann et al. (2009, Theorem 8) showed that with it needs with high probability more than iterations to find the optimum of the OneMax function or any search point in Hamming distance at most from it. This is only a subexponential lower bound. In Lehre (2011, Corollary 13), building on the lower bound method from Lehre (2010), a truly exponential lower bound is shown for the weaker task of finding a search point in Hamming distance at most from the optimum, but only for a relatively large population size of (and again ).
We now extend this result to arbitrary , that is, we remove the conditions and . To obtain the best known constant 0.029 for how close to the optimum the algorithm cannot go in subexponential time, we have to compromise with the constants in the runtime, which consequently are only of a theoretical interest. We therefore do not specify the base of the exponential function or the leading constant. We note that this would have been easily possible since we only use a simple additive Chernoff bound and Corollary 7. We further note that Lehre (2011) also shows lower bounds for a scaled version of fitness proportionate selection and a general mutation rate. This would also be possible with our approach and would again remove the conditions on , but we do not see that the additional effort is justified here.
There is a such that the mutation-only simple GA optimizing OneMax with any population size with probability does not find any solution with within fitness evaluations.
The main difficulty in proving lower bounds for algorithms using fitness proportionate selection is that the reproduction number is non-trivial to estimate. If all but one individual have a fitness of zero, then this individual is selected times. Hence, is the only general upper bound for the reproduction number. The previous works and ours overcome this difficulty by arguing that the average fitness in the population cannot significantly drop below the initial value of , which immediately yields that an individual with fitness has a reproduction number of roughly at most .
While it is natural that the typical fitness of an individual should not drop far below , making this argument precise is not completely trivial. In Neumann et al. (2009, Lemma 6), it was informally argued that the situation with fitness proportionate selection cannot be worse than with uniform selection. For the latter situation a union bound over all lineages of individuals is employed and a negative-drift analysis from Oliveto and Witt (2008, Section 3) is used for a single lineage. The analysis in Lehre (2011, Lemma 9) builds on the (positive) drift stemming from standard bit mutation when the fitness is below (this argument needs a mutation rate of at least ) and the independence of the offspring (here the lower bound is needed to admit the desired Chernoff bound estimates).
Our proof relies on a natural domination argument which shows that at all times all individuals are at least as good as random individuals in the sense of stochastic domination in fitness. This allows to use a simple Chernoff and union bound to argue that with high probability, for a long time all individuals have a fitness of at least . The remainder of the proof is an application of Corollary 7. Here the lower bound (Lehre, 2010, Theorem 4) would have been applicable as well with the main difference that there one has to deal with the constant , which does not exist in Corollary 7.
We start by proving the key argument used in the proof of Theorem 10, namely that at each time for each individual the fitness stochastically dominates (see Section 2) the one of a random individual. We denote by the binomial distribution with parameters and . With a slight abuse of notation, we write to denote that stochastically dominates when is binomially distributed with parameters and .
In this notation, our goal is to show that for all times and all , we have . This statement appears easy to believe since fitness proportionate selection, favoring better individuals at least slightly, should not be able to make the population worse. To be on the safe side, we nevertheless prove this statement formally (after the following remark).
We note that another statement that might be easy to believe is not true, namely that the sum of the fitness values of a population at all times dominates the sum of the fitness values of a random population (such as the initial population), that is, that . As counterexample, let be a multiple of 10 and let us consider the simple GA with after one iteration. Let . We estimate the probability of the event . With probability at least we have . For each , we have with probability 0.5 by the symmetry of the binomial distribution with parameter . These events are all independent, so with probability at least , we have all of them. In this case, for each independently, with probability at least the -th parent chosen in iteration 1 is and with probability at least the offspring generated from it equals the parent. All these events together occur with probability at least , recall that , which shows . Now for , a simple Chernoff bound argument, e.g., via the additive Chernoff bound (Doerr, 2020d, Theorem 1.10.7), shows that . Since this is (much) smaller than for sufficiently large ( suffices), we do not have .
Consider a run of the simple GA (Algorithm 3) on the OneMax benchmark. Then for each and each , we have .
To prove this result, we use the following auxiliary result, which states a number uniformly chosen from a collection of non-negative numbers is stochastically dominated by a number chosen from the same collection via an analogue of fitness proportionate selection. To define the latter formally, let . For a random variable we write if
in the case that for at least one , we have for all , and
in the case that for all , we have for all .
Let . Let be uniformly chosen and . Let and . Then .
We now show Lemma 11.
We show the claim via induction over time. For the random initial population , the claim is obviously true. Assume that in some iteration , the parent population satisfies that for all , we have . We show that the same is true for . Since all individuals of are identically distributed, we consider how one of them is generated. Let be random and let be the parent individual selected for the generation of the offspring. By our inductive assumption and Lemma 12, we have . Let be a uniformly random individual and be the parent just selected. Let and be the results of applying standard bit mutation to and . Since , by Lemma 2 we have . Now is equal (in distribution) to the offspring we just regard and is (still) a random bit string. Hence, as desired.
We are now ready to give the formal proof of Theorem 10.
Consider a run of the simple GA with population size . With the domination argument of Lemma 11, the fitness of a particular solution dominates a sum of independent uniform -valued random variables. Hence, using the additive Chernoff bound (Doerr, 2020d, Theorem 1.10.7), we see that with probability at most for all .
To avoid working in conditional probability spaces, let us consider a modification of the simple GA. It is identical with the original algorithm up to the point when the fitness of an individual for the first time goes below . From that time on, the algorithm selects the parents uniformly. Such an artificial continuation of a process from a time beyond the horizon of interest on was, to the best of our knowledge, in the theory of evolutionary algorithms first used in Doerr et al. (2011). For our modified simple GA, the reproduction rate of any individual in a population with all elements having fitness less than , , is at most . Hence, we can apply Corollary 7 with this . Taking, similar as in Lehre (2011), and , we can work with and thus . For sufficiently large, this allows to use .
While an exponential runtime on OneMax is not an exciting performance, for the sake of completeness we note that the runtime of the simple GA on OneMax is not worse than exponential. A runtime of can be shown with the methods of Doerr (2020b) (with some adaptations). The key observation is that, similar to property (A) in Doerr (2020b, Theorem 3), if at some time the population contains an individual with some fitness at least , then in the next iteration this individual is chosen as parent at least once with at least constant probability and, conditional on this, with probability a particular better Hamming neighbor of is generated from .
8 Conclusion and Outlook
In this work, we have proven two technical tools which might ease future lower bound proofs in discrete evolutionary optimization. The negative multiplicative drift theorem has the potential to replace the more technical negative drift theorems used so far in different contexts. Our strengthening and simplification of the negative drift in populations method should help increase our not very well developed understanding of population-based algorithms in the future. Clearly, it is restricted to mutation-based algorithms—providing such a tool for crossover-based algorithms and extending our understanding of how to prove lower bounds for these beyond the few results (Doerr and Theile, 2009; Oliveto and Witt, 2015; Sutton and Witt, 2019; Doerr, 2020e) would be great progress.
This work was supported by a public grant as part of the Investissement d'avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.
*The author can be contacted at doerr (at) lix.polytechnique (dot) fr.