## Abstract

A decent number of lower bounds for non-elitist population-based evolutionary algorithms has been shown by now. Most of them are technically demanding due to the (hard to avoid) use of negative drift theorems—general results which translate an expected movement away from the target into a high hitting time. We propose a simple negative drift theorem for multiplicative drift scenarios and show that it can simplify existing analyses. We discuss in more detail Lehre's (2010) negative drift in populations method, one of the most general tools to prove lower bounds on the runtime of non-elitist mutation-based evolutionary algorithms for discrete search spaces. Together with other arguments, we obtain an alternative and simpler proof of this result, which also strengthens and simplifies this method. In particular, now only three of the five technical conditions of the previous result have to be verified. The lower bounds we obtain are explicit instead of only asymptotic. This allows us to compute concrete lower bounds for concrete algorithms, but also enables us to show that super-polynomial runtimes appear already when the reproduction rate is only a $(1-ω(n-1/2))$ factor below the threshold. For the special case of algorithms using standard bit mutation with a random mutation rate (called uniform mixing in the language of hyper-heuristics), we prove the result stated by Dang and Lehre (2016b) and extend it to mutation rates other than $Θ(1/n)$, which includes the heavy-tailed mutation operator proposed by Doerr et al. (2017). We finally use our method and a novel domination argument to show an exponential lower bound for the runtime of the mutation-only simple genetic algorithm on OneMax for arbitrary population size.

## 1  Introduction

Lower bounds for the runtimes of evolutionary algorithms are important as they can warn the algorithm user that certain algorithms or certain parameter settings will not lead to good solutions in acceptable time. Unfortunately, the existing methods to obtain such results, for non-elitist algorithms in particular, are very technical and thus difficult to use.

One reason for this high complexity is the use of drift analysis, which seems hard to circumvent. Drift analysis (see Lengler, 2020) is a set of tools that all try to derive useful information on a hitting time (e.g., the first time a solution of a certain quality is found) from information on the expected progress in one iteration. The hope is that the progress in a single iteration can be analyzed with only moderate difficulty and then the drift theorem does the remaining work. While more direct analysis methods exist and have been successfully used for simple algorithms, for population-based algorithms and in particular non-elitist ones, it is hard to imagine that the complicated population dynamics can be captured in proofs not using more advanced tools such as drift analysis.

Drift analysis has been used with great success to prove upper bounds on runtimes of evolutionary algorithms. Tools such as the additive (He and Yao, 2001), multiplicative (Doerr et al., 2012), and variable drift theorem (Mitavskiy et al., 2009; Johannsen, 2010) all allow us to easily obtain an upper bound on a hitting time solely from the expected progress in one iteration. Unfortunately, proving matching lower bounds is much harder since here the drift theorems require additional technical assumptions on the distribution of the progress in one iteration. This is even more true in the case of so-called negative drift, where the drift is away from the target and we aim at proving a high lower bound on the hitting time.

In this work, we propose a very simple negative drift theorem for the case of multiplicative drift (Lemma 3). We briefly show that this result can ease two classic lower bound analyses (also in Section 3).

In more detail, we use the new drift theorem (and some more arguments) to rework Lehre's negative drift in populations method (Lehre, 2010). This highly general analysis method can show exponential lower bounds on the runtime of a large class of evolutionary algorithms solely by comparing the so-called reproduction rate of individuals in the population with a threshold that depends only on the mutation rate.

The downside of Lehre's method is that both the result and its proof are very technical. To apply the general result (and not the specialization to algorithms using standard bit mutation), five technical conditions need to be verified, which requires the user to choose suitable values for six different constants; these have an influence on the lower bound one obtains. This renders the method of Lehre hard to use. Among the 54 citations to Lehre (2010) (according to Google scholar on June 9, 2020), only the two works (Lehre, 2011) and (Dang and Lehre, 2016b) apply this method. To hopefully ease future analyses of negative drift in populations, we revisit this method and obtain the following improvements.

A simpler result: We manage to show essentially the same lower bound by only verifying three of the five conditions of Lehre's result (Theorems 4 and 5). This also reduces the number of constants one needs to choose from six to four.

A non-asymptotic result: Our result gives explicit lower bounds, that is, free from asymptotic notation or unspecified constants. Consequently, our specialization to algorithms using standard bit mutation (Theorem 6) also gives explicit bounds. This allows us one to prove concrete bounds for specific situations [e.g., that the $(μ,λ)$ EA with $λ=2μ$ needs more than 13 million fitness evaluations to find the optimum of the OneMax problem defined over bit strings of length $n=500$; see the example following Theorem 6] and gives more fine-grained theoretical results [by choosing Lehre's constant $δ$ as a suitable function of the problems size, we show that a super-polynomial runtime behavior is observed already when the reproduction rate is only a $(1-ω(n1/2))$ factor below the threshold; see Corollary 7]. With the absence of asymptotic notation, we can also analyze algorithms using standard bit mutation with a mutation rate chosen randomly from a discrete set of alternatives (Section 6). Such a result was stated by Dang and Lehre (2016b), however only for mutation rates that are $Θ(1/n)$. Our result does not need this restriction and thus, for example, applies also to the heavy-tailed mutation operator proposed by Doerr et al. (2017).

A simple proof: Besides the important aspect that a proof guarantees the result to be mathematically correct, an understandable proof can also tell us why a result is correct and give further insights into working principles of algorithms. While every reader will have a different view on what the ideal proof looks like, we felt that Lehre's proof, combining several deep and abstract tools such as multi-type branching processes, eigenvalue arguments, and Hajek's drift theorem (Hajek, 1982), does not easily give a broader understanding of the proof mechanics and the working principles of the algorithms analyzed. We hope that our proof, based on a simple potential function argument together with our negative drift theorem, is more accessible.

Finally, we analyze an algorithm using fitness proportionate selection. The negative drift in populations method is not immediately applicable to such algorithms since it is hard to provide a general unconditional upper bound on the reproduction rate: If all but one individual have a very low fitness, then this best individual has a high reproduction rate. We therefore show that at all times all search points are at least as good (in a stochastic domination sense) as random search points. This allows to argue that the reproduction rates are low and then gives a simple proof of an exponential lower bound for the mutation-only simple genetic algorithm (simple GA) with arbitrary population size optimizing the simple OneMax benchmark, improving over the mildly sub-exponential lower bound in Neumann et al. (2009) and the exponential lower bound for large population sizes only in Lehre (2011).

### 1.1  Related Works

A number of different drift theorems dealing with negative drift have been proven so far, among others, in Happ et al. (2008), Oliveto and Witt (2011, 2012a), Rowe and Sudholt (2014), Oliveto and Witt (2015), Kötzing (2016), Lengler and Steger (2018), and Witt (2019) (note that in early works, the name “simplified drift theorem” was used for such results). They all require some additional assumptions on the distribution of the one-step progress, which makes them non-trivial to use. We refer to Lengler (2020, Section 2.4.3) for more details. Another approach to negative drift was used in Antipov et al. (2019) and Doerr (2019b, 2020a). There the original process was transformed suitably (via an exponential function), but in a way that the drift of the new process still is negative or at most a small constant. To this transformed process the lower bound version of the additive drift theorem (He and Yao, 2001) was applied, which gave large lower bounds since the target, due to the exponential rescaling, now was far from the starting point of the process.

In terms of lower bounds for non-elitist algorithms, besides Lehre's general result (Lehre, 2010), the following results for particular algorithms exist (always, $n$ is the problem size, $ɛ$ can be any positive constant, and $e≈2.718$ is the base of the natural logarithm). Jägersküpper and Storch (2007, Theorem 1) showed that the $(1,λ)$ EA with $λ≤114ln(n)$ is inefficient on any pseudo-Boolean function with a unique optimum. The asymptotically tight condition $λ≤(1-ɛ)logee-1n$ to yield a super-polynomial runtime was given by Rowe and Sudholt (2014). Happ et al. (2008) showed that two simple (1+1)-type hillclimbers with fitness proportionate selection cannot optimize efficiently any linear function with positive weights. Neumann et al. (2009) showed that a mutation-only variant of the simple GA with fitness proportionate selection is inefficient on the OneMax function when the population size $μ$ is at most polynomial, and it is inefficient on any pseudo-Boolean function with unique global optimum when $μ≤14ln(n)$. The mildly subexponential lower bound for OneMax was improved to an exponential lower bound by Lehre (2011), but only for $μ≥n3$. In a series of remarkable works up to Oliveto and Witt (2015), Oliveto and Witt showed that the true simple GA using crossover cannot optimize OneMax efficiently when $μ≤n14-ɛ$. None of these results gives an explicit lower bound or specifies the base of the exponential function. In Antipov et al. (2019), an explicit lower bound for the runtime of the $(μ,λ)$ EA is proven (but stated only in the proof of Theorem 3.1 in Antipov et al., 2019). Section 3 of Antipov et al. (2019) bears some similarity with ours, in fact, one can argue that our work extends (Antipov et al., 2019, Section 3) from a particular algorithm to the general class of population-based processes regarded by Lehre (2010) (where, naturally, Antipov et al., 2019 did not have the negative multiplicative drift result and therefore did not obtain bounds that hold with high probability).

This work is an extended version of a paper that appeared in the proceedings of Parallel Problem Solving from Nature 2020 (Doerr, 2020c). This version contains all proofs (for reasons of space, in Doerr, 2020c, only Lemma 1 and Theorem 2 were proven), a new section on standard bit mutation with random mutation rates, and several additional details.

## 2  Notation and Preliminaries

In terms of basic notation, we write $[a..b]:={z∈Z∣a≤z≤b}$. We recall the definition of the OneMax benchmark function
$OneMax:{0,1}n→R;x=(x1,⋯,xn)↦∑i=1nxi,$
which counts the number of ones in the argument. We denote the Hamming distance of two bit strings $x,y∈{0,1}n$ by
$H(x,y)=|{i∈[1..n]∣xi≠yi}|.$

The classic mutation operator standard bit mutation creates an offspring by flipping each bit of the parent independently with some probability $p$, which is called mutation rate.

We shall twice need the notion of stochastic domination and its relation to standard bit mutation, so we quickly collect these ingredients of our proofs. We refer to Doerr (2019a) for more details on stochastic domination and its use in runtime analysis.

For two real-valued random variables $X$ and $Y$, we say that $Y$stochastically dominates$X$, written as $X⪯Y$, if for all $λ∈R$ we have $Pr[Y≤λ]≤Pr[X≤λ]$. Stochastic domination is a very flexible way of saying that $Y$ is larger than $X$. It implies $E[X]≤E[Y]$, but not only this, we also have $E[f(X)]≤E[f(Y)]$ for any monotonically increasing function $f$.

Lemma 1:

Let $X,Y$ be two random variables taking values in some set $Ω⊆R$. Let $f:Ω→R$ be monotonically increasing, that is, we have $f(x)≤f(y)$ for all $x,y∈Ω$ with $x≤y$. Then $E[f(X)]≤E[f(Y)]$.

Significantly improving over previous related arguments in Droste et al. (2000, Section 5) and Doerr et al. (2012, Lemma 13), Witt (2013, Lemma 6.1) showed the following natural domination argument for offspring generated via standard bit mutation with mutation rate at most $12$. We note that the result is formulated in Witt (2013) only for $x*=(1,⋯,1)$, but the proof in Witt (2013) or a symmetry argument immediately shows the following general version.

Lemma 2:
Let $x*,x,y∈{0,1}n$ with $H(x,x*)≥H(y,x*)$. Let $x'$ and $y'$ be random search points obtained from $x$ and $y$ via standard bit mutation with mutation rate $p≤12$. Then
$H(x'x*)⪰H(y',x*).$

## 3  Negative Multiplicative Drift

The following elementary result allows us to prove lower bounds on the time to reach a target in the presence of multiplicative drift away from the target. While looking innocent, it has the potential to replace the more complicated lower bound arguments previously used in analyses of non-elitist algorithms. We discuss this briefly at the end of this section.

Lemma 3 (Negative Multiplicative Drift Theorem):
Let $X0,X1,⋯$ be a random process in a finite subset of $R≥0$. Assume that there are $Δ,δ>0$ such that for each $t≥0$, the following multiplicative drift condition with additive disturbance holds:
$E[Xt+1]≤(1-δ)E[Xt]+Δ.$
1
Assume further that $E[X0]≤Δδ$. Then the following two assertions hold.
• For all $t≥0$, $E[Xt]≤Δδ$.

• Let $M>Δδ$ and $T=min{t≥0∣Xt≥M}$. Then for all integers $L≥0$,
$Pr[T≥L]≥1-LΔδM,$
and $E[T]≥δM2Δ-12$.

The proof is an easy computation of expectations and an application of Markov's inequality similar to the direct proof of the multiplicative drift theorem in Doerr and Goldberg (2013). We do not see a reason why the result should not also hold for processes taking more than a finite number of values, but since we are only interested in the finite setting, we spare ourselves the more complicated world of continuous probability spaces.

Proof of Lemma 3:
If $E[Xt]≤Δδ$, then $E[Xt+1]≤(1-δ)E[Xt]+Δ≤(1-δ)Δδ+Δ=Δδ$ by (1). Hence, the first claim follows by induction. To prove the second claim, we compute
$Pr[T
where the middle inequality follows from Markov's inequality and the fact that the $Xt$ by assumption are all non-negative. From this estimate, using the shorthand $s=⌊δMΔ⌋$, we compute $E[T]=∑t=1∞Pr[T≥t]≥∑t=1s(1-tΔδM)=s-12s(s+1)ΔδM≥δM2Δ-12$, where the first equality is a standard way to express the expectation of a random variable taking non-negative integral values and the last inequality is an elementary estimate that can be verified as follows. Let $ɛ=δMΔ-s$. Then $s-12s(s+1)ΔδM≥δM2Δ-12$ is the same as $s-s(s+1)2(s+ɛ)≥12(s+ɛ)-12$, which is equivalent to
$2s(s+ɛ)-s(s+1)+(s+ɛ)≥(s+ɛ)2,$
since $s+ɛ≥0$. Now the left-hand side is equal to $s2+2sɛ+ɛ$, which is not smaller than $s2+2sɛ+ɛ2$, since $ɛ∈[0,1)$, and this is just the right-hand side.
We note that in the typical application of this result (as in the upcoming proof of Theorem 4), we expect to see the condition that for all $t≥0$,
$E[Xt+1∣Xt]≤(1-δ)Xt+Δ.$
2
Clearly, this condition implies (1) by the law of total expectation.

We now argue that our negative multiplicative drift theorem is likely to find applications beyond ours to the negative drift in populations method in the following section. To this aim, we regard two classic lower bound analyses of non-elitist algorithms and point out where our drift theorem would have eased the analysis.

Neumann et al. (2009) show that the variant of the simple genetic algorithm (simple GA) not using crossover needs time $2n1-O(1/loglogn)$ to optimize the simple OneMax benchmark. The key argument in Neumann et al. (2009) is as follows. The potential $Xt$ of the population $P(t)$ in iteration $t$ is defined as $Xt=∑x∈P(t)8OneMax(x)$. For this potential, it is shown (Neumann et al., 2009, Lemma 7) that if $Xt≥80.996n$, then $E[Xt+1]≤(1-δ)Xt$ for some constant $δ>0$. By bluntly estimating $E[Xt+1]$ in the case that $Xt<80.996n$, this bound could easily be extended to $E[Xt+1|Xt]≤(1-δ)Xt+Δ$ for some number $Δ$. This suffices to employ our negative drift theorem and obtain the desired lower bound. Without our drift theorem at hand, in Neumann et al. (2009) the potential $Yt=log8(Xt)$ was considered; it was argued that it displays an additive drift away from the target and that $Yt$ satisfies certain concentration statements necessary for the subsequent use of a negative drift theorem for additive drift.

A second example where we feel that our drift theorem can ease the analysis is the work of Oliveto and Witt (2014, 2015) on the simple GA with crossover optimizing OneMax. Due to the use of crossover, this work is much more involved, so we shall not go into detail and simply point the reader to the location where negative drift occurs. In Lemma 19 of Oliveto and Witt (2015), a multiplicative drift statement (away from the target) is proven. To use a negative drift theorem for additive drift (Oliveto and Witt, 2015, Theorem 2), in the proof of Lemma 20 the logarithm of the original process is regarded. So here again, we think that a direct application of our drift theorem would have eased the analysis.

## 4  Negative Drift in Populations Revisited

In this section, we use our negative multiplicative drift result and some more arguments to rework Lehre's negative drift in populations method (Lehre, 2010) and obtain Theorem 4 further below. This method allows an analysis of a broad class of evolutionary algorithms, namely all that can be described via the following type of population process.

### 4.1  Population Selection-Mutation Processes

A population selection-mutation (PSM) process (called population selection-variation algorithm in Lehre, 2010) is the following type of random process. Let $Ω$ be a finite set. We call $Ω$ the search space and its elements solution candidates or individuals. Let $λ∈N$ be called the population size of the process. An ordered multi-set of cardinality $λ$, in other words, a $λ$-tuple, over the search space $Ω$ is called a population. Let $P=Ωλ$ be the set of all populations. For $P∈P$, we write $P1,⋯,Pλ$ to denote the elements of $P$. We also write $x∈P$ to denote that there is an $i∈[1..λ]$ such that $x=Pi$.

A PSM process starts with some, possibly random, population $P(0)$. In each iteration $t=1,2,⋯$, a new population $P(t)$ is generated from the previous one $P(t-1)$ as follows. Via a (possibly) randomized selection operator$sel(·)$, a $λ$-tuple of individuals is selected and then each of them creates an offspring through the application of a randomized mutation operator$mut(·)$.

The selection operator can be arbitrary except that it only selects individuals from $P(t-1)$. In particular, we do not assume that the selected individuals are independent. Formally speaking, the outcome of the selection process is a random $λ$-tuple $Q=sel(P(t-1))∈[1..λ]λ$ such that $PQ1(t-1),⋯,PQλ(t-1)$ are the selected parents.

From each selected parent $PQi(t-1)$, a single offspring $Pi(t)$ is generated via a randomized mutation operator$Pi(t)=mut(PQi(t-1))$. Formally speaking, for each $x∈Ω$, $mut(x)$ is a probability distribution on $Ω$ and we write $y=mut(x)$ to indicate that $y$ is sampled from this distribution. We assume that each sample, that is, each call of a mutation operator, uses independent randomness. With this notation, we can write the new population as $P(t)=mut(PQ1(t-1)),⋯,mut(PQλ(t-1))$ with $Q=sel(P(t-1))$. From the definition it is clear that a PSM process is a Markov process with state space $P$. A pseudocode description of PSM processes is given in Algorithm 1.

The following characteristic of the selection operator was found to be crucial for the analysis of PSM processes in Lehre (2010). Let $P∈P$ and $i∈[1..λ]$. Then the random variable $R(i,P)=|{j∈[1..λ]∣sel(P)j=Pi}|$, called reproduction number of the $i$-th individual in $P$, denotes the number of times $Pi$ was selected from $P$ as parent. Its expectation $E[R(i,P)]$ is called reproduction rate.

Example: We now describe how the $(μ,λ)$ EA fits into this framework. That it fits into this framework and that the reproduction number is $λμ$ was already stated in Lehre (2010), but how exactly this works out, to the best of our knowledge, was never made precise so far, and is also not totally trivial.

We specify that when talking about the $(μ,λ)$ EA, given in pseudocode in Algorithm 2, we mean the basic EA which starts with a parent population of $μ$ search points chosen independently and uniformly at random from ${0,1}n$. In each iteration, $λ$ offspring are generated, each by selecting a parent individual uniformly at random (with repetition) and mutating it via standard bit mutation with mutation rate $p$. The next parent population is selected from these $λ$ offspring by taking $μ$ best individuals, breaking ties randomly.

This algorithm can be modeled as a PSM process with population size $λ$ (not $μ$). To do so, we need a slightly non-standard initialization of the population. We generate $P(0)$ by first taking $μ$ random search points and then generating each $Pi(0)$, $i∈[1..λ]$, by choosing (with replacement) a random one of the $μ$ base individuals and mutating it. With this definition, each individual in $P(0)$ is uniformly distributed in ${0,1}n$, but these individuals are not independent.

Given a population $P$ consisting of $λ$ individuals, the selection operator first selects a set $P0$ of $μ$ best individuals from $P$, breaking ties randomly. Formally speaking, this is a tuple $(i1,⋯,iμ)$ of indices in $[1..λ]$. Then a random vector $(j1,⋯,jλ)∈[1..μ]λ$ is chosen and the selected parents are taken as $Q=(Pij1,⋯,Pijλ)$. The next population $P'$ is obtained by applying the mutation operator to each of these, that is, $P'=(mut(Pij1),⋯,mut(Pijλ))$, where $mut(·)$ denotes standard bit mutation with mutation rate $p$.

From this description, it is clear that each individual of each population of the $(μ,λ)$ EA has a reproduction rate of $λμ$.

### 4.2  Our “Negative Drift in Populations” Result

We prove the following version of the negative drift in populations method.

Theorem 4:

Consider a PSM process $(P(t))t≥0$ with associated reproduction numbers $R(·,·)$ as defined in Section 4.1. Let $g:Ω→Z≥0$, called potential function, and $a,b∈Z≥0$ with $a≤b$. Assume that for all $x∈P(0)$ we have $g(x)≥b$. Let $T=min{t≥0∣∃i∈[1..λ]:g(Pi(t))≤a}$ the first time we have a search point with potential $a$ or less in the population. Assume that the following three conditions are satisfied.

• There is an $α≥1$ such that for all populations $P∈P$ with $min{g(Pi)∣i∈[1..λ]}>a$ and all $i∈[1..λ]$ with $g(Pi), we have $E[R(i,P)]≤α$.

• There is a $κ>0$ and a $0<δ<1$ such that for all $x∈Ω$ with $a we have
$E[exp(-κg(mut(x)))]≤1α(1-δ)exp(-κg(x)).$
• There is a $D≥δ$ such for all $x∈Ω$ with $g(x)≥b$, we have
$E[exp(-κg(mut(x)))]≤Dexp(-κb).$

Then

• $E[T]≥δ2Dλexp(κ(b-a))-12$, and

• for all $L≥1$, we have $Pr[T.

Before proceeding with the proof, we compare our result with Lehre (2010, Theorem 1). We first note that, apart from a technicality which we discuss toward the end of this comparison, the assumptions of our result are weaker than the ones in Lehre (2010) since we do not need the technical fourth and fifth assumption of Lehre (2010), which in our notation would read as follows.

• There is a $δ2>0$ such that for all $i∈[a..b]$ and all $k,ℓ∈Z$ with $1≤k+ℓ$ and all $x,y∈Ω$ with $g(x)=i$ and $g(y)=i-ℓ$ we have
$Pr[g(mut(x))=i-ℓ∧g(mut(y))=i-ℓ-k]≤exp(κ(1-δ2)(b-a))Pr[g(mut(x))=i-k-ℓ].$
• There is a $δ3>0$ such that for all $i,j,k,ℓ∈Z$ with $a≤i≤b$ and $1≤k+ℓ≤j$ and all $x,y∈Ω$ with $g(x)=i$ and $g(y)=i-k$ we have
$Pr[g(mut(x))=i-j]≤δ3Pr[g(mut(y))=i-k-ℓ].$

The assertion of our result is of the same type as in Lehre (2010), but stronger in terms of numbers. For the probability $Pr[T to find a potential of at most $a$ in time less than $L$, a bound of
$O(λL2D(b-a)exp(-κδ2(b-a))),$
is shown in Lehre (2010). Hence, our result is smaller by a factor of $Ω(L(b-a)exp(-κ(1-δ2)(b-a))$. In addition, our result is non-asymptotic, that is, the lower bound contains no asymptotic notation or unspecified constants.

The one point where the result of Lehre (2010) potentially is stronger is that it needs assumptions only on the “average drift” from the random search point at time $t$ conditional on having a fixed potential, whereas we require the same bound on the “point-wise drift,” that is, conditional on the current search point being equal to a particular search point of this potential. Let us make this more precise. Lehre uses the notation $(Xt)t≥0$ to denote the Markov process on $Ω$ associated with the mutation operator [unfortunately, it is not said in Lehre (2010) what $X0$ is, that is, how this process is started]. Then $Δt(i)=(g(Xt+1-g(Xt)∣g(Xt)=i)$ defines the potential gain in step $t$ when the current state has potential $i$. With this notation, instead of our second and third conditions, Lehre (2010) requires only the weaker conditions (here again translated into our notation).

• (ii')

For all $t≥0$ and all $a, $E[exp(-κΔt(i))]<1α(1-δ)$.

• (iii')

For all $t≥0$, $E[exp(-κ(g(Xt+1)-b))∣g(Xt)≥b].

So Lehre only requires that the random individual at time $t$, conditional on having a certain potential, gives rise to a certain drift, whereas we require that each particular individual with this potential gives rise to this drift. On the formal level, Lehre's condition is much weaker than ours (assuming that the unclear point of what is $X0$ can be fixed). That said, to exploit such weaker conditions, one would need to be able to compute such average drifts and they would need to be smaller than the worst-case point-wise drift. We are not aware of many examples where average drift was successfully used in drift analysis (one is Jägersküpper's remarkable analysis of the linear functions problem; Jägersküpper, 2008) despite the fact that many classic drift theorems only require conditions on the average drift to hold.

We now prove Theorem 4. Before stating the formal proof, we describe on a high level its main ingredients and how it differs from Lehre's proof.

The main challenge when using drift analysis is designing a potential function that suitably measures the progress. For simple hillclimbers and optimization problems, the fitness of the current solution may suffice, but already the analysis of the $(1+1)$ EA on linear functions resisted such easy approaches (He and Yao, 2001; Droste et al., 2002; Doerr et al., 2012; Witt, 2013). For population-based algorithms, the additional challenge is to capture the quality of the whole population in a single number. We note at this point that the notion of “negative drift in populations” was used in Lehre to informally describe the characteristic of the population processes regarded, but drift analysis as a mathematical tool was employed only on the level of single individuals and the resulting findings were lifted to the whole population via advanced tools like branching processes and eigenvalue arguments.

To prove upper bounds, in Witt (2006), Chen et al. (2009), Lehre (2011), Dang and Lehre (2016a), Corus et al. (2018), Antipov et al. (2018), and Doerr and Kötzing (2019), implicitly or explicitly potential functions were used that build on the fitness of the best individual in the population and the number of individuals having this fitness. Regarding only the current-best individuals, these potential functions might not be suitable for lower bound proofs.

The lower bound proofs in Neumann et al. (2009), Oliveto and Witt (2014, 2015), and Antipov et al. (2019) all define a natural potential for single individuals, namely the Hamming distance to the optimum, and then lift this potential to populations by summing over all individuals an exponential transformation of their base potential (this ingenious definition was, to the best of our knowledge, not known in the theory of evolutionary algorithms before the work of Neumann et al., 2009). This is the type of potential we shall use as well, and given the assumptions of Theorem 4, it is not surprising that $∑x∈Pexp(-κg(x))$ is a good choice. For this potential, we shall then show with only mild effort that it satisfies the assumptions of our drift theorem, which yields the desired lower bounds on the runtime (using that a single good solution in the population already requires a very high potential due to the exponential scaling). We now give the details of this proof idea.

Proof of Theorem 4:
We consider the process $(Xt)t≥0$ defined by
$Xt=∑i=1λexp(-κg(Pi(t))).$
To apply drift arguments, we first analyze the expected state after one iteration, that is, $E[Xt∣Xt-1]$. To this end, let us consider a fixed parent population $P=P(t-1)$ in iteration $t$. Let $Q=sel(P)$ be the indices of the individuals selected for generating offspring.
We first condition on $Q$ (and as always on $P$), that is, we regard only the probability space defined via the mutation operator, and compute
$E[Xt∣Q]=E∑j=1λexp(-κg(mut(PQj)))=∑i=1λ(R(i,P)∣Q)E[exp(-κg(mut(Pi)))].$
Not anymore conditioning on $Q$, using the law of total expectation, using the assumptions (ii) and (iii) on the drift from mutation, and finally using assumption (i) on the reproduction number and the trivial fact that $∑i=1λR(i,P)=λ$, we have
$E[Xt]=EQ[E[Xt∣Q]]=∑i=1λE[R(i,P)]E[exp(-κg(mut(Pi)))]≤∑i:g(Pi)
and recall that this is conditional on $P(t-1)$, hence, also on $Xt-1$.

Let $Δ=λDexp(-κb)$. Since $P(0)$ contains no individual with potential below $b$, we have $X0≤λexp(-κb)=ΔD≤Δδ$. Hence, also the assumption $E[X0]≤Δδ$ of Lemma 3 is fulfilled.

Let $M=exp(-κa)$ and $T':=min{t≥0∣Xt≥M}$. Note that $T$, the first time to have an individual with potential at most $a$ in the population, is at least $T'$. Now the negative multiplicative drift theorem (Lemma 3) gives
$Pr[T

We note that the proof above actually shows the following slightly stronger statement, which can be useful when working with random initial populations (as, for example, in the following section).

Theorem 5:

Theorem 4 remains valid when the assumption that all initial individuals have potential at least $b$ is replaced by the assumption $∑i=1λE[exp(-κg(Pi(0)))]≤λDexp(-κb)δ$.

## 5  Processes Using Standard Bit Mutation

Since many EAs use standard bit mutation, as in Lehre (2010) we now simplify our main result for processes using standard bit mutation and for $g$ being the Hamming distance to a target solution. Hence, in this section, we have $Ω={0,1}n$ and $y=mut(x)$ is obtained from $x$ by flipping each bit of $x$ independently with probability $p$. Since our results are non-asymptotic, we can work with any $p≤12$.

Theorem 6:

Consider a PSM process (see Section 4.1) with search space $Ω={0,1}n$, using standard bit mutation with mutation rate $p∈[0,12]$ as mutation operator, and such that $Pi(0)$ is uniformly distributed in $Ω$ for each $i∈[1..λ]$ (possibly with dependencies among the individuals). Let $x*∈Ω$ be the target of the process. For all $x∈Ω$, let $g(x):=H(x,x*)$ denote the Hamming distance from the target.

Let $α≥1$ and $0<δ<1$ such that $ln(α1-δ), that is, such that $1-1pnln(α1-δ)=:ɛ>0$. Let $B=2ɛ$. Let $a,b$ be integers such that $0≤a and $b≤b˜:=n1B2-1$.

Selection condition: Assume that for all populations $P∈P$ with $min{g(Pi)∣i∈[1..λ]}>a$ and all $i∈[1..λ]$ with $g(Pi), we have $E[R(i,P)]≤α$.

Then the first time $T:=min{t≥0∣∃i∈[1..λ]:g(Pi(t))≤a}$ that the population contains an individual in distance $a$ or less from $x*$ satisfies
$E[T]≥12λminδα1-δ,1expln21-1pnln(α1-δ)(b-a)-12,Pr[T

The proof of this result is a reduction to Theorem 4. To show that the second and third condition of Theorem 4 are satisfied, one has to estimate $E[exp(-κ(g(mut(x))-g(x)))]$, which is not difficult since $g(mut(x))-g(x)$ can be written as sum of independent random variables. With a similar computation and some elementary calculus, we show that the weaker starting condition of Theorem 5 is satisfied.

Proof of Theorem 6:

We apply Theorem 4. To show the second and third condition of the theorem, let $x∈Ω$ and let $y=mut(x)$ be the random offspring generated from $x$. We use the shorthand $d=g(x)$. We note that $g(y)-g(x)=g(y)-d$ can be expressed as a sum of $n$ independent random variables $Z1,⋯,Zn$ such that for $i∈[1..d]$, we have $Pr[Zi=-1]=p$ and $Pr[Zi=0]=1-p$, and for $i=[d+1..n]$, we have $Pr[Zi=+1]=p$ and $Pr[Zi=0]=1-p$.

Let $κ≥0$ be arbitrary for the moment. We note that for $i∈[1..d]$, we have $E[exp(-κZi)]=(1-p)·1+peκ=1+p(eκ-1)$ and for $i=[d+1..n]$, analogously, $E[exp(-κZi)]=(1-p)·1+pe-κ=1-p(1-e-κ)$ (formally speaking, we compute here the moment-generating function of a Bernoulli random variable). Using the independence of the $Zi$, these elementary arguments, and the standard estimate $1+r≤exp(r)$, we compute
$E[exp(-κ(g(y)-g(x))]=E∏i=1nexp(-κZi)=∏i=1nE[exp(-κZi)]=(1+p(eκ-1))d(1-p(1-e-κ))n-d≤exp(dp(eκ-1))·exp(-(n-d)p(1-e-κ))=exp(dpeκ+(n-d)pe-κ-pn).$
3
Let now $κ=ln(B)$. We consider first the case that $d≤b$, which implies $d≤b˜$. We continue the above computation via
$E[exp(-κ(g(y)-g(x))]≤exp(b˜pB+(n-b˜)p1B-pn)=exppnBB2-1+1-1B2-11B-1=exppn(-1+2B)=exppn-1pnlnα1-δ=(1-δ)1α.$
4
This shows the second condition of Theorem 4 for $κ=ln(B)$.
To show that the third condition of Theorem 4 is satisfied, assume that $g(x)≥b$. We first note the following. Let $x'∈Ω$ with $g(x')=b$ and let $y'=mut(x')$. By Lemma 2, $g(y)$ stochastically dominates $g(y')$. Consequently, by Lemma 1,
$E[exp(-κ(g(y)-b))]≤E[exp(-κ(g(y')-b))]=E[exp(-κ(g(y')-g(x'))]≤(1-δ)1α,$
where the last estimate exploits that we have shown the second condition also for $g(x)=b$. Hence, with $D=max{(1-δ)1α,δ}$ we have also shown the third condition of Theorem 4 (including the requirement $D≥δ$).
We finally show that the starting condition in Theorem 5 is satisfied. Using the moment-generating function of a binomially distributed random variable (which is nothing more than the arguments used in (3)), this follows immediately from the following estimate, valid for a random search point $x$:
$E[exp(-κg(x))]=12+12exp(-κ)n≤exp(-κn/(B2-1))≤exp(-κb)≤Dδexp(-κb).$
The estimate above is easy to see apart from the first inequality, which requires some elementary calculus. Recalling $κ=ln(B)$, this inequality is equivalent to $12+12B≤B-1/(B2-1)$. The latter is satisfied for $B=2$. Since its left-hand side is decreasing in $B$, we now show that the right-hand side is increasing in $B$ and obtain that the inequality is satisfied for all $B≥2$ (and we note that always $B≥2$ since $ɛ≤1$). By the monotonicity of the logarithm, the function $B↦B-1/(B2-1)$ is increasing (in $R>0$) if and only if $B↦ln(B-1/(B2-1))=-lnBB2-1$ is increasing, which is easily seen to be true by noting that its derivative $B↦B2(2ln(B)-1)+1B(B2-1)2$ is positive for $B≥2$.
Consequently, the random initial population $P(0)$ satisfies
$∑i=1λE[exp(-κg(Pi(0)))]≤λDexp(-κb)δ,$
as required in Theorem 5. From the conclusion of Theorem 4, we obtain
$E[T]≥δ2Dλexp(κ(b-a))-12=12λminδα1-δ,1expln21-1pnln(α1-δ)(b-a)-12,Pr[T

As a simple example for an application of this result, let us consider the classic $(μ,λ)$ EA (with uniform selection for variation, truncation selection for inclusion into the next generation, and mutation rate $p=1n$) with $λ=2μ$ optimizing some function $f:{0,1}n→R$, $n=500$, with unique global optimum. For simplicity, let us take as performance measure $λT$, that is, the number of fitness evaluations in all iterations up to the one in which the optimum was found. Since $λ=2μ$, we have $α=2$. By taking $δ=0.01$, we obtain a concrete lower bound of an expected number of more than 13 million fitness evaluations until the optimum is found (regardless of $μ$ and $f$).

Since Theorem 6 is slightly technical, we now formulate the following corollary, which removes the variable $δ$ without significantly weakening the result. We note that the proof of this result applies Theorem 6 with a non-constant $δ$, so we do not see how such a result could have been proven from Lehre (2010).

Corollary 7:

Consider a PSM process as in Theorem 6. Let $x*∈Ω$ be the target of the process. For all $x∈Ω$, let $g(x):=H(x,x*)$ denote the Hamming distance from the target. Assume that there is an $α≥1$ such that

• $ln(α)≤p(n-1)$, which is equivalent to $γ:=1-lnαpn≥1n$;

• there is an $a≤b:=⌊(1-4n)n14γ2-1⌋$ such that for all populations $P∈P$ with $min{g(Pi)∣i∈[1..λ]}>a$ and for all $i∈[1..λ]$, we have $E[R(i,P)]≤α$.

Then the first time $T:=min{t≥0∣∃i∈[1..λ]:g(Pi(t))≤a}$ that the population contains an individual in distance $a$ or less from $x*$ satisfies
$E[T]≥pα4λnmin1,2npαexpln2γb-a-12,Pr[T
In particular, if $a≤(1-ɛ)b$ for some constant $ɛ>0$, then $Tλ$ is super-polynomial in $n$ (in expectation and with high probability) when $γ=ω(n-1/2)$ and at least exponential when $γ=Ω(1)$.

The main argument is employing Theorem 6 with the $δ=p2n$ and computing that this small $δ$ has no significant influence on the exponential term of the bounds.

Proof of Corollary 7:
We apply Theorem 6 with $δ=p2n$. Since $δ≤12$, we have $1-δ≥exp(-2δ)$ and thus $ln(α1-δ)=ln(α)-ln(1-δ)≤ln(α)+2δ$. Consequently, $ɛ:=1-1pnln(α1-δ)$ defined as in Theorem 6 satisfies
$ɛ≥1-1pn(ln(α)+2δ)=1-lnαpn-1n2≥1-lnαpn1-1n=γ1-1n,$
where the second inequality uses our assumption $lnα≤p(n-1)$. Now
$b˜:=n14ɛ2-1≥n14γ2(1-1n)2-1≥n14-γ2(1-2n)γ2(1-2n)=nγ2(1-2n)4-γ2+γ22n≥nγ2(1-2n)4-γ2+(4-γ2)2n=nγ2(1-2n)(4-γ2)(1+2n)≥nγ2(1-2n)24-γ2≥(1-4n)n14γ2-1.$
With these estimates, $b≤⌊b˜⌋$, and the definition of $δ$, the bounds of Theorem 6 become
$E[T]≥12λminδα(1-δ),1expln21-1pnln(α1-δ)(b-a)-12≥pα4λnmin1,2npαexpln2γb-a-12,Pr[T
For the asymptotic statements, we observe first that $pα4nmin{1,2npα}=min{pα4n,12}≥min{αln(α)4n(n-1),12}$ since $p≥ln(α)/(n-1)$ due to our assumption that $ln(a)≤p(n-1)$. Hence, $E[T]λ$ is super-polynomial or at least exponential if and only if the term $exp(ln(2/γ)(b-a))$ is. So it suffices to regard the latter term.

We note that $b=Θ(nγ2)$ since $γ$ is always at most one. By assumption, $(b-a)=Θ(b)$. Assume first that $γ=ω(n-1/2)$. If $γ≤n-1/4$, then $exp(ln(2/γ)(b-a))=(2/γ)b-a≥(2n1/4)ω(1)$, which is super-polynomial. If $γ≥n-1/4$, then $b-a=Ω(n1/2)$ and $exp(ln(2/γ)(b-a))≥2b-a$ is again super-polynomial. This shows the claimed super-polynomiality for $γ=ω(n-1/2)$.

For $γ=Ω(1)$, we have $b-a=Θ(b)=Θ(n)$ and thus $exp(ln(2/γ)(b-a))≥exp(ln(2)(b-a))=exp(Θ(n))$ is exponential in $n$.

The asymptotic statements of the with-high-probability claims follow analogously.

## 6  Standard Bit Mutation with Random Mutation Rate

To analyze a uniform mixing hyper-heuristic which uses standard-bit mutation with a mutation rate randomly chosen from a finite set of alternatives, Dang and Lehre (2016b, Theorem 2) extend Theorem 6 to such mutation operators. They do not give a proof of their result, stating that it would be similar to the proof of the result for classic standard bit mutation (Lehre, 2010, Theorem 4). Since we did not find this so obvious, we reprove this result now with our methods. The non-asymptoticity of our result allows to extend it to super-constant numbers of mutation rates and to mutation rates other than $Θ(1/n)$. We note that such situations appear naturally with the heavy-tailed mutation operator proposed in Doerr et al. (2017).

We show the following result, which extends Theorem 6.

Theorem 8:

Let $n∈N$. Let $m∈N$, $p1,⋯,pm∈[0,12]$, and $q1,⋯,qm∈[0,1]$ such that $∑i=1mqi=1$. Let $mut$ be the mutation operator which, in each application independently, chooses an $I∈[1..m]$ with probability $Pr[I=i]=qi$ for all $i∈[1..m]$ and then applies standard bit mutation with mutation rate $pI$.

Consider a PSM process (see Section 4.1) with search space $Ω={0,1}n$, using this mutation operator $mut(·)$, and such that each initial individual is uniformly distributed in $Ω$ (not necessarily independently). Let $x*∈Ω$ be the target of the process. For all $x∈Ω$, let $g(x):=H(x,x*)$ denote the Hamming distance from the target.

Let $α≥1$, $0<δ<1$, and $B>2$ such that
$∑i=1mqiexp-pin1-2B≤(1-δ)1α.$
5
Let $a,b$ be integers such that $0≤a.

Selection condition: Assume that for all populations $P∈P$ with $min{g(Pi)∣i∈[1..λ]}>a$ and all $i∈[1..λ]$ with $g(Pi), we have $E[R(i,P)]≤α$.

Then the first time $T:=min{t≥0∣∃i∈[1..λ]:g(Pi(t))≤a}$ that the population contains an individual in distance $a$ or less from $x*$ satisfies
$E[T]≥12λminδα(1-δ),1expln(B)(b-a)-12,Pr[T

It is clear that when using standard bit mutation with a random mutation rate, then the drift—regardless of whether we just regard the fitness or an exponential transformation of it—is a convex combination of the drift values of each of the individual mutation operators. The reason why this argument does not immediately extend Theorem 6 to random mutation rates is that the mutation rate also occurs in the exponential term $exp(pn(-1+2B))$ in equation (4). Apart from this difficulty, however, we can reuse large parts of the proof of Theorem 6.

Proof of Theorem 8:
Let $κ:=ln(B)$. Let $x∈Ω$, $d:=g(x)$, and $y=mut(x)$. Assuming $d≤b$, analogous to the proof of Theorem 6, we have
$E[exp(-κ(g(y)-g(x))]=∑i=1mqi(1+pi(eκ-1))d(1-pi(1-e-κ))n-d≤∑i=1mqiexppin-1+2B≤(1-δ)1α.$
This shows the second condition of Theorem 4 and, with the same domination argument as in the proof of Theorem 6 and $D=max{(1-δ)1α,δ}$, also the third condition of Theorem 4. The starting condition of Theorem 5 follows as in the proof of Theorem 6 by noting that again $B≥2$. Now Theorem 4 is applicable and as in the last few lines of the proof of Theorem 6 we show our claim (note that now it suffices to simply replace $κ$ by $lnB$ and not by the more complicated logarithmic term there).

Equation (5) defining the admissible values for $B$ and thus for the starting point $b$ of the negative drift regime is not very convenient to work with in general. We stated it nevertheless because in particular situations it might be useful, for example, to show an inapproximability result, that is, that a certain algorithm cannot come closer to the optimum than by a certain margin in subexponential time. The following weaker assumption is easier to work with and should, in most cases, give satisfactory results as well.

Lemma 9:
Assume that in the notation of Theorem 8, we have
$∑i=1mqiexp(-pin)≤(1-γ)1α,$
for some $0<γ<1$. Then (5) and $B>2$ are satisfied for $δ=12γ$ and
$B=21-ln1-γ/2αln1-γα-1.$
Proof:
Let $ɛ=1-ln1-γ/2αln1-γα$ so that $2B=ɛ$. We note that $0<ɛ<1$ and thus $B>2$. By the concavity of the exponentiation with numbers smaller than one, we have
$∑i=1mqiexp(-pin(1-2B))≤∑i=1mqiexp(-pin)1-ɛ≤1-γα1-ɛ=1-γ21α.$

If in an asymptotic setting $γ$ and $α$ can be taken as constants, then this lemma and Theorem 8 show an exponential lower bound on the runtime. This proves (Dang and Lehre, 2016b, Theorem 2) and extends it to mutation rates that are not necessarily $Θ(1/n)$.

As an example where mutation rates other than $Θ(1/n)$ occur, we now regard the heavy-tailed mutation operator proposed in Doerr et al. (2017). This operator was shown to give a uniformly good performance of the $(1+1)$ EA on all jump functions, whereas each fixed mutation rate was seen to be good only for a small range of jump sizes. The heavy-tailed operator and variations of it have shown a good performance also in other works, for example, Mironovich and Buzdalov (2017); Friedrich et al. (2018a, 2018b); Friedrich, Quinzan et al. (2018); Wu et al. (2018); Antipov et al. (2020a, 2020b); Antipov and Doerr (2020); and Ye et al. (2020). The heavy-tailed mutation operator is nothing else than standard bit mutation with a random mutation rate, chosen from a heavy-tailed distribution. Doerr et al. (2017) defined it as follows. Let $β>1$ be a constant. This will be the only parameter of the mutation operator, however, one with not too much importance, so Doerr et al. (2017) simply propose to take $β=1.5$. In each invocation of the mutation operator, a number $α∈[1..N]$, $N:=⌊12n⌋$ is chosen from the power-law distribution with exponent $β$. Hence, $Pr[α=i]=(CNβ)-1i-β$, where $CNβ$ is the normalizing constant $CNβ:=∑i=1Ni-β$. Once $α$ is determined, standard bit mutation with mutation rate $p=αn$ is employed.

For fixed $N$, the expression on the left-hand side in Lemma 9 is $AN=∑i=1N(CNβ)-1i-βe-i$. This is a convex combination of $e-i$ terms and by comparing the coefficients, we easily see that this expression is decreasing in $N$. Computing $A100<0.178$, we see that for all $n≥200$, a reproduction number of at most $α≤10.178≈5.618$ is small enough to lead to exponential runtimes. This is higher than for standard bit mutation with mutation rate $p=1n$, where only $α≤e≈2.718$ suffices to show exponential runtimes. This observation fits to our general feeling that larger mutation rates can be destructive, from which in particular non-elitist algorithms suffer.

For the limiting value $A=limN→∞AN$ we note that $A≥A-:=∑i=1100(C∞β)-1i-βexp(-i)≈0.164004$ and $A≤A+:=∑i=1100(C∞β)-1i-βexp(-i)+exp(-101)≈0.164004$. Hence, for $n$ sufficiently large, even a value of $α≈10.164004=6.0974$ admits exponential lower bounds for runtimes.

Without going into details, and in particular without full proofs, we note that these estimates are tight, and this for all mutation operators of the type discussed in this section. Consider such a mutation operator such that $∑i=1mqiexp(-pin)≥(1+δ)1α$ for some $δ>0$. We take the $(μ,λ)$ EA optimizing OneMax as example. Assume that at some time we have in our parent population $k$ individuals on the highest non-empty fitness level $L$. In expectation, each of them generates $α=λ/μ$ offspring. Each of these offspring is an exact copy of the parent with probability $∑i=1mqiexp(-pin)≥(1+δ)1α$. Consequently, in the next generation the expected number of individuals on level $L$ or higher (as long as level $L$ is not full) is $(1+δ)k$. This is enough to show polynomial runtimes via the level-based method (Lehre, 2011; Dang and Lehre, 2016a; Corus et al., 2018; Doerr and Kötzing, 2019) when $λ$ is large enough. For example, the computation just made shows that condition (G2) in Doerr and Kötzing (2019, Theorem 3.2) is satisfied.

We note that this tightness stems from the fact that the term $∑i=1mqiexp(-pin)$ appears both in the drift computation here and, via the probability to generate a copy of a parent, in the fitness level method for populations. We do not think that this is a coincidence, but leave working out the details to a future work.

## 7  Fitness Proportionate Selection

In this section, we apply our method to a mutation-only version of the simple genetic algorithm (simple GA). We note that this algorithm traditionally is used with crossover (Goldberg, 1989). The mutation-only version has been regarded in the runtime analysis community mostly because runtime analyses for crossover-based algorithms are extremely difficult. While the first runtime analysis for the mutation-only version (Neumann et al., 2009) appeared in 2009 and showed a near-exponential lower bound on OneMax for arbitrary polynomially bounded population sizes, the first analysis of the crossover-based version from 2012 (Oliveto and Witt, 2012b) could only show a significantly sub-exponential lower bound ($2nc$ for a constant $c$ which is at most $180$) and this for population sizes below $n1/8$. We note that the current best result (Oliveto and Witt, 2015) gives a similar lower bound for population sizes below $n1/4$. Both works call these runtimes exponential, and we acknowledge that this definition for exponential runtimes exists, but given the substantial difference between $2n1/80$ (which is less than 3.5 for all $n≤1020$) and $exp(Θ(n))$ we prefer to reserve the notion “exponential” for the latter.

The mutation-only version of the simple GA with population size $μ∈N$ is described in Algorithm 3. This algorithm starts with a population $P(0)$ of $μ$ random individuals from ${0,1}n$. In each iteration $t=1,2,3,⋯$, it computes from the previous population $P(t-1)$ a new population $P(t)$ by $μ$ times independently selecting an individual from $P(t-1)$ via fitness proportionate selection and mutating it via standard bit mutation with mutation rate $p=1n$.

The precise known results for the performance of Algorithm 3 on the OneMax benchmark are the following. Neumann et al. (2009, Theorem 8) showed that with $μ≤poly(n)$ it needs with high probability more than $2n1-O(1/loglogn)$ iterations to find the optimum of the OneMax function or any search point in Hamming distance at most $0.003n$ from it. This is only a subexponential lower bound. In Lehre (2011, Corollary 13), building on the lower bound method from Lehre (2010), a truly exponential lower bound is shown for the weaker task of finding a search point in Hamming distance at most $0.029n$ from the optimum, but only for a relatively large population size of $μ≥n3$ (and again $μ≤poly(n)$).

We now extend this result to arbitrary $μ$, that is, we remove the conditions $μ≥n3$ and $μ≤poly(n)$. To obtain the best known constant 0.029 for how close to the optimum the algorithm cannot go in subexponential time, we have to compromise with the constants in the runtime, which consequently are only of a theoretical interest. We therefore do not specify the base of the exponential function or the leading constant. We note that this would have been easily possible since we only use a simple additive Chernoff bound and Corollary 7. We further note that Lehre (2011) also shows lower bounds for a scaled version of fitness proportionate selection and a general $Θ(1/n)$ mutation rate. This would also be possible with our approach and would again remove the conditions on $λ$, but we do not see that the additional effort is justified here.

Theorem 10:

There is a $T=exp(Ω(n))$ such that the mutation-only simple GA optimizing OneMax with any population size $μ$ with probability $1-exp(-Ω(n))$ does not find any solution $x$ with $OneMax(x)≥0.971n$ within $T$ fitness evaluations.

The main difficulty in proving lower bounds for algorithms using fitness proportionate selection is that the reproduction number is non-trivial to estimate. If all but one individual have a fitness of zero, then this individual is selected $μ$ times. Hence, $μ$ is the only general upper bound for the reproduction number. The previous works and ours overcome this difficulty by arguing that the average fitness in the population cannot significantly drop below the initial value of $n/2$, which immediately yields that an individual with fitness $k$ has a reproduction number of roughly at most $kn/2$.

While it is natural that the typical fitness of an individual should not drop far below $n/2$, making this argument precise is not completely trivial. In Neumann et al. (2009, Lemma 6), it was informally argued that the situation with fitness proportionate selection cannot be worse than with uniform selection. For the latter situation a union bound over all lineages of individuals is employed and a negative-drift analysis from Oliveto and Witt (2008, Section 3) is used for a single lineage. The analysis in Lehre (2011, Lemma 9) builds on the (positive) drift stemming from standard bit mutation when the fitness is below $n/2$ (this argument needs a mutation rate of at least $Ω(1/n)$) and the independence of the offspring (here the lower bound $λ≥n3$ is needed to admit the desired Chernoff bound estimates).

Our proof relies on a natural domination argument which shows that at all times all individuals are at least as good as random individuals in the sense of stochastic domination in fitness. This allows to use a simple Chernoff and union bound to argue that with high probability, for a long time all individuals have a fitness of at least $(12-ɛ)n$. The remainder of the proof is an application of Corollary 7. Here the lower bound (Lehre, 2010, Theorem 4) would have been applicable as well with the main difference that there one has to deal with the constant $δ$, which does not exist in Corollary 7.

We start by proving the key argument used in the proof of Theorem 10, namely that at each time $t$ for each individual $i∈[1..μ]$ the fitness stochastically dominates (see Section 2) the one of a random individual. We denote by $Bin(n,p)$ the binomial distribution with parameters $n$ and $p$. With a slight abuse of notation, we write $Bin(n,p)⪯Y$ to denote that $Y$ stochastically dominates $X$ when $X$ is binomially distributed with parameters $n$ and $p$.

In this notation, our goal is to show that for all times $t$ and all $i∈[1..μ]$, we have $Bin(n,12)⪯OneMax(Pi(t))$. This statement appears easy to believe since fitness proportionate selection, favoring better individuals at least slightly, should not be able to make the population worse. To be on the safe side, we nevertheless prove this statement formally (after the following remark).

We note that another statement that might be easy to believe is not true, namely that the sum of the fitness values of a population at all times $t≥1$ dominates the sum of the fitness values of a random population (such as the initial population), that is, that $Bin(μn,12)⪯∑i=1μOneMax(Pi(t))$. As counterexample, let $n$ be a multiple of 10 and let us consider the simple GA with $μ=n$ after one iteration. Let $Y=∑i=1μOneMax(Pi(1))$. We estimate the probability of the event $Y≤0.4nμ$. With probability at least $2-n$ we have $OneMax(P1(0))=0.4n$. For each $i=2,⋯,μ$, we have $OneMax(Pi(0))≤0.5n$ with probability 0.5 by the symmetry of the binomial distribution with parameter $p=0.5$. These events are all independent, so with probability at least $2-n-μ+1$, we have all of them. In this case, for each $i=1,⋯,μ$ independently, with probability at least $0.4n/(0.4n+0.5n(μ-1))≥0.8/μ$ the $i$-th parent chosen in iteration 1 is $P1(0)$ and with probability at least $(1-1/n)n≥1/4$ the offspring generated from it equals the parent. All these events together occur with probability at least $2-n-μ+1(0.8/μ)μ(1/4)μ≥(20n)-n$, recall that $μ=n$, which shows $Pr[Y≤0.04n2]≥(20n)-n$. Now for $X∼Bin(nμ,12)$, a simple Chernoff bound argument, e.g., via the additive Chernoff bound (Doerr, 2020d, Theorem 1.10.7), shows that $Pr[X≤0.4nμ]≤exp(2(0.1nμ)2/nμ)=exp(-0.02nμ)=exp(-0.02n2)$. Since this is (much) smaller than $(20n)-n$ for $n$ sufficiently large ($n≥180$ suffices), we do not have $X⪯Y$.

Lemma 11:

Consider a run of the simple GA (Algorithm 3) on the OneMax benchmark. Then for each $t≥0$ and each $i∈[1..μ]$, we have $Bin(n,12)⪯OneMax(Pi(t))$.

To prove this result, we use the following auxiliary result, which states a number uniformly chosen from a collection of non-negative numbers is stochastically dominated by a number chosen from the same collection via an analogue of fitness proportionate selection. To define the latter formally, let $n1,⋯,nμ∈R≥0$. For a random variable $v$ we write $v∼fp(n1,⋯,nμ)$ if

• in the case that $ui>0$ for at least one $i∈[1..μ]$, we have $Pr[v=i]=ni∑j=1μnj$ for all $i∈[1..μ]$, and

• in the case that $ui=0$ for all $i∈[1..μ]$, we have $Pr[v=i]=1μ$ for all $i∈[1..μ]$.

Lemma 12:

Let $n1,⋯,nμ∈R≥0$. Let $u∈[1..μ]$ be uniformly chosen and $U=nu$. Let $v∼fp(n1,⋯,nμ)$ and $V=nv$. Then $U⪯V$.

Proof:
The claim follows immediately from the definition of $fp(·)$ when $ni=0$ for all $i∈[1..μ]$. Hence, let us assume that there is at least one $i∈[1..μ]$ such that $ni>0$. Let us for convenience assume that $n1≤n2≤⋯≤nμ$. Then apparently
$1i∑j=1inj≤1μ∑j=1μnj,$
and hence,
$Pr[V≤ni]=∑j=1inj∑j=1μnj≤iμ=Pr[U≤ni],$
for all $i∈[1..μ]$. This suffices to show stochastic domination since both $U$ and $V$ only take the values $n1,⋯,nμ$.

We now show Lemma 11.

Proof of Lemma 11:

We show the claim via induction over time. For the random initial population $P(0)$, the claim is obviously true. Assume that in some iteration $t+1$, the parent population $P(t)$ satisfies that for all $i∈[1..μ]$, we have $Bin(n,12)⪯OneMax(Pi(t))$. We show that the same is true for $P(t+1)$. Since all individuals of $P(t+1)$ are identically distributed, we consider how one of them is generated. Let $u∈[1..μ]$ be random and let $v∼fp(OneMax(P1(t)),⋯,OneMax(Pμ(t)))$ be the parent individual selected for the generation of the offspring. By our inductive assumption and Lemma 12, we have $Bin(n,12)⪯OneMax(Pu(t))⪯OneMax(Pv(t))$. Let $x$ be a uniformly random individual and $y=Pv(t)$ be the parent just selected. Let $x'$ and $y'$ be the results of applying standard bit mutation to $x$ and $y$. Since $OneMax(x)⪯OneMax(y)$, by Lemma 2 we have $OneMax(x')⪯OneMax(y')$. Now $y'$ is equal (in distribution) to the offspring we just regard and $x'$ is (still) a random bit string. Hence, $Bin(n,12)∼OneMax(x')⪯OneMax(y')$ as desired.

We are now ready to give the formal proof of Theorem 10.

Proof of Theorem 10:

Consider a run of the simple GA with population size $μ$. With the domination argument of Lemma 11, the fitness of a particular solution $Pi(t)$ dominates a sum of $n$ independent uniform ${0,1}$-valued random variables. Hence, using the additive Chernoff bound (Doerr, 2020d, Theorem 1.10.7), we see that $OneMax(Pi(t))≤(12-ɛ)n=:s$ with probability at most $exp(-2ɛ2n)$ for all $ɛ>0$.

To avoid working in conditional probability spaces, let us consider a modification of the simple GA. It is identical with the original algorithm up to the point when the fitness of an individual $Pi(t)$ for the first time goes below $s$. From that time on, the algorithm selects the parents uniformly. Such an artificial continuation of a process from a time beyond the horizon of interest on was, to the best of our knowledge, in the theory of evolutionary algorithms first used in Doerr et al. (2011). For our modified simple GA, the reproduction rate of any individual in a population with all elements having fitness less than $n-a$, $a≤n-s$, is at most $n-as=:α$. Hence, we can apply Corollary 7 with this $α$. Taking, similar as in Lehre (2011), $ɛ=0.0001$ and $a=0.029$, we can work with $α=1-a0.5-ɛ≈1.942388$ and thus $γ≈0.336082$. For $n$ sufficiently large, this allows to use $b=⌈0.02905n⌉$.

For the first time $T'$ that the modified algorithm finds a solution with fitness at least $n-a$ we thus obtain
$Pr[T'
for all $L$. Since with probability at least $1-Lμexp(-2ɛ2n)$ the modified and the true algorithm do not differ in the first $L$ iterations (union bound over all individuals generated in this time interval), we have $Pr[T. With $L=exp(Θ(n))/μ$ suitably chosen, we have shown the claim (note that each iteration takes $μ$ fitness evaluations and note further that we can assume $μ=exp(O(n))$ sufficiently small as otherwise the evaluation of the initial search points already proves the claim).

While an exponential runtime on OneMax is not an exciting performance, for the sake of completeness we note that the runtime of the simple GA on OneMax is not worse than exponential. A runtime of $exp(O(n))$ can be shown with the methods of Doerr (2020b) (with some adaptations). The key observation is that, similar to property (A) in Doerr (2020b, Theorem 3), if at some time $t$ the population contains an individual $x$ with some fitness at least $n/3$, then in the next iteration this individual is chosen as parent at least once with at least constant probability and, conditional on this, with probability $Ω(1n)$ a particular better Hamming neighbor of $x$ is generated from $x$.

## 8  Conclusion and Outlook

In this work, we have proven two technical tools which might ease future lower bound proofs in discrete evolutionary optimization. The negative multiplicative drift theorem has the potential to replace the more technical negative drift theorems used so far in different contexts. Our strengthening and simplification of the negative drift in populations method should help increase our not very well developed understanding of population-based algorithms in the future. Clearly, it is restricted to mutation-based algorithms—providing such a tool for crossover-based algorithms and extending our understanding of how to prove lower bounds for these beyond the few results (Doerr and Theile, 2009; Oliveto and Witt, 2015; Sutton and Witt, 2019; Doerr, 2020e) would be great progress.

## Acknowledgments

This work was supported by a public grant as part of the Investissement d'avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.

## References

Antipov
,
D.
,
Buzdalov
,
M.
, and
Doerr
,
B
. (
2020a
). Fast mutation in crossover-based algorithms. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1268
1276
.
Antipov
,
D.
,
Buzdalov
,
M.
, and
Doerr
,
B
. (
2020b
). First steps towards a runtime analysis when starting with a good solution. In
Parallel Problem Solving from Nature, Part II
, pp.
560
573
.
Antipov
,
D.
, and
Doerr
,
B
. (
2020
). Runtime analysis of a heavy-tailed (1 + (λ, λ)) genetic algorithm on jump functions. In
Parallel Problem Solving from Nature, Part II
, pp.
545
559
.
Antipov
,
D.
,
Doerr
,
B.
,
Fang
,
J.
, and
Hetet
,
T
. (
2018
). Runtime analysis for the (μ + λ) EA optimizing OneMax. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1459
1466
.
Antipov
,
D.
,
Doerr
,
B.
, and
Yang
,
Q
. (
2019
). The efficiency threshold for the offspring population size of the (μ, λ) EA. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1461
1469
.
Chen
,
T.
,
He
,
J.
,
Sun
,
G.
,
Chen
,
G.
, and
Yao
,
X.
(
2009
).
A new approach for analyzing average time complexity of population-based evolutionary algorithms on unimodal problems
.
IEEE Transactions on Systems, Man, and Cybernetics, Part B
,
39:1092
1106
.
Corus
,
D.
,
Dang
,
D.
,
Eremeev
,
A. V.
, and
Lehre
,
P. K.
(
2018
).
Level-based analysis of genetic algorithms and other search processes
.
IEEE Transactions on Evolutionary Computation
,
22:707
719
.
Dang
,
D.-C.
, and
Lehre
,
P. K.
(
2016a
).
Runtime analysis of non-elitist populations: From classical optimisation to partial information
.
Algorithmica
,
75:428
461
.
Dang
,
D.-C.
, and
Lehre
,
P. K
. (
2016b
). Self-adaptation of mutation rates in non-elitist populations. In
Parallel Problem Solving from Nature
, pp.
803
813
.
Doerr
,
B.
(
2019a
).
Analyzing randomized search heuristics via stochastic domination
.
Theoretical Computer Science
,
773:115
137
.
Doerr
,
B
. (
2019b
). An exponential lower bound for the runtime of the compact genetic algorithm on jump functions. In
Foundations of Genetic Algorithms
, pp.
25
33
.
Doerr
,
B.
(
2020a
).
Does comma selection help to cope with local optima?
In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1304
1313
.
Doerr
,
B
. (
2020b
). Exponential upper bounds for the runtime of randomized search heuristics. In
Parallel Problem Solving from Nature, Part II
, pp.
619
633
.
Doerr
,
B
. (
2020c
). Lower bounds for non-elitist evolutionary algorithms via negative multiplicative drift. In
Parallel Problem Solving from Nature, Part II
, pp.
604
618
.
Doerr
,
B.
(
2020d
).
Probabilistic tools for the analysis of randomized optimization heuristics
. In
B.
Doerr
and
F.
Neumann
(Eds.),
Theory of evolutionary computation: Recent developments in discrete optimization
, pp.
1
87
.
Berlin
:
Springer
.
https://arxiv.org/abs/1801.06733
Doerr
,
B.
(
2020e
).
Runtime analysis of evolutionary algorithms via symmetry arguments
.
CoRR
,
abs/2006.04663
.
Doerr
,
B.
, and
Goldberg
,
L. A.
(
2013
).
.
Algorithmica
,
65:224
250
.
Doerr
,
B.
,
Happ
,
E.
, and
Klein
,
C.
(
2011
).
Tight analysis of the (1 + 1)-EA for the single source shortest path problem
.
Evolutionary Computation
,
19:673
691
.
Doerr
,
B.
,
Johannsen
,
D.
, and
Winzen
,
C.
(
2012
).
Multiplicative drift analysis
.
Algorithmica
,
64:673
697
.
Doerr
,
B.
, and
Kötzing
,
T
. (
2019
). Multiplicative up-drift. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1470
1478
.
Doerr
,
B.
,
Le
,
H. P.
,
Makhmara
,
R.
, and
Nguyen
,
T. D
. (
2017
). Fast genetic algorithms. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
777
784
.
Doerr
,
B.
, and
Theile
,
M
. (
2009
). Improved analysis methods for crossover-based algorithms. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
247
254
.
Droste
,
S.
,
Jansen
,
T.
, and
Wegener
,
I
. (
2000
). A natural and simple function which is hard for all evolutionary algorithms. In
Conference of the IEEE Industrial Electronics Society
, pp.
2704
2709
.
Droste
,
S.
,
Jansen
,
T.
, and
Wegener
,
I.
(
2002
).
On the analysis of the (1+1) evolutionary algorithm
.
Theoretical Computer Science
,
276:51
81
.
Friedrich
,
T.
,
Göbel
,
A.
,
Quinzan
,
F.
, and
Wagner
,
M.
(
2018a
).
Evolutionary algorithms and submodular functions: Benefits of heavy-tailed mutations
.
CoRR
,
abs/1805.10902
.
Friedrich
,
T.
,
Göbel
,
A.
,
Quinzan
,
F.
, and
Wagner
,
M
. (
2018b
). Heavy-tailed mutation operators in single-objective combinatorial optimization. In
Parallel Problem Solving from Nature, Part I
, pp.
134
145
.
Friedrich
,
T.
,
Quinzan
,
F.
, and
Wagner
,
M
. (
2018
). Escaping large deceptive basins of attraction with heavy-tailed mutation operators. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
293
300
.
Goldberg
,
D. E.
(
1989
).
Genetic algorithms in search, optimization and machine learning
.
.
Hajek
,
B.
(
1982
).
Hitting-time and occupation-time bounds implied by drift analysis with applications
.
,
13:502
525
.
Happ
,
E.
,
Johannsen
,
D.
,
Klein
,
C.
, and
Neumann
,
F
. (
2008
). Rigorous analyses of fitness-proportional selection for optimizing linear functions. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
953
960
.
He
,
J.
, and
Yao
,
X.
(
2001
).
Drift analysis and average time complexity of evolutionary algorithms
.
Artificial Intelligence
,
127:51
81
.
Jägersküpper
,
J
. (
2008
). A blend of Markov-chain and drift analysis. In
Parallel Problem Solving from Nature
, pp.
41
51
.
Jägersküpper
,
J.
, and
Storch
,
T
. (
2007
). When the plus strategy outperforms the comma strategy and when not. In
Foundations of Computational Intelligence
, pp.
25
32
.
Johannsen
,
D.
(
2010
).
Random combinatorial structures and randomized search heuristics
.
PhD thesis, Universität des Saarlandes
.
Kötzing
,
T.
(
2016
).
Concentration of first hitting times under additive drift
.
Algorithmica
,
75:490
506
.
Lehre
,
P. K
. (
2010
). Negative drift in populations. In
Parallel Problem Solving from Nature
, pp.
244
253
.
Lehre
,
P. K
. (
2011
). Fitness-levels for non-elitist populations. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
2075
2082
.
Lengler
,
J.
(
2020
).
Drift analysis
. In
B.
Doerr
and
F.
Neumann
(Eds.),
Theory of evolutionary computation: Recent developments in discrete optimization
, pp.
89
131
.
Berlin
:
Springer
.
https://arxiv.org/abs/1712.00964
Lengler
,
J.
, and
Steger
,
A.
(
2018
).
Drift analysis and evolutionary algorithms revisited
.
Combinatorics, Probability & Computing
,
27:643
666
.
Mironovich
,
V.
, and
Buzdalov
,
M
. (
2017
). Evaluation of heavy-tailed mutation operator on maximum flow test generation problem. In
Genetic and Evolutionary Computation Conference (GECCO), Companion Material
, pp.
1423
1426
.
Mitavskiy
,
B.
,
Rowe
,
J. E.
, and
Cannings
,
C.
(
2009
).
Theoretical analysis of local search strategies to optimize network communication subject to preserving the total number of links
.
International Journal on Intelligent Computing and Cybernetics
,
2:243
284
.
Neumann
,
F.
,
Oliveto
,
P. S.
, and
Witt
,
C
. (
2009
). Theoretical analysis of fitness-proportional selection: Landscapes and efficiency. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
835
842
.
Oliveto
,
P. S.
, and
Witt
,
C
. (
2008
). Simplified drift analysis for proving lower bounds in evolutionary computation. In
Parallel Problem Solving from Nature
, pp.
82
91
.
Oliveto
,
P. S.
, and
Witt
,
C.
(
2011
).
Simplified drift analysis for proving lower bounds in evolutionary computation
.
Algorithmica
,
59:369
386
.
Oliveto
,
P. S.
, and
Witt
,
C.
(
2012a
).
Erratum: Simplified drift analysis for proving lower bounds in evolutionary computation
.
CoRR
,
abs/1211.7184
.
Oliveto
,
P. S.
, and
Witt
,
C
. (
2012b
). On the analysis of the simple genetic algorithm. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1341
1348
.
Oliveto
,
P. S.
, and
Witt
,
C.
(
2014
).
On the runtime analysis of the simple genetic algorithm
.
Theoretical Computer Science
,
545:2
19
.
Oliveto
,
P. S.
, and
Witt
,
C.
(
2015
).
Improved time complexity analysis of the simple genetic algorithm
.
Theoretical Computer Science
,
605:21
41
.
Rowe
,
J. E.
, and
Sudholt
,
D.
(
2014
).
The choice of the offspring population size in the (1, λ) evolutionary algorithm
.
Theoretical Computer Science
,
545:20
38
.
Sutton
,
A. M.
, and
Witt
,
C
. (
2019
). Lower bounds on the runtime of crossover-based algorithms via decoupling and family graphs. In
Genetic and Evolutionary Computation Conference (GECCO)
, pp.
1515
1522
.
Witt
,
C.
(
2006
).
Runtime analysis of the (μ + 1) EA on simple pseudo-Boolean functions
.
Evolutionary Computation
,
14:65
86
.
Witt
,
C.
(
2013
).
Tight bounds on the optimization time of a randomized search heuristic on linear functions
.
Combinatorics, Probability & Computing
,
22:294
318
.
Witt
,
C.
(
2019
).
Upper bounds on the running time of the univariate marginal distribution algorithm on OneMax
.
Algorithmica
,
81:632
667
.
Wu
,
M.
,
Qian
,
C.
, and
Tang
,
K
. (
2018
). Dynamic mutation based Pareto optimization for subset selection. In
Intelligent Computing Methodologies, Part III
, pp.
25
35
.
Ye
,
F.
,
Wang
,
H.
,
Doerr
,
C.
, and
Bäck
,
T
. (
2020
). Benchmarking a (μ + λ) genetic algorithm with configurable crossover probability. In
Parallel Problem Solving from Nature, Part II
, pp.
699
713
.

## Author notes

*The author can be contacted at doerr (at) lix.polytechnique (dot) fr.