Abstract

There has been renewed interest in modelling the behaviour of evolutionary algorithms (EAs) by more traditional mathematical objects, such as ordinary differential equations or Markov chains. The advantage is that the analysis becomes greatly facilitated due to the existence of well established methods. However, this typically comes at the cost of disregarding information about the process. Here, we introduce the use of stochastic differential equations (SDEs) for the study of EAs. SDEs can produce simple analytical results for the dynamics of stochastic processes, unlike Markov chains which can produce rigorous but unwieldy expressions about the dynamics. On the other hand, unlike ordinary differential equations (ODEs), they do not discard information about the stochasticity of the process.

We show that these are especially suitable for the analysis of fixed budget scenarios and present analogues of the additive and multiplicative drift theorems from runtime analysis. In addition, we derive a new more general multiplicative drift theorem that also covers non-elitist EAs. This theorem simultaneously allows for positive and negative results, providing information on the algorithm's progress even when the problem cannot be optimised efficiently. Finally, we provide results for some well-known heuristics namely Random Walk (RW), Random Local Search (RLS), the (1+1) EA, the Metropolis Algorithm (MA), and the Strong Selection Weak Mutation (SSWM) algorithm.

1  Introduction

Nowadays, the theoretical study of randomised heuristics mainly focuses on rigorous runtime analysis; however, we can trace back literature about general models of evolutionary computation to Holland (1975). Roughly a decade later, Goldberg (1987) focused on modelling the evolving population of a genetic algorithm (GA). This was extended by Vose (1995) who used a Markov chain approach and fixed points analysis. In the same spirit of dynamical systems, we can find an analysis for steady state GAs by Wright and Rowe (2001).

However, Schema theory was proven wrong (Altenberg, 1994) and Markov chain approaches are, in general, intractable (He and Yao, 2003). The remaining mentioned techniques share a common rationale, the stochastic process describing an EA varies according to its expected change. The same idea is used in the so-called ordinary differential equation (ODE) method by Wormald (1995). Here, the EA is considered as a noisy version of an ODE that tracks the expected rate of change. This method has been applied several times in EC (see, e.g., Yin et al., 1995 or Akimoto et al., 2012). Due to this deterministic approach of the expected change, these methods disregard the random component of EAs.

In this article we follow up on a new method presented by Paixão and Pérez Heredia (2017) based on stochastic differential equations. This approach naturally includes the stochasticity implicit in randomised heuristics through the well established tools from Itô calculus. SDEs have already been used in EC for convergence analysis by Schaul (2012); however, the results did not improve the state of the art.

Other modelling attempts such as the application of statistical physics by Prügel-Bennett and Shapiro (1994) yielded highly accurate results when modelling finite populations and fluctuations were also modelled by Prügel-Bennett (1997). However, this method requires that the algorithm's selection operator weights each search point according to the Boltzmann distribution. This is the case for algorithms like Metropolis or SSWM but it will not hold in general. On the other hand, this method and some of the previously mentioned approaches were tackling more general and ambitious problems than what this article covers.

Stochastic differential equations have proven to be a powerful tool when dealing with stochastic processes (see Øksendal, 2003 for many examples). SDEs have accomplished great achievements in physics, starting with the explanation of Brownian motion by Einstein (1905). In finance, especially for Black and Scholes (1973) work on options prices that yielded a Nobel Prize in Economics in 1997. SDEs have also been used in the more related field of Population Genetics (PG) to track the frequency of genes in a population. This approach was used by Kimura (1957) and has ever since been an integral part of the so-called modern synthesis. Many classical results were obtained through this approach, including the steady state distribution of gene frequencies of a finite population under mutation and selection by Kimura (1964), and the probability of fixation of a gene by Kimura (1968). Our work also contributes to the renewed interest of unifying EC and PG (Paixo et al., 2015; Pérez Heredia et al., 2017).

Building on top of the so-called diffusion approximation, we derive an SDE that will be used to model the dynamics of EAs. For some scenarios we will be able to solve this equation analytically. We show that this is the case by obtaining equivalent statements to the well-known additive drift by He and Yao (2001) and multiplicative drift by Doerr et al. (2012). Furthermore, we will derive a new version of the multiplicative drift, extending its applicability to non-elitist algorithms even when the drift is pushing the algorithm away from the optimum. The work we present here has several aims:

  • improving the understating of EA dynamics,

  • deriving analytic expressions for the expectation and the variance,

  • introducing SDE to the field of EC and

  • translation of drift theorems from runtime analysis to fixed budget analysis.

This article focuses on a fixed budget analysis, first introduced by Jansen and Zarges (2012). In this perspective, the research question is to derive the state of the algorithm after some specific time, this approach gives more information about the process than the classical runtime analysis where the question is to find the time needed to optimise a problem. Despite being introduced recently, fixed budget is gaining importance in the theoretical community (Doerr et al., 2016, 2013; Jansen and Zarges, 2012; Lengler and Spooner, 2015; Nallaperuma, Neumann et al., 2017).

Although we build on top of the same mathematical tool, this article widely extends the extended abstract presented by Paixão and Pérez Heredia (2017). First, we will consider more elaborated algorithms such as Metropolis and SSWM. And secondly but more important, we present what to our best knowledge is the first fixed budget analysis of non-elitist EAs. Furthermore, our formulation considers any selection strength. This constitutes an important advantage since drift theorems require to bound the process by a positive and monotonic non-increasing distance function (He and Yao, 2001; Doerr et al., 2012; Johannsen, 2010). In the case of non-elitist EAs, this implies increasing the selection strength such that worsening moves are not very likely to be accepted. However, in our results the drift can even be negative and yet we will be able to provide the expected behaviour, and the variance.

The disadvantage of the SDE method, unlike runtime analysis or other fixed budget techniques, is that it requires performing two approximations. The first approximation consists of considering the system as a time-continuous process. In the second approximation, we will assume that the change of the process Xt+1-Xt follows a distribution without higher statistical moments than the variance. Hence, this method, although is mathematically based, is not completely rigorous. Due to this reason, the last section of the article presents an extensive validation of these approximations. We will compare the obtained results against rigorous literature for runtime and fixed budget. In addition, we present a graphical comparison against the simulated results, and we will experimentally measure the error of the approximation.

The article is structured as follows. In the preliminaries section, we introduce the studied algorithms, fitness functions, a crash course on SDEs, and the diffusion approximation. Then, we derive three drift theorems for fixed budget: additive, multiplicative, and non-elitist multiplicative. In the applications section, we will apply these theorems for five different algorithms on problems. Finally, we present a validation section where we compare our results against the literature and the experimental error.

2  Preliminaries

2.1  Algorithms

In this article, we consider the maximisation of pseudo-Boolean functions f:{0,1}nR. Although we will be studying five different algorithms, they all fit in the family of trajectory-based algorithms. These heuristics evolve a single individual rather than using a population. The two principal operators that will specify the algorithm used within this class are mutation and selection (see Table 1). We will consider algorithms with two mutation operators:

  • Local mutations: Flip a randomly chosen bit from x.

  • Global mutations: Flip each bit of x independently at random with probability 1/n.

As mentioned in the introduction, we are interested in algorithms that deviate from elitism. This property is obtained by the action of the selection operator through an acceptance probability pacc:R[0,1] (see Figure 1). For the sake of clarity in the analysis, when analysing non-elitist EAs, we will consider only local mutations yielding a simple but general family of algorithms. Within this class of algorithms we can recover well-known heuristics depending on the acceptance probability. In following sections we present results for this general scheme and we will also introduce four widely extended selection operators to compare our results against the literature.

Figure 1:

Acceptance probability for RW (orange loosely dashed line), RLS and (1+1) EA (blue dotted line), Metropolis (red solid line), and SSWM (green dashed line).

Figure 1:

Acceptance probability for RW (orange loosely dashed line), RLS and (1+1) EA (blue dotted line), Metropolis (red solid line), and SSWM (green dashed line).

First we will consider one extreme case of non-elitism which is the absence of selection from a Random Walk (RW) process
paccRW(Δf)=1Δf.
(1)
On the other side of the spectrum we find elitist algorithms for which the acceptance function is the step function. This is the case for RLS (local mutations) or the (1+1) EA (global mutations).
paccRLS(Δf)=1ifΔf00ifΔf<0.
(2)
All the other local search algorithms will be in between these two limit cases. If worsening moves are allowed with an exponentially decreasing probability, but elitism is kept for improving moves, we find the so-called Metropolis et al. (1953) algorithm
paccMA(Δf,αR+)=1ifΔf0eαΔfifΔf<0,
(3)
where R+ denotes the set of positive real numbers. Finally, we will consider the acceptance probability of the Strong Selection Weak Mutation (SSWM) regime. Within this regime, the condition on improving moves is relaxed allowing the rejection of search points with higher fitness (see, e.g., Paixão et al., 2017 or Oliveto et al., 2017).
paccSSWM(Δf,N,β)=pfix(Δf,N,β)=1-e-2βΔf1-e-2NβΔf.
(4)
This acceptance probability is obtained from a natural evolutionary regime where NN+ is the population size and βR+ is a scaling parameter. This probability is normally referred to as the fixation probability, since it represents the probability of a mutant genotype going extinct or fixating in the population (Kimura, 1962).

2.2  Fitness Functions

We will consider two of the most theoretically studied functions. First, the OneMax problem represents an easy hill climbing task. The goal is to maximise the number of bit matches with a specific bitstring which is the optimum. Since our algorithms are unbiased, they do not use problem knowledge; without loss of generality we can assume that the optimal search point is the all ones bitstring. Hence, OneMax just counts the number of 1-bits of a search point.

Definition 1:

Let x{0,1}n, then OneMax(x)=i=1nxi.

LeadingOnes is a function that counts the number of 1-bits at the beginning of the bitstring. Using again the landscape metaphor, this problem represents a fitness ridge since bit matches do not always increase the fitness and losing one bit match can yield a huge fitness loss.

Definition 2:

Let x{0,1}n, then LeadingOnes(x)=i=1nj=1ixj.

2.3  Stochastic Differential Equations

This subsection contains the minimal knowledge of SDEs to be able to understand this article. For more extended and rigorous details we refer the reader to the textbook by Øksendal (2003). Stochastic differential equations are equations that deal with the behaviour of stochastic processes. We can define these processes as a collection of random variables in R over time t (i.e., {Xt}t0), together with an underlying probability space (Ω,F,P) that from now on we will take for granted.

The simplest stochastic process is probably the well-known white noise Wt. Its signal integrated over time will produce another famous process Bt=Wsds known as Brownian motion or Wiener process. Some of the characteristics of these processes are that they have 0 mean and their increments are independent and stationary. Furthermore, all the moments higher than the second are 0 (see, e.g., Ross, 1996). We will focus on SDEs of the form
dXt=b(t,Xt)dt+σ(t,Xt)Wtdt=b(t,Xt)dt+σ(t,Xt)dBt.
This equation represents a process where the state change dXt depends on both the time and the current position (which can also depend on the time) through the terms b(t,Xt) and σ(t,Xt). These are usually referred to as the drift and diffusion terms or coefficients. We can say that b tracks the expected evolution of the process; note that if σ=0 (absence of noise), the drift coefficient simply becomes dXt/dt. And σ collects the random behaviour by amplifying a white noise or Brownian motion term. Applying the usual integration rules one obtains a simple expression for the state after t iterations.
Xt=X0+0tb(s,Xs)ds+0tσ(s,Xs)dBs.
(5)
The challenge now is to prove the existence of an integral over an stochastic process gdBs and develop its integration rules. This was done by Itô (1944, 1946, 1950, 1951): he proved the existence of this integral, worked on stochastic integral equations of the type 5, and applied his developments to the already existing field of SDEs. The key idea of Itô's work is a stochastic extension of the Riemann-Stieltjes (Stieltjes, 1894) integral, which is itself, an extension of the standard Riemann integral for when a function is integrated with respect to another function. The other traditional formulation was given by Stratonovich (1966). In this article, we will focus on stochastic processes that fit into the following definition of an Itô process (Definition 4.1.1 in Øksendal, 2003).
Definition 3
(1-dimensionalItprocess): A 1-dimensional Itô process is a stochastic process Xt on (Ω,F,P) of the form
Xt=X0+0tb(s,Xs)ds+0tσ(s,Xs)dBs,
where dBt denotes a Brownian motion process, b is absolute integrable in [0,t] and σ is square integrable in [0,t]. Alternatively, in differential form dXt=b(t,Xt)dt+σ(t,Xt)dBt.

In the following, we will refer as Itô integrals to integrals over a Brownian process gdBs. Again, for a formal definition of these integrals and the set of functions V(S,T) on where they are defined, we refer the reader to Øksendal (2003). These integrals will have the following properties (adapted from Theorem 3.2.1 in Øksendal, 2003).

Theorem 1

(PropertiesoftheItintegral): Let f,gV(0,T) and let 0S<U<T. Then

  • STfdBt=SUfdBt+UTfdBt

  • ST(cf+g)dBt=c·STfdBt+STgdBt (c constant)

  • ESTfdBt=0.

Performing a variable change on a stochastic integral is a bit more involved than for ordinary integrals. The usual rules do not apply and we will have to use the following statement (Theorem 4.1.2 in Øksendal, 2003) which is known as Itô's formula.

Theorem 2
(Itformula): Let Xt be a Itô process given by
dXt=b(t,Xt)dt+σ(t,Xt)dBt.
Let g(t,x) be twice continuously differentiable on xR and once in tR+. Then Yt=g(t,Xt) is again an Itô process, and
dYt=gt+bgx+σ222gx2·dt+σgxdBt.

Finally, another useful property specially when computing higher moments, such as the variance, is the Itô Isometry (Corollary 3.1.7 in Øksendal, 2003).

Theorem 3
(ItIsometry): For all fV(S,T)
ESTf(Xt,t)dBt2=ESTf(Xt,t)2dt.

2.4  The Diffusion Approximation

We refer to the diffusion approximation as the joint application of a time-continuous and Gaussian approximation. The time evolution of the probability density p(x,t) of any stochastic process Xt can be described by the Chapman-Kolmogorov equation (see, e.g., Feller, 1949)
p(x,t+Δt)=Δ(δx)·p(x-δ,t)·dδ,
(6)
where Δ(δx) is the transition probability of reaching x from x-δ. Let us consider the left hand term: after performing a first order Taylor expansion around Δt=0 we obtain
p(x,t+Δt)p(x,t)+Δt·p(x,t)t,
and by Taylor's theorem we know that the error of this approximation will be of order O(Δt). Since we are dealing with time-discrete algorithms, the time counter is increased every generation by just one unit, hence Δt=1. Similarly, for the right-hand term of (6) we can write the second order Taylor expansion of Δ(δx)p(x-δ,t) around x-δ=x yielding
Δ(δx)p(x-δ,t)Δ(δx)p(x,t)-δ·xΔ(δx)p(x,t)+δ22·2x2Δ(δx)p(x,t).
Again, by Taylor's theorem we can estimate the error by O(δ2). This approximation will give good results when δ is small, that is, when the system usually moves between close by states in the search space. Introducing both approximations in (6) and noticing that p no longer depends on δ we can write
p(x,t)+Δt·p(x,t)tp(x,t)·Δ(δx)dδ1-xp(x,t)·δ·Δ(δx)dδE(Δ)+12·2x2p(x,t)·δ2·Δ(δx)dδE(Δ2).
As outlined in the equation above, we can identify these integrals with the statistical moments of Δ leading to the well-known diffusion equation
Δt·p(x,t)t-xp(x,t)·EΔ+12·x2p(x,t)·EΔ2.
This equation is known as the Fokker-Planck equation or the Forward Kolmogorov equation (see, e.g., Karlin and Taylor, 1981). This is an equation for the time evolution of the probability density p(x,t) of the random variable Xt. In principle, one could solve this equation to obtain a probability distribution at any particular time point. However, this is often impossible in practice. Instead, we deal directly with the SDE associated with the Fokker-Planck equation and take that as a model for EAs. It can be shown (see, e.g., Øksendal, 2003), that the associated SDE is an Itô process of the form:
dXtEΔdt+EΔ2dBt.
This is the central equation for this article. We will use it to describe the dynamics of EAs by computing its drift coefficient EΔ and second moment or diffusion coefficient EΔ2. Finally, we present our model as an hypothesis where it is interesting to note that the first moment E(Δ) exactly corresponds to the drift as used in runtime analysis.
Hypothesis 1
(TheDiffusionApproximation): The dynamics of any algorithm described by Algorithm 1 with local or global mutations can be approximated by the following Itô process:
dXt=b(Xt,t)dt+σ(Xt,t)dBtb(Xt,t)=EXt+1-Xt|Xtσ2(Xt,t)=EXt+1-Xt2|Xt,
where Bt is a Brownian motion process.
formula

Are Evolutionary Algorithms Diffusive Processes?

The previous hypothesis would be a rigorous theorem if evolutionary algorithms were a pure diffusion process, but they are not. First of all, we used differential and not difference equations to model a discrete process. This approximation is intrinsic to the chosen method and carries an error of constant size. Secondly, the Gaussian approximation depends on each algorithm and the fitness function. The variable of interest is the transition probability Δ(δx) from Equation (6), which reads as the probability of reaching the state x from the state x-δ in one iteration. It was shown before that the Gaussian approximation will carry an error of size O(δ2). Hence, depending on the evolutionary operators of the EA this approximation will or will not be accurate.

To motivate the use of this approximation for the algorithms from Subsection 2.1, we can focus just on the mutation operator. The reason is that selection cannot increase the distance δ between the parent and child solutions. It will only decide if the new individual is accepted or rejected.

The question resides then in how big is the δ produced by mutation. To illustrate this, let us consider a random walk on the Boolean hypercube and its number of ones x. First, with local mutations, the RW will have a constant δ=1. On the other hand, if global mutations are used δ can be any integer number in {0,1,,n}. Fortunately, the probability of large mutations decays exponentially with their size δ. Actually, the two smallest values of δ (0 and 1) will carry together a probability mass of approximately 2/e. Hence, all the larger jumps together will only have a probability of 1-2/e. Since δ is concentrated, the higher order moments should be close to zero, suggesting that the Gaussian approximation is also accurate for global mutations. However, on some fitness problems all the relevant events could belong to the neglected probability mass of 1-2/e. In this article, we consider only problems (OneMax and LeadingOnes) where, as we will see, global mutations are decently Gaussian approximated. Nevertheless, at the end we provide an extensive validation for Hypothesis 7.

3  Drift Theorems for Fixed Budget Analysis

The Diffusion Approximation (Hypothesis 7) considers a scenario which in general does not have an analytic solution. This will be dependent on the mathematical expressions of the drift b(Xt,t) and diffusion σ(Xt,t) coefficients. In this article we will consider three simple scenarios where we can obtain equivalent results to drift theorems for runtime analysis.

The methods to solve SDEs are similar to those to solve ODE, but we have to use the Itô formula (Theorem 5) when performing a change of variables. In the following subsections our approach is to use a priori knowledge. We will find another process Yt, described by a simpler function g(x,t) that we know is the solution of the equivalent ODE (SDE with σ=0). For example, if our process is of the form dXt=Xtdt+XtdBt our candidate solution will be Yt=ln(Xt). When the coefficients are more complex, we will try to find a variable change that leads to an already known SDE.

Before starting the derivation of these theorems, we would like to emphasize that this section is purely rigorous. No approximations are used unlike in the derivation of Hypothesis 7.

3.1  Additive Drift

The simplest case is when the drift b(Xt,t) does not depend on the time t or the current state Xt, i.e., b(Xt,t)=b. This corresponds with the additive drift for runtime analysis (He and Yao, 2001).

Theorem 4:
Let Xt be an Itô process of the form
dXt=bdt+σdBt,
where Bt is a Brownian process, bR and σR0+. Then,
Xt=X0+bt+σBtEXt=EX0+btVarXt=VarX0+σ2t.
Proof:
This is a trivial case that can be solved by directly integrating
0tdXs=0tbds+0tσdBs
(7)
Xt=X0+bt+σBt(B0=0).
(8)
Taking expectations in Equation (7) we can compute the expected value
EXt=EX0+E0tbds+E0tσdBs=EX0+0tEbds=EX0+bt,
(9)
where we have used property (iii) from Theorem 4. Finally, to compute the variance we use the results obtained in (8) and (9) as follows
VarXt=EXt-EXt2=EX0-EX02+σ2Bt2=VarX0+σ2·E(Bt2).
Considering a normalised Brownian motion (E(Bt2)=t) yields the claimed statement.

As in runtime analysis it is not typically the case that we know exactly the drift's value, but we can bound it within some range. It follows directly that knowing those bounds we will be able to bound the expected progress after t iterations.

Corollary 5:
In the context of Theorem 8, if bbbu and σσσu, with b,buR and σ,σuR0+. Then
X0+bt+σlBtXtX0+but+σuBtEX0+btEXtEX0+butVarX0+σ2tVarXtVarX0+σu2t.

3.2  Multiplicative Drift

Analogous to runtime analysis, the assumption of additive drift is in general too loose since it does not take into account the current state of the process. This is solved by the multiplicative drift theorem from Doerr et al. (2012) when the drift is proportional to the current state Xt.

As discussed in the introduction, SDEs are a well established method in other research fields, allowing us to recycle already existing results. We will make use of a simplified version of the Cox-Ingersoll-Ross (CIR) model (Cox et al., 1985; Maghsoodi, 1996), which describes the evolution of interest rates with the following SDE
dXt=b(θ-Xt)dt+σXtdBt.
(10)
The coefficient b specifies the speed (drift) at which the process approaches its expected value θ provoking a mean reversion effect. The noise due to the market dBt is amplified by a term σXt to ensure that there are not negative interest rates (note that this effect is also obtained with any diffusion coefficient that goes to 0 when Xt=0). We will use a simplified version without the mean reversion effect (θ=0), which resembles an elitist multiplicative drift process.
Theorem 6
(CIRwithoutmeanreversion,Coxetal.,1985): Let Xt>0 be an Itô process of the form
dXt=-bXtdt+σXtdBt,
where Bt is a Brownian process and bR+ and σR0+. Then,
EXt=EX0·e-btVarXt=EX0σ2b·e-bt1-e-bt.
Proof:
The results for the expectation and variance of the original CIR process (10) are (Equation (19) in Cox et al., 1985)
EXt=EX0·e-bt+θ1-e-tVarXt=EX0σ2b·e-bt1-e-bt+θσ22b1-e-t2.
Our statement is just the special case when θ=0.

Once more, if we can only compute bounds rather than exact values it follows straightforward the next corollary. Notice that the result for the expectation is very similar to a previous fixed-budget multiplicative drift by Lengler and Spooner (2015) (Theorem 4).

Corollary 7:
In the context of Theorem 10, if bbbu and σσσu with b,buR+ and σ,σuR0+. Then,
EX0·e-btEXtEX0·e-butVarXtEX0σ2be-bt1-e-btVarXtEX0σu2bue-but1-e-but.

3.3  Non-Elitist Multiplicative Drift

Drift theorems for the runtime share a common condition: the drift must be bounded by a monotonic function towards the optimum (He and Yao, 2001; Doerr et al., 2012; Johannsen, 2010). This limitation is also shared by previous fixed budget drift theorems (Lengler and Spooner, 2015 and Theorem 10). In the following statement we relax this condition—the process also follows a monotonic behaviour in expectation, but towards the stationary distribution (which can be far away from the optimum). In fact, as we will see in the experimental section, even when the process starts near the optimum it will follow an exponential decay towards its equilibrium state, moving away from the optimum.

Although we have named it Non-Elitist Multiplicative Drift this theorem can also be applied for elitist algorithms. This will be the limit case when b=α=0 and β<0 where we recover Theorem 10.

Theorem 8:
Let Xt0 be an Itô process of the form
dXt=a(b-Xt)dt+α-βXtdBt,
with aR+, b,αR0+, βR, α-βXt0 for all Xt, α-bβ0 and Bt is a Brownian process. Then,
Xt=X0e-at+b1-e-at+0te-a(t-s)α-βXsdBsEXtX0=X0e-at+b1-e-atVarXtX0=α2a(1-e-2at)-βaX0e-at(1-e-at)-bβ2a1-e-at2.
Proof:
Let us use the integrating factor eat to perform the variable change Yt=eatXt, then by applying the Itô formula (Theorem 5) we obtain
dYt=(eatXt)t+a(b-Xt)(eatXt)Xt+α-βXt22(eatXt)Xt2·dt+α-βXt(eatXt)XtdBt=aeatXt+a(b-Xt)eat·dt+α-βXteatdBt=abeatdt+α-βXteatdBt.
Applying the usual integration rules leads to
Yt=Y0+ab0teasds+0tα-βXseasdBs.
Let us now compute just the first integral and also revert the variable change. Then,
eatXt=X0+beat-1+0tα-βXseasdBs.
Multiplying both sides by e-at leads to the first claimed result in the theorem. For the second result we take expectations leading to
EXt=EX0e-at+b1-e-at+E0te-a(t-s)α-βXsdBs.
Fortunately, expectations of integrals over a Brownian motion average out (see Property (iii) from Theorem 4) yielding the second result of the theorem. Finally, to compute the variance we will start from the formula Var(Xt)=E(Xt-E(Xt))2, which after introducing the two previous results yields
VarXtX0=E0tα-βXse-a(t-s)dBs2;
applying Itô isometry (Theorem 6) we obtain
VarXtX0=0tα-βEXse-2a(t-s)ds=αe-2at0te2asds-βe-2at0tEXse2asds;
introducing the result for the expectation leads to
VarXtX0=αe-2at0te2asds-βe-2at0tX0e-as+b1-e-ase2asds=αe-2at0te2asds-βe-2atX00teasds-bβe-2at0te2as-easds;
computing the integrals completes the proof
VarXtX0=αe-2at·e2at-12a-βe-2atX0·eat-1a-bβe-2at·eat-122a=α2a(1-e-2at)-βaX0e-at(1-e-at)-bβ2a1-e-at2.

Once more, if exact values for the drift and diffusion coefficients are hard to compute, but can be bounded we can make use of the following corollary.

Corollary 9:
In the context of Theorem 12, if a(b-Xt)a(b-Xt)au(bu-Xt)R+ for all Xt, and α-βXtα-βXtαu-βuXtR0+ for all Xt. Then,
EXtX0X0e-aut+bu1-e-autEXtX0X0e-at+b1-e-atVarXtX0αu2au(1-e-2aut)-βuauX0e-aut(1-e-aut)-buβu2au1-e-aut2VarXtX0α2a(1-e-2at)-βaX0e-at(1-e-at)-bβ2a1-e-at2.

4  Applications

Although the results derived in the previous section are rigorous, its application for evolutionary algorithms is just an approximation (see Subsection 2.4). In this section, we will take for granted that the diffusion approximation holds and we will relax our statements to the category of application rather than theorem. It will be finally in Section 5 where we will extensively validate this assumption (Hypothesis 7).

4.1  Elitist Algorithms on LeadingOnes

First, we start with a study case for the additive drift (Theorem 8). A good problem candidate, although not ideal, is the LeadingOnes function. Many algorithms like RLS or the (1+1) EA have an almost additive drift on this function; however, at the end of the optimisation process the drift depends on the state. The reason this fitness function is not ideal for an additive drift approach are the fitness boundary conditions (0xn). Our formulation of additive drift does not consider boundaries on the process. SDEs can cope with these conditions but this will highly increase the complexity and extension of the article, deviating from our introductory interests. To circumvent this problem, we will add a clause in the application's statement excluding our results after the optimum is reached. This approach has already been used in the literature as in Nallaperuma, Neumann et al. (2017).

Application 10:Under the Diffusion Approximation, after t iterations, RLS onLeading-Oneswill find the optimum or reach an expected value and variance of:
EX0+1ntEXtEX0+2ntEX0+1ntVarXtVarX0+4nt,
where X0 is the number of leading ones of the initial search point.
Proof:
Let us denote by Xt the number of leading ones in the bitstring at time t. And assume that Hypothesis 7 is a good approximation for the dynamics of Xt. As explained in the introduction, the drift and diffusion coefficients will be the first and second moments of the rate of change of the process. Or mathematically speaking
b=EXt+1-XtXt=xx·p(Xt+1-Xt=xXt)σ2=EXt+1-Xt2Xt=xx2·p(Xt+1-Xt2=xXt).
Recall that RLS flips exactly one bit per iteration; this bit can either be: (A) one of the Xt leading ones bits, (B) the first 0-bit, or (C) one of the remaining n-1-Xt bits (free riders). Since RLS is an elitist algorithm, we can neglect events of type A (fitness decrease). In the case of events of type C, although the algorithm will accept the movement towards a point of equal fitness, it will not yield a contribution to the drift since x=0. Hence, we focus just on event B and using the law of total expectation we obtain
b(Xt)=x=1n-Xtx·p(Xt+1-Xt=xXt,B)p(B).
Note that flipping the first 0-bit (probability 1/n) automatically increases the fitness by 1; however, it could even be the case that the remaining n-1-Xt bits (free riders) were already set to 1 and the problem is immediately optimised. As described before, events of type C are accepted and free riders bits are being flipped no matter its value, i.e, they are uniformly at random distributed (Lemma 1 in Lehre and Witt, 2012). Hence, the number of free riders set to 1 follows a geometric distribution with parameter 1/2. Therefore,
b(Xt)=1n·x=1n-Xt-1x·121-12x-1+(n-Xt)2-(n-Xt).
The reason for the term (n-Xt)2-(n-Xt) is that the problem size n is finite. Then, the event where all the free riders were already set to 1 absorbs the remaining probability mass. Computing the sum yields the already known result for the drift of RLS on LeadingOnes
b(Xt)=1n·2-2-(n-1-Xt).
This result can be understood as the probability of flipping the first 0-bit, times the drift impact of such mutation. We now bound the drift by bounding the impact in the range [1,2) yielding
bl=1n,bu=1n·2.
(11)
To compute the diffusion coefficient σ we use the same rationale but raising the impact to the power of 2; therefore
σl2=1nσu2=1n·22.
These coefficient bounds fit in our additive drift description (Corollary 9); using this corollary with the previously computed coefficients directly leads to the claimed results.

The application to the (1+1) EA is equivalent but we have to deal with global mutations, which makes computing the exact value for the drift and diffusion coefficients harder.

Application 11:Under the Diffusion Approximation, after t iterations, the (1+1) EA onLeadingOneswill find the optimum or reach an expected value and variance of:
EX0+1entEXtEX0+2ntVarX0+1entVarXtVarX0+4nt,
where X0 is the number of leading ones of the initial search point.
Proof:

Let us denote by Xt the number of leading ones in the bitstring at time t. And assume that Hypothesis 7 is a good approximation for the dynamics of Xt.

As in the previous application, the drift can be expressed with two terms: the probability of having a positive drift and its impact. The impact term remains identical, but the mutational term is different since we have to ensure that all the leading 1-bits of the current solution are preserved.
b(Xt)=1n1-1nXt·2-2-(n-1-Xt).
To obtain the bounds for the drift we will consider the extreme cases Xt=0 and Xt=n-1 for the mutational term and bound the impact in the range [1,2) yielding
b1n·2=bu
(12)
b1n1-1nn-1·11en=bl.
(13)
For the diffusion coefficient we can recycle the previous probabilities but we have to raise the impact to the power of 2 to obtain the second moment of the process
σl2=1en·12σ21n·22=σu2.
Finally, introducing these calculations in Corollary 9 proves the result.

4.2  Elitist Algorithms on OneMax

Secondly, we apply the multiplicative drift for the same algorithms but for the OneMax problem, which is ideal for this drift theorem. Unlike in LeadingOnes, the boundary conditions are not a problem since our multiplicative drift bounds cannot push the system out of the fitness interval [0,n]. Notice that the drift is always increasing the number of ones and vanishes when approaching the optimum.

Application 12:Under the Diffusion Approximation, the expected fitness reached by RLS inOneMaxafter t iterations is
EXt=n1-e-tn+EX0·e-tnVarXt=n-EX0·e-tn1-e-tn,
where EX0 is the expected number of ones of the initial search point.
Proof:
Let us denote by Zt the number of ones in the bitstring at time t. And assume that Hypothesis 7 is a good approximation for the dynamics of Zt. The drift coefficient, as always, will be the first moment (expectation) of the process. The diffusion coefficient will be the second moment.
b(Zt)=EZt+1-ZtZt=zz·p(Zt+1-Zt=zZt)σ2(Zt)=EZt+1-Zt2Zt=zz2·p(Zt+1-Zt2=zZt).
In the OneMax problem at state Zt there are Zt bit-flips out of n that lead to an improvement, with each of these events reducing the number of zeros by 1; therefore
b(Zt)=-Ztn.
For the diffusion coefficient we just have to repeat the calculations raising to 2 the impact of each event (-1)2, yielding
σ2(Zt)=(-1)2Ztn=Ztn.
Therefore, the approximated SDE for this process is of the form dZt=b(Zt)dt+σZtdBt, which fits in our multiplicative drift (Theorem 10). Applying this theorem with b=-1/n and σ=1/n we obtain the following results
EZt=EZ0·e-tnVarZt=EZ0·e-tn1-e-tn.
Finally, translating to the number of ones Xt=n-Zt leads to the theorem's statement.

The next application is regarding the (1+1) EA; again in this case we have computed bounds on the coefficients rather than exact values due to the difficulty to obtain exact results with global mutations.

Application 13:Under the Diffusion Approximation, the expected fitness reached by the (1+1) EA inOneMaxafter t iterations is bounded by
EXtn1-e-ten+EX0·e-tenEXtn1-e-3.1tn+EX0·e-3.1tnVarXtn-EX0·e-ten1-e-tenVarXtn-EX0·e-3.1tn1-e-3.1tn,
where EX0 is the expected number of ones of the initial search point.
Proof:
Let us denote by Zt the number of ones in the bitstring at time t. And assume that Hypothesis 7 is a good approximation for the dynamics of Zt. The drift and diffusion coefficients (first and second statistic moments of the process) can be expressed as
b(Zt)=j=1Zt-j·mut(Zt,Zt-j),σ2(Zt)=j=1Zt(-j)2·mut(Zt,Zt-j),
where mut(Zt,Zt-j) is the probability of mutation moving the process from Zt to Zt-j. This is obtained by k 1-bits flipped into 0-bits and vice-versa for k+j bits. Then,
mut(Zt,Zt-j)=k=0nn-ZtkZtk+j1nj+2k1-1n2n-j-2k.
The drift coefficient can be lower bounded using only the case when j=1 leading
b(Zt)-Ztn1-1nn-1-Ztne.
For the upper bound, we make use of Lemma 3 from Paixão et al. (2017) to upper bound mut(Zt,Zt+j) by Ztnj·1-1nn-j·1.14j! yielding
b(Zt)=-j=1ZtZtnj·1-1nn-j·1.14j!·j-1.14·Ztn·j=1jj!=-1.14·Ztn·e-3.1·Ztn.
Analogous calculations lead to the following bounds on the diffusion coefficient
Ztenσ2(Zt)3.1·Ztn.
Calling Corollary 11 with -1/enb-3.1/n and 1/enσ23.1/n leads to
EZ0·e-1entEZtEZ0·e-3.1ntVarZtEZ0·e-ten1-e-tenVarZtEZ0·e-3.1tn1-e-3.1tn.
Finally, translating to the number of ones Xt=n-Zt leads to the claimed results.

4.3  Non-Elitist Algorithms on OneMax

In this subsection, we will consider algorithms with local mutation but different selection operators. We keep general the acceptance probability, with the only assumption that it is monotonically non-decreasing. Afterwards, in the validation section we will plug in the expressions for a RW, MA, and SSWM.

Lemma 14:
Consider any Algorithm 1 with a monotonically non-decreasing acceptance function pacc(Δf) on OneMax. Then,
EXt+1-XtXt=p+pn·npp+p-XtE(Xt+1-Xt)2Xt=p-Xtnp-p,
where Xt denotes the number of 1-bits in the search point at time t, p=pacc(Δf=1) and p=pacc(Δf=-1).
Proof:
Let us start from EXt+1-XtXt=xx·p(Xt+1-Xt=xXt). Now we use the same arguments as for the analysis of RLS on OneMax (see the proof of Application 12) but considering the acceptance of deleterious mutations. Then,
EXt+1-XtXt=n-Xtnp-Xtnp=p-Xtnp+p=p+pn·npp+p-Xt,
where we have multiplied and divided by p+p/n. For the diffusion coefficient, we use the definition of the second statistical moment EXt+1-Xt2Xt=xx2·p(Xt+1-Xt=xXt). Using the same rationale as for the drift coefficient but noticing that the second moment can only change by (±1)2 yields
EXt+1-Xt2Xt=(+1)2·n-Xtnp+(-1)2·Xtnp=p-Xtnp-p.

Once the coefficients are known, the following application derives the fixed budget results by using the non-elitist multiplicative drift (Theorem 12).

Application 15:Consider any Algorithm 1 with a monotonically non-decreasing acceptance function pacc(Δf) onOneMax. Under the Diffusion Approximation the following statements hold
EXtX0=X0e-p+pnt+npp+p1-e-p+pntVarXtX0=np2(p+p)1-e-2tp+pn-p-pp+pX0e-p+pnt1-e-p+pnt-p(p-p)2n1-e-p+pnlt2,
where Xt denotes the number of 1-bits in the search point at time t, p=pacc(Δf=1) and p=pacc(Δf=-1).
Proof:
Assume that Hypothesis 7 is a good approximation for the dynamics of Xt. Then, the statement is proven after a direct application of Theorem 12 with the coefficients from Lemma 14.
a=p+pnb=npp+pα=pβ=p-pn.
(14)

5  Validation of the Diffusion Approximation

The previous section showed several case studies of the application of the three rigorous drift theorems for fixed budget analysis. However, this method was the result of an approximation which we were assuming to be accurate (Hypothesis 7). Finally, we provide an extensive validation to this assumption. We will tackle the accuracy of the method from several angles including experiments but also comparing the results against those obtained by rigorous methods.

5.1  Comparison with the Literature

We now compare the results obtained in Subsections 4.1 and 4.2 with some of the fixed budget literature. For the comparison on the LeadingOnes problem, we use the article that presented the first fixed budget analysis (Jansen and Zarges, 2012). In Table 2, we observe a high agreement for the upper bounds estimations (even an exact match for the RLS). Both approaches obtained the same growth term of 2t/n and the discrepancies for the (1+1) EA are present only on small terms. However, on the lower bounds the discrepancy is bigger, our method only obtained a growth term of t/n since we had to pessimistically use the smallest drift value (when Xt=n-1) to be able to apply the additive drift (recall the proof of Application 10). On the other hand, the results from the literature only hold for time budget values which are sufficiently smaller than the expected optimisation, whereas our results hold until the optimum is found.

Table 1:
Mutation and selection operator of each of the studied algorithms.
Algorithm 1 (1+1) EA RLS RW Metropolis SSWM 
Mutation Global Local Local Local Local 
Selection paccRLS paccRLS paccRW paccMA paccSSWM 
Algorithm 1 (1+1) EA RLS RW Metropolis SSWM 
Mutation Global Local Local Local Local 
Selection paccRLS paccRLS paccRW paccMA paccSSWM 
Table 2:
Comparison of results for LeadingOnes. The literature column is obtained from Theorems 7 and 12 in Jansen and Zarges (2012). The results for RLS hold for t=(1-β)n2 with 1/2+c<β<1. In the case of the (1+1) EA, the result holds for t=(1-β)n2α(n) with 1/2+c<β<1 and α(n)=w(1). The diffusion approximation column corresponds with Applications 10 and 11 and hold for t<min{t0Xt=n}. Random initialisation was considered in all the cases: EX0=1-2-n.
Algorithm EXt Literature Diffusion Approximation 
RLS  1+2t/n-2-Ω(1-β)n 1+t/n-2-n 
  1+2t/n-2-n 1+2t/n-2-n 
(1+1) EA  1+2t/n-ot/n 1+t/(en)-2-n 
  1+2t/n-ot/n 1+2t/n-2-n 
Algorithm EXt Literature Diffusion Approximation 
RLS  1+2t/n-2-Ω(1-β)n 1+t/n-2-n 
  1+2t/n-2-n 1+2t/n-2-n 
(1+1) EA  1+2t/n-ot/n 1+t/(en)-2-n 
  1+2t/n-ot/n 1+2t/n-2-n 

In Table 3 we consider the OneMax problem. Again, for RLS we compare against Jansen and Zarges (2012); we observe how both results (first two rows) can be recovered from our method noticing that e-t/n=(e-1/n)t(1-1/n)t, which is a good approximation for large n. In the case of the (1+1) EA we were able to obtain exactly the same lower bound as Lengler and Spooner (2015). The upper bounds are also in high agreement; however, the constant inside the exponential term is not as tight as in the literature.

Table 3:
Comparison of results for OneMax. The literature column is obtained from Theorems 8 and 5 in Jansen and Zarges (2012) and Theorem 5 in Lengler and Spooner (2015). The diffusion approximation column corresponds with Applications 12 and 13. Random initialisation was considered in all the cases apart from the first row (X0=0n).
Algorithm EXt Literature Diffusion Approximation 
RLS = n1-1-1/nt n1-e-t/n 
 = n-n/21-1/nt n-n/2e-t/n 
(1+1) EA  n1-12e-t/(en) n1-12e-t/(en) 
  n1-12e-t/n-Ot/n n1-12e-(3.1t)/n 
Algorithm EXt Literature Diffusion Approximation 
RLS = n1-1-1/nt n1-e-t/n 
 = n-n/21-1/nt n-n/2e-t/n 
(1+1) EA  n1-12e-t/(en) n1-12e-t/(en) 
  n1-12e-t/n-Ot/n n1-12e-(3.1t)/n 

Finally, since to our best knowledge, there are no previous fixed budget analysis of non-elitist algorithm we could not perform this sanity check for RW, MA, and SSWM. Nonetheless, the following subsections circumvents this problem by recurring to the concept of stationary distribution.

5.2  Stationary Distribution

One feature of non-elitist EAs is that although they can sample the optimum, they might leave it and move to a search point of lower fitness. Actually, in the long run the algorithm (under certain conditions) will reach the so-called stationary distributionπ(x) for when the probabilities of leaving each state cancel out with the probabilities of reaching that state (see e.g. Ross, 1996)
π(x)·P(xy)=π(y)·P(yx)x,y.
(15)
Once the process reaches this limit state, all the points in the search space will have a non-zero probability of being found when measuring the system. However, a smart algorithm will assign higher probability masses to points of higher fitness.

It is straightforward to notice that elitist algorithms do not have a stationary distribution. This is due to the fact that these algorithms always accept moves towards points of higher fitness but never accept moves towards points of lower fitness. This breaks the ergodicity required for a Markov Chain to have a stationary distribution. Hence, this sanity check for our method can only be applied for a RW, MA, and SSWM.

If we consider a RW with local mutations, it will follow the mutational bias towards bitstrings with n/2 zeros. Then, after enough time depending on the initialisation, the expected number of ones will be n/2. In the case of MA and SSWM, we find in the literature that they share the following stationary distribution (see Metropolis et al., 1953; Sella and Hirsh, 2005 or for a proof Nallaperuma, Oliveto et al., 2017):
π(x)=eγf(x)Z,
(16)
where γ=α for Metropolis, γ=2β(N-1) for SSWM and Z is the normalising constant.
Theorem 16:

Consider Metropolis and SSWM on OneMax once they have reached the stationary distribution. Then for any initialisation, the expected number of ones will be EXt=n1+e-γ, where γ=α for Metropolis and γ=2β(N-1) for SSWM.

Proof:
In the case of OneMax we can express EXt with the probabilities of each bit i being in each of the two possible values pi(1) and pi(0),
EXt=i=1npi(1)pi(1)+pi(0)=n·p(1)p(1)+p(0),
where in the last equality we have used the fact that each bit is optimised independently, thus we can apply linearity of expectations and consider that for any bit pi(1)=p(1) and pi(0)=p(0). Finally, introducing the probabilities at equilibrium from Equation (16) yields
EXt=n·eγeγ+1=n1+e-γ.
(17)

As discussed earlier, the long term (limit for t) expectation of the non-elitist multiplicative drift tends to the b term from the drift coefficient which according to Equation (14) has a value of b=(np)/(p+p). We can observe in Table 4 how this value matches exactly the real expected fitness at equilibrium EXt showing the strength of the SDE approach for this scenario.

Table 4:
Expected value at equilibrium EXt from Theorem 15 and long term expectation b from Application 15 on OneMax. With p=pacc(1) and p=pacc(-1).
 RW Metropolis SSWM 
p 1-e-2β1-e-2Nβ 
p e-α e2β-1e2Nβ-1 
b n/2 n1+e-α n1+e-2β(N-1) 
EXt n/2 n1+e-α n1+e-2β(N-1) 
 RW Metropolis SSWM 
p 1-e-2β1-e-2Nβ 
p e-α e2β-1e2Nβ-1 
b n/2 n1+e-α n1+e-2β(N-1) 
EXt n/2 n1+e-α n1+e-2β(N-1) 

5.3  Reconciling Fixed Budget with Runtime Analysis

As argued by Jansen and Zarges (2012), the fixed budget perspective provides more insight on the optimisation process. Runtime analysis focuses on the expected first hitting time of the optimum, this time is expressed as a function of the problem size ET=f(n). On the other hand, fixed budget considers the expected fitness obtained after t iterations. Hence, the result will also depend on the time EXt=f(n,t).

However, both approaches should be in concordance. If we know the first hitting time T(X) of any specific fitness value X, it looks natural to compute the inverse function to obtain the equivalent fixed budget result. As shown by Doerr et al. (2013) this is the case for deterministic algorithms, but it will not be true in general for randomised algorithms. The authors circumvent this problem by using sharp bounds on the probability of deviating from the expected times. Finally, they developed a method to recycle expected optimisation times to derive fixed budget results.

In this section, we consider the inverse approach, that is, deriving optimisation times from fixed budget results. Again, inverting the function EXt=f(n,t) for the time will not give, in general, the correct answer. However, we do not aim for a rigorous translation tool as achieved by Doerr et al. (2013). Our purpose is to obtain some time measurement that can be compared with the runtime literature (up to the use of Landau's notation).

Additive Drift

Before jumping into conclusions for the studied algorithms, let us consider an additive process. This is a collection of stochastic variables in time Xt0R whose expected change remains constant EXt-Xt-1Xt=b. When b is positive, the expectation EXt is constantly being increased and EXt always reaches any finite target value x (given enough, but finite time). Let us recall that this will not be the case for EAs reaching the optimum xopt of a fitness function, here Xtxopt and therefore Prob(Xt=xopt)=1 is required in order for EXt=xopt. We will consider first a pure additive process and afterwards we will relate it with RLS and the (1+1) EA optimising LeadingOnes.

We mentioned before that computing the inverse function of EXt for the time t will not yield, in general, the correct result. However, this approach is correct for an additive drift process. Here, we can solve EXt=f(n,t) for the time, obtaining a function of the form t=f(EX,n). Finally, if we introduce any target value for EX, it will yield the time when the expectation reaches the target value.

Theorem 17:
Consider a stochastic process Xt0R that follows an Additive Drift, that is, EXt-Xt-1Xt=b for some constant b>0. The time τ(x) is defined as the time when the expectation EXt reaches a value of x and is given by
τ(x)=x-EX0b.
Proof:
By using the additive drift condition, we can compute EXtX0 as follows
EXtX0=X0+i=0tEXi+1-XiXi=X0+i=0tb=X0+bt.
Using the law of total expectation, we obtain EXt=EX0+bt. Introducing EXt=x and t=τ leads to x=EX0+bτ, which solving for τ yields the claimed statement.

It is straightforward to notice that this theorem is not applicable for optimisation algorithms. If Xt describes the fitness of the EA, we observe that the boundary condition Xtxopt breaks the additive requirement: EXt-Xt-1Xt=xopt0b. As we mentioned earlier, we do not aim for a rigorous translation tool. However, as shown in Subsection 4.1, we can approximate the optimisation process of some algorithms on LeadingOnes with additive drift processes.

In Applications 10 and 11, we found additive drift bounds for the real drift of RLS and the (1+1) EA on LeadingOnes respectively. By introducing x=n (the optimal fitness of LeadingOnes) in Theorem 16, we will compare, in a non-rigorous way, the τ(n)-times for these processes with the runtimes from the literature. Table 5 presents these results showing a high agreement, up to smaller terms, of the SDE method with the literature. Since we used bounds for the drift coefficient b, the table includes an upper τu and a lower τ bound for the time τ(n).

Table 5:
Expected optimisation times ET from the literature (see e.g. Oliveto and Yao, 2011) and bounds on τ(n) from Theorem 16. The bounds on the drift bu, are taken from Equation (11) (RLS) and Equations (13) and (12), (lower and upper bound of the (1+1) EA respectively). Initialisation at 0n.
LeadingOnes RLS (1+1) EA 
b 1/n 1/en 
bu 2/n 2/n 
EX0 
τ(n) n2/2 n2/2 
τu(n) n2 en2 
ET Θ(n2) Θ(n2) 
LeadingOnes RLS (1+1) EA 
b 1/n 1/en 
bu 2/n 2/n 
EX0 
τ(n) n2/2 n2/2 
τu(n) n2 en2 
ET Θ(n2) Θ(n2) 

Multiplicative Drift

In the case of the multiplicative drift (Theorems 10 and 12) if we were to perform the same sanity check we would obtain an infinite time. This is due to the exponential decay of EXt through the term e-t. However, we can circumvent this problem by recurring to the concept of half-life from exponential decaying radioactive processes (see, e.g., Krane, 1987). The half-life is defined as the time taken until half of the atoms from a radionuclide have followed a decay process. In the same spirit, the following definition introduces the notion of δ-life.

Definition 4
(δ)-life: Consider a stochastic process Xt>0 whose expectation follows an exponential decay, that is, dEXt/dt=-aEXt for aR+. The δ-life namely τδ is defined as the time when EXt/EX0=δ and has a value of
τδ=ln(1/δ)a.

As used in Subsection 4, we will focus on the number of 1-bits on linear functions; then we establish our satisfiability criterion as δ=1/n. This implies that in the worst case, where the system starts at a distance n from the long term expectation we will be satisfied when there is only a difference of one 1-bit with the optimal solution.

Theorem 18:
The 1/n-life of the Non-Elitist Multiplicative Drift (Theorem 12) is given by
τ1/nnlnnp.
Proof:
The result for the expectation from Theorem 12 can be rewritten as EXt-b=EX0-b·e-at. It is straightforward to see that the process EXt-b follows an exponential decay; then using Definition 17 for δ=1/n leads to
τ1/n=lnna,
introducing the a coefficient from Equation (14) completes the proof.
τ1/n=nlnnp+pnlnnp.

Table 6 includes the 1/n-lives (τ1/n) derived from the SDE method with the above expression and expected optimisation times ET from the literature. Before making the comparison let us remember that optimisation times refer to hitting the optimum whereas the 1/n-life refers to approaching equilibrium. If the parameters of the algorithm allow the stationary distribution, to be close enough to the optimum this will be a fair comparison. This is the reason why the results for the RW highly differ in the table; a RW cannot optimise OneMax in polynomial time but can reach its stationary distribution. On the other hand, the results for the three other algorithms are in high concordance with the literature (up to smaller terms).

Table 6:
τ1/n from Theorem 18 and expected optimisation times ET on OneMax (RLS—e.g., Oliveto and Yao (2011), RW—Garnier et al. (1999), Metropolis—Jansen and Wegener (2007), and SSWM—Pérez Heredia et al. (2017)). Where p=pacc(1), γ=α for Metropolis and γ=2β(N-1) for SSWM.
OneMax RLS RW Metropolis SSWM 
p 1-e-2β1-e-2Nβ 
τ1/n nlnn nlnn nlnn nlnnp 
ET nlnn+O(n) Ω2n nlnn+O(n) nlnnp+O(n) 
γ lnn lnn 
OneMax RLS RW Metropolis SSWM 
p 1-e-2β1-e-2Nβ 
τ1/n nlnn nlnn nlnn nlnnp 
ET nlnn+O(n) Ω2n nlnn+O(n) nlnnp+O(n) 
γ lnn lnn 

5.4  Simulations

Another sanity check that we have considered to validate Hypothesis 7 is to perform numerical simulations. We built several graphical representations where the experimental results are represented by solid lines (expectation) and a shadowed region (one standard deviation around the expectation). On the other hand, theoretical results for the expectation are represented by dashed lines. For aesthetics reasons we have not included the variance estimations in the figures. This would have implied plotting two extra shadowed zones (upper and lower bounds estimations) per algorithm.

Firstly, the left graph from Figure 2 shows the application of the additive drift for LeadingOnes. We can observe that for RLS this approach is very precise until the last part of the optimisation process. Whereas for the (1+1) EA there is a large gap; nevertheless, the method properly represents the evolution of the (1+1) EA's fitness on this problem. In our opinion, the discrepancy gap is mainly due to two reasons:

  • choosing an additive drift approach for a process whose drift varies depending on the state and

  • pessimistic estimations for the additive constant b (see Theorem 8 and Applications 10 and 11).

Figure 2:

Comparison of results for LeadingOnes (left) and OneMax (right). Solid lines represent the experimental results (RLS—red and (1+1) EA—blue). Dashed lines represent the theoretical results (RLS—black and (1+1) EA—blue). For LeadingOnes, the theoretical upper bound is shared for both algorithms and the lower bound for RLS is shown in red. Initialisation at 0n. Problem size n=100. Time budget 10000 (Leading-Ones) and 1000 (OneMax). Results were averaged over 100 runs and the shadowed zones include one standard deviation around the expectation.

Figure 2:

Comparison of results for LeadingOnes (left) and OneMax (right). Solid lines represent the experimental results (RLS—red and (1+1) EA—blue). Dashed lines represent the theoretical results (RLS—black and (1+1) EA—blue). For LeadingOnes, the theoretical upper bound is shared for both algorithms and the lower bound for RLS is shown in red. Initialisation at 0n. Problem size n=100. Time budget 10000 (Leading-Ones) and 1000 (OneMax). Results were averaged over 100 runs and the shadowed zones include one standard deviation around the expectation.

This deviation from additivity can be seen at the end of the simulation when the experimental lines start curving, whereas the theoretical lines remain straight until they hit the optimum.

Secondly, the right graph in Figure 2 compares RLS and the (1+1) EA on the OneMax problem. The comparison between experiments and theory for RLS is excellent, even better than in LeadingOnes due to the absence of the boundary condition. Again for the (1+1) EA the theory collects the exponential decrease of the process. However, there is still a gap which in our opinion is mainly due to equivalent reasons:

  • using a multiplicative drift when the drift is not exactly multiplicative and

  • pessimistic estimations for the multiplicative constant b (see Theorem 10 and Application 13).

Figure 3 shows the application of the non-elitist drift for OneMax. The left graph in Figure 3 shows an excellent match between theoretical and experimental results for all the algorithms, describing the dynamics of these processes even when they are not able to optimise the problem. With the same accuracy, the right graph in Figure 3 shows a case study where two algorithms (Metropolis and RW) started in between their stationary expected value and the optimum but due to their weak selection strength, they move away the optimum but towards the equilibrium's value.

Figure 3:

Theoretical (dashed black lines) and experimental results (RLS—red, RW—green, Metropolis—blue, and SSWM—pink) on OneMax. Initialisation at 0n (left) and 020180 (right). Problem size n=100. Time budget 1000. Algorithms' parameters: α=1 for Metropolis and (β,N)=(1,2) for SSWM. Results were averaged over 100 runs and the shadowed zone includes one standard deviation around the expectation.

Figure 3:

Theoretical (dashed black lines) and experimental results (RLS—red, RW—green, Metropolis—blue, and SSWM—pink) on OneMax. Initialisation at 0n (left) and 020180 (right). Problem size n=100. Time budget 1000. Algorithms' parameters: α=1 for Metropolis and (β,N)=(1,2) for SSWM. Results were averaged over 100 runs and the shadowed zone includes one standard deviation around the expectation.

We would like to point out that our applications were just study cases to motivate the use of SDEs as an analysis tool for EAs. If we were aiming for precision, we could have recycled already known and better bounds for the drift such us b(Zt)-Zten1+16Ztn from Doerr et al. (2011), but then it would become a variable drift process (Johannsen, 2010).

5.5  Experimental Error

The previous sanity check was based on a visual representation of experimental and theoretical results. However, graphical representations can sometimes be misleading and we have to be sure that the results obtained were not just an aesthetic effect. In order to objectively validate the SDE method (Hypothesis 7) we perform an experimental measurement of the error. We will use the widely used concepts of mean absolute percentage error (MAPE) and root mean squared error (RMSE) (see, e.g., Hyndman and Koehler, 2006).

Definition 5:
Let Yi=1,2,,n be the vector of observed values and Y¯i=1,2,,n the vector of predicted values. Then, the mean absolute percentage error (MAPE) and the root mean squared error (RMSE) are given by
MAPE(Yi)=100%ni=1nYi¯-YiYi,RMSE(Yi)=1ni=1nYi¯-Yi2.

The advantage of the MAPE is that it takes into consideration the magnitude of the observed values. It produces a result in the form of a percentage that is interpreted as the magnitude of the deviation with reality. However, for small values of the observed values this measurement is not appropriate due to the singularity at Yi=0. This is a problem for measuring the error of the variance due to the deterministic initialisation used in the experiments and the concentration of the independent runs towards the same fitness value at the end of the simulation. For this reason, we have decided to use the MAPE for the expectations and the RMSE for the variance.

Tables 7 and 8 show agreement with Figures 2 and 3. The results for all the algorithms with local mutations, that is, all but the (1+1) EA, are excellent on OneMax. With MAPE of the expectation being even smaller than 1%. Only for the (1+1) EA we can see a significant difference, due to the reasons explained in the previous section.

Table 7:
Experimental errors for the expectation and variance of the five studied algorithms on OneMax. The observed values are obtained from experiments. The predicted values are those derived in applications 12, 13, and 15. Algorithms' parameters: α=1 for Metropolis and (β,N)=(1,2) for SSWM. Initialisation at 0n. Problem size n=100. Time budget 1000.
OneMax MAPE(EXt) RMSE(VarXt) 
RLS 0.15% 0.41 
(1+1) EA lower bound 16.34% 0.71 
(1+1) EA upper bound 22.18% 1.30 
RW 0.88% 0.42 
MA 0.62% 0.44 
SSWM 0.45% 0.37 
OneMax MAPE(EXt) RMSE(VarXt) 
RLS 0.15% 0.41 
(1+1) EA lower bound 16.34% 0.71 
(1+1) EA upper bound 22.18% 1.30 
RW 0.88% 0.42 
MA 0.62% 0.44 
SSWM 0.45% 0.37 
Table 8:
Experimental errors for the expectation and variance of RLS and the (1+1) EA on LeadingOnes. The observed values are obtained from experiments. The predicted values are those derived in applications 10 and 11. Initialisation at 0n. Problem size n=100. Time budget 10000.
LeadingOnes MAPE(EXt) RMSE(VarXt) 
RLS lower bound 47.90% 7.82 
RLS upper bound 3.39% 7.73 
(1+1) EA lower bound 72.27% 8.03 
(1+1) EA upper bound 23.74% 7.93 
LeadingOnes MAPE(EXt) RMSE(VarXt) 
RLS lower bound 47.90% 7.82 
RLS upper bound 3.39% 7.73 
(1+1) EA lower bound 72.27% 8.03 
(1+1) EA upper bound 23.74% 7.93 

Finally, the comparison against LeadingOnes (see Table 8) is not as successful as in the previous case. Here, the errors are quite large as shown in the left graph of Figure 2. But again, we think that the reasons for the discrepancy are not intrinsic to using SDEs; they are due to the chosen drift technique.

6  Conclusions

For the last decade, rigorous runtime analysis has become immensely popular, in part due to the emergence of the so-called “drift theorems” that inform about the time to reach a particular state or set of states. However, these results are not directly translatable for fixed budget scenarios which ask the opposite question: how much improvement can be obtained in a fixed number of fitness evaluations? This problem is relevant in practice: in real problems it is usually impossible to know whether the optimum has been found. In this case, it is arguably more useful to know how much improvement an algorithm can achieve in a fixed number of steps.

Here, we have introduced the use of SDEs to the study of EAs which seem particularly suited to obtain results for fixed budget scenarios. Even though SDEs are approximations of the dynamics of EAs, they can produce analytical insight which is not possible with other tools, such as Markov chains. At the same time, SDEs do not discard the stochasticity of the dynamics and allow for the estimation of the variance of relevant quantities, such as the state at which the algorithm will find itself after t iterations.

Here we made a simple preliminary exploration of the potential of these tools for the study of EAs. We did not make use of other techniques for the study of SDEs that allow for greater insight into the dynamics of a stochastic process. For example, the Fokker-Planck equation associated with an SDE can be used to track the full probability distribution of the process. Even though this is, in many instances, impossible to solve analytically it can reveal what are the crucial parameters of a stochastic process. Furthermore, it is often solvable for the steady state, allowing for the characterization of the stationary distribution of the process. In fact, these techniques are commonly used in the related field of population genetics where they have been used to provide foundational results. The results we presented here are in themselves interesting, especially when applied to fixed budget analysis, but they also hold the promise that SDEs can become a standard tool in EC.

On the other hand, it is fair to say that previous modelling attempts (those mentioned in the introduction) were aiming for more general and ambitious problems than what this article has covered. This is why we believe that the natural direction for future work will be considering more complex functions or more elaborated heuristics. But the main challenge would be a rigorous translation of the SDE method presented here.

Acknowledgments

The author would like to thank Dirk Sudholt and Tiago Paixão for valuable input and discussions on this manuscript. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 618091 (SAGE).

References

Akimoto
,
Y.
,
Auger
,
A.
, and
Hansen
,
N
. (
2012
). Convergence of the continuous time trajectories of isotropic evolution strategies on monotonic C2-composite functions. In
Proceedings of Parallel Problem Solving from Nature - PPSN XII: 12th International Conference, Part I
, pp.
42
51
.
Altenberg
,
L
. (
1994
). The schema theorem and Price's theorem. In
Proceedings of the Third Workshop on Foundations of Genetic Algorithms
, pp.
23
49
.
Black
,
F.
, and
Scholes
,
M
. (
1973
).
The pricing of options and corporate liabilities
.
Journal of Political Economy
,
81
(
3
):
637
654
.
Cox
,
J. C.
,
Ingersoll
,
J. E.
, and
Ross
,
S. A
. (
1985
).
A theory of the term structure of interest rates
.
Econometrica
,
53
(
2
):
385
407
.
Doerr
,
B.
,
Doerr
,
C.
, and
Yang
,
J
. (
2016
). Optimal parameter choices via precise black-box analysis. In
Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO '16)
, pp.
1123
1130
.
Doerr
,
B.
,
Fouz
,
M.
, and
Witt
,
C.
(
2011
).
Sharp bounds by probability-generating functions and variable drift
. In
Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO '11)
, pp.
2083
2090
.
ACM
.
Doerr
,
B.
,
Jansen
,
T.
,
Witt
,
C.
, and
Zarges
,
C
. (
2013
). A method to derive fixed budget results from expected optimisation times. In
Proceedings of the Fifteenth Annual Conference on Genetic and Evolutionary Computation (GECCO '13)
, pp.
1581
1588
.
Doerr
,
B.
,
Johannsen
,
D.
, and
Winzen
,
C
. (
2012
).
Multiplicative drift analysis
.
Algorithmica
,
64
(
4
):
673
697
.
Einstein
,
A
. (
1905
).
Über die von der molekularkinetischen theorie der wärme geforderte bewegung von in ruhenden flüssigkeiten suspendierten teilchen
.
Annalen der Physik
,
322
(
8
):
549
560
.
Feller
,
W.
(
1949
).
On the theory of stochastic processes, with particular reference to applications
. In
Proceedings of the [First] Berkeley Symposium on Mathematical Statistics and Probability
, pp.
403
432
.
Berkeley, California
:
University of California Press
.
Garnier
,
J.
,
Kallel
,
L.
, and
Schoenauer
,
M
. (
1999
).
Rigorous hitting times for binary mutations
.
Evolutionary Computation
,
7
(
2
):
173
203
.
Goldberg
,
D.
(
1987
).
Simple genetic algorithms and the minimal, deceptive problem
. In
L.
Davis
(Ed.),
Genetic algorithms and simulated annealing
, pp.
74
88
.
He
,
J.
, and
Yao
,
X.
(
2001
).
Drift analysis and average time complexity of evolutionary algorithms
.
Artificial Intelligence
,
127:57
85
.
He
,
J.
, and
Yao
,
X
. (
2003
).
Towards an analytic framework for analysing the computation time of evolutionary algorithms
.
Artificial Intelligence
,
145
(
1
):
59
97
.
Holland
,
J. H
. (
1975
).
Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control and artificial intelligence
.
Cambridge, MA
:
MIT Press
.
Reprint edition 1992 (originally published in 1975)
.
Hyndman
,
R. J.
, and
Koehler
,
A. B
. (
2006
).
Another look at measures of forecast accuracy
.
International Journal of Forecasting
,
22
(
4
):
679
688
.
Itô
,
K.
(
1944
).
Stochastic integral
. In
Proceedings of the Imperial Academy
,
20
(
8
):
519
524
.
Itô
,
K.
(
1946
).
On a stochastic integral equation
. In
Proceedings of the Japan Academy
,
22
(
2
):
32
35
.
Itô
,
K.
(
1950
).
Stochastic differential equations in a differentiable manifold
.
Nagoya Mathematical Journal
,
1:35
37
.
Itô
,
K.
(
1951
).
On a formula concerning stochastic differentials
.
Nagoya Mathematical Journal
,
3:55
65
.
Jansen
,
T.
, and
Wegener
,
I
. (
2007
).
A comparison of simulated annealing with a simple evolutionary algorithm on pseudo-Boolean functions of unitation
.
Theoretical Computer Science
,
386
(
1–2
):
73
93
.
Jansen
,
T.
, and
Zarges
,
C
. (
2012
). Fixed budget computations: A different perspective on run time analysis. In
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '12)
, pp.
1325
1332
.
Johannsen
,
D.
(
2010
).
Random combinatorial structures and randomized search heuristics
.
PhD thesis, Universität des Saarlandes, Saarbrücken, Germany and the Max-Planck-Institut für Informatik
.
Karlin
,
S.
, and
Taylor
,
H. M.
(
1981
).
A second course in stochastic processes
, 1st ed.
New York
:
Academic Press
.
Kimura
,
M
. (
1957
).
Some problems of stochastic processes in genetics
.
The Annals of Mathematical Statistics
,
28
(
4
):
882
901
.
Kimura
,
M
. (
1962
).
On the probability of fixation of mutant genes in a population
.
Genetics
,
47
(
6
):
713
719
.
Kimura
,
M
. (
1964
).
Diffusion models in population genetics
.
Journal of Applied Probabilitiy
,
1
(
2
):
177
232
.
Kimura
,
M
. (
1968
).
Genetic variability maintained in a finite population due to mutational production of neutral and nearly neutral isoalleles
.
Genetics Research
,
11
(
03
):
247
270
.
Krane
,
K. S
. (
1987
).
Introductory nuclear physics
.
New York
:
Wiley
.
Lehre
,
P. K.
, and
Witt
,
C
. (
2012
).
Black-box search by unbiased variation
.
Algorithmica
,
64
(
4
):
623
642
.
Lengler
,
J.
, and
Spooner
,
N
. (
2015
). Fixed budget performance of the (1+1) EA on linear functions. In
Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII
, pp.
52
61
.
Maghsoodi
,
Y
. (
1996
).
Solution of the extended cir term structure and bond option valuation
.
Mathematical Finance
,
6
(
1
):
89
109
.
Metropolis
,
N.
,
Rosenbluth
,
A. W.
,
Rosenbluth
,
M. N.
,
Teller
,
A. H.
, and
Teller
,
E
. (
1953
).
Equation of state calculations by fast computing machines
.
The Journal of Chemical Physics
,
21
(
6
):
1087
1092
.
Nallaperuma
,
S.
,
Neumann
,
F.
, and
Sudholt
,
D.
(
2017
).
Expected fitness gains of randomized search heuristics for the traveling salesperson problem
.
Evolutionary Computation
,
25
(
4
):
673
705
.
Nallaperuma
,
S.
,
Oliveto
,
P. S.
,
Heredia
,
J. P.
, and
Sudholt
,
D.
(
2017
).
When is it beneficial to reject improvements
? In
Proceedings of the Genetic and Evolutionary Computation Conference
(
GECCO '17
), pp.
1391
1398
.
Øksendal
,
B
. (
2003
).
Stochastic differential equations: An introduction with applications
.
Ann Arbor, MI
:
University of Michigan Press
.
Oliveto
,
P. S.
,
Paixão
,
T.
,
Pérez Heredia
,
J.
,
Sudholt
,
D.
, and
Trubenová
,
B.
(
2017
).
How to escape local optima in black box optimisation: When non-elitism outperforms elitism
.
Algorithmica
, pp.
1
300
.
Oliveto
,
P. S.
, and
Yao
,
X.
(
2011
).
Runtime analysis of evolutionary algorithms for discrete optimization
. In
Theory of Randomized Search Heuristics: Foundations and Recent Developments
.
World Scientific Publishing Co., Inc
.
Paixão
,
T.
, and
Pérez Heredia
,
J
. (
2017
). An application of stochastic differential equations to evolutionary algorithms. In
Proceedings of the 14th ACM/SIGEVO Conference on Foundations of Genetic Algorithms
, pp.
3
11
.
Paixão
,
T.
,
Pérez Heredia
,
J.
,
Sudholt
,
D.
, and
Trubenová
,
B
. (
2017
).
Towards a runtime comparison of natural and artificial evolution
.
Algorithmica
,
78
(
2
):
681
713
.
Paixão
,
T.
,
Badkobeh
,
G.
,
Barton
,
N.
,
Çörüş
,
D.
,
Dang
,
D.-C.
,
Friedrich
,
T.
,
Lehre
,
P. K.
,
Sudholt
,
D.
,
Sutton
,
A. M.
, and
Trubenová
,
B.
(
2015
).
Toward a unifying framework for evolutionary processes
.
Journal of Theoretical Biology
,
383:28
43
.
Pérez Heredia
,
J.
,
Trubenová
,
B.
,
Sudholt
,
D.
, and
Paixão
,
T
. (
2017
).
Selection limits to adaptive walks on correlated landscapes
.
Genetics
,
205
(
2
):
803
825
.
Prügel-Bennett
,
A
. (
1997
).
Modelling evolving populations
.
Journal of Theoretical Biology
,
185
(
1
):
81
95
.
Prügel-Bennett
,
A.
, and
Shapiro
,
J. L
. (
1994
).
Analysis of genetic algorithms using statistical mechanics
.
Physical Review Letters
,
72
(
9
):
1305
.
Ross
,
S
. (
1996
).
Stochastic processes
.
New York
:
John Wiley & Sons
.
Schaul
,
T
. (
2012
). Natural evolution strategies converge on sphere functions. In
Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO '12)
, pp.
329
336
.
Sella
,
G.
, and
Hirsh
,
A. E.
(
2005
).
The application of statistical physics to evolutionary biology
. In
Proceedings of the National Academy of Sciences of the United States of America
,
102
(
27
):
9541
9546
.
Stieltjes
,
T. J
. (
1894
). Recherches sur les fractions continues. In
Annales de la Faculté des sciences de Toulouse: Mathématiques
, pp.
68
71
.
Stratonovich
,
R
. (
1966
).
A new representation for stochastic integrals and equations
.
SIAM Journal on Control
,
4
(
2
):
362
371
.
Vose
,
M. D
. (
1995
).
Modeling simple genetic algorithms
.
Evolutionary Computation
,
3
(
4
):
453
472
.
Wormald
,
N. C
. (
1995
).
Differential equations for random processes and random graphs
.
Annals of Applied Probability
,
5
(
4
):
1217
1235
.
Wright
,
A. H.
, and
Rowe
,
J. E.
(
2001
). Continuous dynamical system models of steady-state genetic algorithms. In
W. N.
Martin
and
W. M.
Spears
(Eds.),
Foundations of genetic algorithms 6
, pp.
209
225
.
San Francisco
:
Morgan Kaufmann
.
Yin
,
G.
,
Rudolph
,
G.
, and
Schwefel
,
H.-P
. (
1995
).
Analyzing (1,λ) evolution strategy via stochastic approximation methods
.
Evolutionary Computation
,
3
(
4
):
473
489
.

Note

*

An extended abstract of this article with preliminary results was presented at FOGA '17 by Paixão and Pérez Heredia (2017).