Extending previous analyses on function classes like linear functions, we analyze how the simple (1+1) evolutionary algorithm optimizes pseudo-Boolean functions that are strictly monotonic. These functions have the property that whenever only 0-bits are changed to 1, then the objective value strictly increases. Contrary to what one would expect, not all of these functions are easy to optimize. The choice of the constant c in the mutation probability p(n)=c/n can make a decisive difference.
We show that if c<1, then the (1+1) EA finds the optimum of every such function in iterations. For c=1, we can still prove an upper bound of O(n3/2). However, for , we present a strictly monotonic function such that the (1+1) EA with overwhelming probability needs iterations to find the optimum. This is the first time that we observe that a constant factor change of the mutation probability changes the runtime by more than a constant factor.
Rigorously understanding how randomized search heuristics solve optimization problems and proving guarantees for their performance remains a challenging task. The current state of the art is that we can analyze some heuristics for simple problems. Nevertheless, the current words yielded new insight, helped to discount mistaken beliefs, and turned correct beliefs into proven facts.
For example, it was long believed that a pseudo-Boolean function is easy to optimize if it is unimodal, that is, if each that is not optimal has a Hamming neighbor y with f(y) > f(x) (Mühlenbein, 1992). Recall that y is called a Hamming neighbor of x if x and y differ in exactly one bit.
This belief was debunked in Droste, Jansen, and Wegener (1998). There the unimodal long k-path function (Horn, Goldberg, and Deb, 1994) was considered and it was proven that the simple (1+1) evolutionary algorithm ((1+1) EA) with high probability does not find the optimum within iterations. Note that, as was seemingly overlooked for a long time, the Annals of Probability paper by Aldous (1983) also implies that unimodal functions are not necessarily easy for randomized algorithms. This classical episode shows how important it is to support an intuitive understanding of evolutionary algorithms with rigorous proofs.
It also shows that it is very difficult to identify problem classes that are easy for a particular randomized search heuristic. This, however, is needed for a successful application of such methods, because the no free lunch theorems (Igel and Toussaint, 2004) tell us, in simple words, that no randomized search heuristic can be superior to another if we do not restrict the problem class we are interested in.
1.1. Previous Work
In the following, we restrict ourselves to classes of pseudo-Boolean functions. We stress that the last 10 years also produced a number of results on combinatorial problems, (cf. Oliveto, He, and Yao, 2007). At the same time, research on classical test functions and function classes continued, spurred by the many still open problems.
We also restrict ourselves to one of the most simple randomized search heuristics, the (1+1) EA. The first rigorous results on this heuristic were given by Mühlenbein (1992), who determined how long it takes to find the optimum of simple test functions like OneMax, counting the number of 1-bits. Quite some time later, and with much more technical effort necessary, Droste, Jansen, and Wegener (2002) extended the O(nlog n) bound to all linear functions . Without loss of generality, one can assume that all coefficients ai are positive, so that the all 1s string 1n is the global optimum. Since it was hard to believe that such a simple result should have such a complicated proof, this work initiated a sequence of follow-up results in particular introducing the powerful drift analysis method to the community (He and Yao, 2001, 2002) (see, e.g., Jägersküpper, 2011; Doerr, Johannsen, and Winzen, 2010a; Doerr and Goldberg, 2010b, for recent extensions). However, not all promising-looking function classes are easy to optimize. As laid out in the first paragraphs of this paper, unimodal functions are already difficult.
Almost all of the results described above were proven for the standard mutation probability 1/n. It is easy to see from their proofs (or, in the case of linear functions, the more elaborate methods needed, Doerr and Goldberg, 2010a), that all results remain true for p(n)=c/n, where c can be an arbitrary constant.
We should add that the question of how to determine the right mutation probability is also far from being settled. Most theory results for simplicity take the value p(n)=1/n, but it is known that this is not always optimal (Jansen and Wegener, 2000). In practical applications, 1/n is the most recommended static choice for the mutation probability (Bäck, Fogel, and Michalewicz, 1997; Ochoa, 2002) in spite of known limitations of this choice (Cervantes and Stephens, 2009).
Only recently, more precise theoretical results on optimal mutation probabilities have appeared. Böttcher, Doerr, and Neumann (2010) showed that is the best choice of the mutation probability for the LeadingOnes problem. Witt (2011) proved that p(n)=1/n is an optimal mutation probability for the (1+1) EA on linear functions. Sudholt (2011) proved the same for the (1+1) EA on long k-paths.
With the standard mutation probability 1/n the (1+1) EA is similar to a random local search where exactly one bit is flipped. The bit to be flipped is chosen uniformly at random. The resulting search point replaces the current one in case its fitness is not worse. For many functions, the (1+1) EA with mutation probability 1/n and random local search have equal performance. The general question where random local search and the (1+1) EA have asymptotically equal performance is difficult to answer (Doerr, Jansen, and Klein, 2008). But for the class of linear functions, both algorithms have expected optimization time . For random local search, this also holds for monotonic functions for the very same reasons. Flipping a single bit from 0 to 1 implies that the fitness strictly increases and so this 1 will never be lost again. On average, after O(nlog n) mutations, each bit has been flipped at least once, so that each bit has a value of 1. For the (1+1) EA, however, things are much more involved and become very different if the mutation probability is only slightly increased to c/n (with a sufficiently large constant c).
1.2. Our Work
In this work, we regard the class of strictly monotonic functions. A pseudo-Boolean function is called strictly monotonic (or simply monotonic in the following) if any mutation flipping at least one 0-bit into a 1-bit and no 1-bit into a 0-bit strictly increases the function value. Hence, much stronger than for unimodal functions, we not only require that each nonoptimal x has a Hamming neighbor with better f-value, but we even ask that this holds for all Hamming neighbors that have an additional 1-bit.
Obviously, the class of monotonic functions includes linear functions with all bit weights positive. On the other hand, each monotonic function is unimodal. Contrary to the long k-path function, there is always a short path of at most n search points with increasing f-value connecting a search point to the optimum.
It is easy to see that monotonic functions are just the ones where a simple coupon collector argument shows that random local search finds the optimum in time O(nlog n). Surprisingly, we find that monotonic functions are not easy to optimize for the (1+1) EA in general. Secondly, our results show that for this class of functions, the mutation probability p(n)=c/n, where c is a constant, can make a crucial difference.
More precisely, we show that for c<1, the (1+1) EA with mutation probability c/n finds the optimum of any monotonic function in time , which is the best possible given previous results on linear functions. For c = 1, the drift argument breaks down and we have to resort to an upper bound of O(n3/2) based on a related model by Jansen (2007). We currently do not know the full truth. As the lower bound, we only have the general lower bound for all mutation-based evolutionary algorithms (Droste et al., 2002).
If c is sufficiently large, an unexpected change of regime happens. For , we show that there are monotonic functions such that the (1+1) EA with overwhelming probability needs an exponential time to find the optimum. The construction of such functions heavily uses probabilistic methods. To the best of our knowledge, this is the first time that problem instances are constructed this way in the theory of evolutionary computation.
It must be stressed that this is the first result where the mutation probability stays within while the expected optimization time changes from polynomial to exponential. Earlier results showed a similar drastic change only when the order of growth of the mutation probability changed, for example, from to (Jansen and Wegener, 2000). In addition, we show that this unexpected behavior may already take place in the class of monotonic functions, which is generally considered to be good natured. From a theoretical point of view, this is an important step toward much more precise results and a better understanding of mutation.
For some randomized search heuristics, our results are also of practical relevance. In memetic algorithms (Krasnogor and Smith, 2005), it is common practice to (occasionally) use very high mutation probabilities to escape regions of local optima. In artificial immune systems (Dasgupta and Nio, 2008) hypermutations using very high mutation probabilities are the most common variation operators. For these algorithms, it is not uncommon to have mutation probabilities that exceed 16/n. Our results demonstrate the possible danger of this approach. It points out that in general it may not be a good idea to only rely on search by such disruptive variation operators. For artificial immune systems, this has already been pointed out based on theoretical findings (Jansen and Zarges, 2011). That these problems are real has led to the inclusion of less disruptive mutation operators in artificial immune systems in practical applications. For example, in the B-cell algorithm (Kelsey and Timmis, 2003) hypermutations are combined with standard bit mutations using mutation probability 1/n leading to proven good performance (Jansen, Oliveto, and Zarges, 2011).
We consider the maximization of a pseudo-Boolean function by means of a simple evolutionary algorithm, the (1+1) EA. The results can easily be adapted for minimization. In this work, n always denotes the number of bits in the representation. We use common asymptotic notation (Cormen, Leiserson, Rivest, and Stein, 2001); all asymptotics are with respect to n.
In our analyses we denote by mut(x) the bit string that results from a mutation of x. We denote as x+ the search point that results from a mutation of x and a subsequent selection. Formally, y=mut(x) and x+=y if and x+=x otherwise.
For let Z(x) describe the positions of all 0-bits in x, . By |x|0 ≔|Z(x)| we denote the number of 0-bits in x and by |x|1≔n−|x|0 we denote the number of 1-bits. For let . For a set with we write for the substring of x with the bits selected by I. To simplify notation, we assume that any time we consider some but in fact need some , then r is silently replaced by or as appropriate.
We are interested in the optimization time, defined as the number of mutations until a global optimum is found. For the (1+1) EA, this is an accurate measure of the actual runtime. For bounds on the optimization time, we use common asymptotic notation. Such a bound on the optimization time is called exponential if it is . We also say that an event A occurs with overwhelming probability (w.o.p.) if .
A function f is called linear if it can be written as for weights . The most simple linear function is the function OneMax. Another intensively studied linear function is BinVal. As , the bit value of some bit i dominates the effect of all bits on the function value. Both functions will later be needed in our construction.
For two search points , we write if holds for all . We write x<y if and hold. We call f a strictly monotonic function (usually called simply monotonic in the following) if for all with x<y it holds that f(x)<f(y). Observe that the above condition is equivalent to f(x)<f(y) for all x and y such that x and y only differ in exactly one bit and this bit has value 1 in y. In other words, every mutation that only flips bits with value 0 strictly increases the function value. Clearly, the all 1s bit string 1n is the unique global optimum for a monotonic function.
Note that every linear function with strictly positive weights is a strictly monotonic function as flipping only 0-bits to 1 strictly increases the fitness. Also recall that every monotonic function is unimodal since for each nonoptimal search point, that is, for each , we can flip exactly one 0-bit and get a Hamming neighbor y with f(y)>f(x).
If for two arbitrary search points x, y, if neither x<y nor y<x holds, we say that x and y are incomparable. This happens if and only if there are two different bit positions i and j such that xi=0 but yi=1 and xj=1 but yj=0. Note that monotonicity does not impose any restrictions on the fitness values of x and y. In other words, if f is monotonic, then any of the following cases can occur: f(x)>f(y), f(x)=f(y), or f(x)<f(y). When constructing a monotonic function, we can choose any of the above cases for f, as long as no monotonicity constraint involving other search points is violated. This in particular indicates that the class of monotonic functions contains much more complex functions than linear functions.
3. Runtime Results for Monotonic Functions
For the (1+1) EA, the difficulty of monotonic functions strongly depends on the mutation probability p(n). We are interested in mutation probabilities p(n)=c/n for some constant . If c is a constant with c<1, on average, less than one bit flips in a single mutation. If this is a 1-bit we have f(x)>f(mut(x)) and x=x+ holds. Otherwise, f(x+)>f(x) holds and we accept this move. This way the number of 0-bits is quickly reduced to 0 and the unique global optimum is found. Using drift analysis, this reasoning can easily be made precise.
Letbe a constant. For every monotonic function, the expected optimization time of the (1+1) EA with mutation probability p(n)=c/n is.
The proof of the lower bound is not restricted to . For any constant c>0, the number of steps considered, c-1(n−c)ln n, is . This implies the following corollary.
Let c>0 be a constant. For every monotonic function, the expected optimization time of the (1+1) EA with mutation probability p(n)=c/n is.
The proof of the upper bound in Theorem 1 breaks down for c=1. In this case, the drift in the number of 1-bits can be bounded pessimistically by a model due to Jansen (2007) where we consider a random process that mutates x to y with mutation probability p(n)=1/n and replaces x by y if either holds or we have neither nor but |y|1<|x|1 holds.
The model is pessimistic in the following sense. Every mutation that flips only 0-bits to 1-bits is guaranteed to lead to an improvement in the function value for every monotonic function and is accepted in this model, too. For the analysis of the model as well as the (1+1) EA on a monotonic function, drift analysis could be employed using the number of 0-bits as the drift function. With respect to this drift function, the model is more pessimistic than any monotonic function since each mutation that potentially decreases the number of 1-bits in a monotonic function is accepted. This cannot happen for a monotonic function. To see this, consider, for example, n=4 and the following sequence of bit strings: s0=0111, s1=1100, s2=0001, s3=0011. In the pessimistic model, we could have s0, s1, s2, s3, and s0 as a sequence of current bit strings. This cannot be the case for the (1+1) EA with any monotonic function, since f(s2)<f(s3)<f(s0) holds by definition of monotonicity. Having s0, s1 as a sequence of current bit strings implies , and since f(s0)>f(s2) we cannot have s2 as the next current bit string. Thus, the pessimistic model allows for cycles that are not possible for the (1+1) EA with any monotonic function.
Using the number of 0-bits as drift function, the worst case model can yield an upper bound for the expected optimization time of the (1+1) EA with mutation probability p(n)=1/n on monotonic functions. This way, we obtain the upper bound of O(n3/2) for p(n)=1/n.
For every monotonic function, the expected optimization time of the (1+1) EA with mutation probability p(n)=1/n is O(n3/2).
Our main result is that using mutation probability p(n)=c/n, where c is a sufficiently large constant, the optimization of monotonic functions can become very difficult for the (1+1) EA. This is the first result where increasing the mutation probability by a constant factor increases the optimization time from polynomial to exponential with overwhelming probability.
For every constantthe following holds. For all, there exists a monotonic functionand a constantsuch that, with probability, the (1+1) EA with mutation probability p(n)=c/n does not optimize f withingenerations.
The remainder of this work is devoted to the formal proof of Theorem 3. We first present the construction of such a monotonic function f in the following section, and then prove that it has the desired properties in Section 5.
4. A Difficult to Optimize Monotonic Function
In this section, we describe a monotonic function that is difficult to optimize via a (1+1) EA with mutation probability p(n)=c/n, if is constant.
The main idea is the construction of a kind of long path function, similar to the work by Horn et al. (1994). They defined a path of Hamming neighbors (i.e., bit strings differing in exactly one bit) of exponential length. The probability of taking a shortcut by mutation, that is, jumping forward a long distance on the path, is very small, as many bits have to flip simultaneously. All points that are not on the path have an unfavorable fitness, so an evolutionary algorithm is forced to follow the path to the end.
Here, we also have an exponentially long path such that shortcuts can only be taken if a large number of bits flip simultaneously, a very unlikely event. The construction is complicated by the fact that the function needs to be monotonic. Hence, we cannot forbid leaving the path by giving the boundary of the path an unfavorable fitness. We solve this problem, roughly speaking, by implementing the path on a level of bit strings having similar numbers of 1-bits. Monotonicity simply forbids leaving the level to strings having fewer 1-bits. The path is broad in a sense that the algorithm can gather some additional 1-bits without leaving the path. The crucial part of our construction is setting up the function in such a way that, in spite of monotonicity, not too many 1-bits are collected.
Our path will be located at a region where the number of 1-bits is already fairly large. If the mutation probability c/n is large, it is likely that more 1-bits are flipped to 0 than 0-bits are flipped to 1. So, when mutating a point on the path, it is likely that we have a net loss in terms of the number of 1-bits. This effect becomes more pronounced the more 1-bits the mutated search point has. The behavior of the (1+1) EA of course depends on whether such a net loss will be accepted. Monotonicity requires that whenever only 1-bits are flipped to 0, then the fitness must decrease. However, if, say, one 0-bit is flipped to 1 and three 1-bits are flipped to 0, the two search points are incomparable. Hence, even for a monotonic fitness function, such a transition might be accepted. Our long path function is constructed in such a way that operations leading to a net loss of 1-bits when moving to an incomparable offspring are often accepted, while the current search point is on the path. This prevents the algorithm from gathering too many 1-bits and hence leaving the path.
These considerations particularly apply to a subset of bits that we call the window. The precise subset determines the position on the long path; the set of bits in the window changes as the algorithm moves along on the path. More formally, for a subset of indices and for the bits xi with are referred to as the window. These indices need not form a block, that is, B can be any subset of [n] and need not necessarily be of the form with . The bits xi with i ∉ B are outside the window. Inside the window, the function value is given by BinVal. The weights for BinVal are ordered differently for each window in order to avoid correlation between windows. The window is placed such that there is only a small number of 0-bits outside the window. Reducing the number of 0-bits outside causes the window to be moved. This is a likely event that happens frequently. However, we manage to construct an exponentially long sequence of windows with the additional property that in order to come from one window to another one at a large distance (in the sense of this sequence), a large number of bits needs to be flipped simultaneously. Since this is highly unlikely, it is very likely that the sequence of windows is followed, that is, we do not jump from one window to another one at a large distance. Thus, following the path takes, with overwhelming probability, an exponential number of steps. Droste, Jansen, and Wegener (1998) embed the long path into a unimodal function in a way that the (1+1) EA reaches the beginning of the path with probability close to 1. We adopt this technique and extend it to our monotonic function.
The following Lemma 1 defines the sequence of windows of our function by defining the index sets Bi. Concrete values for the upcoming constants and will be given later on in Theorem 4. The property that windows with large distance have large Hamming distance is formally stated as for and some constant .
Letbe constants withand. Let, and. Finally, let. Then there existsuch that the following holds. Letfor all. Then
for allsuch that.
The proof invokes the probabilistic method (Alon and Spencer, 2008), that is, we describe a way to randomly choose the bi that ensures that properties (1) and (2) hold with positive probability. This necessarily implies the existence of such a sequence .
Let the be chosen uniformly at random subject to condition (1). More precisely, let be chosen uniformly at random. If are already chosen, then choose bi from uniformly at random.
Let with i < j and . By definition, the sets Bi and Bj do not share an index in [L]. Fix any outcome of Bi. For all let Xk be the indicator random variable for the event . Then . We have that, conditional on any outcomes of all other bj+k-t, , the probability that is at most .
One technical tool in the definition of the set of difficult monotonic functions is the (random) permutation of vectors that allow us to effectively reduce dependencies between bits. We define the notation for this tool in the following definition.
Let, , , L, , the biand Bi be as in Lemma 1. Letwith. Forlet. Let, ifis nonempty. Forletbe a permutation of Bi. We use the shorthandto denote the vector obtained from permuting the components ofaccording to. Consequently, .
The following definition introduces a set of monotonic functions most of which will turn out to be difficult to optimize. The definition assumes the sequence of windows Bi to be given. For we say that some is a potential position in the sequence of windows if the number of 0-bits outside the window Bi is limited by , some constant. We select the largest potential position i as actual position and have the function value for x depend mostly on this position. If no potential position i exists, we have not yet found the path of windows and lead the (1+1) EA toward it. If , that is, the end of the path is reached, the (1+1) EA is led toward the unique global optimum via OneMax.
In addition to the permutations from Definition 1, we define further permutations , , for the window B1. These permutations are used to lead the (1+1) EA toward the start of the path and the first window B1.
We state one observation concerning the function that is important in the following. It states that as long as the end of the path of windows is not found, the number of 0-bits outside is not only bounded by but equals . This property will be used later on to show that the window is moved frequently.
Letbe as in Definition 2. Letwithand. If, then.
By assumption we have . We consider and see that the set coincides with in all but two elements: we have and . Consequently, and differ by at most one. Thus, implies and we can replace by . This contradicts . We have by definition and thus follows.
Our first main claim is that is in fact monotonic. This is not difficult, but might not, due to the complicated definition of , be obvious.
For allas above, is monotonic.
Let . Let and such that xj=0. Let be such that yk=xk for all and yj=1−xj. That is, y is obtained from x by flipping the jth bit (which is 0 in x) to 1. To prove the lemma, it suffices to show f(x)<f(y).
Let first . If we have and so f(x)<f(y) follows. If we have either (in case j∉B1) or (in case ). In both cases, f(x)<f(y) holds.
Now assume and . By definition , hence . If , we conclude with Lemma 3 that , and f(y)>f(x) follows from . If , then f(y)>f(x). In all other cases, f(x)=L23n+|x|1 and f(y)=L23n+|y|1, hence f(y)>f(x).
5. Proof of Theorem 3
By means of the function defined in the previous section (Definition 2), we are now ready to prove Theorem 3. We start with the concrete statement we want to prove.
Consider the (1+1) EA with mutation probability c/n for a constanton the functionfrom Definition 2 whereis chosen uniformly at random and the parameters are chosen according to, , and. There is a constantsuch that with probability the (1+1) EA needs at leastgenerations to optimize f.
This result shows that if f is chosen randomly (according to the construction described), then the (1+1) EA w.o.p. needs an exponential time to find the optimum. Clearly, this implies that there exists a particular function f, that is, a choice of , such that the EA faces these difficulties. This is Theorem 3. In fact, there is even an exponential number of functions for which this holds. The parameters , and in Theorem 4 were chosen to obtain a small constant in the threshold 16/n for the mutation rate.
The proof of Theorem 4 is long and technical. Therefore, we first present an overview over the main proof ideas.
We shall show that both after a typical initialization, when , and afterward, when and , we have the following situation. There is a window of bits ( if is defined and B1 otherwise) such that the fitness of the search points depends mainly on the BinVal function inside the window. Moreover, the fitness is always increased in case the mutation decreases the number of 0-bits outside the window. If this is due to the term in the fitness function and otherwise it is because the current -value has increased. The gain in fitness is so large that it dominates any change of the bits inside the window.
We claim that with this construction it is very likely that the current window always contains at least 0-bits, where is some positive constant. This is proven by showing that in case the number of 0-bits in the window is in the interval , constant, then there is a tendency (drift) to increase the number of 0-bits again. Applying the drift theorem by Oliveto and Witt (2011) yields that even in an exponential number of generations the probability that the number of 0-bits in the window decreases below is exponentially small. We first elaborate on why this drift holds and then explain how the lower bound of 0-bits implies the claim.
If a mutation decreases the number of 0-bits outside the window, the bits inside the window are subject to random, unbiased mutations. Hence, if the number of 0-bits is at most , the expected number of bits flipping from 1 to 0 is larger than the expected number of bits flipping from 0 to 1. Note that a mutation flipping 0-bits to 1 outside the window and flipping 1-bits to 0 inside the window creates an incomparable offspring. If the mutation probability is large enough, the net gain of 0-bits inside the window makes up for the 0-bits lost outside the window. So we have a net gain in 0-bits in expectation, with regard to the whole bit string. Note that the window is moved during such a mutation. As by Lemma 3 the number of 0-bits outside the window is fixed to , we have a net gain in 0-bits for the window, regardless of its new position.
In case the number of 0-bits outside the window remains put, acceptance depends on a BinVal instance on the bits inside the window. For BinVal accepting the result of a mutation is completely determined by the flipping bit with the largest weight. In an accepted step, this bit must have flipped from 0 to 1. All bits with smaller weights have no impact on acceptance and therefore are subject to random, unbiased mutations. If, among all bits with smaller weights, there is a sufficiently small rate of 0-bits, more bits will flip from 1 to 0 than from 0 to 1. In this case, we again obtain a net increase in the number of 0-bits in the window, in expectation. Here we again require a large mutation probability since every increase of BinVal implies that one 0-bit has been lost and a surplus of flipping 1-bits has to make up for this loss. This surplus must be generated by flipping 1-bits inside the window that have a small weight. Recall that the window only represents a -fraction of all bits in the bit string. So, the mutation probability has to be large enough such that the expected number of flipping bits among the mentioned bits is still large enough to make up for the lost bit.
For a fixed BinVal instance the bits tend to develop correlations between bit values and weights over time; bits with large weights are more likely to become 1 than bits with small weights. This development is disadvantageous since the above argument relies on many 1-bits with small weights. In order to break up these correlations we use random instances of BinVal wherever possible. Whenever a new random instance of BinVal is assigned, the bit weights for all bits in the window are reassigned, such that all correlations are lost.
New random instances are applied quickly. If and, by Lemma 3, also if we have exactly 0-bits outside the current window and every mutation that flips exactly one of these bits leads to a new BinVal instance. Since this happens with probability , this frequently breaks up correlations and prevents the algorithm from gathering 1-bits at bits inside the window with large BinVal-weights. Pessimistically dealing with bits that have been touched by mutation while optimizing the same BinVal instance, a positive expected increase in the number of 0-bits can be shown.
How does the lower bound of 0-bits inside the window imply Theorem 4? With overwhelming probability we start with and at least 0-bits in the window B1. We maintain at least 0-bits in B1, while the algorithm is encouraged to turn the 0-bits outside of B1 to 1 quickly. Once the number of 0-bits outside of B1 has decreased to or below , the path has been reached.
The 0-bits in B1 thereby ensure that the initial -value, that is, the initial position on the path, is at most . This is because B1 only has a small overlap to sets Bj that are far further on the path, that is, . In general, every two sets Bi, Bj with only intersect in at most bits. So 0-bits in Bi imply at least 0-bits outside of Bj. For j to become the new window, however, at most 0-bits outside of Bj are allowed. By choice of , , and , moving from B1 to Bj requires a linear number of 0-bits in B1 to flip to 1 if . The described mutation has probability . Hence the (1+1) EA finds the start of the long path with overwhelming probability.
The argument on small overlaps also implies that the probability of increasing by more than in one generation is . Hence, even when considering an exponential period of time, with overwhelming probability the (1+1) EA in each generation only makes progress at most on the path. As the path has exponential length, the claimed lower bound follows.
In the following, we prove our claim in three steps. We first show that it is very unlikely to take large shortcuts once the path is reached, implying that the algorithm is forced to follow the path (see Section 5.1). Afterward, we make use of drift arguments in order to show that there is always a linear fraction of 0-bits within the current window until the end of the path is reached or an exponential number of iterations have passed (see Section 5.2). The proof of this part is further separated into a part dealing with the case of a moving window and one part where the window stays put. Finally, we show that we hit the beginning of the path starting from a random initialization with overwhelming probability (see Section 5.3). Putting these results together in Section 5.4 then proves Theorem 4.
5.1. Unlikeliness of Shortcuts
We consider the (1+1) EA with mutation probability c/n and say that the (1+1) EA is on level if x is the current search point. We also speak of phase as the random time until the (1+1) EA increases its current level. Note that many phases can be empty. is called the current window of bits in situations where we are looking at a trajectory of these sets and want to emphasize that the bits we are considering might change over time.
The main observation for our analysis is that the current window typically contains at least 0-bits for some positive constant . This property is maintained even during an exponential number of generations, with overwhelming probability. Under this condition, the probability of increasing the current level by a large value is very small. Intuitively speaking, the reason for this is that the sets Bi only have a small intersection and many bits have to change in order to move from to some set Bj with . This is made precise in the following lemma.
Letandbe constants such thatand. Let, with respect toandbe constructed as in Definition 2, for arbitrary. Let c>0 be a constant and let x be the current search point of the (1+1) EA with mutation probability p(n)=c/n optimizing. Assume thatandcontains at least0-bits. Then the probability that the (1+1) EA increases the levelby more thanin one generation is at most.
Since , it holds that contains more than 0-bits. Recall that for all . Thus, there are more than 0-bits outside of Bj. This implies, by the definition of , that a necessary condition for increasing to any value is thus that one mutation decreases the number of 0-bits in to a value below or equal to . This is a decrease of at least bits for some constant . The probability of flipping at least bits simultaneously is at most .
One conclusion from this lemma is that, with overwhelming probability, the (1+1) EA follows the path given by the sets Bi without jumping from one window to another one at a large distance. More precisely, each phase increases the current level by at most with overwhelming probability. This will establish the claimed time bound.
5.2. Proving an Invariance Property on the Number of 0-bits in the Current Window
This section deals with the proof of the invariance property on the number of 0-bits in the current window. For this proof we make use of the following drift theorem by Oliveto and Witt (2011).
Simplified Drift Theorem (Oliveto and Witt, 2011):Let Xt, , be the random variables describing a Markov process over a finite state spaceand denoteforand. Suppose there exists an interval [a, b] in the state space, two constantsand, possibly depending on I≔b−a, a function r(I) satisfyingsuch that for allthe following two conditions hold:
for i>a and.
Then there is a constant such that for the time it holds that .
A prerequisite for this theorem is that the number of 0-bits in the current window increases in expectation when the number of 0-bits is in a certain interval. We choose the interval , where , but establish lower bounds for the drift with respect to a larger interval . The larger interval will be used later on when proving that after initialization the (1+1) EA finds the start of the path (cf. Lemma 10).
The drift on the number of 0-bits will be bounded from below by positive constants in two cases: either the current level remains fixed in one generation or the current level is increased. We start with the latter case and give a lower bound for the number of 0-bits in the current window. At the end of this section, we apply the above stated drift theorem.
Before we formulate the main statements of this section, we need to introduce some notation. For any x, let denote the substring of x induced by , that is, the substring in the current window. Recall that |xB|0 denotes the number of 0-bits in the current window. That is, . For readability purposes, we write |x+B|0 instead of |(x+)B|0 for the number of 0-bits of x+ in its window.
5.2.1. Invariance for Sliding Windows
We first consider the case where the current level is increased, that is, a transition from to with happens. Note that here we deal with the case and thus, and holds. We show that in this situation we have a drift in the number of 0-bits within the current window that is bounded below by a positive constant. Due to the transition, it is not sufficient to only consider changes within the current window. Furthermore, transitions are often triggered by changes outside the current window. Thus, we assume a form of a global view and take into account both the changes within the current window as well as changes outside the current window. We formalize this within the next lemma.
Let, , and c be constants such thatand. Let n be sufficiently large and let f, with respect toand, be constructed as in Theorem 4. Let x be the current search point of the (1+1) EA with mutation probability p(n)=c/n maximizing f. We denote bythe event that a transition from leveltowithoccurs in an iteration of the (1+1) EA maximizing f. Assume.
Then there is a constant such that the drift in the number of 0-bits is at least , that is, .
Let , the indices not contained in the current window, and the corresponding induced substring
of x. Analogously, we define . Due to Lemma 3, we have .
The main part of the proof is to derive a lower bound on . Afterward we show that this bound together with the given prerequisites on , , c, and |xB|0 yields a positive drift in the number of 0-bits.
5.2.2. Invariance for Nonsliding Windows
In the following, we deal with the case . We show that, whenever the number of 0-bits in the current window is in the interval , we observe a drift toward more 0-bits. This is formalized in the following lemma.
Let, , and c be constants such that. Let n be sufficiently large and let f, with respect toand, be constructed as in Definition 2. Let x be the current search point of the (1+1) EA with mutation probability p(n)=c/n maximizing f. Assume. We denote by A the event that the (1+1) EA maximizing f and starting in x does not leave the current level, that is, . Then the following two statements hold.
For every constantthe number of different bits that are flipped during phaseis at most, with probability.
For small enough, assuming that the event from (1) holds, there exists a constantsuch that the drift in the number of 0-bits is at least .
The proof of this lemma will heavily depend on the drift in the number of 0-bits induced by the random BinVal within the current window. In the proof of Lemma 7, we will have to deal with variable lengths of the considered bit string. Therefore, the following auxiliary lemma is formulated for a bandwidth of possible bit string lengths. One precondition is that bit weights of BinVal are assigned uniformly at random. This is the case right after a new BinVal instance has been set.
In order to prevent confusion, let us remark that the expectation is taken both with respect to the random assignment of the function weights as well as with respect to the position of the 0-bits of .
As a first simple observation, let us recall the following. Whenever , it holds that . Thus, we are only interested in the case . Note that in this case, the construction of BinVal implies that the bit with the largest weight is one that flips from 0 to 1, as the (1+1) EA would otherwise not accept as a new search point. For all other bits that are being flipped in this iteration, the direction of the flipping bit (i.e., whether the bit itself is a 0-bit flipping to 1 or a 1-bit flipping to 0) is random and does only depend on the shares of 0- and 1-bits. This will be formalized in the following.
For readability purposes, let us introduce the following notations. For every we denote by pk the probability that the (1+1) EA flips exactly k bits in the mutation step. Clearly, for and .
Now, for any such k, it holds that equals the probability that the flipping bit with the largest weight flips from 0 to 1 (which occurs with probability ) times the drift conditional on k bit flips. The latter equals as outlined above.
We can now easily deduce Lemma 7.
Let us assume that event A (as defined in the statement) holds. That is, the acceptance of the mutated bit string mut(x) is fully determined by the random BinVal within the current window. Thus, we can restrict our attention to the current window.
Let us begin with proving the first claim. For this purpose, let be a constant. We prove an auxiliary claim stating that with probability 1- the time until the (1+1) EA exits level is at most . That is, we can assume that phase does not take longer than steps. We then show how to derive the original claim.
By construction, the (1+1) EA exits level if exactly one of the 0-bits outside the current window is being flipped. Thus, the probability of exiting the current level in one step is at least . It follows that the probability of not exiting level in steps is at most .
Now, the expected number of bits that have been flipped in steps is at most . We apply a standard Chernoff bound, and obtain that the probability that more than bits are being flipped in steps is at most .
5.2.3. Applying the Drift Theorem
Finally, we prove the claimed invariance property.
Let, , , and c be constants such that, , and. Let, with respect toand, be constructed as in Definition 2, forchosen uniformly at random.
Assume that for the current search point x of the (1+1) EA with mutation probability p(n)=c/n it holds and the current window contains at least 0-bits. There is a constant such that with probability in the following generations the (1+1) EA always has at least 0-bits in the current window or the end of the path is reached.
First, observe that the event described in the first statement of Lemma 7 occurs with probability . By the union bound, the probability that the event occurs within phases is still if is a sufficiently small constant.
We apply the drift theorem (Theorem 5) to a potential that reflects the number of 0-bits in the current window. Consider the interval and observe that by assumption the algorithm starts with a potential of at least . Using Lemma 7 with the condition from the first paragraph and Lemma 6, if the current potential is within the interval and the end of the path is not reached, then the expected increase in the potential is bounded from below by a positive constant.
For the probability that the potential decreases by j is bounded from above by the probability that the (1+1) EA flips at least j bits. This probability is at most , where the last estimation is trivial for and obvious otherwise. Applying Theorem 5 with and r=22ec yields that with overwhelming probability in generations, if again is sufficiently small, the potential does not decrease below or the end of the path is reached.
5.3. Hitting the Path
All that is left to complete the proof of the main result is the fact that the path is reached from a random initialization, with overwhelming probability.
Our function is constructed such that after a typical initialization the fitness equals OneMax on all bits outside the window B1, multiplied by a huge weight of 2n, plus BinVal on all bits inside the window. The OneMax-part encourages the (1+1) EA to turn the bits outside the window B1 to 1 quickly. Note that B1 becomes a potential window once the number of 0-bits outside B1 is no more than . The BinVal-part on the bits within B1 is used to maintain a certain number of 0-bits inside the window. This ensures that the algorithm reaches the path close to B1, for the same reason that prevents the (1+1) EA from taking shortcuts when climbing the path. This reasoning is made precise in the following lemma.
Let , , and be constants such that and , , and . Let , with respect to , and , be constructed as in Definition 2, for chosen uniformly at random. With probability the (1+1) EA with mutation probability p(n)=c/n optimizing at some point of time reaches some search point x with and.
The proof follows from reusing many previous arguments, as the situation of the (1+1) EA moving toward the path is very similar to climbing up the path. The outline of the proof is as follows. We first show that a minimum number of 0-bits in B1—the same minimum number as in the setting of climbing the path—prevents the (1+1) EA from taking shortcuts. We then show that the path is reached within the first n2 generations. Finally, we argue that, during these n2 generations, we keep a minimum number of 0-bits inside the window, with overwhelming probability.
Let x be the current search point of the (1+1) EA. By the same reasoning as in Lemma 5, we observe that if , then for every , since and , we have . Hence, we only need to prove that the number of 0-bits in B1 does not decrease below until the set of potential window positions becomes nonempty for the first time.
Recall that we have and , constant. The initial search point contains an expected number of 0-bits in B1. The probability that the initial search point contains at least 0-bits in B1 is by Chernoff bounds. Assume that this happens and consider a situation where we have at least 0-bits outside of B1 and the number of 0-bits in B1 has decreased below . Arguing as in the proof of Lemma 7, if the number of 0-bits in B1 is within and , then there is a positive drift toward increasing the number of 0-bits again. (The only difference from the previous arguments is as follows. Instead of considering a new random BinVal instance when the current -value is increased, we obtain a new BinVal instance whenever the number of 1-bits outside the window is increased. The probability for the latter event can even be larger than the probability for the former.) This allows us to apply Lemma 8 in the same fashion as in the proof of Lemma 7. This results in a positive drift. Since we start with at least 0-bits in B1, we can apply the drift theorem as in Lemma 18 w.r.t. the interval . This proves that in n2 generations the number of 0-bits in B1 does not drop to or below , with probability .
5.4. Putting Everything Together
Now we are prepared to prove Theorem 4.
Choose and . It is easily verified that for the chosen values , , , , and holds, satisfying all preconditions on these variables for Lemmas 5, 9, and 10. By Lemma 10, the (1+1) EA reaches some search point x with and with overwhelming probability. Lemma 9 then states that with probability , the number of 0-bits in the current window is always at least until the end of the path is reached or generations have passed for a sufficiently small constant (which would correspond to the claimed time bound).
Given the condition on the 0-bits, by Lemma 5 the (1+1) EA increases its current -value by at most in one generation, with probability . The probability that this always happens until an -value of is reached is at least since . This implies that (1+1) EA spends at least generations on the path, with probability , if is chosen small enough. Since the sum of all error probabilities is , the claim follows.
Understanding which problems and problem classes are difficult for evolutionary algorithms remains a challenging task. We have made an important step forward by showing that even innocent looking functions like monotonic ones can be surprisingly hard to optimize with evolutionary algorithms. We showed that the optimum of any monotonic function is found efficiently if the mutation probability is at most 1/n. Once the mutation probability exceeds 16/n, the situation drastically changes. In this case, there are monotonic functions such that the (1+1) EA with overwhelming probability needs an exponential time to find their optimum.
This result indicates that, to a greater extent than expected, care has to be taken when choosing the mutation probability, even if restricting oneself to mutation probabilities c/n with a constant c. Contrary to previous observations, for example, for linear functions, it may well happen that constant factor changes in the mutation probability lead to more than constant factor changes in the efficiency.
Mutation probabilities of 16/n are not used in practice in evolutionary algorithms. However, in memetic algorithms and artificial immune systems they are actually applied. Therefore, our theoretical findings make a significant contribution toward practical applications of these randomized search heuristics.
Apart from generally suggesting more research on the right mutation probability, this work leaves two particular problems open. (1) For the mutation probability 1/n, give a sharp upper bound for the optimization time of monotonic functions (this order of magnitude is between and O(n3/2)). (2) Determine the largest constant c such that the expected optimization time of the (1+1) EA with mutation probability p(n)=c/n is nO(1) on every monotonic function. Currently, we only know that 1<c<16 holds. We do not expect the pessimistic model that establishes the O(n3/2) bound for c=1 (Jansen, 2007) to be particularly useful for this task. In this model it is harder to locate the unique optimum than for any monotonic function. Note that it is not even clear that the upper bound is tight for c=1 on monotonic functions.
The authors would like to thank Xin Yao for several useful discussions. The first author is thankful to Jon Rowe for pointing out this problem to him at the ThRaSH workshop in Birmingham. This material is based in part upon works supported by the Science Foundation Ireland under Grant No. 07/SK/I1205. Dirk Sudholt was partly supported by a postdoctoral fellowship from the German Academic Exchange Service while visiting the International Computer Science Institute in Berkeley, CA, USA and EPSRC grant EP/D052785/1. Carola Winzen is a recipient of the Google Europe Fellowship in Randomized Algorithms. This research is supported in part by this Google Fellowship. Christine Zarges was partly supported by a postdoctoral fellowship from the German Academic Exchange Service.