## Abstract

Understanding the relationship between a search algorithm and the space of problems is a fundamental issue in the optimization field. In this paper, we lay the foundations to elaborate taxonomies of problems under estimation of distribution algorithms (EDAs). By using an infinite population model and assuming that the selection operator is based on the rank of the solutions, we group optimization problems according to the behavior of the EDA. Throughout the definition of an equivalence relation between functions it is possible to partition the space of problems in equivalence classes in which the algorithm has the same behavior. We show that only the probabilistic model is able to generate different partitions of the set of possible problems and hence, it predetermines the number of different behaviors that the algorithm can exhibit. As a natural consequence of our definitions, all the objective functions are in the same equivalence class when the algorithm does not impose restrictions to the probabilistic model. The taxonomy of problems, which is also valid for finite populations, is studied in depth for a simple EDA that considers independence among the variables of the problem. We provide the sufficient and necessary condition to decide the equivalence between functions and then we develop the operators to describe and count the members of a class. In addition, we show the intrinsic relation between univariate EDAs and the neighborhood system induced by the Hamming distance by proving that all the functions in the same class have the same number of local optima and that they are in the same ranking positions. Finally, we carry out numerical simulations in order to analyze the different behaviors that the algorithm can exhibit for the functions defined over the search space .

## 1 Introduction

Estimation of distribution algorithms (EDAs; Mühlenbein and Paaß, 1996; Larrañaga and Lozano, 2002; Pelikan, 2005) are a population-based optimization paradigm in the field of evolutionary computation that has acquired special relevance in the last decade. Proof of this popularity is the development of new and more complex EDAs (Bosman, 2010; Hauschild et al., 2012), the applications for these EDAs (Armañanzas et al., 2008; Santana et al., 2008; Brownlee et al., 2008) and the works which study fundamental issues in order to better understand how these algorithms perform (González et al., 2002; Zhang, 2004; Shapiro, 2005; Hauschild et al., 2009; Echegoyen et al., 2011).

The main characteristic of EDAs is the use of probabilistic models to lead the search toward promising areas of the space of solutions. By making use of a subset of promising solutions belonging to the population, the learning algorithm allows an estimate to be made of a new probability distribution over the search space at each step of the algorithm. Thus, each of the possible solutions has an associated probability of being sampled, which varies during the optimization process. The probability values assigned to the solutions are the main source of information in determining which solutions will be returned by the algorithm. Consequently, given a problem, the ideal objective of an EDA is to get higher probability values for the highest quality solutions throughout an iterative process. However, when an EDA is applied to search for the best solutions of a problem, the algorithm can exhibit a wide variety of behaviors (Echegoyen et al., 2011). In this regard, identifying and classifying the different behaviors of an EDA, and how these behaviors relate to the characteristics of the optimization problems, are fundamental issues to gain an understanding of the underlying mechanisms that govern this type of algorithm.

The behavior of an EDA can be specified by the sequence of probability distributions generated at each generation. Ultimately, the algorithm can have different executions depending on the objective function. However, with the aim of developing a taxonomy of problems, we start by considering the following question: Is it possible to group the optimization problems according to the behavior of the EDA? Therefore, we are questioning the existence of groups of problems in which the algorithm exhibits a similar performance regarding the optimization purpose. In order to start the study of this topic, we make the following three assumptions: (i) we consider EDAs with infinite population, (ii) the selection scheme is based on the rank of the solutions, and (iii) the algorithm is applied in the space of injective functions.

One important particularity of the approach carried out in the current paper is the use of ranking information in the selection step to guide the search. We argue that many black box algorithms only compare objective values instead of taking into account their absolute values. In addition, ranking-based algorithms play an important role in the theory of black box optimization. For instance, it has been shown in Doerr and Winzen (2012) that, for certain optimization problems, the use of ranking-based selection mechanisms facilitates the estimates of realistic time complexity results.

In order to elaborate taxonomies of optimization problems under EDAs that satisfy the aforementioned three assumptions, a crucial element is the definition of an equivalence relation between objective functions. The equivalence relation, which is based on the sequences of probability distributions generated during the search, partitions the space of functions into equivalence classes. We will see that it is possible to assert that the algorithm has the same behavior for the functions belonging to the same class. In turn, we show that only the probabilistic model is able to create partitions of the space of problems. From the definitions that we provide, it can be deduced that all the objective functions are equivalent when the probabilistic model is able to exactly recover the underlying distribution of the selected individuals. Therefore, we can say that all the optimization problems are equally difficult from the point of view of this type of EDA. This crucial role of the probabilistic model supports the importance that has been attributed to it in the literature regarding the classification of different EDAs and the development of new techniques and studies.

The partition of the space of problems under EDAs is studied in depth for a so-called univariate algorithm whose probabilistic model assumes independence among the variables of the problem. The key point to understanding the behavior of an EDA for a given problem is the way in which the objective function is related to the probabilistic model. We show that this relation can be expressed by means of a structure of sets. For this type of univariate EDA, the necessary and sufficient condition to identify equivalent functions is provided. Next, we develop the operators that allow us to describe the functions in the same class and count the number of elements per class. Another fundamental topic that we take into account is the relation between the behavior of the EDA and the properties of the objective function. In particular, we study the connection between the equivalence classes and the local optima of the objective functions belonging to each class. We show that the functions in the same class have the same number of local optima and in the same ranking positions. This fact reveals the intrinsic connection between local optima and any EDA that introduces a univariate probabilistic model. This link has only been shown for specific EDA implementations (González et al., 2001; Zhang, 2004).

Finally, we conduct numerical simulations of a univariate EDA that implements tournament selection (Zhang, 2004). The algorithm is applied to the injective functions defined over the search space . The partition of the space of problems in equivalence classes allows us to carry out a detailed analysis of the different EDA behaviors. Through the numerical analysis, we mainly study the complexity of the problems belonging to each class. Due to the relevant role that the local optima play in EDAs that assume independence among the variables, we present the difficulty of the problems in relation to that number. Moreover, the experiments are useful in order to illustrate the taxonomy of problems which is abstractly presented throughout the paper.

The rest of the paper is organized as follows. Section 2 formally introduces the problems to solve and the estimation of distribution algorithms. Section 3 presents EDAs with infinite population and provides the specific definitions involved in the framework of the current work. In Section 4, the concept of equivalence between functions is presented and discussed. Section 5 regards the study of the equivalence classes under a simple EDA. Firstly, it explains and establishes the sufficient and necessary condition to decide the equivalence between two functions. Secondly, the operators to describe the functions in the same class are deduced. Lastly, the relationship between the equivalence classes and local optima is presented. In Section 6, numerical experiments are conducted. Section 7 points out possible future work. Finally, Section 8 draws the conclusions obtained during the study.

## 2 Background

### 2.1 Optimization Problems

*S*| of the search space is therefore 2

^{n}. Throughout the paper, we assume that the function

*f*(

**x**) is injective. Thus, for each , there is only one such that

*f*(

**x**)=

*z*. The results could be generalized to noninjective functions because the most basic definitions that we present do not depend on the injectivity of the function.

*f*(

**x**) naturally induces a permutation of the solutions of the search space

*S*. This permutation is thought of as a 2

^{n}-tuple of the elements in

*S*ordered according to function values

*f*(

*S*). The first solution of , namely

**x**

_{1}, has the highest function value,

**x**

_{2}has the second highest, and so on, with the last solution being the one with the lowest function value: Of course, the first solution

**x**

_{1}of corresponds to the solution that solves Equation (1). Independent of the specific function values

*f*(

*S*), whenever they provide the same ranking of the solutions , the function is represented by the same permutation . In the case of injective functions, we have 2

^{n}! permutations and the set containing all possible is denoted by . A permutation provides a ranking of the solutions of the search space. From now on, the objective function

*f*(

**x**) and the corresponding can be used indistinctly as synonyms. We use the notation to index the permutation. Thus, returns the solution which is at position

*i*in the ranking.

In Table 1, we provide an example of the permutation induced by a function *f*(**x**). In the first column, the original function values are shown. In the second column, the corresponding 2^{n}-tuple is presented; and in the third column, we indicate the rank of the solutions.

### 2.2 Estimation of Distribution Algorithms

*n*binary random variables. We define

*p*(

**X**=

**x**), or simply

*p*(

**x**), as the joint probability distribution of the variables

**X**taking values from the search space

*S*. The general scheme of the EDA approach is shown in Algorithm 1.

The procedure of an EDA can be explained in terms of the probability distributions involved in the search (Santana et al., 2009; see Figure 1). Firstly, *D _{t}* denotes the EDA population at generation

*t*and

*p*(

**x**,

*t*) is the underlying probability distribution of this sample. Secondly,

*p*(

^{s}**x**,

*t*) is the probability distribution of the selected individuals

*D*. Finally,

^{s}_{t}*p*(

^{a}**x**,

*t*) is the distribution given by the probabilistic model chosen to approximate

*p*(

^{s}**x**,

*t*). Once we have

*p*(

^{a}**x**,

*t*), the next generation

*D*

_{t+1}is constructed by sampling this distribution.

In any evolutionary algorithm (EA), selection is one of the fundamental operators and therefore a wide variety of proposals can be found in the literature. According to Lee and El-Sharkawi (2008), the selection schemes can be classified into two groups: proportionate selection and ordinal-based selection. Proportionate-based procedures select individuals based on their specific function values. Ordinal-based selection procedures select individuals based on their rank within the population. Thus, this class of selection only takes into account qualitative information about the function, that is, it only uses the fact that *f*(**x**)>*f*(**y**) instead of the real value given by the function. In this work, we assume that the algorithm uses this last type of selection. We must say that assuming ordinal-based selection is not an insurmountable restriction, because the definitions that we present could be extended to selections that take into account the specific function values. Common examples of selection schemes based only on the rank of the solutions are truncation, tournament, linear ranking, and exponential ranking selection. These schemes are considered in a wide variety of EAs, both in solving real problems and in theoretical studies (Blickle and Thiele, 1996; Prügel-Bennett, 2000; Zhang, 2004; Doerr and Winzen, 2012).

## 3 The Infinite Population EDA Model

The application of the EDA scheme presented in Algorithm 1 to face optimization problems can involve an unapproachable variety of situations and behaviors. With the aim of dealing with all possible EDA behaviors, we consider the concept of infinite population (Vose, 1999), which is commonly used to analyze EAs. Regarding the field of EDAs, several works have assumed an infinite population in order to study and provide the fundamental properties of this type of algorithm (e.g., Mühlenbein et al., 1999; Zhang and Mühlenbein, 2004; Zhang, 2004).

In EDAs with infinite population, it is assumed that the empirical probability distribution induced by the solutions in *D _{t}* and

*D*(see Figure 1) will converge to the underlying probability distributions

^{s}_{t}*p*(

**x**,

*t*) and

*p*(

^{s}**x**,

*t*), respectively, as the size of the sample tends to infinity. Therefore,

*p*(

**x**,

*t*) and

*p*(

^{s}**x**,

*t*) could be thought of as the population and the selected individuals at iteration

*t*(Zhang and Mühlenbein, 2004). Consequently, we can assume that

*p*(

**x**,

*t*+1)=

*p*(

^{a}**x**,

*t*). In EDAs with infinite population, the random errors caused by sampling finite sets of solutions from the probability distributions managed by the algorithm are canceled and hence, we no longer need

*D*and

_{t}*D*.

^{s}_{t}By assuming infinite population, the EDA is seen as a sequence of probability distributions. Therefore, we concentrate solely on the evolution of these probability distributions during the search instead of dealing with the specific candidate solutions. Nevertheless, it is important to note that, although the analysis is carried out by assuming infinite populations, the taxonomy of problems developed throughout the paper is also valid for finite populations.

### 3.1 The Algorithm and the Population

In Algorithm 2 we formally describe the infinite population EDA procedure that we consider in the current work. This algorithm will be simply denoted by .

It is established that the first step in algorithm is to create the permutation of the solutions . The permutation that the objective function provides is a crucial element of the algorithm.

*p*(

**x**,

*t*) that the algorithm manages at time

*t*. Nevertheless, algorithm does not explicitly work with the distributions

*p*(

**x**,

*t*). Instead, at each generation, this algorithm manages a probability vector where each probability value

*p*in

_{i}**p**corresponds to the probability of the solution

**x**

_{i}, which is in the position

*i*of the permutation . It is always interpreted that the first value

*p*

_{1}of the vector

**p**is the probability of the optimal solution,

*p*

_{2}is the probability of the second best solution, and so on, with the last probability corresponding to the solution with the worst function value. Accordingly, we have the probability vectors

**p**

^{s}and

**p**

^{a}representing the selected population and the approximation respectively. To be absolutely precise, the probability vectors at time

*t*should be represented as . Note that, unlike the probability distribution

*p*(

**x**), a vector

**p**implies no specific assignment of probabilities to solutions

**X**. In order to obtain the probability distribution

*p*(

**x**,

*t*) that algorithm manages at time

*t*, we link the vector

**p**

_{t}and the permutation through the pair . The pair induces a probability function

*p*(

**x**) such that . Therefore, we have that is the probability of the optimum, is the probability of the second best solution, and so on.

The arrangement of the probability vector **p** and its relationship with the permutation is illustrated in Table 2. In this example, we consider two different permutations and . We assume that algorithm was applied to both functions and that it manages the same probability vector **p**_{t} at time *t*. We can see that, for example, *p*_{1} is associated with the solution (1, 1, 1) according to . However, according to , *p*_{1} is associated with the solution (0, 0, 0). Therefore, *p*_{1} is the probability of the optimum in both cases, independent of the specific configuration of this solution. In the third column of Table 2, we can see that the same vector induces different probability distributions depending on . Note that, since we have 2^{n}! different permutations , a vector **p** can generate 2^{n}! different probability distributions.

. | p_{t}
. | p(x, t)
. | p^{s}_{t}
. |
---|---|---|---|

(1,1,1) | p_{1} | p^{s}_{1} | |

(0,1,0) | p_{2} | p^{s}_{2} | |

(0,0,1) | p_{3} | p^{s}_{3} | |

(1,0,0) | p_{4} | p^{s}_{4} | |

(0,1,1) | p_{5} | p^{s}_{5} | |

(1,0,1) | p_{6} | p^{s}_{6} | |

(1,1,0) | p_{7} | p^{s}_{7} | |

(0,0,0) | p_{8} | p^{s}_{8} | |

p_{t} | p(x, t) | p^{s}_{t} | |

(0,0,0) | p_{1} | p^{s}_{1} | |

(1,0,1) | p_{2} | p^{s}_{2} | |

(1,1,0) | p_{3} | p^{s}_{3} | |

(0,1,1) | p_{4} | p^{s}_{4} | |

(1,0,0) | p_{5} | p^{s}_{5} | |

(0,1,0) | p_{6} | p^{s}_{6} | |

(0,0,1) | p_{7} | p^{s}_{7} | |

(1,1,1) | p_{8} | p^{s}_{8} |

. | p_{t}
. | p(x, t)
. | p^{s}_{t}
. |
---|---|---|---|

(1,1,1) | p_{1} | p^{s}_{1} | |

(0,1,0) | p_{2} | p^{s}_{2} | |

(0,0,1) | p_{3} | p^{s}_{3} | |

(1,0,0) | p_{4} | p^{s}_{4} | |

(0,1,1) | p_{5} | p^{s}_{5} | |

(1,0,1) | p_{6} | p^{s}_{6} | |

(1,1,0) | p_{7} | p^{s}_{7} | |

(0,0,0) | p_{8} | p^{s}_{8} | |

p_{t} | p(x, t) | p^{s}_{t} | |

(0,0,0) | p_{1} | p^{s}_{1} | |

(1,0,1) | p_{2} | p^{s}_{2} | |

(1,1,0) | p_{3} | p^{s}_{3} | |

(0,1,1) | p_{4} | p^{s}_{4} | |

(1,0,0) | p_{5} | p^{s}_{5} | |

(0,1,0) | p_{6} | p^{s}_{6} | |

(0,0,1) | p_{7} | p^{s}_{7} | |

(1,1,1) | p_{8} | p^{s}_{8} |

*p*<1 for all . This condition is established in order to avoid the trivial case in which the algorithm starts from a degenerate probability distribution that assigns 1 to a solution of the search space. In this case, the algorithm would converge when the initial population is generated. This situation will not be considered.

_{i}**p**

_{0}, the application of algorithm to a function

*f*induces a deterministic sequence of probability vectors: We use this sequence to describe the behavior of the algorithm. Note that if we took into account the probability distributions

*p*(

**x**,

*t*) to describe the EDA behavior, the algorithm could always generate different sequences of distributions for each possible . The use of probability vectors

**p**provides a higher level of abstraction, which is essential in order to group EDA behaviors.

### 3.2 The Selection Scheme *φ*

As indicated previously, we assume selection schemes based on the rank of the solutions within the population. Since in algorithm any probability value *p _{i}* is interpreted as the probability of the solution with position

*i*in the permutation , the selection can be simply defined by a function . This function modifies the probability values of the individuals according to the rank in which they are.

In order to create the partition of the space of problems, we do not need to consider any specific implementation of the selection operator . However, it will be essential for to satisfy two basic properties or axioms.

**Property 1 (Impartiality).**The selection operator is independent of the configuration**X**of a solution. This operator only takes into account the fitness value*f*(**x**) of the solution. Although this property is implicit in the definition of , we believe that it is worth a brief discussion. Thus, since we assume an ordinal-based selection scheme, given**p**, we always obtain the same the probability vector after selection, independent of . This fact is illustrated in the last column of Table 2 where we indicate that the probability vector after selection is the same in both functions.**Property 2 (No degeneration).**The selection operator cannot assign extreme probabilities 1 or 0 to solutions whose probabilities are in the interval (0, 1). More formally, the vector , computed as at generation*t*, satisfies that if 0<*p*<1 then 0<_{i}*p*<1 for all in every generation . In addition, if^{s}_{i}*p*=0 then_{i}*p*=0. The convergence of the algorithm can only take place as a result of the iterative process when^{s}_{i}*t*tends to infinity.

The conditions mentioned above are needed in order to guarantee that the taxonomies of problems are valid for any implementation of . Therefore, the selection will play no role in the partition of the space of optimization problems.

### 3.3 The Approximation Step

In algorithm , the approximation step deals with the probability distribution *p ^{s}*(

**x**) of the selected individuals. Therefore, the probability vector

**p**

^{s}has to be related to the corresponding solutions

**X**by means of the pair . The approximation step is defined as a function and it is computed as in the algorithm.

Note that the approximation is the only operator in that takes into account the permutation . Therefore, can translate the difference between functions to different behaviors of the algorithm.

## 4 Equivalent Functions and Equivalence Classes

*t*.

The properties of functions similar to have usually been studied by means of discrete dynamical systems. In González et al. (2001) and Zhang (2004), important insights and results about the convergence of some EDAs were provided using this approach. In the current paper, the definition of equivalence does not consider specific implementations for or and hence, we do not have a formulation of to study the dynamics of the algorithms. Although the iterations of are modeled as a dynamical system, we will take a more general perspective to describe the behavior of EDAs.

The definition of equivalence between objective functions under EDAs can be expressed as follows.

*Let and be the permutations induced by the objective functions f_{1}(x) and f_{2}(x) respectively. Let be an EDA with any given and . We say that and are equivalent under , and by extension f_{1} and f_{2}, if for any , for all *.

In a less formal way, we say that two functions are equivalent under if the corresponding sequences of probability vectors induced by the algorithm are equal from any initial point. The equivalence between functions means that we have the same EDA behavior if we focus on the rank of the solutions instead of their specific **X** configurations. Therefore, if two functions are equivalent, we can say that the algorithm will have the same performance in terms of solving Equation (1). In Table 3, a very simple example with *n*=2 variables illustrates the equality of vector sequences. Departing from the uniform distribution, algorithm exactly induces the same sequence of vectors **p**_{t}. If both sequences are equal from any initial **p**_{0} then and will be equivalent. Note that the sequences of probability distributions *p*(**x**, *t*) are different.

. | p_{0}
. | p_{1}
. | p_{2}
. | p_{3}
. | … . | . |
---|---|---|---|---|---|---|

(1,1) | 0.25 | 0.4375 | 0.6836 | 0.8999 | … | 1 |

(0,1) | 0.25 | 0.3125 | 0.2539 | 0.0962 | 0 | |

(0,0) | 0.25 | 0.1875 | 0.0586 | 0.0039 | 0 | |

(1,0) | 0.25 | 0.0625 | 0.0039 | 0.0000 | … | 0 |

p_{0} | p_{1} | p_{2} | p_{3} | … | ||

(1,0) | 0.25 | 0.4375 | 0.6836 | 0.8999 | … | 1 |

(1,1) | 0.25 | 0.3125 | 0.2539 | 0.0962 | 0 | |

(0,1) | 0.25 | 0.1875 | 0.0586 | 0.0039 | 0 | |

(0,0) | 0.25 | 0.0625 | 0.0039 | 0.0000 | … | 0 |

. | p_{0}
. | p_{1}
. | p_{2}
. | p_{3}
. | … . | . |
---|---|---|---|---|---|---|

(1,1) | 0.25 | 0.4375 | 0.6836 | 0.8999 | … | 1 |

(0,1) | 0.25 | 0.3125 | 0.2539 | 0.0962 | 0 | |

(0,0) | 0.25 | 0.1875 | 0.0586 | 0.0039 | 0 | |

(1,0) | 0.25 | 0.0625 | 0.0039 | 0.0000 | … | 0 |

p_{0} | p_{1} | p_{2} | p_{3} | … | ||

(1,0) | 0.25 | 0.4375 | 0.6836 | 0.8999 | … | 1 |

(1,1) | 0.25 | 0.3125 | 0.2539 | 0.0962 | 0 | |

(0,1) | 0.25 | 0.1875 | 0.0586 | 0.0039 | 0 | |

(0,0) | 0.25 | 0.0625 | 0.0039 | 0.0000 | … | 0 |

Definition 1 provides an equivalence relation because it is a reflexive, symmetric, and transitive relation between functions. Given this equivalence relation, for each , we have the equivalence class of , denoted by . The equivalence relation partitions the space of functions into equivalence classes under an algorithm . The sequences of probability vectors generated by the algorithm uniquely identify the functions in a class.

The equivalence between functions is defined under a given algorithm which implements certain and . However, as we discussed in the previous section, both operators do not play the same role. If two functions are equivalent under , then both functions will be equivalent for any given implemented in satisfying Properties 1 and 2. However, two functions that are equivalent for could not be so for , which implements a different probabilistic model. We can conclude that only the factorization used to approximate *p ^{s}*(

**x**) can create different partitions of the space of problems. This partition is independent of the implementation of .

By taking into account all the aforementioned definitions of algorithm, function, and equivalence, we can deduce the following result assuming that satisfies the properties of impartiality and no degeneration.

*Let be an EDA whose implementation of satisfies p^{a}(x, t)=p^{s}(x, t) for all . All the objective functions are in the same equivalence class under *.

The proof of the theorem is straightforward. Two functions and are equivalent if generates the same sequence of probability vectors departing from any initial **p**_{0}. Due to the impartiality of , we obtain the same vector after selection for and . Next, if satisfies *p ^{a}*(

**x**)=

*p*(

^{s}**x**), then , and therefore we have the same vector

**p**

^{a}

_{0}for both functions. Since

**p**

_{t+1}=

**p**

^{a}

_{t}, the algorithm obtains the same probability vector

**p**

_{1}, and consequently it will obtain the same vectors

**p**

_{t}at any time .

Therefore, if it is assumed that *p*(**x**, *t*+1)=*p ^{s}*(

**x**,

*t*), the algorithm has the same behavior for all the injective functions and hence the same properties of convergence to the optimum (Mühlenbein et al., 1999; Zhang and Mühlenbein, 2004). As commented above, only the probabilistic model is able to create partitions of the space of problems. And moreover, if this model is able to exactly represent the distribution of the selected individuals, all the functions are in a single class. These results support the usual classification of EDAs which is carried out according to the probabilistic model implemented.

### 4.1 Descriptors of the Behavior of EDAs

Once the space of problems is divided into equivalence classes, we could study the behavior that the algorithm exhibits in each class. In this regard, we discuss the role of two descriptors of the behavior of the EDA which are relevant for the purposes of the current paper. On the one hand, the most basic descriptor is the sequences of probability vectors. We know that the algorithm induces the same set of sequences for any function in the same class and different sets of sequences for functions in different classes. Therefore, the set of sequences associated with any function of a class can be used to unequivocally represent the behavior of the corresponding class.

On the other hand, the behavior of an EDA can be described by using the basins of attraction of the solutions in *S*. Basin of attraction is a term used in dynamical systems that we adopt in a simplified manner. Roughly speaking, the basin of attraction of a point **X** in a dynamical system is the set of initial points that converge to **X**. More formally, by using the function , we say that the basin of attraction of a solution , is the set such that , where is the position of the solution **x**_{i} in the ranking . According to this definition, the basins of attraction related to each solution generate a partition of that can be expressed by the sets .

The sequences of probability vectors express the complete process of convergence, whereas the basins of attraction represent only the final convergence of the algorithm. From Definition 1, we can deduce that if functions and are equivalent, the algorithm generates the same basins of attraction for both functions. Therefore, all the functions belonging to the same class have the same basins of attraction. However, two different classes could have the same basins of attraction although they always have different sequences of probability vectors. Informally speaking, we might say that, for functions in different classes, the algorithm can reach the same places although it uses different roads.

## 5 Case Study: Univariate EDA

The concepts and definitions about the taxonomy of problems that were abstractly discussed in the previous sections will be studied in depth for a simple EDA that assumes independence among the variables of the problem. Different algorithms such as population-based incremental learning (PBIL), compact genetic algorithm (cGA), and univariate marginal distribution algorithm (UMDA) introduce this type of model (Larrañaga and Lozano, 2002).

*p*(

^{s}**x**) by means of the following univariate factorization: where is the set of parameters associated with the factorization. The local parameters specify the marginal probability distributions

*p*(

^{s}*x*) where is the probability value

_{i}*p*(

^{s}*X*=

_{i}*x*). In EDAs with finite population, each parameter can be estimated from the selected individuals by using different approaches such as relative frequencies or more sophisticated update rules. Nevertheless, in the model of infinite population, the marginal probabilities can be exactly calculated as, From now on, we will ignore the explicit reference to the parameters in Equation (3). Specifically, the function is implemented in algorithm as

_{i}### 5.1 Equivalence Condition

*p*(

^{s}*x*) can be put into relation with the ranking of the solutions which are involved in it. Thus, we can rewrite Equation (4) as where is the set of ranking positions corresponding to points , whose probabilities have been used to calculate the marginal distribution. Note that the cardinality of any is always 2

_{i}^{n-1}. For each marginal probability, we have the set of ranking positions associated to

*p*(

*X*=0), denoted by

_{i}*Q*

^{0}

_{i}, and the set associated to

*p*(

*X*=1), denoted by

_{i}*Q*

^{1}

_{i}. Note that the sets are intrinsically related to the parameters . Then, we associate the set with the marginal distribution

*p*(

*x*). Note that for all

_{i}*i*, and . In addition, all the sets involved in the factorization have to be different. Finally, we define the set which includes all the information needed to link the function and the factorization. The ranking positions belonging to each subset depend on the function to which the algorithm is applied. In summary, the relationship between the probabilistic model and the function is expressed by the structure of sets represented in Figure 2(a). An illustrative example of the definitions mentioned above is presented in Figure 2(b). We can see that by using Equation (6), the marginal probability

*p*(

*X*

_{1}=0) is calculated through the solutions with the rankings . In turn,

*p*(

*X*

_{1}=1) is associated with the set of rankings . Thus, we have the set related to the marginal distribution

*p*(

*x*

_{1}). In this example, we also have the sets associated with

*p*(

*x*

_{2}) and for

*p*(

*x*

_{3}). Finally, we can create the corresponding set associated with the factorization of the probability distribution and the function being optimized.

In general, we can prove the following necessary and sufficient condition of equivalence between objective functions when algorithm implements the function according to Equation 3. By using this condition, we will carry out the partition of the space of functions into equivalence classes.

*Let be an EDA that implements as . Two functions and are equivalent under if and only if the corresponding sets and are equal*.

Two functions and are equivalent if generates the same sequence of probability vectors departing from any initial **p**_{0}. Therefore, we need to prove that in every generation *t* for all initial **p**_{0} if and only if . Firstly, we show that if , then . In order to do that and taking into account that is independent of , it is enough to prove that for any given **p**^{s}. if and only if for any , there exists such that and vice versa. In turn, if an only if for any there exists such that and vice versa. Therefore, according to Equation (6), if , then we are calculating the same set of probability values for both functions and we will obtain the same probability vector **p**^{a} in both cases.

Secondly, we prove that if in every generation *t* for all initial **p**_{0}, then . To prove this part of the theorem, it suffices to consider a specific set of initial probability vectors containing values greater than 0 only in the desired positions. Thus, let be any initial vector such that *p _{i}*>0 if , otherwise

*p*=0. These initial vectors have probability values greater than 0 only in the positions associated with solutions that begin with 1 in . Due to Property 2 of , we obtain a vector

_{i}**p**

^{s}

_{0}after selection with nonzero probabilities in the same positions as in

**p**

_{0}. Then, we have that

*p*(

^{s}*X*

_{1}=1,

*t*=0)=1 for after selection. Since we know that , it necessarily implies that there exists such that

*p*(

^{s}*X*=

_{j}*x*,

_{j}*t*=0)=1 for and therefore . Otherwise, we would have that

*p*>0 for all

_{i}*i*in and hence, the sequence would be different. By the same argument, any initial vector satisfying

*p*>0 if implies that there exists such that . The same process is repeated for the remaining indices until

_{i}*n*, where we consider any initial point such that

*p*>0 if . Since we have already matched

_{i}*n*−1 sets with the corresponding

*n*−1 sets , there remains a last index variable

*l*such that . Therefore, if the algorithm generates the same sequences of probability vectors from every initial point for and , then .

### 5.2 Characterization of Equivalence Classes

Before addressing in detail the description of the functions belonging to a class, we show in Table 4 an example of three equivalent functions , , and . We applied two different operations to obtain these equivalent functions. Firstly, we generated the function by negating the values *x*_{1} for all **x**=(*x*_{1}, *x*_{2}, *x*_{3}) belonging to . Secondly, the function was obtained by swapping the values *x*_{2} and *x*_{3} for all **x**=(*x*_{1}, *x*_{2}, *x*_{3}) belonging to . This operation can be seen as a swapping of the columns *x*_{2} and *x*_{3}. In Figure 3, we present the corresponding sets , , and related to the functions of Table 4. Since these sets are equal, we know by Theorem 2 that the functions are equivalent.

. | . | . | . |
---|---|---|---|

Rank . | (x_{1}, x_{2}, x_{3})
. | (ln otx_{1}, x_{2}, x_{3})
. | (ln otx_{1}, x_{3}, x_{2})
. |

1 | (1,1,1) | (0,1,1) | (0,1,1) |

2 | (0,1,0) | (1,1,0) | (1,0,1) |

3 | (0,0,1) | (1,0,1) | (1,1,0) |

4 | (1,0,0) | (0,0,0) | (0,0,0) |

5 | (0,1,1) | (1,1,1) | (1,1,1) |

6 | (1,0,1) | (0,0,1) | (0,1,0) |

7 | (1,1,0) | (0,1,0) | (0,0,1) |

8 | (0,0,0) | (1,0,0) | (1,0,0) |

. | . | . | . |
---|---|---|---|

Rank . | (x_{1}, x_{2}, x_{3})
. | (ln otx_{1}, x_{2}, x_{3})
. | (ln otx_{1}, x_{3}, x_{2})
. |

1 | (1,1,1) | (0,1,1) | (0,1,1) |

2 | (0,1,0) | (1,1,0) | (1,0,1) |

3 | (0,0,1) | (1,0,1) | (1,1,0) |

4 | (1,0,0) | (0,0,0) | (0,0,0) |

5 | (0,1,1) | (1,1,1) | (1,1,1) |

6 | (1,0,1) | (0,0,1) | (0,1,0) |

7 | (1,1,0) | (0,1,0) | (0,0,1) |

8 | (0,0,0) | (1,0,0) | (1,0,0) |

In general, taking into account that is defined as a 2^{n}-tuple of the solutions , the following two operations permit the description of the functions in the class :

**Operator M1 (Negation).**Given a permutation , where , and an index , the operator M1 returns a function , where , verifying that*y*=ln_{i}*otx*for all ranking positions . Through this operation we change zeros with ones, and vice versa, in the corresponding variable_{i}*X*. This operator can be successively applied until all combinations of variable negations are obtained. Thus, including in the count, we can generate 2_{i}^{n}different functions by means of M1.**Operator M2 (Swapping).**Given a permutation , where , and two indexes , the operator M2 returns a function , where , verifying that*y*=_{i}*x*and_{j}*y*=_{j}*x*for all ranking positions . If we consider the elements of disposed in columns, then we can say that the permutation of the columns associated with the variables_{i}*X*and_{i}*X*produces an equivalent function. From this interpretation, it is easy to see that_{j}*n*! different functions can be generated by applying M2.

According to Theorem 2, we can assert that, given a function , the aforementioned operators return equivalent functions because . In fact, these two operators are a consequence of this equality of sets. We have previously shown how the set is created from according to the probabilistic model used by the algorithm. Inversely, we can also read the permutation from the set . Thus, a set is telling us the positions of the permutation in which *X _{i}*=

*x*(cf Figure 2). If these positions are read from every set , then we will obtain . Note that if we modify the ranking positions belonging to , we are modifying the positions in which

_{i}*X*=

_{i}*x*, and therefore we will read a different permutation .

_{i}Specifically, departing from a given , we can move the ranking positions that this set contains in two different ways in order to read different functions such that . The first type of movement is as follows. Given any , the ranking positions and are swapped together between both subsets to create an equal set in which and . From , we can read a new function equivalent to . In Figure 3, this type of movement is applied to the set obtaining an equal set . From this movement, we deduce that the negation operator M1 produces equivalent functions.

In the second type of movement, given any pair , we can exchange the ranking positions belonging to both sets *O _{i}* and

*O*as follows. Let and , we swap the elements between the sets and to create the set in which and . All the ranking positions belonging to have to be moved at the same time, otherwise, the associated marginal probability will make no sense. In Figure 3, this type of movement is applied to the sets

_{j}*O*

_{2}and

*O*

_{3}in producing an equal set . From this movement we deduce that the swapping operator M2 produces equivalent functions.

*n*!2

^{n}. Therefore, the number of classes is This is the number of different behaviors that a univariate EDA can show in solving Equation (1).

The operators of negation and swapping presented in this section are closely related to the xor-invariance and the permutation-invariance introduced in Lehre and Witt (2012) in order to conduct time complexity analysis of randomized search heuristics using unbiased variation operators. Although the operators presented in the current paper and the properties defined in Lehre and Witt have technical differences, both approaches provide a scenario in which the algorithm is not biased by the specific configuration of the solutions. In the terms used in Lehre and Witt, and setting aside the differences, we could say that the univariate EDA is an unbiased algorithm.

### 5.3 Equivalence Classes and Local Optima

The impact that different problem characteristics have in the performance of search algorithms, and hence in the difficulty of the problem, is a fundamental topic in the optimization field. Properties such as number of local optima, additive decomposability of the function (Mühlenbein and Mahnig, 1999; Gao and Culberson, 2005), fitness landscape network (Liu et al., 2012), and different difficulty measures (Naudts and Kallel, 2000; Liu et al., 2012) have been proposed and studied in order to advance the performance of EAs. In fact, the problem difficulty analysis is also useful to classify optimization problems (Jansen, 2001). Although this type of classification is performed in a very different way from the taxonomy of problems presented in the current paper, it could be possible to create links between both approaches. Thus, the equivalence classes can be connected to characteristics of the problems belonging to them with the aim of assisting in the prediction of problem difficulty for EDAs.

*S*induced by the Hamming distance. Thus, the distance

*H*(

**x**,

**y**) between two solutions

**X**and

**y**is given by

The neighbors of a solution **X** are those solutions such that *H*(**x**, **y**)=1. In terms of the permutation , a solution **X** is called a local optima if for any such that *H*(**x**, **y**)=1. Figure 4 illustrates how the neighborhood system and the local optima can be seen for the given . Each vertex of the cube represents a solution . Below each solution, we have the corresponding position of each **X** in . It can be checked that the first four solutions in the permutation are local optima because they are better solutions than their neighbors. These solutions are marked in bold.

The relation between the equivalence classes and the Hamming neighborhood system in *S* is established by the following theorem.

*All the functions in the same equivalence class have the same number of local optima and in the same ranking positions*.

A local optimum can be defined according to the position that it holds in the permutation and the Hamming distance to the preceding solutions. Then, a solution is a local optima for if for all *j*<*i*. Given a function , it is easy to see that by applying the operators M1 and M2 of negation and swapping, any function that we obtain verifies for all . Therefore, if there is a local optima in the position *i* of , then there is a local optima in the same position *i* of , and vice versa. Since the operators M1 and M2 allow for a description of all the functions of a class, these functions always have the same number of local optima and are in the same ranking positions.

From the results presented in Lehre and Witt (2012), it is also possible to deduce that arbitrary applications of M1 and M2 preserve Hamming distance. This is due to the property of Hamming-invariance that the unbiased variation operators satisfy.

Theorem 3 agrees with the results presented by González et al. (2001) and Zhang (2004). According to these papers, all the local optima are attractive fixed points (Scheinerman, 1996) of the dynamical systems that were used to model the corresponding univariate EDA implementations. Therefore, since we say that the algorithm has the same behavior for all the functions belonging to a class, all those functions should necessarily have the same number of local optima and in the same ranking positions to support González et al. and Zhang. Nevertheless, Theorem 3 also provides a more general perspective because it implies that the relation between univariate EDAs and the local optima of the function is an intrinsic property of the probabilistic model and is independent of the implementation of the selection scheme.

## 6 Numerical Experiments in *S*={ 0,1}^{3}

In this section, we use the previously elaborated partition of classes of the functions to carry out a more detailed analysis of the different behaviors the univariate EDA can have in the injective functions in . Note that conducting numerical experiments to analyze the behavior of the algorithm in all equivalence classes is only feasible for very small *n* (e.g., ). For greater *n*, conducting an exhaustive analysis of the different EDA behaviors becomes intractable and then only theoretical studies or specific experiments can be carried out. We conduct numerical simulations of an EDA with infinite population which implements tournament selection and computes the approximation step according to Equation (5). The local dynamical behavior of this algorithm was theoretically studied in Zhang (2004).

Particularly, we will concentrate on the complexity of the classes for the algorithm. The complexity is measured by two descriptors: (i) the size of the basin of attraction of the global optimum, and (ii) the generated sequence of probability vectors. We will consider that the smaller the size of the basin of attraction in a class, the more difficult the problems in that class are. Moreover, for two classes with the same basin size, the more time the algorithm takes to converge, the more difficult the function is.

In order to add more information to this analysis, and taking into account the relevant role that the local optima play in the EDA that assumes independence (see Theorem 3), we will put the previous complexity results in relation to this problem characteristic. We will see that the complexity for the EDA in terms of the size of the basin of attraction is closely related to the number of local optima and their positions in the function ranking. To the best of our knowledge, this is the first time for such an analysis to appear in the literature.

### 6.1 Experimental Design

We take into account all the possible that can be constructed over the search space . Therefore, we consider 2^{3}!=40320 functions. By creating the set for each function and applying Theorem 2 for each pair of sets, we group the functions by equivalence classes. We have 3!2^{3}=48 functions per class and hence 840 classes. We only need to consider one function per class because all the functions in the same class behave equivalently. The selection of the function which represents the class is arbitrary.

To carry out the EDA simulations, we need to specify four elements: the initial points, the selection mechanism, the approximation step (Equation (5)), and the stopping condition. We create 10,000 initial probability vectors which try to be representative of the simplex . These initial points have been randomly generated by sampling a Dirichlet distribution with all the parameters equal to 1. In addition, we also take into account the uniform distribution as an initial point. Then, for each function, we launch 10,001 EDA runs, one from each initial probability vector previously generated. All these runs try to represent the EDA behavior for the corresponding optimization problem.

**p**

^{s}after tournament selection can be computed from the vector

**p**as follows:

Tournament selection obeys the properties that we imposed to in Section 3.2. Therefore, the partition induced from the equivalence condition of Theorem 2 is valid.

The stopping condition of the algorithm is a maximum of 40 iterations. This number of generations provides a satisfactory trade-off between accuracy in the numerical results and computational cost. The algorithm converges to 1 in 93% of all the runs conducted. In the rest of the runs, the highest probability value after 40 generations is always greater than 0.9998. In these cases, it is assumed that the algorithm has converged to the corresponding solution. The numerical precision that we have used is double-precision floating point, requiring 64 bits per stored value.

In the numerical analysis, the size of the basins of attraction can be stored in a vector where each *b _{i}* is the number of initial points that have converged to the solution with rank

*i*.

### 6.2 Results

First of all, in Figure 5, we show the proportion of functions with different numbers of local optima in order to provide a general perspective of this problem characteristic. By considering classes instead of specific functions, we have 180 classes with just one local optimum (the global optimum), 510 classes with two local optima, 120 classes with three local optima, and finally 30 classes with four local optima.

#### 6.2.1 Analysis Based on Basins of Attraction

In order to provide a first general picture of the equivalence classes, we represent in Figure 6 the basins of attraction of the optimum by means of different colors. The sizes of these basins of attraction are interpreted in terms of problem difficulty. The color bar on the right of this picture indicates the relation between the colors and the size of the basin. At the top of the spectrum, the white color is assigned to the largest basins of attraction which indicate easy problems. At the bottom, the dark colors represent low basins and hence, they reveal the hardest problems. In addition, the classes have been grouped by the number of local optima. Thus, the picture has been divided into four parts separated by vertical dashed lines. From left to right, we have the classes with one, two, three, and four local optima, respectively. In each of these parts, the classes are ordered according to the size of the basins of attraction.

Note that we have assigned the white color only to the classes in which the size of the basins is 10,001 or is a very close number to that. We start to use light gray colors when approximately 150 initial points do not converge to the optimum. We try to highlight all the small variations between classes because they could imply dramatic differences in EDAs with finite populations and problems with greater dimension. According to Figure 6, we could say that the white classes are easy. These white classes cover all the problems with one local optima and a small number of problems with two local optima. It can be observed that a higher number of local optima does not necessarily imply more difficult problems. In fact, the darkest colors are in the area corresponding to classes with two local optima. It is the zone in which we can see the widest range of colors.

As discussed in Section 4, different classes can have the same basins of attraction. Thus, Figure 6 shows how the classes can be grouped according to these basins of attraction. This fact could be related to the existence of a second level of grouping among classes. Nevertheless, it deserves a specific study.

Through Figure 7, we analyze the basins of attraction of both the global optimum and the rest of local optima. We consider all the ranking positions at which the local optima can be allocated. In this case, when we refer to local optima, the global optimum is not included. The different scenarios are indicated in the legends and expressed by means of different markers. Note that in these plots only the basins of the global optimum and the best local optimum are explicitly indicated. The dashed line *y*=10001−*x* is used as a reference to illustrate the proportion of initial points that have converged to the rest of the local optima.

In Figure 7(a), we show the classes with one local optimum. This solution can be in different positions of the permutation as indicated in the figure. Note that the difficulty of the problem strongly depends on the ranking position of the local optimum. When this solution changes from the second position to the third position of the ranking, the difficulty of the problem decreases dramatically. In fact, when the local optimum has rank 5, the complexity of the problems is very similar to the complexity of a problem without local optima. Figures 7(b) and 7(c) show the classes with two and three local optima, respectively. We can see how the final convergence of the algorithm is distributed among the different local optima. Analogously to the previous situation, as the local optima have a lower rank, their basins of attraction clearly decrease. Particularly, when the rank of the worst local optimum changes from 3 to 4 in Figure 7(b), the difficulty of the problems dramatically decreases. In Figure 7(c) the changes are more subtle. Nevertheless, we can observe that the classes with a local optimum in the fifth position never reach the highest complexities of the classes with a local optimum in the fourth position.

#### 6.2.2 Analysis Based on Sequences

We know that the sequences of probability vectors generated by the algorithm uniquely identify the functions in a class. In this section, we use this fact in order to distinguish the different EDA behaviors for the classes with only one local optimum (the global optimum). According to the basins of attraction (Figure 6), all these classes have the same complexity for the algorithm. However, in Figure 8 we can observe different convergence behaviors. In Figure 8(a), we show the curves that the probability of the optimum depicts throughout the generations. Alternatively, Figure 8(b) represents the same probabilities of the optimum by means of colors. The specific probability values are shown in the color bar on the right. Particularly in Figure 8(b), we can clearly see how several classes on the bottom of the chart have a slower convergence. This slower convergence to the optimum can also be interpreted as a consequence of facing harder problems. From this point of view, we could say that not all the functions with one local optimum have the same complexity. This fact could have implications in EDAs with finite samples.

## 7 Discussion and Future Work

There are a number of directions in which we plan to extend the results presented in this paper. In the following paragraphs, the most important points are discussed.

An important generalization is the introduction of noninjective functions in the analysis. In general terms, instead of dealing with only one to represent a function, we could consider a set of permutations to represent noninjective functions. An adequate definition and management of noninjectivity in algorithm would be essential in order to extend Theorems 1 and 2 to this type of function.

The function in algorithm can be extended to deal with a set of different factorizations; or this operator could be restricted to work with a maximum complexity for the probabilistic model. By means of this generalization, the algorithm could represent the behaviors of complex EDAs such as those that implement learning of Bayesian networks.

The sets can also be extended to include any type of factorization expressed by means of marginal and conditional probability functions. In this regard, it would be necessary to develop new properties to guarantee valid equivalences and to characterize equivalent functions. An interesting related issue is to know how the number of classes decreases as the complexity allowed in the probabilistic model increases.

In turn, the definition of equivalence can be directly extended to include noninjective functions or more complex algorithms because, in essence, we only demand equal sequences of probability vectors. This is independent of the way in which the noninjectivity or the probabilistic models are managed by the algorithm.

In relation to the selection , we distinguish the following two important open issues. Firstly, alternative definitions of equivalence between problems should be developed to include selection schemes that take into account the specific function values of the solutions. Secondly, a crucial point is to take into account the convergence of the algorithm to the optimal solution. In order to deal with convergence issues in the framework of the equivalence classes, we need to take into account the properties that the selection should obey (Section 3).

In the current paper, we have shown the connection between the equivalence classes and the neighborhood system induced by Hamming distance for univariate EDAs. We hypothesize that it is possible to discover other links with problem characteristics and descriptors. The equivalence classes can be connected, on the one hand, with problem descriptors and, on the other hand, with the difficulty of the problems. Therefore, the information that we could have about the problem at hand can be used to identify the class to which it belongs and then try to advance how the algorithm will perform.

## 8 Conclusions

This work can be divided into three main parts. Firstly, we have laid the foundations to create taxonomies of problems under EDAs by providing the needed definitions regarding the optimization problems, the algorithm, and the equivalence relation. From these definitions, it has been deduced that all the problems are in the same class when the probabilistic model does not impose restrictions to approximate the distribution of the selected individuals. Secondly, we have studied the taxonomy of problems that arises under a univariate EDA. To express the relation between the probabilistic model and the function, we have defined the sets . Based on these sets, we are able to provide a necessary and sufficient condition to decide the equivalence between functions and to partition the space of problems. Through the operators of negation and swapping, it is possible to describe all the functions in a class and count its members. By taking into account the aforementioned elements, we reveal an intrinsic connection between the univariate probabilistic model and the neighborhood system induced by Hamming distance. In the third and last part, we have conducted numerical simulations of a univariate EDA which implements tournament selection. We can extract the following main conclusions from the experiments. (i) A higher number of local optima does not necessarily imply more difficult problems. In this regard, the difficulty of the problem strongly depends on the ranking position of the local optima. (ii) We have observed how the classes can be grouped according to the basins of attraction showing a second level of grouping. (iii) We have shown that not all the functions without local optima have the same complexity.

In general terms, this paper introduces a framework to formally study the relationship between EDAs and the space of optimization problems. The results that we have presented can be generalized and extended in many directions. Specifically, once the partition of the space of problems has been created, we consider the following questions particularly important:

How can we describe and identify the classes of easy and hard problems for EDAs? (Chen et al., 2010)

Which are the problem descriptors that allow to identify the class to which that problem belongs to?

How can we study the convergence of the algorithm (Zhang, 2004; Zhang and Mühlenbein, 2004) for the problems in a class?

Which is the minimum complexity that should be introduced in the probabilistic model in order to converge to the optimum? (Echegoyen et al., 2012)

How can we extrapolate these results to algorithms that use finite samples?

These types of issues could be translated to any EA. We believe that working in the direction given by these questions is important to reach an in depth understanding of the underlying mechanisms that govern EAs.

## Acknowledgments

This work was partially supported by the Saiotek and Research Groups 2007-2012 (IT-242-07) programs (Basque Government), TIN2010-14931 (Spanish Ministry of Science and Innovation), and COMBIOMED network in computational biomedicine (Carlos III Health Institute). Carlos Echegoyen holds a fellowship from UPV-EHU.