## Abstract

Estimation of distribution algorithms (EDAs) are stochastic optimization techniques that explore the space of potential solutions by building and sampling explicit probabilistic models of promising candidate solutions. While the primary goal of applying EDAs is to discover the global optimum or at least its accurate approximation, besides this, any EDA provides us with a sequence of probabilistic models, which in most cases hold a great deal of information about the problem. Although using problem-specific knowledge has been shown to significantly improve performance of EDAs and other evolutionary algorithms, this readily available source of problem-specific information has been practically ignored by the EDA community. This paper takes the first step toward the use of probabilistic models obtained by EDAs to speed up the solution of similar problems in the future. More specifically, we propose two approaches to biasing model building in the hierarchical Bayesian optimization algorithm (hBOA) based on knowledge automatically learned from previous hBOA runs on similar problems. We show that the proposed methods lead to substantial speedups and argue that the methods should work well in other applications that require solving a large number of problems with similar structure.

## 1. Introduction

One of the key goals in genetic and evolutionary computation is the design of competent genetic algorithms (GAs) that can solve hard problems scalably and reliably (Goldberg, 2002). Much work has been done in this regard, resulting in algorithms that can solve broad classes of problems in a robust and scalable manner. Yet this is not the complete story because even if we can solve a problem in low-order polynomial time, the problem may still be intractable. This can happen for many reasons, among them extremely complex fitness functions and an enormously large number of variables to optimize. To solve such challenging problems, practitioners are often forced to use additional efficiency enhancement techniques (Goldberg, 2002; Sastry et al., 2006)—such as parallelization and hybridization—to speed up GAs, and a number of efficiency enhancement techniques have been proposed for this purpose in the past.

Estimation of distribution algorithms (EDAs; Baluja, 1994; Larrañaga and Lozano, 2002; Pelikan, Goldberg et al., 2002; Pelikan, Sastry, and Cantú-Paz, 2006) are among the most powerful and generally applicable genetic and evolutionary algorithms, in which traditional variation operators of genetic algorithms—such as crossover and mutation—are replaced by building a probabilistic model of promising solutions and sampling the built model to generate new candidate solutions. While EDAs have many advantages over standard genetic algorithms (Larrañaga and Lozano, 2002; Pelikan, Sastry, and Cantú-Paz, 2006), one in particular this paper will focus on is that at the end of an EDA run, a series of probabilistic models of our solution space have been built, which hold a great deal of information about the problem. Although such information should be useful for effective efficiency enhancement and it has often been argued that using problem-specific knowledge should significantly improve EDA performance (Schwarz and Ocenasek, 2000; Baluja, 2006), the use of this readily available source of problem-specific information has been practically ignored by the EDA community. This paper takes the first step toward the design of automated techniques for learning from experience and exploiting information included in probabilistic models learned in the past to speed up future EDA runs. This is done by biasing model building using information gathered from probabilistic models obtained on similar problems in previous runs. While the techniques discussed in this paper are designed for the hierarchical Bayesian optimization algorithm (hBOA), the proposed techniques can be adapted to other EDAs based on multivariate probabilistic models in a straightforward manner.

The paper is organized as follows. Section 2 outlines the hBOA. Section 3 outlines efficiency enhancement techniques for genetic and evolutionary algorithms. Section 4 proposes two methods for automatically biasing model building in hBOA using the probabilistic models learned in previous hBOA runs on similar problems. Section 5 presents experimental results. Section 6 discusses related work in both EDAs and machine learning. Finally, Section 7 summarizes and concludes the paper.

## 2. Hierarchical BOA (hBOA)

EDAs (Baluja, 1994; Mühlenbein and Paaß, 1996; Larrañaga and Lozano, 2002; Pelikan, Goldberg et al., 2002)—also called probabilistic model building genetic algorithms (PMBGAs; Pelikan, Goldberg et al., 2002; Pelikan, 2005) and iterated density estimation algorithms (IDEAs; Bosman and Thierens, 2000)—replace standard crossover and mutation operators of genetic and evolutionary algorithms by building a probabilistic model of selected solutions and sampling the built model to generate new candidate solutions. It is beyond the scope of this paper to provide an overview of EDAs and their strengths and limitations; the interested reader should consult (Pelikan, Goldberg et al., 2002; Larrañaga and Lozano, 2002; Pelikan, Sastry, and Goldberg, 2006; Lozano et al., 2006; Mühlenbein and Mahnig, 2002). The hBOA (Pelikan and Goldberg, 2001, 2003b; Pelikan, 2005) is an EDA that uses Bayesian networks to represent the probabilistic model and incorporates restricted tournament replacement (Harik, 1995) for effective diversity maintenance. This section outlines hBOA and briefly discusses Bayesian networks, which are used to guide the exploration of the search space in hBOA.

### 2.1. Basic hBOA Procedure.

HBOA evolves a population of candidate solutions represented by fixed-length strings over a finite alphabet (e.g., binary strings). The initial population is generated at random according to the uniform distribution over the set of all potential solutions. Each iteration (generation) starts by selecting promising solutions from the current population using any standard selection method of genetic and evolutionary algorithms.

After selecting the promising solutions, hBOA builds a Bayesian network (Howard and Matheson, 1981; Pearl, 1988) with local structures as a model for these solutions. New solutions are generated by sampling the built network. The new solutions are then incorporated into the original population.

The next iteration is executed unless some predefined termination criteria are met. For example, the run can be terminated when the maximum number of generations is reached or the entire population consists of copies of the same candidate solution. For more details about the basic hBOA procedure, see Pelikan and Goldberg (2001) or Pelikan (2005).

### 2.2. Bayesian Networks.

Bayesian networks (Pearl, 1988; Howard and Matheson, 1981) combine graph theory, probability theory, and statistics to provide a flexible and practical tool for probabilistic modeling and inference. HBOA uses Bayesian networks to model promising solutions and generate new candidate solutions. A Bayesian network consists of two components:

**Structure**. Structure is defined by an acyclic directed graph with one node per variable and the edges corresponding to conditional dependencies between the variables.**Parameters**. Parameters consist of the conditional probabilities of each variable given the variables that this variable depends on.

In addition to encoding direct conditional dependencies, a Bayesian network may also encode a number of conditional independence assumptions. More specifically, a Bayesian network encodes the assumption that each variable *X _{i}* is conditionally independent of its predecessors in an ancestral ordering of variables given Π

_{i}, where the ancestral ordering orders the variables so that for all

*i*the variables in Π

_{i}precede

*X*.

_{i}HBOA uses Bayesian networks with local structures in the form of dependency trees (Chickering et al., 1997; Friedman and Goldszmidt, 1999). That means that the conditional probabilities for each variable are stored in a decision tree, allowing a more efficient representation of conditional dependencies and a more powerful model building procedure. For more details on learning and sampling Bayesian networks with local structures, see Chickering et al. (1997) and Pelikan (2005).

## 3. Efficiency Enhancement Techniques

There are two primary computational bottlenecks in EDAs: (1) fitness evaluation and (2) model building. To alleviate these bottlenecks, numerous efficiency enhancement techniques (Goldberg, 2002; Sastry et al., 2006; Pelikan, 2005) can be incorporated into an EDA, many of which can be adopted from genetic algorithms or other evolutionary algorithms. Efficiency enhancement techniques for EDAs can be divided into the following categories (Pelikan, 2005):

Parallelization (Cantú-Paz, 2000; Ocenasek, 2002; Ocenasek et al., 2006; Mendiburu et al., 2006).

Evaluation relaxation (Smith et al., 1995; Sastry et al., 2001; Pelikan and Sastry, 2004; Sastry and Goldberg, 2004).

Hybridization (Hartmann, 1996; Pelikan and Goldberg, 2003a; Pelikan and Hartmann, 2006).

Time continuation (Goldberg, 1999, 2002; Srivastava and Goldberg, 2001).

Sporadic and incremental model building (Ocenasek and Schwarz, 2000; Pelikan, 2005; Pelikan, Sastry, and Goldberg, 2006; Etxeberria and Larrañaga, 1999).

Incorporating problem-specific knowledge and learning from experience (Schwarz and Ocenasek, 2000; Sastry, 2001a; Pelikan and Goldberg, 2003a; Mühlenbein and Mahnig, 2002; Baluja, 2006).

For a more detailed discussion of efficiency enhancement techniques in EDAs, please see Sastry et al. (2006).

## 4. Biasing the Model Building

One advantage of EDAs over more conventional evolutionary algorithms is that they construct a fairly global model of the problem landscape and areas worthy of further exploration. The models obtained by EDAs can be analyzed to learn more about the problem being solved and to improve EDA performance on similar problems. One way to exploit information from previous EDA models is to bias the model building procedure in some way as is discussed in this section. More specifically, this section describes the following two approaches to biasing the model building based on probabilistic models obtained in previous runs on similar problems.

Bias model building using the probability coincidence matrix.

Bias model building using the distance threshold.

Both the proposed techniques start with a number of sequences of probabilistic models built by hBOA on one or more instances of the considered problem class. Then, the probabilistic models are processed to provide us with a method to bias probabilistic models in hBOA, which can be used to improve hBOA performance on future instances of the same problem type.

To provide examples of the application of these two approaches, we use three optimization problems: (1) Concatenated traps of order 5, (2) 2D Ising spin glass with ±*J* couplings and periodic boundary conditions, and (3) MAXSAT.

The section starts by describing the test problems. We then describe how to bias model building using the probabilistic coincidence matrix. Next, we focus on biasing the model building using a distance threshold. Finally, the benefits of restricting model structure in hBOA, as well as some pitfalls if the method is chosen incorrectly, are discussed.

### 4.1. Test Problems.

This section describes the test problems considered in this paper.

#### 4.1.1. Trap-5: Concatenated 5-Bit Trap.

*u*is the number of 1s in the input string of 5 bits. The task is to maximize the function. An

*n*-bit trap-5 function has one global optimum in the string of all 1s and (2

^{n/5}− 1) other local optima. Traps of order 5 necessitate that all bits in each group are treated together, because statistics of lower order are misleading. Since hBOA performance is invariant with respect to the ordering of string positions (Pelikan, 2005), it does not matter how the partitioning into 5-bit groups is done, and thus, to make some of the results easier to understand, we assume that trap partitions are located in contiguous blocks of bits.

#### 4.1.2. 2D Ising Spin Glass with ±*J* Couplings and Periodic Boundary Conditions.

Ising spin glasses are prototypical models for disordered systems and have played a central role in statistical physics during the last three decades (Binder and Young, 1986; Mezard et al., 1987; Fischer and Hertz, 1991; Young, 1998). A simple model to describe a finite-dimensional Ising spin glass is to arrange them on a regular 2D or 3D grid where each node *i* corresponds to a spin *s _{i}* and each edge 〈

*i*,

*j*〉 corresponds to a coupling between two spins

*s*and

_{i}*s*. Each edge has a real value

_{j}*J*

_{i,j}associated with it that defines the relationship between the two connected spins. The spins are arranged on a 2D grid, with each spin connected to its four neighbors. To approximate the behavior of large-scale systems, periodic boundary conditions are used that introduce a coupling between the first and the last elements in each row and column.

Here the task is to find a spin configuration for a given set of coupling constants that minimizes the energy of the spin glass. The states with minimum energy are called *ground states*. The spin configurations are encoded with binary strings where each bit specifies the value of one spin (0 for a spin +1, 1 for a spin –1).

All instances of sizes up to 18 × 18 with ground states were obtained from S. Sabhapandit and S. N. Coppersmith from the University of Wisconsin who identified the ground states using flat-histogram Markov chain Monte Carlo simulations (Dayal et al., 2004). The ground states of the remaining instances were obtained from the Spin Glass Ground State Server at the University of Cologne (Spin Glass Ground State Server, 2004).

#### 4.1.3. MAXSAT.

In MAXSAT the task is to find the maximum number of clauses which can be satisfied in a given propositional logic formula in conjunctive normal form. MAXSAT is an important problem in complexity theory and artificial intelligence because it is NP-complete and many other important problems can be mapped to MAXSAT in a straightforward manner. MAXSAT has also become popular in evolutionary computation and a number of researchers studied performance of various genetic and evolutionary algorithms on this class of problems (Rana and Whitley, 1998; Gottlieb et al., 2002; Pelikan and Goldberg, 2003a; Boughaci and Drias, 2004; Brownlee et al., 2007).

*k*; formulae in this form are called

*k*-CNF formulae. A CNF formula is a logical AND of clauses, where each clause is a logical OR of literals. Each literal can either be a proposition or a negation of a proposition. An example of a 3-CNF formula with four propositions

*X*

_{1},

*X*

_{2},

*X*

_{3},

*X*

_{4}is An interpretation of propositions assigns each proposition to be either true or false. The task in MAXSAT is to find an interpretation that maximizes the number of satisfied clauses in the formula. In the example from Equation (4), the assignment setting all literals to true would satisfy all the clauses in the formula and therefore it would be one of the global optima of the corresponding MAXSAT problem. HBOA encodes the interpretations with binary strings where each bit specifies an assignment of one proposition (0 for false, 1 for true). Of note is that MAXSAT is NP-complete for

*k*-CNF if

*k*≥ 2.

In this paper we will consider instances of combined-graph coloring translated into MAXSAT (Gent et al., 1999), which are especially interesting because they are often difficult for standard MAXSAT heuristics such as WalkSAT (Pelikan, 2005) and because they represent a class of problems where randomness is combined with structure (Gent et al., 1999).

The graph-coloring problem instances were created by combining regular ring lattices and random graphs with a specified number of neighbors (Gent et al., 1999). Combining two graphs consists of selecting (1) all edges that overlap in the two graphs, (2) a random fraction (1 – *p*) of the remaining edges from the first graph, and (3) a random fraction *p* of the remaining edges from the second graph. By combining regular graphs in this way, the amount of structure and randomness in the resulting graph can be controlled, since as *p* decreases the graphs become more regular (for *p*= 0, we are left with a regular ring lattice). This allows us to test methods against MAXSAT instances with varying amounts of structure. The ring lattice is constructed by ordering the vertices cyclically and connecting each vertex to the eight closest neighbors. All tested instances of combined-graph coloring were downloaded from the satisfiability library SATLIB (Hoos and Stutzle, 2000) and more details on these instances can be found elsewhere (Gent et al., 1999).

### 4.2. Biasing Models Using the Probability Coincidence Matrix.

For this method we first compute what we will call a probability coincidence matrix (PCM) of size *n* × *n* where *n* is the string length. We denote the matrix by *P* and the elements of this matrix by *P _{ij}* where

*i*,

*j*∈ {1…

*n*}. The value of

*P*is defined as the proportion of probabilistic models in which

_{ij}*i*th and

*j*th string positions are connected (in either of the two possible directions), with the elements of

*P*computed by parsing all available probabilistic models.

While we do not consider time *t* at which each model was discovered, for some problems it may be necessary to incorporate time as well and compute a sequence of matrices *P ^{t}_{ij}* where

*t*denotes the iteration (generation) at which the probabilistic model was built. This may be beneficial if the models are expected to change significantly over time; otherwise, it should be advantageous to use all models available so that we end up with a more accurate estimate of the structures seen in past hBOA runs. We may also consider subsets of available models or weigh models based on various criteria.

Of note is that runs of longer length (with respect to the number of iterations of the main hBOA loop) will be represented more strongly in the PCM than runs of shorter length. While we could have implemented a method that weighed all runs equally, there is no indication that this would have been beneficial. Not all runs are equal, and some similar problems are harder than others, so it could be beneficial to have certain runs more strongly represented. Additionally, with this method, *P _{ij}* represents the actual percentage of models that connected

*i*and

*j*, whereas with other methods the meaning of

*P*would be more abstract. To illustrate the method we use to generate the PCM, Figure 1 shows three example coincidence graphs for a 4-bit problem and the resulting PCM generated by our method.

_{ij}To gain a better understanding of the information provided by the PCM, let us look at a simple example. We computed the PCM from probabilistic models obtained by hBOA for trap-5 of *n* = 50 bits. The trap-5 problem has adjacent nonoverlapping blocks of epistatic genes and this should be clear from the obtained PCM matrix. The resulting PCM matrix is visualized in Figure 2(a). We can clearly see that the majority of the models contain an edge between any pair of positions in the same trap partition, whereas the percentages of edges between pairs of positions that are not located in the same trap partition are extremely small. Of note is that unlike our experiments on MAXSAT and 2D Ising spin glasses (see Section 5), deterministic hill-climbing (DHC) was not used with this example. However, even on trap-5 DHC is known to substantially improve hBOA performance and make models much more accurate (Radetic et al., 2009), providing even better inputs for speeding up future hBOA runs. To visualize the effects of DHC on the PCM built for trap-5, Figure 2(b) shows the PCM computed from 30 independent bisection runs of trap-5 using DHC. The figure shows that when using DHC on trap-5, the vast majority of dependencies are simple chain dependencies among trap partitions. This, however, does not mean that the models are inaccurate; to the contrary, on trap-5, hBOA with DHC can find a correct model with much smaller populations and chain dependencies suffice to correctly capture any trap partition (Radetic et al., 2009).

Once we have computed the PCM, probably the simplest approach to using this matrix for biasing model building in future problems of similar types it to only allow edges between nodes that are contained in at least some percentage *p*_{min} of the provided sample models. For example, we can restrict the models by allowing only edges that appear in at least 40% of the sample models, disallowing a direct conditional dependence between any *i* and *j* for which *P _{ij}* <

*p*

_{min}= 0.4. At an algorithmic level, when hBOA is examining whether to add edges between pairs of nodes, if

*P*<

_{ij}*p*

_{min}then that edge is completely ignored and the possible network gain from such an edge is not calculated. If our PCM was the one in Figure 1, we can see that for

*p*

_{min}= 0.4, we would only allow edges from 0 ↔ 1 and from 2 ↔ 3.

Revisiting the example PCM for trap-5 of 50 bits (see Figure 2(a)), it is easy to note that for all but extremely small values of *p*_{min}, the probabilistic models would be restricted to contain only the edges between pairs of positions located concurrently in any trap partition. As we will explain later on in this section, using the PCM to restrict the probabilistic model would only work with trap-5 if the partitions were located in the same sets of positions in all cases. For example, if the PCM was built from trap-5 instances where the partitions were adjacent and then this PCM was used to bias model building when solving trap-5 instances where the partitions were interleaved, we would expect poor performance.

Another example of a PCM is shown in Figure 2(c), which shows a PCM for 100 random instances of the 10 × 10 Ising spin glass. For each instance, we first used the bisection method (Sastry, 2001b; Pelikan, 2005) to ensure that hBOA finds a true ground state in five out of five independent runs. Then, models from all 500 successful runs (five for each of the 100 instances) were processed to provide the PCM. The resulting PCM for the 10 × 10 spin glass indicates that dependencies are most frequent between connected spins and the percentage of dependencies decreases with spin distance, as is expected based on previous analyses of probabilistic models for 2D spin glasses (Hauschild et al., 2007). Unlike in trap-5, the influence of *p*_{min} on the resulting model restriction is much stronger. Nonetheless, even for small *p*_{min}, the number of possible edges shrinks dramatically. Experimental results with various values of *p*_{min} are shown in Section 5.

The main advantage of using PCM to restrict model structures over user-defined model restrictions based on prior problem-specific knowledge is that the approach proposed here is applicable automatically and the only parameter that needs to be specified by the user is the threshold *p*_{min}. However, the PCM method only gathers valid information when the level of interaction between two specific bit positions in one instance corresponds directly to the level of interaction between the same bit positions in another instance. To help illustrate this, Figure 2(d) shows a PCM gathered from models generated when hBOA was used to solve 100 instances of the combined-graph coloring problem translated into MAXSAT with *p* = 2^{−8}. The only structure visible are those edges that correspond to the regular ring lattice, with all other values being extremely small. For MAXSAT problems in the general case, there is no guarantee that the relationship between two specific propositions in an instance will be in any way similar to their relationship in another instance. For this reason the PCM method is not used when solving the MAXSAT problem instances in this paper. This suggests the need for another approach to encoding and using information from prior hBOA models. One such approach is discussed in the next section.

### 4.3. Biasing Models Using Distance.

As mentioned previously, the PCM-based approach presented above has one main limitation which restricts its utility in practice. Specifically, it is only applicable when the structure of the underlying problem does not change much between different problem instances. While this assumption is clearly satisfied in trap-5 (with fixed trap partitions), 2D Ising spin glasses, and many other problems, it does not hold in other important classes of problems. Nonetheless, both in the finite-dimensional Ising spin glasses as well as in other important classes of challenging problems—for example in MAXSAT and the minimum vertex cover—we may define a distance metric between different decision variables in the problem which should loosely correspond to the strength of the dependence between these variables. For example, in 2D spin glasses, we know that the shorter the shortest path between two spins is, the more frequent the dependencies between these spins are (Hauschild et al., 2007); therefore, when studying models obtained by hBOA, it should be beneficial to use the distance metric based on the shortest distance between two spins in the 2D lattice. In MAXSAT, the distance of two Boolean variables may be defined as 0 for any two variables located in the same clause and for any other pair of variables, we may recursively define the distance between these two variables as the minimum distance between these variables passing through any other variable. Similarly as for the Ising spin glass, variables located in the same clause can be expected to be related much more frequently than other variables. Distance metrics can also be defined for pairs of variables in other important classes across a variety of disciplines, including the quadratic assignment problem, facility location problems, scheduling, different types of graph problems, atomic cluster optimization, and many others.

Given the distance metric, which defines distance between any pair of string positions, we can process the given probabilistic models with respect to the distance. In this paper, we deal with problems for which the distances range from 1 to some maximum value with step 1. Of course, sometimes the distance metric is defined in the continuous space; in that case, the range of possible distances can be typically discretized in a straightforward manner in order to provide a finite set of possible distance categories, allowing us to apply the same technique even to such problems.

The result of such an analysis may be encoded by a vector **Q** that stores the proportion **Q**_{d} of dependencies at a specific distance *d* or a shorter one when averaging over all available models. Of course, other statistics of similar type may be computed as well. It is easy to see that the values in **Q** may never decrease with distance and the entry corresponding to the maximum achievable distance is equal to 1. Table 1 shows an example **Q** vector derived from data.

Distance . | Dep ≤ Dist . | Q
. |
---|---|---|

1 | 500 | 0.5 |

2 | 700 | 0.7 |

3 | 900 | 0.9 |

4 | 1000 | 1.0 |

Distance . | Dep ≤ Dist . | Q
. |
---|---|---|

1 | 500 | 0.5 |

2 | 700 | 0.7 |

3 | 900 | 0.9 |

4 | 1000 | 1.0 |

One way to use the resulting vector **Q** for biasing probabilistic models is to use **Q** to provide an upper bound for the distance of variables that are allowed to be connected. Specifically, we may specify a threshold *q*_{max} and disallow edges between variables at distances for which the *q* value is greater than this threshold. For example, if *q*_{max} = 0.7, then only dependencies spanning distances at or below those distances that occurred in 70% of all total dependencies found in the sample models will be allowed. Referring back to Table 1, we see that setting *q*_{max} = 0.7 would also be equivalent to restricting our model building to disallow any dependency that spanned a distance greater than 2 in our distance metric. Although setting *q*_{max} to a particular value is equivalent to providing an upper bound on the distance, using *q*_{max} can be expected to be more robust in practice. This is mainly because it may not be clear what distance restrictions mean with respect to the particular problem, yet one may typically expect that models covering a large proportion of dependencies will have approximately the same effect as the unrestricted ones.

To visualize the relationship between the distance and the number of dependencies of this distance, Figure 3 shows histograms of the number of dependencies of a given distance in five hBOA runs on one instance of the 20 × 20 spin glass and one instance of the 28 × 28 spin glass. As we can see, most of the dependencies found correspond to shorter distances (Hauschild et al., 2007). However, to use the aforementioned approach to model restriction, we must still decide on the threshold *q*_{min}. Typically, we would expect the maximum distance of dependencies to grow with problem size; this would most likely result in a decreasing value of *q*_{min} with problem size.

### 4.4. Why and When Are Model Restrictions Beneficial?.

There are two main benefits of restricting model structure in hBOA and other multivariate EDAs. First of all, if we restrict model structures by disallowing some dependencies, the model building becomes significantly faster because the model building algorithm has to examine many fewer dependencies. This can lead to substantial reduction in time complexity of model building, as is supported by experimental results in Section 5. Second, if we restrict the probabilistic models by disallowing edges which are indeed unnecessary, then the search may become more effective because simpler yet more accurate models are discovered and the mixing effects of model sampling are maximized.

Using the correct bias should lead to an improvement of EDA performance. However, the practitioner must ensure that a valid method is applied. While both discussed methods work by attempting to represent common strengths of interaction between problem variables shared by similar problem instances, the statistics gathered are quite different. Care must be taken that the appropriate method is used for a particular problem type. The PCM method should only be used when the absolute position of the problem variables determines the relative strength of interactions between different problem instances. This holds for problems such as Ising spin glasses and deceptive trap problems. On the other hand, using the PCM on MAXSAT or the graph bisection problem would result in little useful information, as the absolute position of problem variables does not determine the relative strength of interaction between them. Instead, for those problems where it is possible to define a distance metric that can capture the relative strength between dependencies, the distance bias method should be used.

The following section provides a thorough empirical analysis of the above two approaches to automatically restricting model structure on 2D Ising spin glasses and MAXSAT.

## 5. Experiments

This section covers the experiments using the two proposed approaches to biasing hBOA model building on 2D Ising spin glasses and MAXSAT. Since in MAXSAT, problem structure of different problem instances may vary substantially, for this class of problems we only consider the approach based on the distance metric. First the parameter settings and the experimental setup are discussed. The results are then presented.

In many of the following results, we use the term speedup. We formally define speedup as the ratio of the original execution time and the execution time after incorporating model restrictions. For example, if the original execution time was 6 s and the execution time after applying model restrictions was 3 s, the speedup would be 2; in other words hBOA was able to solve the problem twice as fast due to the model restrictions.

### 5.1. Experimental Setup.

To select promising candidate solutions we use truncation selection with threshold τ = 50%, which selects the best half of the current population based on the objective function.

To incorporate candidate solutions back into the population we use restricted tournament replacement (RTR; Harik, 1995) which ensures effective diversity maintenance. RTR with window size *w*>1 incorporates each new candidate solution *X* into the original population using the following three steps: (1) Randomly select a subset *W* of *w* candidate solutions from the original population. (2) Let *Y* be a solution from *W* that is most similar to *X* (based on genotypic distance). (3) Replace *Y* with *X* if *X* is better; otherwise, discard *X*.

To improve the performance of hBOA, we incorporate a deterministic hill climber (DHC; see Section 3) to improve quality of each evaluated solution. DHC takes a candidate solution represented by a binary string and keeps performing single-bit flips on the solution that leads to the greatest improvement in fitness. The local searcher is applied to every solution before it is evaluated.

### 5.2. Parameter Settings.

To build Bayesian networks, the greedy algorithm was used with the Bayesian-Dirichlet scoring metric for Bayesian networks with local structures (Cooper and Herskovits, 1992; Chickering et al., 1997) with an additional penalty for model complexity (Friedman and Goldszmidt, 1999; Pelikan, 2005). In RTR, the window size was set as *w* = min{*n*, *N*/20} (Pelikan, 2005).

For all problem instances, bisection (Sastry, 2001b; Pelikan, 2005) was used to determine the minimum population size to ensure convergence to the global optimum in five out of five independent runs, with the results averaged over the five runs. The number of generations was upper bounded according to preliminary experiments and hBOA scalability theory (Pelikan, Sastry et al., 2002) by *n*/4 where *n* is the number of bits in the problem. Each run of hBOA is terminated when the global optimum has been found (success) or when the upper bound on the number of generations has been reached without discovering the global optimum (failure).

### 5.3. Experiments with PCM Model Bias on Ising Spin Glass.

While building a PCM is relatively straightforward, determining a threshold *p*_{min} for cutting off edges is not. If the threshold is too small, models will not be restricted much and the benefits of using PCM to restrict model structures may be negligible. If the threshold is too large, the restrictions may be too severe and hBOA performance may suffer by using inadequate models.

To get a better idea of how *p*_{min} influences the benefits of restricting models with PCM, we considered a range of problem sizes from 16 × 16 (256 spins) to 32 × 32 (1,024 spins). For each problem size, we used 100 random instances. In order to not use the same problem instances to both learn PCM as well as validate the resulting bias on model structure, we used 10-fold cross validation. For each problem size, the 100 instances were divided into 10 equally sized subsets and in each step of the cross-validation, we used nine of these 10 subsets to learn PCM and tested the resulting bias on model building on the remaining subset of instances. This was repeated 10 times, always leaving one subset of instances for validation. In this manner, the validation was done on different instances than those used for learning PCM and each subset of instances was used for validation.

Figure 4 shows the average execution time speedup obtained from the 10-fold cross-validation for four different problem sizes with varying threshold *p*_{min}. Note that the execution time does not include only the time spent in model building; it includes the overall time required for the entire execution of hBOA. The threshold for cutting of model dependencies was varied from *p*_{min} = 0 (no restrictions) to some maximum value, where the maximum value was set in order to ensure that no instances in the validation set become infeasible within reasonable computational time (this would indicate too severe model restrictions). First of all, we see that for all problem sizes, execution-time speedups of about 4–4.5 were obtained, which is a substantial speedup. It can be also be seen that the optimal value of *p*_{min} decreases with problem size. Nonetheless, even when the value of *p*_{min} is twice as large as the optimal one, the speedups still remain substantial. As problem size increases, the optimal speedup stays nearly constant (in the range of 4–4.5). Note that while these graphs show that all levels of restriction displayed in the graphs led to overall speedup, this is not the case for all possible restrictions. It is possible for a sufficiently large *p*_{min} to overly restrict the hBOA model and slow down overall execution time due to an increase in the number of generations or fitness evaluations or even make the search for the optimum intractable.

But exactly how are our restrictions affecting model building efficiency? To quantify the effects of model building restrictions on the complexity of model building, we record the number of bits that must be checked to update model parameters during the entire model building procedure as this is the primary factor affecting time complexity of model building in hBOA. The fewer bits we must examine, the faster the model building should be. If we restrict the number of potential edges ending in any particular node, after adding an edge into this node, the number of examined bits reduces by the same factor. For Bayesian networks with local structures, limiting the model structure leads to a decrease in the number of potential splits on decision variables in the network.

Figure 5 shows the average factor of decrease in the number of bits examined during the model building with various values of the PCM cutoff threshold *p*_{min}. The results show that model restrictions based on PCM dramatically decrease the number of bits that must be examined. One thing of note is that even after we reach the point of maximum execution time speedup, the number of bits examined further decreases with increasing *p*_{min}. This indicates that after the optimal cutoff, other factors start to weigh more heavily on the execution time. For example, as the model-building bias becomes too restrictive, convergence can slow down, the number of fitness evaluations can increase, or the bias could require a larger population.

Table 2 shows the best speedups obtained, the percentage of total possible dependencies considered by hBOA, and the corresponding cutoff values for all problem sizes examined. The results show nearly the same maximum speedup of about 4–4.5 for all problem sizes. The results also indicate that as the problem size increases, the cutoff values must be slightly decreased to achieve optimal speedup. We believe that the reason for this is that for larger problems, the total number of dependencies increases, and to ensure that a sufficient fraction of dependencies is considered, the cutoff threshold must decrease. We also see that even as the cutoff threshold increases, hBOA only needs to consider a small percentage of the total dependencies. In fact this percentage is remarkably similar for the different problem sizes.

Size . | Execution time speedup . | p_{min}
. | Percent of total dependencies . |
---|---|---|---|

256 (16 × 16) | 3.89 | 0.020 | 6.4% |

324 (18 × 18) | 4.37 | 0.011 | 8.7% |

400 (20 × 20) | 4.34 | 0.020 | 7.0% |

484 (22 × 22) | 4.61 | 0.010 | 6.3% |

576 (24 × 24) | 4.63 | 0.013 | 4.6% |

676 (26 × 26) | 4.62 | 0.011 | 4.7% |

784 (28 × 28) | 4.45 | 0.009 | 5.4% |

900 (30 × 30) | 4.93 | 0.005 | 8.1% |

1024 (32 × 32) | 4.14 | 0.007 | 5.5% |

Size . | Execution time speedup . | p_{min}
. | Percent of total dependencies . |
---|---|---|---|

256 (16 × 16) | 3.89 | 0.020 | 6.4% |

324 (18 × 18) | 4.37 | 0.011 | 8.7% |

400 (20 × 20) | 4.34 | 0.020 | 7.0% |

484 (22 × 22) | 4.61 | 0.010 | 6.3% |

576 (24 × 24) | 4.63 | 0.013 | 4.6% |

676 (26 × 26) | 4.62 | 0.011 | 4.7% |

784 (28 × 28) | 4.45 | 0.009 | 5.4% |

900 (30 × 30) | 4.93 | 0.005 | 8.1% |

1024 (32 × 32) | 4.14 | 0.007 | 5.5% |

### 5.4. Experiments with Distance-Based Bias on 2D Ising Spin Glass.

This section tests the distance-based bias approach on the 2D Ising spin glass with the distance metric defined as the minimum number of couplings one must pass to get from one spin to the other one.

While it is clear that the probabilistic models discovered by hBOA contain mostly dependencies at shorter distances (see Figure 3), setting an appropriate threshold for the maximum distance of dependencies remains a challenge. If the distances are restricted too severely, the bias on the model building may be too strong to allow for sufficiently complex models; this was also supported by the results in Hauschild et al. (2007). On the other hand, if the distances are not restricted sufficiently, the benefits of this approach may be negligible.

To explore this trade off, we considered spin glass instances of sizes 16 × 16 to 28 × 28, with 100 random instances for each problem size. Then, for each instance, we ran a series of experiments with dependencies restricted by the maximum distance, which was varied from 1 to half the maximum achievable distance (for example, for 20 × 20 spin glasses, we ran experiments only allowing dependencies between spins of a maximum distance from 1 to 10). For the larger problems some of the instances would not converge even for extremely large population sizes (*N*= 512,000) when the restrictions on model structure were too strong; the results in these cases are omitted.

Figure 6 shows the execution time speedup with model complexity restricted by the maximum distance. The horizontal axis is the ratio of the number of dependencies in the original runs that matched that restriction and the overall number of dependencies. The maximum distance allowed is shown as a label for each of the points in the graph. For example, in Figure 6(a) we see that in the original runs of spin glasses of size 16 × 16, about 50% of the dependencies were neighbor dependencies and more than 80% were dependencies of distance 7 or less. The results show that restricting models by maximum distance results in significant speedups of about 4.3–5.2; in fact, the optimal speedups obtained are better than those obtained with the PCM-based model bias. In agreement with previous results (Hauschild et al., 2007), we also see that as the problem size increases, dependencies at larger distance should be allowed for maximum speedup. Nonetheless, the speedups obtained seem to be again nearly independent of problem size, just like for the results obtained with PCM.

Just as we did for the PCM-based approach, we also examined the effects of the distance-based model restriction on the number of bits that had to be examined during the entire model building procedure. Figure 7 shows the factor by which this quantity decreases with various thresholds on the maximum distance. The distance restrictions are labeled on the graph with arrows. The results show a dramatic drop in the number of bits that must be examined with small values of the maximum distance, by as much as a factor of 30. They also show that this maximum decrease is maintained as problem size increases.

Table 3 shows the best speedups, the corresponding maximum distance threshold and *q*_{min}, and the percentage of total possible dependencies that were considered by hBOA for all problem sizes. We can see that hBOA is only considering a small percentage of the possible dependencies for these cutoffs. We can also clearly see that as problem size increases, the maximum speedup stays nearly the same. This is again great news, indicating that we can keep our speedups even as we scale to larger problems.

. | . | . | . | . |
---|---|---|---|---|

Size . | Execution time speedup . | Max distance allowed . | q_{min}
. | Percent of total dependencies . |

256 (16 × 16) | 4.2901 | 2 | 0.62 | 4.7% |

400 (20 × 20) | 4.9288 | 3 | 0.64 | 6.0% |

576 (24 × 24) | 5.2156 | 3 | 0.60 | 4.1% |

784 (28 × 28) | 4.9007 | 5 | 0.63 | 7.6% |

. | . | . | . | . |
---|---|---|---|---|

Size . | Execution time speedup . | Max distance allowed . | q_{min}
. | Percent of total dependencies . |

256 (16 × 16) | 4.2901 | 2 | 0.62 | 4.7% |

400 (20 × 20) | 4.9288 | 3 | 0.64 | 6.0% |

576 (24 × 24) | 5.2156 | 3 | 0.60 | 4.1% |

784 (28 × 28) | 4.9007 | 5 | 0.63 | 7.6% |

A comparison of Tables 2 and 3 reveals that while the speedups in the two cases are similar, the speedups obtained with the distance-based model restriction are slightly better than those obtained with the PCM-based approach. We also see the same pattern of hBOA considering similar percentages of total dependencies using each of the methods.

### 5.5. Experiments with Distance-Based Bias on MAXSAT.

In Section 5.4 we saw that restricting model structure by distance leads to significant speedups of hBOA on 2D Ising spin glasses. Can this approach be applied to other problems? In this section we will attempt to answer this question by looking at combined-graph coloring problems encoded as instances of the MAXSAT problem. The distance metric for MAXSAT used in the experiments was described in Section 4.3.

Similarly as for spin glass problems, the distance metric is relatively straightforward to implement, but it is difficult to predict what threshold on distances should be used to obtain optimal speedups. We would certainly expect that many dependencies would be between propositions that share a clause but restricting hBOA to only consider dependencies between propositions in the same clauses would likely be too severe of a restriction. Yet just as with spin glasses, if we do not restrict model structure substantially, then the gains will be negligible.

To examine this trade-off, we looked at instances of MAXSAT for graph coloring of combined graphs (Gent et al., 1999) with *p* = 1, *p* = 2^{−4} and *p* = 2^{−8}. For each value of *p*, we considered 100 random instances where all 100 instances were 5-colorable, and contained 500 propositions and 3,100 clauses. Then, for each of these instances we ran experiments with dependencies restricted by the maximum distance, which was varied from 1 to the maximum distance found between any two propositions (for example, for *p* = 2^{−4} we ran experiments using a maximum distance from 1 to 9). For some instances with *p* = 1 the maximum distance was 500, indicating that there was no path between some pairs of propositions. On the tested problems, small distance restrictions (restricting to only distance 1 or 2) were sometimes too restrictive and some instances would not be solved even with extremely large population sizes (*N*= 512,000); in these cases the results were omitted (such restrictions were not used). For *p* = 1, the maximum distance of 3 was also too restrictive and these results were also omitted.

Figure 8 shows the execution time speedup of hBOA on MAXSAT with model complexity restricted by the maximum distance. As in the results with distance restrictions on 2D spin glasses, the horizontal axis stands for the ratio of the number of dependencies with a specific distance bound and the total number of dependencies in the original, unrestricted runs. The maximum distance allowed is shown as a label for each of the points in the graph.

The results show that the speedups obtained by restricting model structures by distance vary with the amount of structure in the considered problem instances (represented by parameter *p*). For *p* = 1, problem instances have very little structure and we see that only one distance threshold led to a speedup of about 1.5; the speedups obtained in the remaining cases were negligible. For *p* = 2^{−4} we see a maximum speedup of about 2.5 and for the most structured problems with *p* = 2^{−8} we see a maximum speedup of about 2.2.

As the amount of structure in the problem increases (by decreasing *p*), we also see that a wider variety of distance thresholds lead to speedups. For *p* = 1 only one threshold leads to a substantial improvement. However, as *p* decreases, we get a wider band of thresholds that lead to noticeable improvements. One thing to note about the graphs is that as *p* increases, the number of dependencies at small distances increases rapidly—for example, for *p* = 1 we see that over 98% dependencies were of distance 4 or less, while for *p* = 2^{−8}, only 72% dependencies were at distance 4 or less.

We also examined the effects of the MAXSAT distance-based restrictions on the number of bits that had to be examined during the entire model building procedure. Figure 9 shows the factor by which this quantity decreases with various thresholds on the maximum distance. As in the previous figure, the distance restrictions are labeled on the graph with arrows. As in the execution time results, we see that our gains are smaller for *p* = 1 but are much higher for the other two values of *p*.

Table 4 shows the best speedups, the corresponding maximum distance threshold and *q*_{min} and the percentage of total possible dependencies that were considered by hBOA for all values of *p*. While the results are less impressive than for the 2D Ising spin glasses, speedups are obtained for all values of *p* examined, indicating that even for MAXSAT instances with some structure, distance restrictions improve the performance of hBOA. In contrast to the results on 2D spin glasses, we see that hBOA needs to consider a much larger proportion of possible dependencies. In fact, in the *p* = 1 case, almost all dependencies were considered, and yet the results still showed an almost 50% improvement in efficiency. In the remaining two cases, the results were very similar, with approximately 30% of the original dependencies considered.

p
. | Execution time speedup . | Max distance allowed . | q_{min}
. | Percent of total dependencies . |
---|---|---|---|---|

p = 1 | 1.53 | 4 | 0.99 | 97.7% |

p = 2^{−4} | 2.67 | 3 | 0.79 | 29.6% |

p = 2^{−8} | 2.20 | 4 | 0.83 | 28.4% |

p
. | Execution time speedup . | Max distance allowed . | q_{min}
. | Percent of total dependencies . |
---|---|---|---|---|

p = 1 | 1.53 | 4 | 0.99 | 97.7% |

p = 2^{−4} | 2.67 | 3 | 0.79 | 29.6% |

p = 2^{−8} | 2.20 | 4 | 0.83 | 28.4% |

## 6. Related Work

This section discusses relevant prior work on using problem-specific knowledge and learning from experience in EDAs and machine learning.

### 6.1. Related Work in Evolutionary Computation.

There are several basic approaches to speeding up EDAs and other evolutionary algorithms by incorporating problem-specific knowledge: (1) bias the procedure for generating the initial population (Schwarz and Ocenasek, 2000; Sastry, 2001a; Pelikan and Goldberg, 2003a), (2) bias the representation to more strongly represent good solutions (Rothlauf, 2006), and (3) bias or restrict the model building procedure (Schwarz and Ocenasek, 2000; Mühlenbein and Mahnig, 2002; Baluja, 2006). For all these approaches, one may either (1) hard code the modifications based on prior problem-specific knowledge or (2) develop automated procedures to improve EDA performance by learning from previous EDA runs on problems of similar type (learning from experience). Since the main focus of this paper is on biasing model building in EDAs, in this section we will review only prior work relevant to this topic.

One of the first attempts to bias model building in EDAs based on prior problem-specific knowledge used a prior network to bias model building to structures that contain edges between the pairs of nodes connected in the underlying graph in graph bi-partitioning (Schwarz and Ocenasek, 2000). While other dependencies were also allowed, the dependencies between nodes connected in the underlying graph were given higher priority. Mühlenbein and Mahnig (2002) also considered graph bipartitioning but they restricted the models to only allow connections between nodes connected in the underlying graph.

Baluja (2006) discussed how to bias EDA model building on the problem of graph coloring. Similarly as in the study of Mühlenbein and Mahnig (2002), the probabilistic models were restricted to only contain edges that were consistent with the underlying graph structure; all dependencies between unconnected nodes were simply disallowed. However, in this study, dependency-tree models were used while in the study of Mühlenbein and Mahnig (2002), Bayesian networks were used.

Most experimental results on using prior knowledge to bias the model building in EDAs suggested that prior knowledge led to significant speedups. Nonetheless, there are several limitations of the aforementioned techniques that we addressed in this paper. First, all of the aforementioned techniques were tailored to one specific problem type and it was unclear how one could apply similar techniques to other problem types. Second, the aforementioned studies introduced a handcrafted bias based on prior problem-specific knowledge, but none of these studies considered automated learning from experience based on previous EDA runs on similar problems.

### 6.2. Related Work in Machine Learning.

In machine learning the use of bias in learning and the transfer of learned structures between related learning tasks are commonplace. In this section we review some of the work in machine learning that directly relates to the topics discussed in this paper.

In machine learning, the area that focuses on transferring learned structures between related learning tasks is called inductive transfer or transfer learning (Pratt et al., 1991; Pratt, 1992; Sharkey and Sharkey, 1992; Caruana, 1997). The early work on inductive transfer focused mainly on speeding up learning (Pratt et al., 1991; Pratt, 1992; Sharkey and Sharkey, 1992), although in later studies the transfer of learned structures was shown to also improve generalization on sequences of learned tasks (Thrun and Mitchell, 1994; Thrun, 1995, 1996). Improvements in both the speed of learning as well as generalization performance would also be beneficial in the context of transferring learned structures in EDAs. However, to the best of our knowledge none of the early studies in inductive transfer considered Bayesian networks and their variants, which are used in most EDAs based on graphical models, such as hBOA.

A related area in inductive transfer is multitask learning (Caruana, 1997), in which multiple learning tasks are executed in parallel using a shared representation. The goal in multitask learning is to use progress in each of the learning tasks to bias other tasks in order to allow them to be learned better. The general concept of multitask learning is discussed in Caruana (1997) along with examples of multitask learning in learning neural networks using backpropagation, *k*-means clustering, and decision tree learning.

The use of inductive transfer in learning Bayesian networks is discussed by Niculescu-Mizil and Caruana (2007), who describe the first multitask structural learning algorithm for Bayesian networks. Since Bayesian networks and their variants are the most common probabilistic models used in EDAs, the study of Niculescu-Mizil and Caruana (2007) and the subsequent work in this area are among the most relevant results in machine learning that relate to this paper. However, their study does not consider the use of a distance metric and their approach can only be adapted to problems in which the structure of the problem does not change from one instance to another; this approach will thus have a similar set of limitations as the PCM-based approach discussed in Section 4.

In summary, past research on inductive transfer in machine learning provides a wealth of prior work, some of which is closely related to our study presented in this paper. One of the important topics for future work will be to investigate this prior work and its potential for improving the approaches presented in this paper. For more information on inductive transfer and multitask learning in machine learning, please consult Caruana (1997), Thrun (1996), Jebara (2004), and Lawrence and Platt (2004).

## 7. Summary and Conclusions

Apart from providing the user with the global optimum or at least its good approximation, EDAs also provide a series of probabilistic models that hold a great deal of information about the problem. If one wants to solve more problem instances of similar type, using this readily available source of problem-specific knowledge should allow the user to improve performance of EDAs on future problem instances of this type. This study takes the first step in exploiting this information and shows that restricting model building in hBOA based on the results from previous runs on similar problems leads to substantial speedups. Two approaches to restricting probabilistic models based on previous runs were proposed and tested, yielding a speedup of about 4–5 on the 2D Ising spin glass and a speedup of about 1.5–2.5 on MAXSAT depending on the amount of structure in the problem.

While this study considered only two specific classes of problems—2D Ising spin glasses and MAXSAT—the proposed approaches can be adapted to any problem where either (1) problem structure does not vary significantly between different problem instances or (2) one can impose a distance metric on problem variables so that variables located at greater distances are less likely to be connected in the probabilistic model. Some examples of such problem classes include 3D spin glass, and various problems from graph theory (such as the minimum vertex cover, facility location problems, and graph coloring). The proposed approaches provide a principled way to control model bias based on previously discovered probabilistic models without requiring the user to manually design such a bias. The proposed techniques can also be adapted to other EDAs based on multivariate probabilistic models, such as ECGA.

While any efficiency enhancement technique is useful on its own right, combining multiple efficiency enhancement techniques often yields multiplicative speedups (Goldberg, 2002; Sastry et al., 2006). For example, sporadic model building or parallel model building can be used in combination with the techniques proposed in this paper, further improving hBOA performance. This should allow the practitioners to further increase the size and difficulty of problems feasible with EDAs and solve problems unsolvable with the current methods.

There are many key areas for future research on this topic. First of all, the proposed techniques should be tested on other classes of important problems, such as the graph coloring or the minimum vertex cover. Second, the process of choosing appropriate thresholds to achieve maximum speedups should be made more automatic so that the user does not have to rely on the trial and error approach to setting these parameters, with the end goal of removing all additional parameters. Additional theoretical and empirical work should be done to gain a better understanding of expected benefits of different efficiency enhancement techniques with respect to the properties of the problem landscape. Finally, the proposed approaches should be improved by exploring other techniques to bias model building and extended to deal with other problem types.

## Acknowledgments

This project was sponsored by the National Science Foundation under CAREER grant ECS-0547013, by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant FA9550-06-1-0096, and by the University of Missouri in St. Louis through the High Performance Computing Collaboratory sponsored by Information Technology Services, and the Research Award and Research Board programs.

The US Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, the Air Force Office of Scientific Research, or the US government. Some experiments were done using the hBOA software developed by Martin Pelikan and David E. Goldberg at the University of Illinois at Urbana-Champaign and most experiments were performed on the Beowulf cluster maintained by ITS at the University of Missouri in St. Louis.

## References

*I*,

*II*, pp.

*n*-th thing any easier than learning the first?