In this paper, we present AMS-DEMO, an asynchronous master-slave implementation of DEMO, an evolutionary algorithm for multi-objective optimization. AMS-DEMO was designed for solving time-intensive problems efficiently on both homogeneous and heterogeneous parallel computer architectures. The algorithm is used as a test case for the asynchronous master-slave parallelization of multi-objective optimization that has not yet been thoroughly investigated. Selection lag is identified as the key property of the parallelization method, which explains how its behavior depends on the type of computer architecture and the number of processors. It is arrived at analytically and from the empirical results. AMS-DEMO is tested on a benchmark problem and a time-intensive industrial optimization problem, on homogeneous and heterogeneous parallel setups, providing performance results for the algorithm and an insight into the parallelization method. A comparison is also performed between AMS-DEMO and generational master-slave DEMO to demonstrate how the asynchronous parallelization method enhances the algorithm and what benefits it brings compared to the synchronous method.
Real-world optimization problems are often high-dimensional, requiring the use of stochastic optimization techniques, such as evolutionary algorithms (EAs). Multi- objective optimization (MO) is the process of simultaneous optimization of two or more conflicting objectives, resulting in a set of solutions that represent various trade-offs between the objectives. As population-based methods, EAs can be extended to return multiple solutions in a single run, which makes them suitable for multi-objective optimization.
In addition, real-world problems are frequently time-intensive, requiring the use of parallel computer architectures for the optimization to have any practical value. Before parallelizing an EA, properties of the parallel computer architecture should be considered, since the selection of the most appropriate parallelization method depends on them. The most important is the distinction between homogeneous and heterogeneous computer architectures; the components of the latter differ in their processing power as well as communication speed, making them more difficult to use efficiently and narrowing the choice of applicable parallelization methods. The goal of this paper is to present the AMS-DEMO algorithm, a parallel evolutionary algorithm for solving time-intensive multi-objective optimization problems, that is able to utilize various parallel computer architectures, both homogeneous and heterogeneous, regardless of the properties of interconnections, and with no limits on the number of processors. The algorithm is parallelized using an asynchronous master-slave method. Although similar asynchronous master-slave methods have been used for parallelization of EAs (Stanley and Mudge, 1995; Talbi and Meunier, 2006) and particle swarm optimization (Kennedy and Eberhart, 1995), and displayed good performance on heterogeneous computer architectures, no analysis of their asynchronism exists to our knowledge. This is possibly due to EAs being experimentally evaluated, paired with the difficulty of generalizing the experiments performed on heterogeneous computer architectures. We combine the experimental evaluation of the algorithm on a homogeneous computer architecture and heterogeneous grid architecture with a theoretical analysis of the algorithm behavior that provides an insight into the parallelization method.
This paper is further organized as follows. Section 2 presents the background of multi-objective optimization evolutionary algorithms, their representative used in this paper—the differential evolution for multi-objective optimization (DEMO)—and a review of the possible methods of parallelization. Section 3 presents the main contribution of the paper, on both the conceptual and the implementation level, with all the important details explained. Experiments with the proposed algorithm on a benchmark problem and a real-world industrial problem follow in Section 4. The paper concludes with Section 5, where possible limitations of the algorithm and future work are discussed.
2.1. Multi-Objective Optimization
2.2. DE and DEMO
DE is an evolutionary algorithm for solving single-objective optimization problems defined over continuous domains (Price and Storn, 1997; Storn and Price, 1997; Price et al., 2005). Variation operators of DE are differential mutation and uniform crossover. In every iteration, an individual, called a parent, is selected from the population at random, but in such a way that every population member acts as a parent in one generation. Differential mutation takes three or more members of the population , to help construct a mutation vector v by vector addition and scalar multiplication. A simple way of calculating a mutation vector is by , where and is often from the interval (0, 1]. The mutation vector is then recombined with the parent by uniform crossover, creating a trial solution. The trial solution and the parent then undergo selection, after which the better of the two is selected for the next generation and the other is discarded.
Robič and Filipič (2005) presented DEMO, which was devised from DE by implementation of the following changes:
The algorithm was changed from generational to steady state. Unlike generational algorithms, steady state algorithms (Eiben and Smith, 2003) immediately evaluate the solutions they create—most often one or two new solutions are created by variation operators—and replace their parents in the population, before variation operators are applied again to the population. This change is straightforward to implement in DEMO—after each selection, the surviving solutions (the trial or the parent or both) immediately replace the parent solution in the current population instead of placing them into a new population.
When the parent is compared to the trial solution, the Pareto dominance is used, resulting in three possible outcomes. The first two are that one solution dominates the other, and is thus used for the next generation, while the other is discarded, as in DE. The third option is that no solution dominates the other, and in this case both solutions are kept, increasing the population size by one.
A mechanism was added to counterbalance the increases in population size. Every n evaluations, where n is the starting population size, the population is truncated using the nondominated sorting and the crowding distance metric from the NSGA-II algorithm by Deb et al. (2002).
DEMO has been extensively tested by Robič and Filipič (2005) on ZDT test problems (Zitzler et al., 2000); and compared by Tušar and Filipič (2007) with IBEA (Zitzler and Künzli, 2004), SPEA-2 (Zitzler et al., 2001), and NSGA-II on DTLZ test problems (Deb et al., 2005) and WFG test problems (Huband et al., 2005). DEMO was found to outperform other algorithms on a large subset of these problems. A DEMO variant with a mechanism for self-adaptation of DE parameters, DEMOwSA by Zamuda et al. (2007), participated in CEC 2007 Competition on Performance Assessment of Multi-objective Optimization Algorithms (Suganthan, 2007). Based on the previous extensive comparisons, DEMO behavior is well known, and DEMO was therefore selected as the base algorithm for the parallelization.
2.3. Parallelization of EAs
In solving real-life optimization problems, fitness evaluation is sometimes very computationally expensive as, for example, in microprocessor design (Stanley and Mudge, 1995) and airplane wing design (Quagliarella and Vicini, 1998; Sasaki et al., 2001). It is therefore beneficial to parallelize the algorithm for use on multiple processors and thus shorten the execution time. A two-level hierarchical parallelization first parallelizes fitness evaluation on p1 processors, and then parallelizes optimization on p2 groups of processors. A total of processors can be used in this kind of highly efficient parallelization. The problems of parallelizing fitness evaluation and optimization are disjunct, and since the options for efficient parallelization of fitness evaluation are problem dependent, in this paper, we only work on parallelization of the optimization algorithm.
There are several categorizations of parallel metaheuristics for multi-objective optimization (Nebro et al., 2005; Talbi et al., 2008). Here, we present one that is most commonly used when dealing with EAs. There are four EA parallelization methods (Cantú-Paz, 1997; Alba and Troya, 2002; van Veldhuizen et al., 2003; Luna et al., 2006), three basic, that is, the master-slave (also called the global parallelization), the island model, and the diffusion model (also known as the cellular model), and the hybrid model that encompasses combinations of the basic types, usually in a hierarchical structure.
Master-slave EAs are the most straightforward type of parallel EAs because they build on the inherent parallelism of EAs. Consequently, they traverse the search space identically to their serial counterparts. They can be visualized as a master node running a serial EA, but with a modification in creation and evaluation of solutions, which happen on all available processors in parallel. This, however, does not apply to steady state algorithms, in which the creation and evaluation of a single solution depend on the previously generated solution and the result of its evaluation. Steady state algorithms can therefore not be parallelized using the master-slave type without prior modification.
The highest efficiency of the master-slave parallelization type can be achieved on computers with homogeneous processors and in problem domains where the fitness evaluation time is long, constant, and independent of the solution. When these criteria are fulfilled, near-linear speedup (Akl, 1997; i.e., speedup that is close to the upper theoretical limit) is possible. The master-slave parallelization is popular with MOEAs, ranging from simple implementations as in the case of Oliveira et al. (2003) where the master runs on a separate processor, and the cases of Radtke et al. (2003) and Nebro and Durillo (2010), where the master node also runs one slave.
There are also implementations for heterogeneous computer architectures where load-balancing has to be implemented. Examples can be found in Eberhard et al. (2003) with the pool-of-tasks load balancing algorithm, in Lim et al. (2007), where a grid-enabled algorithm combines the island model with the master-slave model, and in Stanley and Mudge (1995), and Talbi and Meunier (2006), with an asynchronous master-slave parallelization of a steady state algorithm where load balancing is implicit. Similar methods have been used to parallelize particle swarm optimization (PSO) methods with good results. The asynchronous master-slave model is compared to the synchronous model on various numbers of processors and using varying evaluation time in Koh et al. (2006). Multi-objective PSOs were parallelized in Scriven et al. (2008), Mostaghim et al. (2008), and Lewis et al. (2009) with a mixed asynchronous and synchronous master-slave model, resulting in an algorithm that works well on unreliable heterogeneous computer systems. Yet the connection between the number of processors and the behavior of any of the asynchronous master-slave algorithms has not been analyzed.
3. The AMS-DEMO Algorithm
3.1. Parallelizing DEMO
By means of parallelization, the proposed asynchronous master-slave DEMO (AMS-DEMO) attempts to speed up solving real-world optimization problems with expensive evaluation functions. It is designed to be used on problems where the evaluation of solutions, including constraint checks, takes several orders of magnitude longer than other parts of the algorithm, and to run efficiently on both homogeneous and heterogeneous computer architectures. Because it is not designed to be efficient on computationally inexpensive problems, its use in combination with surrogate models is not envisioned. The parallelization seeks to modify the algorithm in a way that maximizes speedup (Akl, 1997), that is, the factor of how much shorter the execution time is on multiple processors than on a single processor.
We used DEMO as a base algorithm (more specifically: the variant DEMO/parent described in Robič and Filipič, 2005, using the DE/rand/1/bin scheme) because it already proved successful in solving numerical multi-objective optimization problems (Tušar and Filipič, 2007), outperforming other state of the art algorithms. Its application in the optimization of a real-world steel casting process (Filipič et al., 2007), however, proved very time-consuming, with a single run taking more than three days to complete on an average PC. To speed up the DEMO algorithm, it was parallelized to run on a computer cluster. Given the properties of the available cluster (32 identical processors with fast interconnections) and problem properties (solution evaluation time is slightly variable but input independent and orders of magnitude longer than variation operators of DEMO), the master-slave parallelization model was selected as the most appropriate.
Initially, the DEMO algorithm was transformed into a generational algorithm and parallelized with the standard master-slave model in Filipič and Depolli (2009), because it offers reasonable speedups while being easy to implement. We call this algorithm generational DEMO and use it as a baseline for comparison in Section 4 of this paper. Once every generation, generational DEMO synchronizes all of the processors. This happens when the master calculates the population of the next generation while the slaves wait. In the best case scenario, all the slaves finish at the same time and only wait for the master to calculate and distribute the next generation—a task that is negligible in duration compared to a solution evaluation. This is, however, only possible if all the slaves are load balanced, which in practice translates to all the slaves being equally fast, performing the same number of evaluations per generation, i.e., requiring the population size to be a multiple of the number of processors, and all evaluations being equally time-intensive. Although these conditions are often met to a high degree, greater flexibility is desired and, as we show next, can be achieved for the cost of a minor drop in the algorithm convergence rate.
3.2. Asynchronous Master-Slave
By shifting the main task of the algorithm away from traversing the search space in the same way as on a single processor, and toward keeping the slaves constantly occupied, we created AMS-DEMO that greatly exceeds the flexibility of generational DEMO. The master-slave model is used for parallelization, with the slaves running on all the p processors and the master running as an additional process on one of the processors. Conceptually, the AMS-DEMO operates as p asynchronous and independent DEMO algorithms processing a shared population, as shown on the right-hand side of Figure 1. The population is stored on the master, where the variation operators and selection are also applied, while the slaves only evaluate solutions supplied to them by the master, as seen on the left-hand side of Figure 1. Input parameter bounds are checked and enforced on the master while all problem-dependent constraint checks are performed on the slaves, within the evaluation (evaluation time includes all the constraint checks). Because the slaves are asynchronous, there is no need to explicitly load-balance them. In practice, this means AMS-DEMO is efficient in using heterogeneous computer architectures, computers with varying background load, dynamic numbers of processors, and there are no performance-based restrictions on the population size or the number of processors.
The slaves only wait a minimum amount of time between evaluations, while the master performs operations that are orders of magnitude shorter, and spends most of the time waiting. This, however, does not decrease the efficiency of the algorithm, because the master shares a processor with one of the slaves, and both processes are implemented with nonblocking wait. The shared processor is thus either executing the master or the slave and is never idle.
The communication between the master and the slaves is in the form of asynchronous message passing. Message passing means that communication consists of a sender sending a message and a receiver receiving the message. The asynchronous nature of the communication is manifested as the ability of the sender and the receiver to handle messages independently of each other. In contrast, the synchronous message passing used in generational DEMO requires the sender and the receiver to participate in the communication simultaneously, in effect synchronizing them. Generational DEMO requires the processors to be synchronized at the time the messages are either gathered from the slaves or sent to the slaves, making the synchronous message passing perfect for the task. For AMS-DEMO, the asynchronous message passing is an advantage because it requires no unnecessary synchronization or associated wait times.
To use asynchronous communication to its full extent, AMS-DEMO introduces FIFO (first in, first out) queues of solutions pending evaluation of the slaves. A slave with a local queue is able to start evaluation of a solution from the front of its queue immediately after it completes its previous evaluation. It only briefly interrupts the chain of continuous evaluations by sending the last evaluation results to the master, and by checking for and processing any pending messages from the master. Both of these operations are fast because of the asynchronous communication.
Note that the queue of length one is equivalent to no queue at all, because the solution at the front of the queue is the one being evaluated—a slave only removes it after it has evaluated it. In most cases, a queue of length two should suffice to eliminate all the wait time for the slave, because the slave does not require more than one solution waiting in the queue. There are exceptions, such as the case where the communication time is of the same order of magnitude as the evaluation time, and the case where it is beneficial to send more than one solution per message because of expensive communication.
3.3. Selection Lag
We explore an important difference between AMS-DEMO and the original DEMO—the difference in the way solutions are related to the population. The difference can easily be demonstrated if we observe a typical solution from its creation to its selection. While in the original DEMO, the population does not change in this observed time period, it may change in AMS-DEMO. The change happens because, in AMS-DEMO, while the observed solution is being evaluated on one processor, some number of other solutions complete their evaluation on other processors, and are sent to the master, undergo selection, and may thus change the population. This causes a lag in exploitation of good solutions. We will call it selection lag, denote it with l, and define it, per solution, as the number of solutions that undergo selection in the time between the observed solution's creation and selection. The selection lag therefore counts the number of possible changes to the population (the number of replaced solutions) that are not known to AMS-DEMO when it creates the observed solution, but would be known to the original DEMO. Because every selection is coupled with the creation of a new solution, the selection lag can also be thought of as the number of solutions created while an observed solution is being evaluated, in other words, as the number of solutions that could be created differently in the original DEMO than they are in AMS-DEMO, because of the different processing of the observed solution. It should be stressed that the changes to the population counted by the selection lag are possible, but not necessary. Furthermore, although their probabilities depend linearly on the selection lag, they also depend on the population size and on the probability of the offspring surviving selection.
Defined per solution, in a single run of AMS-DEMO, the selection lag becomes a statistical variable which characterizes the behavior of AMS-DEMO. We can assume, however, that as long as the selection lags of solutions deviate little from the mean, the errors made by observing only the mean selection lag, which is easily calculated as 1, are negligible.
The proof for 1 is given through the following example. In the simplest case, with queue size 1 and equal evaluation times, slaves work as follows. They are assigned solutions and start evaluating them in an orderly fashion. Slave 1 is assigned solution 1, slave 2 is then assigned solution 2 and so on until the last slave p is assigned solution p. Equally fast slaves finish evaluations and receive the next p solutions in the same order as the first p solutions. Slave 1 evaluates solution 1 and is assigned solution p+ 1. Therefore, the selection lag for solution 1, given as the number of solutions created during its evaluation, equals p− 1, which extends to all other solutions. The mean selection lag is then also p− 1. In a more realistic case, where the evaluation times vary, the selection lag may no longer equal to p− 1 for all solutions. Any increase in one solution's selection lag, however, must produce an equivalent decrease in selection lags of other solutions. This can be demonstrated with an example of two solutions, a and b, evaluated in parallel, with a undergoing selection just prior to b. If the evaluation time of a were increased just enough for it to undergo selection just after b, its selection lag la would increase by 1. But this would automatically decrease the selection lag of b by 1, because a would no longer undergo selection while b was being evaluated. Thus, any transposition of the evaluation order of two solutions changes their selection lags symmetrically, so that their mean does not change. This rule can also be extended to all possible permutations of the evaluation order, since any permutation can be represented as a composition of transpositions. With the introduction of queues, the time between creation and selection of a solution lengthens by the time the solution waits in the queue. The number of solutions generated in a certain period of time equals the number of solutions already in the queue, q−1, plus the number of solutions generated on the other processors, (q−1)(p−1). The selection lag of an average solution is the total number of solutions generated while this solution is waiting or being evaluated: (q−1)+(q−1)(p−1)+(p−1), which simplifies to pq−1.
3.4. Implementation Details
The AMS-DEMO algorithm consists of the master and slave processes. Both communicate with each other using the message passing interface (MPI; Snir et al., 1996) communication standard. The algorithm does not depend on this standard, so any asynchronous communication protocol should be adequate.
If the parent is marked as unevaluated, then the offspring replaces it, bypassing the selection. There are two possible scenarios that result in the parent being marked as unevaluated. The first, more common one, is that the solution is from the initial population, having no real parent, and is marked as the parent of itself (line 4 of Algorithm 3). In the second, rarer scenario, the parent is a member of the initial population and has not yet been evaluated. In such a case, the two related solutions (the parent and the offspring) simply switch roles. The offspring skips the selection and replaces the parent in the population. Then, after the parent is evaluated, the parent undergoes selection in which it competes against the offspring. Because of the asynchronous nature of AMS-DEMO, the parent of an observed solution might not be found in the population, because it was already replaced by some other solution or eliminated by the population truncation. When this happens, a random solution from the population is selected as the parent to compete against the observed solution.
4. Algorithm Evaluation
The performance of the proposed algorithm was assessed on a benchmark problem and a real-world multi-objective optimization problem. The benchmark problem is used to discover the relation between the computational complexity of the evaluation function and the efficiency of AMS-DEMO. The real-world problem was used for testing the algorithm convergence, parallel speedup, and the ability to run on heterogeneous systems with numbers of processors larger than the population size. Experiments on both problems were performed on a cluster of 16 dual AMD Opteron 244 processor computers, each with 1024 MB of memory, and six gigabit Ethernet ports, connected with a gigabit Ethernet switch. The software used includes Fedora Core 2 Linux with kernel 2.6.8-1.521smp, MPICH 1.2.7 communication library (Gropp et al., 1996) and GCC 3.3.3 compiler.
4.1. Experiments with Evaluation Time
We call it weak speedup because it is based on the execution time of an algorithm with the number of evaluations set as the stopping condition. True speedup, in contrast, would have solution quality as the stopping condition, but that would make it more problem dependent. In this analysis, we are solely interested in the response of AMS-DEMO to various evaluation time lengths and therefore we chose the weak speedup over the true speedup to make the results as widely applicable as possible. We develop a simple model for estimating the weak speedup of AMS-DEMO on p processors. We examine the time required for p evaluations and all the algorithm overhead. For simplicity, we intentionally leave out some factors that are hard to determine in advance for any given computer system. Nevertheless, we need to introduce three specific times. The algorithm overhead time is ta, defined per evaluation. The algorithm overhead includes all algorithm tasks but evaluation, for example, application of variation operators and selection, output of solutions and evaluation results to a file, and so on. The value of tn is the network latency, that is, the time it takes for one processor to fully transfer a message to a remote processor, or to fully receive a message from the remote processor. During the network latency, the processor may perform other tasks, because all of the work is done by either networking hardware or the remote processor. There is, however, an overhead for the processor, associated with the communication, the time required to prepare data and pack it into a message or to unpack the data from the message and copy it to local variables (depending at which side of communication the said processor is). We refer to this time as communication overhead time, tc.
In Figure 2 we plot the the experimentally measured Se (mean of 25 runs) for various p. Interestingly, weak speedups are less than 1 for te<0.1 and all p. We can see that for ms, AMS-DEMO weak speedup rises above 1. At te=10 ms, AMS-DEMO already reaches nearly full speedup () at low numbers of processors (). For higher numbers of processors, the evaluation time must be even higher for AMS-DEMO to achieve full speedup. We also modeled and plotted Se for 32 processors in the same figure (labeled as p=32, modeled) using values for tn=0.22 ms, tc=0.03 ms, and ta=0.20 ms, measured during the experiments. The simple model is accurate on most of the values for te, and only fails to predict the weak speedup at te=10 ms. Even with one inaccurate prediction, the simple model should be useful for deciding on when to use AMS-DEMO.
4.2. The Real-Life Optimization Problem
Next, AMS-DEMO was tested on a real-world multi-objective optimization problem of tuning coolant flows in industrial continuous casting of steel. Continuous casting is widely used at steel plants to produce semi-products of various shapes and dimensions. The process starts by pouring liquid steel, melted in a furnace, into the mold, a bottomless vessel cooled by water flow in its walls. Cooling in the mold, also referred to as primary cooling, extracts heat from the steel and initiates the formation of a solid shell on its surface. The shell formation is crucial for the support of the steel slab after it exits the mold and enters into the secondary cooling area where it is cooled by water sprays. Led through the secondary cooling area by support rolls, the slab progressively solidifies and finally exits the casting machine. At its outlet, it is cut into pieces of predetermined length.
Water flows cannot be set arbitrarily, but according to the technological constraints. For each water flow, minimum and maximum values are prescribed. Table 1 shows an example of the prescribed target temperatures and maximum water flows for continuous casting of the steel grade analyzed in this study. Minimum water flows are 0 for all zones and both center and corner positions.
|.||Center positions .||Corner positions .|
|Zone .||Target .||Max flow .||Target .||Max flow .|
|number .||[C] .||[m3/h] .||[C] .||[m3/h] .|
|.||Center positions .||Corner positions .|
|Zone .||Target .||Max flow .||Target .||Max flow .|
|number .||[C] .||[m3/h] .||[C] .||[m3/h] .|
As we can see, coolant flow tuning in continuous casting of steel is a constrained two-objective optimization problem. To solve it by means of parallel multi-objective optimization, we integrated the optimization algorithm with a numerical simulator of the casting process. Given the coolant flow values, the simulator calculates the temperature field in the slab and extracts the values of objectives f1 and f2. For this purpose, we use a numerical model of the process with finite element method (FEM) discretization of the temperature field, and the corresponding nonlinear equations are solved with relaxation iterative methods. The model has previously been used in a single-objective optimization study of the casting process (Filipič and Laitinen, 2005), and in preliminary multi-objective optimization studies with the serial variant of the DEMO algorithm (Filipič et al., 2007).
Optimization calculations were performed for a selected steel grade with a slab cross section of 1.70 m × 0.21 m. The assumed casting speed was 1.6 m/min, and the target core length, , was 27 m.
4.3. Initial Experiments and Results
Initial parallel optimization experiments were performed to compare generational DEMO and AMS-DEMO. As shown in previous work (Filipič et al., 2007), solving the continuous casting optimization problem, DEMO appears to work best with population sizes between 20 and 40, which coincides well with the 32 processors available in the cluster. Therefore, a population size of 32 was chosen for the initial experiments. The remaining parameters for both parallel algorithms were set as follows: scaling factor F to 0.5, crossover probability to 0.1, DE scheme to rand/1/bin, and the stopping criterion to 9,600 evaluations.
It turned out that both parallel algorithms were able to discover the solutions known from previous applications of the original DEMO (Filipič et al., 2007) demonstrating conformance with it. All nondominated fronts reached the size equal to the population size (set to 32). To illustrate the results, Figure 3 shows the resulting nondominated front (approximating the Pareto optimal front) found by generational DEMO. The conflicting nature of the two objectives (that is, improving the coolant flow settings with respect to one objective makes them worse with respect to the other) is evident from the presented nondominated front. In addition, a systematic analysis of the solutions confirms that the actual slab surface temperatures are, in most cases, higher than the target temperatures, while the core length is shorter than or equal to the target core length. For example, the temperature difference for three solutions from the front displayed in Figure 3, two at the boundaries of the front and one from the middle of the front, are shown in Figure 4.
4.4. Convergence of AMS-DEMO
Convergence of AMS-DEMO was tested experimentally on 1, 2, 4, 8, 16, and 32 processors, where for each number of processors p, the algorithm was run 25 times. Generational DEMO was tested under the same conditions as AMS-DEMO, with the difference that each experiment was a batch of only five runs. A lower number of runs was used, in contrast to the 25 runs per AMS-DEMO experiment, because the number of processors p does not influence the way in which the algorithm works. Because the difference in the generational DEMO runtimes is always below 1%, five runs per each p were sufficient to calculate mean runtimes as a function of p. To compare the convergence rates of generational DEMO and the original DEMO, an additional batch of 25 runs of generational DEMO was performed on 32 processors. The presented comparisons of algorithms and of various numbers of processors were all performed using the mean number of evaluations required to reach the same solution quality.
The convergence of the tested DEMO variants was evaluated using the hypervolume indicator IH (Zitzler et al., 2002), also called the metric (Zitzler and Thiele, 1998), which is a measure of the hypervolume of objective space that is dominated by a set of solutions. The properties of the hypervolume indicator (Knowles and Corne, 2002) enable observation of the convergence of solutions toward the optimum within a single run, and the comparisons of achieved solutions between two or more runs. On the other hand, the hypervolume indicator is sensitive to the properties of the nondominated front (Auger et al., 2009), such as the evenness of the distribution of solutions along the front, which makes comparison between different algorithms less reliable. The differences between the DEMO variants that we compare, however, are not in the variation operators, the truncation of solutions, or related functions of the algorithm, and therefore have no influence on the properties of the nondominated front, making comparison between the algorithms possible.
First, AMS-DEMO is evaluated against the original DEMO, with regard to the convergence of the algorithm, characterized by the convergence of the IH. It should be noted that when the number of processors drops to one, AMS-DEMO reverts to original DEMO; given the same random generator and seed, the AMS-DEMO algorithm traverses the same path through the search space as the original DEMO, with no calculation overhead. Therefore, experiments with AMS-DEMO on a single processor are also taken as the experiments of the original DEMO. With the exception of the number of processors, no other algorithm parameter varied between the experiments. The population size was set to 32 and the runs were terminated after 9,600 (population size times 300) evaluations. Each run was carefully timed, and the hypervolume indicator of the nondominated set of solutions was measured every 32 evaluations, that is, after every population truncation. The values of the hypervolume indicator for the experiment on a single processor are shown in Figure 5.
Due to the changes in the algorithm required by parallelization, the convergence rate of AMS-DEMO is expected to slow down when the number of processors increases. The number of evaluations required to reach specific IH values were examined and compared between the experiments. The mean values in Figure 6 indicate that increasing the number of processors does slow down the convergence. The confidence intervals, which were calculated using the basic percentile bootstrap method (Efron, 1982), are quite large, denoting low statistical confidence of such conclusions.
The statistical significance of the differences in number of evaluations performed by AMS-DEMO for relevant values of IH is determined using the two-sample Kolmogorov–Smirnov test. As can be seen from Figure 7, the differences are statistically significant (p value <.05) only for and approximately for . This is consistent with our expectations. The difference between DEMO and AMS-DEMO increases with an increasing number of processors and is more important in the early stage of search when convergence is faster, and less important in the later stage of search when convergence is slower. The difference is also not detected immediately but rather after the initial random population is significantly improved.
The convergence rate of generational DEMO is also compared against the convergence rate of the original DEMO and the differences are found to be insignificant. Because generational DEMO runs equivalently on any number of processors p, we can conclude that it has an advantage over AMS-DEMO for larger numbers of p, where the convergence of AMS-DEMO is noticeably slower.
We finally analyze the selection lag. Figure 8 shows the distributions of selection lag on the performed experiments. Means of distributions equal pq−1, as expected. Modes (peaks in distributions) also equal pq−1 and distributions appear only slightly asymmetric. Although the total range of measured values is wide, for example, from 6 to 58 for p=32, standard deviations are small, for example, 2.2 for p=32. Therefore, mean selection lag seems adequate to explain the difference in convergence between AMS-DEMO and DEMO.
Convergence tests can be summarized as follows. A bit surprisingly, generational DEMO converges at the same rate as the original DEMO, which should result in good parallel speedups. AMS-DEMO convergence slows down, as expected, as the number of processors increases. The change is only statistically significant at 16 and 32 processors, and on less than the whole range of target solution qualities. Given much more than 25 runs per test, the change in AMS-DEMO convergence might be significant at other numbers of processors, but we are limited in the number of tests we make by the long execution times of tests, particularly those on smaller numbers of processors. Because selection lag is the driver of changes in convergence rate, we measure it on the performed tests. Its mean equals pq−1 and its variability is low, both as expected. Because variability is low, the mean selection lag adequately explains the difference in convergence between AMS-DEMO and DEMO.
From Table 2, the average difference between AMS-DEMO and generational DEMO can be observed for . The expected decrease in AMS-DEMO efficiency as a result of an increasing number of processors is quantified in a column for Sc under AMS-DEMO. AMS-DEMO therefore becomes progressively less efficient than the original DEMO with every additional processor. On the other hand, high Sp—nearly equal to the number of processors on all the experiments—implies very good utilization of the available processors. The opposite holds for generational DEMO. While it is as efficient as the original DEMO at using evaluations, it is less efficient at utilizing additional processors, which is reflected in smaller Sp.
|.||AMS-DEMO .||Generational DEMO .|
|p .||Sc(p) .||Sp(p) .||S(p) .||E(p) .||Sc(p) .||Sp(p) .||S(p) .||E(p) .|
|.||AMS-DEMO .||Generational DEMO .|
|p .||Sc(p) .||Sp(p) .||S(p) .||E(p) .||Sc(p) .||Sp(p) .||S(p) .||E(p) .|
Looking at the speedups calculated from the performed experiments, an initial conclusion could be that the algorithms are closely matched, with generational DEMO having a slight advantage. The presented experiments, however, are biased, since they encompass only scenarios that particularly suit generational DEMO. Therefore, we prepare two additional scenarios and analyze them both analytically and through additional experiments.
4.6. Analytical Comparison
AMS-DEMO would exhibit a clear advantage over generational DEMO if the population size were not a multiple of the number of processors. Generational evolutionary algorithms, implementing the basic master-slave parallelization, evaluate one population at a time in parallel. But they cannot proceed to the next generation until the whole population is evaluated. Consequently, assuming that the evaluation of a single solution requires one processor, the number of processors used cannot exceed the size of the population. Furthermore, it is inefficient to use a number of processors that does not divide the size of the population. Consider the worst case scenario in which the population size n is larger than the number of processors by one: n=p+1. Evaluating such a population would require p−1 processors to each evaluate one solution and one processor to evaluate two solutions, resulting in considerable idle time of the processors. They would also be idle if equal numbers of solutions for evaluation were assigned to them, but the evaluation times of these solutions were not equal. Two possible (nonexclusive) causes of differences in evaluation times are a heterogeneous set of processors and an evaluation function with nonconstant runtime, either dependent on the solution or simply random. The noted caveats of the master-slave parallelization apply to generational DEMO without exception. AMS-DEMO circumvents both of those caveats, with an important and rather surprising consequence, AMS-DEMO is able to utilize more processors than is the size of the population. We will explore how effective it is in this respect in Section 4.7.
To estimate the behavior of AMS-DEMO and (for reference) generational DEMO, equations for the execution time have been devised for both algorithms.
The computed execution times were tested against the experimentally measured execution times and were found to be within the experimental confidence intervals. Using the given equations for the execution time of generational DEMO and AMS-DEMO and the measured evaluation times, the execution times were estimated for the number of processors on interval [1, 100] and are shown in Figure 10.
From the execution times, speedups can be derived accurately for generational DEMO, and less accurately for AMS-DEMO, because the latter behaves differently for the different numbers of processors. Nevertheless, it can be argued that, for example, increasing the number of processors from 16 to 30, the execution time and the behavior of generational DEMO would not change, causing the speedup to drop significantly. AMS-DEMO, on the other hand, would experience two counteracting effects: the ability to run the same number of evaluations in less time, balanced out to some extent by the requirement for more evaluations to reach the same solution quality. As the experiments show, when increasing the number of processors, the first effect always outweighs the second, yielding an increase in speedup. Therefore, increasing the number of processors is always beneficial for the performance of AMS-DEMO, while it often degrades the performance of generational DEMO.
4.7. Varying the Queue Size
Although queues have been implemented to reduce the slave idle time to a minimum, they also allow for simulating more processors than are available on the experimental architecture. This allows for a simulation of an interesting algorithm property, the ability to run on a number of processors that is larger than the population size. Although there are other possibilities of simulating additional processors, for example, running multiple processes on a single processor, using the queues is chosen because it simultaneously provides an example of the drawback of queues.
The algorithm running on p slaves, each having a queue of size q, explores the objective space in a similar fashion as if it were running on slaves, each having a queue of size 1. This is because the algorithm behavior changes with the selection lag, for which was shown that its mean equals pq−1. The same mean selection lag may be obtained through different values of p and q; therefore, increasing queue size emulates the use of additional processors. Although settings of the algorithm that produce the same mean selection lag produce very similar behavior, running the algorithm on fewer processors with longer queues differs from running it on more processors with shorter queues. If a number of solutions is inserted into a single queue, they undergo selection in the order of insertion. On the other hand, if the same number of solutions is distributed between different processors and the evaluation time varies, they are likely to undergo selection in a different order. The additional out-of-order selection manifests as an increase in the selection lag variance, and, although difficult to quantify, has some influence on the algorithm behavior. In our experiments, for example on 32 processors, the mean selection lag is 31, while its standard deviation is 2.65, which we believe is small enough to be negligible.
Queue sizes of 10 and 20 are used on 32 processors, simulating 320 and 640 processors, respectively, and the results are compared to the original DEMO. All the experiments are performed using the same algorithm parameters as before, including the population size of 32. The speedup calculated from Equations (8) and (14) is plotted in Figure 11.
First, we see from Figure 11 that the speedup grows much more slowly than the number of processors, but it grows nevertheless. Because the number of processors is not limited by the population size, the algorithm is able to produce speedups far greater than the population size. The drop in efficiency, however, should be taken into consideration when this property of the algorithm is used. The second observation is that the speedup improves as the target IH value rises. This follows from the property of AMS-DEMO that its traversal through the search space deviates the most from the original DEMO traversal when the convergence is fastest. Because the convergence of the algorithm slows at higher IH, AMS-DEMO behaves more like the original DEMO and thus becomes more efficient, causing the speedup to increase.
4.8. Experiments on Grid'5000
To demonstrate the flexibility of AMS-DEMO, we performed additional experiments on the Grid'5000 (Bolze et al., 2006) computer setup. Grid'5000 is a research tool for studying large-scale distributed systems and high-performance scientific computing. It is distributed between nine sites, with each site hosting one or more clusters composed of several hundred processors. We used clusters Bordereau and Bordeplage, located at the Bordeaux site, to perform four experiments with varying , each repeated 25 times. The queue length was set to 1 and other algorithm settings were set as for the previous experiments.
The convergence of AMS-DEMO on Grid'5000 is compared to AMS-DEMO running on a single processor. The results are shown in Figure 12. While the convergence rate gets slower with increasing p, the solutions reach about the same average IH both on p=100 and p=1 within the performed number of evaluations. For p>100, this does not happen, which confirms the results of the experiments with emulated processors.
Clusters used in the experiments are composed of various hardware and perform differently on the evaluation function. As seen from Figure 13, where the distribution of evaluation time te is plotted for all evaluations performed during experiments, Bordeplage performs evaluations faster than Bordereau. Its mean te, 21.8 s, compared to Bordereau's 27.9 s, is faster by 22% on average. If generational DEMO were considered, the difference in mean performance alone would cause poor utilization of the faster processors. Observing both clusters together, there is also a wide spread of te, between 16 s and 40 s, that would further reduce processor utilization on generational DEMO. Another obstacle to the generational DEMO efficiency is a relatively small population size n, compared to the number of available processors. If n were fixed to 32, then generational DEMO would not be able to use additional processors of Grid’5000 at all; but if n were equal to p, its convergence would slow down, and more importantly, because of different population sizes, the results of generational DEMO could not be compared to the results of AMS-DEMO using the hypervolume indicator. Therefore we do not experiment with generational DEMO on Grid’5000.
|Total .||Bordeplage .||Bordereau .||Sc(p) .||Sp(p) .||S(p) .||Sw(p) .||E(p) .|
|Total .||Bordeplage .||Bordereau .||Sc(p) .||Sp(p) .||S(p) .||Sw(p) .||E(p) .|
Using the data from all performed AMS-DEMO runs, we also analyzed the selection lag on the heterogeneous computer architecture of Grid’5000. We show the distribution of selection lag for tested values of p in Figure 15 and the relevant statistics in Table 4. The measured selection lag distributions typically have a two-peak shape and are skewed to the right, that is, distribution means are to the right of both peaks. The two peaks correspond to the two types of processors in the used portion of Grid’5000. The smaller peak at lower selection lags is produced by the smaller number of Bordeplage processors, while the larger peak at higher selection lags is produced on the higher number of Bordereau processors. The most likely reason for skewness is the similarly skewed distribution of te on the Bordereau cluster (see Figure 13). Distributions are also very wide, as seen from their ranges and standard deviations compared to their means. Therefore, the mean selection lag may no longer provide enough information to fully understand the changes in the AMS-DEMO convergence rate; further experiments are required to determine the effects that large variations in the selection lag have on AMS-DEMO convergence.
|p .||Range .||Mean .||SD .||Peaks .|
|p .||Range .||Mean .||SD .||Peaks .|
The steady state DEMO algorithm was parallelized using an asynchronous master-slave parallelization type, creating the AMS-DEMO. AMS-DEMO utilizes queues for each slave, which reduce the slave idle time to a negligible amount. Because of its asynchronous nature, the algorithm is able to fully utilize heterogeneous computer architectures and is not slowed down, even if the evaluation times are not constant.
Unlike the more common synchronous master-slave parallelization of generational algorithms, which traverse the decision space identically on any number of processors, the asynchronous master-slave parallelization changes the trajectory in which the algorithm traverses the decision space. Selection lag, a property that fully characterizes this change, was identified. Selection lag depends directly on the number of processors and queue sizes, and has an adverse effect on the algorithm, increasing the number of evaluations required to find optimal solutions. Experiments on a real-world problem indicate that the effect of selection lag is negligible for a number of processors lower than about half the population size. Although the AMS-DEMO convergence rate appears to deteriorate slightly on this number of processors, we did not find this deterioration statistically significant. Only for larger numbers of processors does the increase in the number of evaluations become statistically significant. Nevertheless, we find that, when increasing the number of processors, the requirement for additional evaluations caused by the increased selection lag is outweighed by the additional computational resources provided by the processors, resulting in shorter optimization times and larger speedups. This finding is robust, and holds for all the performed experiments, even with numbers of processors up to several times the population size.
The constraints for the number of processors were also reduced, compared to the constraints imposed by the synchronous master-slave parallelization. The number of processors is not required to divide the population size and may even exceed it. Experiments on the computers of Grid’5000 showed that on this amount of processors, AMS-DEMO achieves speedups larger than population size, and therefore larger than the theoretical limit for generational algorithms. Although we used no grid-computing middleware for our tests on Grid’5000, extending AMS-DEMO to use it should be possible. Asynchronous nature makes AMS-DEMO robust to communication failures and able to handle dynamic allocation of processing resources, and thus suitable for grid computing. However, additional work is needed to explore the behavior of AMS-DEMO in the presence of failures and on the grid.
We tested the AMS-DEMO algorithm on a benchmark problem and a real-life problem. On the benchmark problem SYMPART, the evaluation function is simple to compute, and we inserted a variable delay to observe how the weak speedup changes with evaluation time. As expected, AMS-DEMO was slower than the original DEMO in tests with an extremely short evaluation time. If the evaluation time was comparable to communication time, AMS-DEMO performance improved; while in tests with evaluation time several orders of magnitude longer than communication time, AMS-DEMO performed at near-linear weak speedup. We also devised a simplified model of weak speedup for approximate performance of AMS-DEMO.
The majority of the experiments was done on the real-life multi-objective optimization problem of continuous steel casting, which requires the optimization of parameters of the industrial procedure according to two objectives. As a result of a computationally demanding and time-consuming evaluation function, which is based on a computer simulation, this problem is difficult to solve. Therefore, the parallelization of the optimization algorithm was used to make solving more manageable. The efficiency of the proposed AMS-DEMO algorithm was contrasted with the simpler and more straightforward synchronous master-slave parallelization method. The experiments reveal that the synchronous master-slave parallelism can be equally fast or slightly faster on a homogeneous architecture, even when the evaluation times are not constant. When conditions unfavorable to synchronous parallelism accumulate, however, AMS-DEMO gains advantage, as the experiments on a heterogeneous Grid’5000 architecture show.
Although the predictions based on the analysis and the experimental results so far agree, AMS-DEMO should be further tested on other problems before making firm conclusions. Since the parallel properties of AMS-DEMO depend largely on the proposed asynchronous master-slave parallelization method and less so on the original DEMO algorithm, a sensible next step would be to investigate the proposed parallelization type independently. Its applicability to other algorithms, both single-objective and multi-objective, would be of special interest. Finally, a more in-depth understanding of the selection lag and new ways to minimize its negative effects remain topics for further work. The ways in which the selection lag distribution influences the algorithm convergence would be of great interest.