Abstract
Thirty years, 1993–2023, is a huge time frame in science. We address some major developments in the field of evolutionary algorithms, with applications in parameter optimization, over these 30 years. These include the covariance matrix adaptation evolution strategy and some fast-growing fields such as multimodal optimization, surrogate-assisted optimization, multiobjective optimization, and automated algorithm design. Moreover, we also discuss particle swarm optimization and differential evolution, which did not exist 30 years ago, either. One of the key arguments made in the paper is that we need fewer algorithms, not more, which, however, is the current trend through continuously claiming paradigms from nature that are suggested to be useful as new optimization algorithms. Moreover, we argue that we need proper benchmarking procedures to sort out whether a newly proposed algorithm is useful or not. We also briefly discuss automated algorithm design approaches, including configurable algorithm design frameworks, as the proposed next step toward designing optimization algorithms automatically, rather than by hand.
1 Introduction
When being asked to write a paper 30 years after the original work was published, namely our paper, “An Overview of Evolutionary Algorithms for Parameter Optimization” (Bäck and Schwefel, 1993), I (Thomas Bäck) immediately thought this would be too big a challenge. In the early 1990s, the “big unification” was achieved in the field that formerly consisted of relatively strictly separated algorithmic branches: Genetic Algorithms, Genetic Programming, Evolutionary Strategies, and Evolutionary Programming, which were then united into what we call today Evolutionary Computation. Due to Evolutionary Computation's success in a huge range of application domains and its importance as a foundation for many subfields of computer science, the development of the field over the past 30 years has been tremendous, going far beyond what we can report in this paper. In the late 1990s, the Handbook of Evolutionary Computation (Bäck et al., 1997) represented an attempt towards summarizing the state of the art based on a unified perspective of the field, and the editors were taking up this challenge since there was at least some hope to make it happen.
In the following, we provide a list of what we subjectively consider important developments over the past 30 years, and we elaborate on some of those in detail across the paper.
The development of the Covariance Matrix Adaptation Evolution Strategy, in short CMA-ES (Hansen and Ostermeier, 1996), has defined a new state of the art and a whole branch of algorithms derived from the CMA-ES (see Bäck et al., 2013, for an overview). Section 2.1 provides more details about this development.
Genetic Algorithms have been specialized into many powerful variants for combinatorial optimization problems, which are beyond the scope of this paper. When it comes to our problem definition (see Equation 1), the Genetic Algorithms often still exhibit strong similarities with the canonical Genetic Algorithm described by Bäck and Schwefel (1993), as discussed in Section 2.2.
Parameter control methods, dating back to the early days of Evolution Strategies with Rechenberg's 1/5-success rule (Rechenberg, 1973) and Schwefel's self-adaptation methods (Schwefel, 1981), have now become a widespread technique in Evolutionary Computation. Section 2.2 discusses this in the context of Genetic Algorithms, and Section 3.1 provides a more general perspective.
When it comes to identifying multiple local optima at the same time in so-called multimodal optimization, and its generalization to find solutions that exhibit behavioral properties as diverse as possible, new techniques in Novelty Search and Quality-Diversity Search have been developed. These algorithms are briefly discussed in Section 3.2.
Constraint handling methods have been developed well beyond penalty-function and re-evaluation-based approaches that were popular in the early 1990s; see Section 3.3 for details.
For expensive objective function evaluations, Surrogate-Assisted Evolutionary Algorithms use nonlinear regression methods from the field of machine learning to learn fast-to-evaluate proxy functions for objective(s) and constraints (Jin, 2011). Section 3.4 provides a few of the corresponding developments in this area of research.
Multiobjective optimization (Deb, 2009; Coello Coello, 2006) has developed from only a few publications into a huge field within Evolutionary Computation due to the powerful ability of a population to approximate a Pareto-optimal frontier in its entirety by using concepts such as Pareto dominance or performance indicators. Section 3.5 explains some of the key developments and state of the art.
Automated algorithm design using configurable frameworks, discussed in Section 3.6, is a promising new approach towards automatically creating new algorithms for parameter optimization and dynamically configuring algorithms.
Particle swarm optimization and differential evolution, two relevant classes of evolutionary algorithms that did not exist yet in 1993, are discussed briefly in Sections 3.7 and 3.8.
It is impossible to discuss the wealth of new theoretical results in the field, such that we can only give a very brief list of relevant works in Section 4.
In Section 5, we briefly discuss the flood of “new” paradigms from nature that are proposed as optimization algorithms, and the urgent need for appropriate benchmarking procedures for qualifying new methods, and in Section 6 a brief outlook is provided.
Most recently, we have seen many tremendously important developments in the field, for instance, new benchmarking standards and test function sets, statistical analysis methods, and software tools for experimentation, visualization, and analysis of experimental results (Bäck et al., 2022). In Section 5.2, we address some key developments in benchmarking and argue that proper benchmarking and empirical analysis of optimization algorithms is critical to assessing their performance and their value to the community and, more importantly, to the domain expert who wants to solve an optimization problem but is not an expert in optimization algorithms. Related to this, we also need to stress that it is of paramount importance to properly benchmark any proposed “natural metaphor-based” optimization algorithm against established state-of-the-art methods and to refrain from an unjustified use of non-mathematical terminology to describe such algorithms. As outlined in Section 5.1, we fully concur with critical voices such as Sörensen (2015) and Campelo and Aranha (2018, 2021), and would like to encourage the research community to support the publication of new algorithms only if they are adequately defined using mathematical notation and appropriately benchmarked against well-known state-of-the-art algorithms.
Going beyond proposing expert-designed optimization algorithms, a fascinating idea is to automatically design optimization algorithms that are best-in-class for a given (set of) optimization problems. Approaches such as hyper-heuristics (Pillay and Qu, 2018; Drake et al., 2020), which can be based on genetic programming (Burke et al., 2009), have been proposed for such automated algorithm design. More recently, algorithm configuration methods have also been proposed very successfully for genetic algorithms (Ye et al., 2022), modern CMA-ES-based evolution strategies (van Rijn et al., 2016), and differential evolution and particle swarm optimization (Boks et al., 2020; Camacho-Villalón et al., 2022b). Although these ideas are not completely new in the domain of genetic algorithms (see Grefenstette, 1986, for hyperparameter optimization; Bäck, 1994, for algorithm configuration and hyperparameter optimization), they can now be applied at a massive scale, with proper benchmarking procedures for comparing the algorithmic variants, and also facilitating both configuration and hyperparameter optimization as a systematic next step.
There are many important aspects of evolutionary computation that are not covered in this paper simply because we had to make choices.
2 Evolutionary Computation for Parameter Optimization—30 Years Later
We assume familiarity with the underlying ideas of evolutionary computation, including a set of candidate solutions (the population), a given optimization problem (as in Equation 1), variation operators such as mutation and recombination, and one or more selection operators—and an iteration of the operator loop until a given termination criterion is satisfied. In Bäck and Schwefel (1993), we have presented the general algorithmic loop of evolutionary algorithms and how it can be used to describe the various instances. In this section, we briefly summarize some of the key developments since 1993 concerning parameter optimization variants.
2.1 Evolution Strategies
Evolution strategies, like all other EAs, have seen tremendous development over more than 60 years, starting from the early work at the Technical University of Berlin (Schwefel, 1965; Rechenberg, 1965, 1973). The algorithm, a (1+1)-ES, was inspired by the process of biological evolution and used a simple iterative procedure that mutates the search point of a single individual by a multivariate normal distribution, that is, , and assigns iff . Experimenting with engineering optimization problems, Schwefel and Rechenberg quickly discovered the need for adaptively learning the mutation step size , a concept that then became central to evolution strategies: The self-adaptation of strategy parameters.
For the (1+1)-ES, it started with the so-called 1/5-success rule for updating , which was derived from theoretical results on the sphere and corridor models (Schwefel, 1977). With multimembered evolution strategies, for example, the -ES and the -ES, which has parents and offspring, mutative self-adaptation was introduced. With these methods, individuals are represented as , where in addition to the search point , a set of endogenous parameters is evolved by the ES, based on the intuition that good search points sprout from good strategy parameters (Beyer and Schwefel, 2002). Schwefel's strategy extensions additionally allowed for the self-adaptation of coordinate-specific step sizes and covariances for mutations, using a lognormal distribution for their variation (Schwefel, 1981). In 1993, this -mutational step size control (MSC)-ES (Schwefel, 1981) was considered state of the art and was the first evolution strategy that was able to adapt the parameters of the mutation distribution to an arbitrary covariance matrix and generate correlated mutations.
However, as pointed out by Hansen et al. (1995), this strategy strongly depends on the chosen coordinate system. It is not invariant to search space rotations—an insight that led to a development that again, after the discovery of self-adaptation, has significantly advanced state of the art through derandomization (Section 2.1.1) and the Covariance Matrix Adaptation ES and its derivates (Section 2.1.2), which are now state of the art in the field.
2.1.1 Derandomized (DR) Evolution Strategies
Derandomized evolution strategies moved away from mutative self-adaptation and used global strategy parameters for all the individuals in the population. This was motivated by the observation that mutative self-adaptation of individual step sizes is unsatisfactory for small populations (Schwefel, 1987), which is attributed to two key reasons. Namely, (i) good strategy parameters do not necessarily cause a successful mutation of the search points, and (ii) there is a conflict in maintaining a diverse set of strategy parameters within the population and stability of the search behaviour (Ostermeier et al., 1994a). Derandomized evolution strategies aim to achieve self-adaptation without any independent stochastic variation of the strategy parameters. This is addressed in DR1 by using the length of the most successful mutation vector to update and by applying a dampening factor in order to reduce strong oscillations of over time (Ostermeier et al., 1994a).
With DR2 (Ostermeier et al., 1994b), in addition to a global step size , local step sizes are included, which control the variability of each component of the decision variables. The update of both the local and global step sizes is not only based on the best mutation vector in the current generation, but also on the best moves of previous generations. This information is collected in a vector , which is updated based on a global learning rate .
DR3 (Hansen et al., 1995), or Generating Set Adaptation Evolution Strategy (-GSA-ES), includes a representation of the covariance matrix as a global strategy parameter and can produce mutations from an arbitrary multivariate normal distribution, similar to the -MSC-ES. Moreover, the GSA-ES is able to learn problem scaling and is invariant to search space rotations. Instead of using a direct representation for the covariance matrix, a matrix is used, of which the column vectors form a so-called generating set. Correlated mutation vectors are generated by matrix multiplication of a random normal vector with . The number of columns of the matrix can be used to control the speed of adaptation, where a smaller value of yields faster adaption and a larger value increases the accuracy.
2.1.2 Completely Derandomized Self-Adaptation in Evolution Strategies
The algorithms described in the previous section, DR1-3, introduce the first level of derandomization of parameter control in evolution strategies, which reduces selection disturbance on the strategy level. The second level of derandomization, or complete derandomization, completely removes this disruption and aims to explicitly realize the original objective of mutative strategy parameter control: to favor strategy parameters that produce selected mutation steps with high probability (Hansen and Ostermeier, 2001). In addition, the second level of derandomization also aims to provide a direct control mechanism for the rate of change of strategy parameters and requires that strategy parameters are left unchanged in the case of random selection.
The Covariance Matrix Adaptation Evolution Strategy (CMA-ES), introduced in Hansen and Ostermeier (1996), satisfies these conditions through two key techniques, Covariance Matrix Adaptation (CMA) and Cumulative Step length Adaptation (CSA). A series of papers (Hansen and Ostermeier, 1996, 1997, 2001) refined the CMA-ES to use weighted recombination for . In this section, we discuss the -CMA-ES as introduced in Hansen and Ostermeier (2001), focusing on CMA and CSA. For a complete tutorial on the CMA-ES, we refer the interested reader to Hansen (2016).
Extensions. Several modifications to the CMA-ES have been proposed over the years, but the core algorithm remains the current state of the art in ES, ranking high in benchmarks for numerical optimization (Hansen et al., 2010), with many successful applications in real-world optimization problems (Hansen, Niederberger et al., 2009; Bredèche, 2008; Winter et al., 2008). Early extensions to the CMA-ES involved the introduction of local restart strategies (Auger et al., 2004), which uses multiple criteria to determine whether to restart the algorithm at certain points during the optimization run. This gave rise to the introduction of the IPOP-CMA-ES (Auger and Hansen, 2005), which upon a restart increases the size of the population, and the BIPOP-CMA-ES (Hansen et al., 2010), which balances between larger and smaller population sizes during restarts. The (1+1)-Cholesky CMA-ES (Igel et al., 2006) uses an implicit method for adapting the covariance matrix, without using an eigendecomposition but by leveraging the so-called Cholesky decomposition, which has the effect of reducing the runtime complexity in each generation from to . This was later extended to the -Cholesky CMA-ES (Krause et al., 2016), which uses triangular Cholesky factorization instead of the inverse square root of the covariance matrix. Since, in general, the CMA-ES scales poorly with increasing , several variants have been proposed for large-scale optimization such as the sep-CMA-ES (Ros and Hansen, 2008) and dd-CMA-ES (Akimoto and Hansen, 2020); see Varelas et al. (2018) for a review. The -CMA-ES uses the information of the best individuals in the population to compute the update of . The Active-CMA-ES improves upon this idea by also including the information from the worst individuals and scaling them with negative weights. Alternative strategies for updating the global step size have been proposed, such as the two-point step-size adaption (Hansen, 2008) and the median success rule (ElHara et al., 2013). Mirrored sampling and pairwise selection (Auger et al., 2011), seeks to improve the mutation operator by generating pairs of mirrored mutation steps and only selecting the best out of each mirrored pair for updating the strategy parameters. This was later extended to use orthogonal sampling (Wang et al., 2014). Many of the extensions mentioned in this section can be combined arbitrarily to produce novel CMA-ES variants (van Rijn et al., 2016; de Nobel et al., 2021a), even ones that are not originally proposed for use with the CMA-ES, such as the usage of quasi-random mutations (Auger et al., 2005). Hybrid CMA-ES algorithms, such as the HCMA, BIPOP-Active-CMA-STEP (Loshchilov et al., 2013), and IPOP-Active-CMA-ES (Faury et al., 2019) are currently ranking top on the well-known BBOB single objective noiseless benchmark (Hansen et al., 2021).
This brief overview clarifies how much the subfield of evolution strategies has diversified after the introduction of the CMA-ES in the second half of the 1990s. All variants based on the CMA-ES use historical information about successful mutations to support continual learning of the mutation distribution. Even for this branch of evolutionary computation, gaining a complete overview of the state of the art in the field becomes very difficult, as exemplified by the subset of algorithm variants that have been collected and empirically compared by Bäck et al. (2013).
2.2 Genetic Algorithms
The genetic algorithm (GA) is likely still the most widely known variant of evolutionary algorithms. It was initially conceived in Holland's work of general adaptive processes (Holland, 1975). The commonly known GA, which is referred to as canonical GA or simple GA (SGA), has been the first example of applying genetic algorithms to parameter optimization (De Jong, 1975). The SGA applies a binary representation and proportional selection, which is based on the fitness values of individuals. It emphasizes the use of recombination, that is, crossover, as the essential variator for generating offspring, while preferring a low mutation probability. After the initialization step, the procedure of an SGA can be briefly described as an optimization loop that successively applies the operators selection, crossover, mutation, and evaluation until a stopping criterion is met. Following years of development, the GA variants that have been proposed for applications in different domains still primarily follow this algorithm skeleton, but depending on the application domain (and, correspondingly, the search space), numerous variations have been proposed. A complete overview is beyond the scope of this paper, which focuses on continuous parameter optimization.
2.2.1 Decoding Is Not a Unique Option
A GA used to be characterized by its encoding and decoding system. The optimization operators execute on a genotype solution, while the problem-specific solutions are presented as the phenotype. The SGA applies a binary genotype representation, that is, , where is the length of the bit string. When dealing with continuous and discrete parameter optimization problems, encoding and decoding methods are applied to transfer solutions between genotype and phenotype.
Nowadays, the GA supports the representations of arbitrary types of variables such as real, integer, and categorical values such that the encoding and decoding system is unnecessary for some variants, for example, in complex GA applications such as in engineering (Dasgupta and Michalewicz, 2013), neural networks (Leung et al., 2003), and permutation-based problems (Whitley, 2019; Larrañaga et al., 1999; Reeves and Yamada, 1998; Costa et al., 2010). Also, novel operators have been proposed to deal with the non-binary representation.
2.2.2 Real-Coded Operators
For continuous optimization, the GA variants with non-binary representation usually apply the following collection of real-coded crossover operators:
Blend crossover (BLX) (Eshelman and Schaffer, 1992) samples values of offspring from a uniform distribution, of which the range depends on the values of the two parents.
Simulated binary crossover (SBX) (Deb et al., 1995) simulates the one-point crossover of the binary representation. It creates two offspring based on the linear combination of two parents, and the scale of each parent's value depends on a distribution with a given spread factor.
Unimodal normal distribution crossover (UMDX) (Ono et al., 1999) samples offspring values as linear combinations of the realization of two normal distributions. One distribution obtains the mean and the standard deviation, which is the average of two parents' values and is proportional to their distances, respectively. The other distribution's mean is 0, and its variance depends on the distance from the third parent to the line connecting the two parents.
Recent works mostly focus on the modification and self-adaptivity of the distributions that are used to sample offspring. For example, the simple crossover (SPX) (Tsutsui et al., 1999) samples offspring values as the combinations of values sampled uniformly at random based on pairs formed by multiple parents, which can be seen as an extension of BLX for multiple parents. Moreover, self-adaptive and dynamic models have been proposed for SBX (Deb et al., 2007; Chacón and Segura, 2018; Pan et al., 2021). The Parent-Centric crossover (PCX) (Deb, Anand, et al., 2002) modified the UNDX by introducing a parent-centric concept, and the PNX (Ballester and Carter, 2004) followed the idea of PCX and simplified the normal distribution used to sample offspring based on the deviation of two parents.
For the real-coded mutation operators, one of the earliest related work (Janikow et al., 1991; Neubauer, 1997) known as a non-uniform mutation increases and decreases variable values with equal probabilities. The mutated distance, that is, deviation of the parent and offspring values, is proportional to the deviation of the boundary and the parent values, and this deviation gets close to 0 after a large number of generations. Also, the recently proposed wavelet mutation (Ling and Leung, 2007) samples the mutated distance using the Morlet wavelet, which distributes symmetrically with an expected value of zero. Moreover, novel mutation operators (Korejo et al., 2010; Temby et al., 2005) search towards promising space based on the progress feedback from previous updatings of the population.
2.2.3 Developments in Binary Representation
At the same time, GAs with the binary representation are still applied in many domains, for example, pseudo-Boolean optimization. For the binary representation, mutation flips the chosen bits of a parent, and crossover swaps the chosen bits of the two parents.
The classic standard bit mutation flips each bit of an individual with probability . It was seen as a “background operator” with a small , later , in early GAs (Holland, 1975; Bäck and Schwefel, 1993). We rephrase the mutation procedure as flipping bits chosen uniformly at random. The standard bit mutation samples from a binomial distribution . Recent work of mutation-only EAs studies the choice of . For example, the so-called fast GA applies a heavy-tailed power-law distribution to sample (Doerr et al., 2017), and a normalized bit mutation samples using a normal distribution in Ye et al. (2019). Moreover, it has been shown that the value of depends on problem dimensionality (Witt, 2013; Doerr et al., 2019).
Crossover, another key operator of the GA, obtains important characteristics that are not in mutation (Spears, 1993). Theoretical studies have tried to explain how crossover works in a GA and indicate that crossover could capitalize on population and mutation (Doerr et al., 2015; Dang et al., 2017; Corus and Oliveto, 2017; Sudholt, 2017; Pinto and Doerr, 2018; Antipov et al., 2020). The value of was mentioned in Bäck and Schwefel (1993) for the crossover probability, found in earlier empirical studies. However, a recent study shows that this suggested value is not optimal for some cases. For example, the optimal value changes with population size and dimensionality for the LeadingOnes problem (Ye et al., 2020).
Note that many theoretical studies exist about GAs for discrete optimization (Doerr and Neumann, 2020). Also, much evidence indicates that the optimal parameters (i.e., mutation rate, crossover probability, and population size) are dynamic, which relates to the topic of parameter control (Karafotias et al., 2014; Aleti and Moser, 2016).
2.2.4 Learning in Genetic Algorithms
The schema theorem plays an important role in the early theoretical study of genetic algorithms. It considers the hypothesis that good GAs combine building blocks, which are short, low-order, and above-average schemata, to form better solutions (Goldberg, 1989; White, 2014). Though the schema theorem shows some drawbacks (Eiben and Rudolph, 1999)---for example, this building blocks hypothesis implicitly assumes that the problems are separable, and the theorem cannot explain the dynamic behavior and limits of GAs---it reveals that one key issue of GA is destroying building blocks. Furthermore, to take “correct” actions, contemporary methods extract and apply useful information learning from the population of GAs, though this information may unnecessarily be about building blocks.
Linkage Learning was introduced to form a novel crossover operator (Harik and Goldberg, 1996), avoiding to destroy building blocks. Later on, the hierarchical clustering algorithm is applied to learn a linkage tree, creating building blocks for the crossover of the linkage tree GA (Thierens, 2010), which belongs to the family of gene-pool optimal mixing EAs (GOMEAs) (Thierens and Bosman, 2011). The GOMEA learns the maintained population's general linkage model, called the Family of Subset (FOS). A FOS is usually a set of variables, and for each FOS, GOMEA varies the values of the corresponding variables. Many variants exist for GOMEA, and the algorithms have been applied in both discrete and continuous optimization (Bosman and Thierens, 2013; Bouter et al., 2017).
The compact GA (CGA) (Harik et al., 1999) is another GA extension inspired by the idea of building blocks. However, CGA represents the population as a distribution over the set of solutions. New solutions are sampled based on the distribution, and the distribution is updated for each generation. Nowadays, this technique is known for the family of estimation of distribution algorithms (EDA) (Hauschild and Pelikan, 2011). EDAs maintain a stochastic model for the search space, and this model can be built for discrete, continuous, and permutation-based solutions.
2.2.5 Outlook
As before, the GA can still be represented by the iterative order selection,variator, and evaluate. A significant difference from the SGA introduced in Bäck and Schwefel (1993) is that the contemporary GAs do not necessarily follow the generational model, maintaining consistent population size. Moreover, more techniques are available for the variators to create new solutions. Novel operators have been proposed for mutation and crossover. In addition, contemporary variators learn to create new solutions using linkage information intelligently. GAs have been applied in various domains, although we cannot list all the applications, such as permutation-based problems (Oliver et al., 1987; Mathias and Whitley, 1992; Nagata and Kobayashi, 1997; Whitley et al., 2010; Goldberg and Lingle, 1985). While more and more techniques appear in various problem-specific domains, theoretical work is developing to help us understand how the GA works. However, providing guidelines for selecting the optimal operators for a given problem remains challenging while the GA community is putting effort towards this goal (Bartz-Beielstein et al., 2020; Ye, 2022).
2.3 Evolutionary Programming
The development of evolutionary programming (EP) dates back to the mid-1960s, too. Initial work was presented by Fogel et al. (1965, 1966) for evolving finite state machines (FSM) with the aim to solve prediction tasks generated from Markov processes and non-stationary time series. EP was originally proposed as an alternative method for generating machine intelligence (Fogel, 1994). Genetic variation was performed by mutating the corresponding discrete, finite alphabet with uniform random permutations. Each candidate machine in the parent population produced one offspring by mutation. The best performing half number of parents and offspring were then selected for the following generation. This selection method is analogous to the ( + ) strategy commonly used in ES. In addition, the original work for evolving finite state machines also offered the potential for exchanging parts of finite state machines, which one might consider as a proposal for a “crossover” operator (Fogel et al., 1966).
D. B. Fogel later enhanced EP by adapting genetic variation on real-valued variables with normally distributed mutations (see e.g., Fogel et al., 1990; Fogel, 1992, 1993). EP traditionally did not use any form of crossover; therefore, the evolutionary process typically relied on a set of mutation techniques that have been proposed over the last decades. In general, EP viewed each candidate solution as a reproducing population, whereby the used mutation operator simulates all changes that occur between the current and the subsequent generation. Like evolutionary strategies, EP stressed the role of mutation in the evolutionary process. Traditionally, EP used normally distributed mutation as the primary search operator. During the course of the 1990s, the annual conference on EP was held until 1998. Several studies in the mid- and late-1990s showed that using a Cauchy random variable can enhance performance for certain parameter optimization problems (Yao and Liu, 1996). Moreover, a multi-operator approach to EP was proposed in this period. It was shown that this multi-mutational self-adaptive EP strategy was able to adjust the use of both, the Gaussian and Cauchy mutations, and was thereby able to provide greater rates of optimization when compared to the sole use of Cauchy mutations (Saravanan and Fogel, 1997). A runtime analysis of EP using Cauchy mutations has been presented by Huang et al. (2010).
EP has been very successful in evolving strategies for certain games (see, e.g., work by Chellapilla and Fogel, 1999a). Most remarkable is the use of EP to evolve neural networks that were able to play checkers without relying on expert knowledge (Chellapilla and Fogel, 1999b; Fogel, 2000). In this approach, EP created a population of 15 artificial neural networks. Each candidate network created an offspring with Gaussian mutation that was controlled with a self-adaptive step size. To validate the strength of the best-evolved networks, the created artificial checkers players operated within a computer program called Blondie24 on the online board gaming site Zone.com. Over a two-month period, 165 games against human opponents were played. The estimated rating of Blondie24 was in the mean 2045.85 with a standard error of 0.48 whereby the master level starts at a rating of 2200. The best result from the 165 games was seen when the network defeated a player with a rating of 2173, which is only 27 points away from the master level. The story of Blondie24 is described in David B. Fogel's book Blondie 24—Playing at the Edge of AI (Fogel, 2002). Later on, Fogel applied the approach to chess and created Blondie25, a chess program that was able to win over a nationally ranked human chess player and also showed some wins (one out of twelve games for black, two of twelve for white) over Fritz 8.0, which was the number five ranked program in the world at that time (Fogel et al., 2006; see also Fogel, 2023).
When considering parameter optimization problems as defined in Equation (1), early version of ES and EP shared quite a few similarities, as we already outlined in Bäck and Schwefel (1993). From the year 2000, algorithms based on the concept of CMA-ES are more prominent for this type of application than EP-based approaches, and today one finds that the term evolutionary programming is sometimes used to denote algorithms for continuous parameter optimization that are solely based on mutation (Abido and Elazouni, 2021) or the term is also sometimes confused with genetic programming (Jamei et al., 2022).
3 Other Important Developments
In this section, a few developments are discussed that we consider of strong importance for the field, however, again this is a subjective selection and a wide range of topics cannot be covered in this paper.
3.1 Parameter Control
It is truly amazing to see that early parameter control techniques such as the 1/5-th success rule (Rechenberg, 1973) and Hans-Paul's early approaches for adapting variance(s) and covariances of an -dimensional Gaussian distribution (Schwefel, 1981), later on proposed in a similar way for genetic algorithms and binary search spaces (Bäck and Schütz, 1996; Kruisselbrink et al., 2011), as well as mixed search spaces (Li et al., 2013), have now turned into a standard approach in evolutionary computation. Thanks to the wonderful progress that was achieved in the theoretical analysis of evolutionary algorithms in the past 30 years, we also now have a fairly good understanding of why such approaches are helpful, how to analyze them, and what the time complexity and optimal parameter settings for some (classes of) test functions look like—both for specific pseudo-Boolean optimization problems and for specific continuous parameter optimization problems. This field of research is huge now, and giving an overview is impossible, so a selection of just a few research works is incomplete and subjective, and therefore we abstain from trying to do so here.
3.2 Multimodal Optimization, Novelty Search, and Quality Diversity Algorithms
In this subsection, we consider the development in multimodal optimization in the last 30 years and a relatively new generalization of such optimization referred to as Novelty Search and Quality Diversity algorithms. At first, we introduce and motivate multimodal optimization, then highlight the development of niching as one of the most important concepts in multimodal optimization, and then proceed with describing several algorithms that use this concept including the most promising state-of-the-art algorithm. Secondly, we show how the idea of multimodal optimization developed to a more general setting by introducing and motivating Novelty Search and Quality Diversity. Thirdly, we introduce Multidimensional Archive of Phenotypic Elites (MAP-Elites) algorithm as one of the historically first and most influential methods in quality diversity and discuss its limitations and how they are overcome. We finish with a discussion of state-of-the-art quality diversity algorithms and the scenarios of their application.
3.2.1 Multimodal Optimization
Traditional single-objective optimization focuses on obtaining a solution with the best possible quality with regard to the given objective function in the smallest possible time. However, this exact setup may be not targeted by engineers and other practitioners who apply optimization at the beginning of the construction process. For this kind of application, it is relevant to explore possible trade-offs that give sufficiently good performance in acceptable time (Bradner et al., 2014). Such exploration may be seen as the task of obtaining a diverse set of solutions with a fitness value above the given threshold. This task is called multimodal optimization (MMO) since it is common for the considered objective functions to have many local optima or modes.
To approach MMO, a number of algorithms were developed. The first algorithms of this type were proposed back in the 1970s (De Jong, 1975). They were extensions of genetic algorithms by niching methods that aim to maintain the diversity in the populations (Preuss, 2015). Works of the mid-1990s followed where niching methods were improved. A clearing method was proposed (Pétrowski, 1996) where similar individuals are penalized when their number exceeds the given threshold. Restricted tournament selection modifies the selection procedure for creating the new population (Harik, 1995) by restricting the competition to take place only between similar offspring. This ensures that the population has a diverse set of individuals even if the quality of some of them is worse than others. Both those niching methods use distance in the search space as the criterion of similarity. The threshold (or niche radii) for this criterion in particular, and other parameters of those niching algorithms in general, cannot be efficiently determined a priori without assumptions about the optimized function. The series of works followed to solve this problem, in particular Shir and Bäck (2006) and Shir et al. (2007) propose to adaptively control this threshold during the optimization process. The latter work also introduced niching into the CMA-ES for the first time. The comprehensive survey on variations of niching can be found in Das et al. (2011).
Apart from niching in evolutionary algorithms, the same idea of diversity in population can be applied to other metaheuristics. Particle swarm optimization was extended with niching by Qu et al. (2012). It uses the Euclidean distance-based neighborhood of every particle and picks several locally best-performing ones to guide the search for every particle. Thomsen (2004) proposed to extend differential evolution with niching using the classical idea to penalize individuals that are close. Surveys by Das et al. (2011) and Li et al. (2016) thoroughly describe other important methods and ideas for extending existing algorithms to MMO settings. According to Preuss et al. (2021), one of the leaders in competitions in MMO that were held on CEC/GECCO from 2016 to 2019 and the winner among niching algorithms when a greater budget is given is HillVallEA (Maree et al., 2018). This algorithm makes extra evaluation of the objective function to check the existence of worse solutions (hills) between the given pair of solutions to decide whether they belong to the same niche (valley).
3.2.2 Novelty Search
In the optimization community, mainly in evolutionary robotics, diversity with regard to the behavior of the solutions started to be considered around 2010 (Lehman and Stanley, 2008). In such domains, the individuals do not describe the complete solution, but the complete description may be obtained during the evaluation of the objective function, for example, the simulation of robot movements. The set of all possible descriptions is called behavior space (BS). The first algorithms that approached this modified goal aimed to explore this space, because the connection with genotype space is not known for the optimizer. Such approaches, known as novelty search algorithms (Lehman and Stanley, 2011a), changed the goal from obtaining high-quality solutions to the exploration of BS. Instead of optimizing the single given value of the objective function, they optimize the distance in BS between the sampled solution and solutions in the previously collected archive. Illumination of the landscape of BS produces a diverse set of solutions with different qualities. However, only the solutions of sufficiently good quality are interesting for practical applications.
3.2.3 Quality-Diversity
The family of algorithms that search for a distributed set of sufficiently good solutions in BS is known as Quality-Diversity (QD) algorithms (Pugh et al., 2015). Following Chatzilygeroudis et al. (2021), the goal of the QD algorithms may be seen as a generalization of MMO, such that the objective function is maximized as in the MMO case but the diversity is targeted with regard to the given behavior descriptor function. One of the first QD algorithms proposed in 2011 is an extension of novelty search and optimizes two objectives: novelty objective (the same as in the original novelty search) and local competition objective (number of closest solutions with lower objective value) (Lehman and Stanley, 2011b). Optimization of those objectives allows the generation of a set of diverse solutions with good local quality. Approximately at the same time, the MAP-Elites (Mouret and Clune, 2015) algorithm was proposed. Due to the simplicity of its idea and efficiency in practice, this algorithm gained a lot of attention in the optimization area and became a classical QD optimizer. Due to the great influence of this algorithm on QD optimization, we give its description in more detail in the next paragraph.
MAP-Elites maintains an archive of previously obtained solutions in a special container. The container discretizes the BS and splits it into cells. Every solution in the container belongs to the cell that corresponds to its behavior descriptor and eliminates from the container a worse solution (with respect to the objective function) that shares the same cell with it. To update the container MAP-Elites follows the general template of evolutionary algorithms. In the original version, random selection is applied to choose a number of solutions from the container, then mutation and crossover are applied to the selected individuals to create a new generation of individuals, then the objective function is evaluated for each of them to obtain their behavior descriptors and performance, and then the container is updated.
The standard version of MAP-Elites was applied in a number of domains (see Section 3.1 of Chatzilygeroudis et al., 2021). However, it exhibits two limitations: (i) the high dimensionality of BS does not allow for efficient storage of the container; and (ii) the expensiveness of the objective function restricts the number of evaluations that the QD algorithm may afford to make in a reasonable time. Different modifications of this algorithm were proposed to overcome those problems; here we mention only the first ones. Vassiliades et al. (2017) addressed the problem of high-dimensional BS by using Voronoi diagrams to store the individuals in the container instead of the traditional grid, which occupies exponentially more memory with the growth of the number of dimensions. Gaier et al. (2017) addresses the problem of optimizing the expensive objective function by building a surrogate of the function and applying EGO for optimization.
QD algorithms are actively developed to satisfy the needs of modern applications. In the case when the BS is not simple to define before the optimization, Cully and Demiris (2018) proposed to apply autoencoders to automatically learn the latent space of digits images and use it as BS of a robotic hand that learns to write digits. In the case of a noisy function, Flageat and Cully (2020) proposed to store multiple individuals in every cell of the MAP-Elites container, select only the individuals with the highest quality, and change the randomly selected individual in the cell to the freshly added individual. This procedure ensures that the search is not misled by the deceptiveness of a noisy environment. Pierrot et al. (2022) considered the case of multiobjective functions and proposed to maintain the Pareto front in every cell of MAP-Elites' grid. The benchmarking study of this algorithm showed that it maintains a diverse set of individuals and at the same time manages to outperform (with respect to the global hypervolume) traditional multi-objective algorithms in many cases while not significantly losing on the rest of the considered functions. Urquhart and Hart (2018) first proposed a MAP-Elites based QD algorithm for discrete settings and applied it to workforce scheduling and routing problems, resulting in new best-known solutions for several problem instances.
All of these areas are strongly related, as analyzed and compared also in Hagg et al. (2020). Their use to support engineers in finding diverse solutions to engineering design problems (Hagg et al., 2018) and in involving the user experience in the ideation process (Hagg et al., 2019) are highly relevant to works in real-world optimization.
3.3 Constrained Optimization
Another highly relevant topic for parameter optimization is the handling of constraints, as these are typically occurring in real-world optimization settings. Consequently, the past 30 years have seen tremendous development in constraint-handling approaches (Coello, 2022), which, 30 years ago, was often limited to the simple “death penalty” approach of discarding infeasible offspring. Ordered from the most straightforward to more complex ones, we categorize contemporary approaches into six groups.
(i) Feasible solution preference is the most straightforward strategy. It works by always preferring feasible solutions over infeasible solutions (also known as the death penalty (Kramer, 2010), and preferring solutions with a small constraint violation over solutions with large constraint violations (Deb, 2000; Mezura-Montes and Coello, 2005). A downside of this method is that little information is preserved in the population from infeasible solutions in the optimization process. Consequently, the algorithms using this method can get stuck in local optima.
(ii) Penalized unconstrained optimization uses a so-called penalty function that assigns a cost to constraint violations (Mezura-Montes and Coello, 2011). Penalty functions can be static, dynamic, adaptive, or even based on self-adaptive fitness formulations and stochastic ranking approaches. Penalty functions for constraints, however, lead to a strong bias toward feasible solutions. This is not always ideal, as for example, a high cost assigned by a bad penalty function could cause a selection operator to prefer a feasible solution far away from the optimum over a solution very close to the optimum which is only slightly infeasible. All the information from this promising solution would be lost and not maintained in the population. Unfortunately, the best penalty function cannot be known upfront, and tuning the hyper-parameters of the penalty function for each constraint requires a lot of additional function evaluations (Richardson et al., 1989).
(iii) Repair mechanisms aim to repair solutions that violate the constraints, and are often based on constraint-specific characteristics. Since they are often problem-dependent, a range of different repair mechanisms has been proposed (Salcedo-Sanz, 2009), for example, gradient-based repair mechanisms (Chootinan and Chen, 2006; Zahara and Kao, 2009) and strategies that reduce the probability to generate infeasible offspring (Liepins and Vose, 1990; Arnold and Hansen, 2012).
(iv) Constraint separation considers both the constraint and the original problem as separate optimization problems and aims to solve each of these independently. This can, for example, be achieved via coevolution, where two populations interact with each other (Paredis, 1994), where one population is focused on the fitness of solutions, and the other on constraint satisfaction. Another frequently used approach considers the constraints as additional objectives (Surry et al., 1997; Coello and Montes, 2002). After optimizing the multiobjective (see Section 3.5) problem, the solution with the best true objective function value without any constraint violation can be selected. A downside of this method is an unnecessarily more complex objective space and a strong bias towards the regions where the constraints are violated in the optimization process.
(v) Model-assisted constraint optimization uses machine learning to identify feasible solutions, by training a classification model or fitting a surrogate model with the already evaluated solutions of the constraint function. The trained models can then be used to classify or predict which solutions are feasible and which are not. Simple classification algorithms (such as SVMs; see Poloczek and Kramer, 2013) can be used in this approach, but more advanced regression or surrogate models (see Section 3.4) can also be used. A benefit of model-assisted constraint optimization is that no information is lost from evaluated solutions and that a lot of function evaluations can be saved. However, this often comes at the cost of a substantial increase in computational resources.
(vi) Hybrid methods exist that make use of a combination of the above-mentioned techniques. Noteworthy mentions are, for example, SACOBRA, which uses a repair mechanism in combination with model-assisted constraint optimization techniques (Bagheri et al., 2017) and automatic constraint-handling technique selection based on problem characteristics (Mallipeddi and Suganthan, 2010).
These constraint-handling strategies are not necessarily limited to being used solely in single-objective optimization, but can also be applied to find feasible solutions for multiobjective problems. The only clear difference between them is a different way of handling the objectives (see Section 3.5 for more information).
3.4 Surrogate-Assisted Approaches
Optimization based on computer simulations or real-world experiments can be very expensive in terms of time and money. Therefore, the automotive, marine, aviation, and other engineering fields that make use of computer simulations make use of surrogate-assisted optimization algorithms.
Surrogate-assisted evolutionary computation is a variant of EC that utilizes a surrogate model, also known as a metamodel or response surface model, to guide the search process. In traditional EC, the objective function, which maps a set of input parameters to a single output value, is evaluated directly to determine the fitness of candidate solutions. In surrogate-assisted EC, the objective function is approximated using a surrogate model, which is trained on a subset of the evaluated points and can be used to predict the objective function value at any point in the search space. The use of a surrogate model allows the search process to be more efficient by reducing the number of function evaluations required to find good solutions. This is particularly useful when the objective function is computationally expensive to evaluate or when the search space is large and complex.
The original idea of using surrogate models in optimization comes from the Efficient Global Optimization (EGO) framework (Bayesian optimization) by Jones et al. (1998). In EGO, a Gaussian Process Regression (GPR) model is used as a surrogate model where the GPR is exploited to select a promising candidate solution. At each iteration, the most promising (yet unseen) candidate solution is evaluated on the real objective function. In surrogate-assisted EC, similar techniques are used to reduce the number of real objective evaluations by using the surrogate models as a filter to remove the most likely bad individuals from the population before the real evaluation. Surrogate-assisted EC was mainly motivated by reducing the time-complexity of expensive problems such as aerodynamic design optimization (e.g., Jin and Sendhoff, 2009) or drug design (e.g., Douguet, 2010).
Different strategies for using surrogate models are often known as model management or evolution control. This is specifically challenging when the problems are of high dimension. The first works in this field date from the mid-1990s (e.g., Ratle, 1998; Yang and Flockton, 1995). Surrogate models can be applied to almost all operations of EAs, such as population initialization, mutation (Abboud and Schoenauer, 2001), crossover (Anderson and Hsu, 1999), and fitness evaluations (Rasheed and Hirsh, 2000).
3.4.1 Advances in Model Management Strategies
Model management strategies for fitness evaluations can generally be divided into population-based, individual-based, and generation-based strategies (Jin, 2005). Population-based means that more than one subpopulation coevolves. Each subpopulation uses its own surrogate model for fitness evaluations. Individual-based methods use a surrogate for some of the individuals in a generation and use the real function on the rest of the individuals (Jin et al., 2000). Generation-based means that the surrogate is used for fitness evaluations in some of the generations, and the real objective function in the rest (Ratle, 1998; Jin et al., 2002).
3.4.2 Advances in Surrogate Models
The original EGO framework uses Gaussian Process Regression, or Kriging (Rasmussen and Williams, 2006), as the default surrogate model due to its capability of providing uncertainty quantification. However, for surrogate-assisted EC this uncertainty is not always required and other machine-learning models can be used as well. There are also many advances in EGO surrogate models that use Kriging approximation methods to deal with larger numbers of dimensions or samples (Van Stein et al., 2020; Raponi et al., 2020; Antonov et al., 2022). Next to Kriging(-like) methods, popular choices are Radial Basis Functions (Sun et al., 2018), artificial neural networks (Jin et al., 2002), polynomial repression (Zhou et al., 2005), and support vector regression (Wang et al., 2015). Recent advances also include the use of surrogate ensembles and incorporate automated hyperparameter optimization methods (AutoML) for selecting a well-fitting surrogate model (Yu et al., 2020, 2022; Huang et al., 2022; de Winter et al., 2022).
3.5 Multiobjective Optimization
While multiobjective optimization almost was in its infancy 30 years ago, nowadays it is almost impossible to imagine evolutionary computation without this important subfield. Evolutionary multiobjective optimization now defines its own research field and requires other approaches than single-objective optimization. The main reason for this is that in multiobjective optimization, we are often not interested in a single solution, but in finding the set of Pareto-optimal solutions (Konak et al., 2006). The first multiobjective optimization algorithm (VEGA, see Schaffer, 1985) splits the population into multiple subpopulations where each population would be optimized for a different objective. This strategy, however, tends to only converge to the extremes of each objective. In the 30 years following, researchers realized that there was a need for diversity mechanisms, elitism in the population, and other fitness functions to find well-spread solutions along the entire Pareto frontier. Three evolutionary multiobjective optimization algorithms that had these characteristics quickly started to define the state of the art (Deb, 2007):
Nondominated Sorting Genetic Algorithm II (NSGA-II) by Deb, Agrawal et al. (2002) became popular because it uses an elite-preservation mechanism, enhances a diversity mechanism, and emphasizes nondominated solutions. The emphasis on the nondominated solutions is done with a selection operator that selects only the top nondominated solutions for crossover to generate new individuals.
Strength Pareto Evolutionary Algorithm 2 (SPEA2) by Zitzler et al. (2001) keeps a fixed archive of nondominated solutions where nondominated solutions are replaced with solutions that dominate more individuals. The search for new solutions is guided by a -th nearest neighbor density estimation technique.
Pareto Envelope-based Selection Algorithm (PESA2) by Corne et al. (2001) is an evolutionary strategy that compares a parent selected from a hyperbox with its offspring child. If the child dominates the parent, the child is accepted as the next parent. All nondominated solutions are put in an archive, and to keep the archive small only the solutions in the least crowded regions are kept.
In the years following, the multiobjective optimization research field was enhanced with strategies for many-objective optimization (e.g., Wagner et al., 2007; Deb and Jain, 2013; Ishibuchi et al., 2008), for problems with expensive function evaluations (e.g., Ponweiser et al., 2008; Knowles, 2006; Santana-Quintero et al., 2010), interactive multiobjective optimization (e.g., Miettinen, 2014; Xin et al., 2018), and constraint multiobjective optimization (e.g., Qu and Suganthan, 2011; Blank and Deb, 2021; de Winter et al., 2022).
3.6 Automated Algorithm Design
As the number of available algorithms to solve parameter optimization problems keeps growing, as evidenced by the many developments in the algorithms described in Section 2, the question of how to choose one of them is gaining considerable importance. While the problem of algorithm selection (see Kerschke et al., 2019; Muñoz et al., 2015 for surveys), where we want to pick one algorithm to solve a particular problem, has been around for decades (Rice, 1976), extensions on this original framework have gained more traction in the last decade.
The question of algorithm configuration or meta-optimization (see Huang et al., 2019 for a survey), where we want to tune parameters for a specific algorithm, has become more feasible as computational resources are more readily available. Simultaneously, the number of parameters of most methods in evolutionary computation is growing as more modifications are proposed. As such, tuning of parameters is becoming more prevalent (Eiben and Smit, 2011), and tools for algorithm configuration are gaining popularity (e.g., see López-Ibáñez et al., 2016; Lindauer et al., 2022).
While tuning parameters for an overall performance gain on a specific function or set of functions is natural, further benefits are expected when the choice of algorithms or their parameters can be changed during the optimization process. In contrast to self-adaptation, this notion of Dynamic Algorithm Configuration (DAC; see Biedenkapp et al., 2020; Adriaensen et al., 2022) often relies on methods like reinforcement learning, which enable more general adaptation of not just a single parameter, but potentially switching multiple operators at once (see Eimer et al., 2021; Sun et al., 2021). Results with DAC have been promising, showing the potential to mimic complex traditional self-adaptation schemes (Shala et al., 2020).
These online configuration tasks are feasible only if we consider our algorithms to be modular in nature, so changes between operators are as simple as swapping a module, as we have recently shown (de Nobel et al., 2021a; Vermetten et al., 2019). However, even when this modularity is not present, advantages could be gained by switching between algorithms to exploit differences in their behavior (Vermetten et al., 2020). While this notion of switching between algorithms has shown promise in the context of hybrid algorithm design, online policies to switch between algorithms on a per-run basis, for example, based on the internal state and the landscape as seen by the algorithm, are promising future directions (e.g., Kostovska, Jankovic et al., 2022).
3.7 Particle Swarm Optimization
3.8 Differential Evolution
Originally proposed in 1995 for parameter optimization problems, Differential Evolution (DE) (Storn and Price, 1995) has rapidly gained popularity over the subsequent years due to the simplicity of its implementation, low number of parameters, low space complexity, and competitive performance on a wide range of problems (Das and Suganthan, 2011).
The general framework of DE is as follows: the algorithm starts with a population of solutions typically initialized at random as real-valued vectors inside problem boundaries. Then, within the generational loop of DE, all solutions in the population sequentially one-by-one undergo a multi-stage perturbation made up of: (i) mutation: first, a number of “donor” solutions is selected uniformly at random from the population for each “current” solution (now called “target”), then a new “mutant” solution is generated as a sum of a selected solution (called “base”) selected following the mutation strategy and a scaled difference between pairs of abovementioned selected solutions; (ii) crossover/recombination: components of the new “trial” solution are produced via exchanging components, with some probability, between mutant and target solutions; (iii) selection: trial and target solutions deterministically compete for a spot in the next generation in terms of their values of the objective function. Similar to other EC algorithms, the termination criterion can be either based on the precision or the budget of function evaluations.
All classic DE variants can be described via a general notation DE/a/b/c where a stands for mutation strategy, b denotes the number of difference vectors employed in the mutation strategy (e.g., rand/1 Storn and Price, 1995; rand/2, best/1, best/2 originally proposed in Price et al., 2005; and more advanced versions like target-to-pbest/1, Tanabe and Fukunaga, 2013 and Zhang and Sanderson, 2007; target-to-best/2, Gong and Cai, 2013; target-to-rand/1, Qijang, 2014, Sharma et al., 2020, Qin et al., 2009; 2-Opt/1 Chiang et al., 2010), and c specifies the crossover strategy (e.g., originally proposed bin, Storn and Price, 1995; exp, Price et al., 2005; or more advanced options from Das et al., 2016). Initialization (uniform, Halton, Gaussian, etc.) and feasibility-preserving (Boks et al., 2021) operators are traditionally not included in such notation, typically leaving the algorithm's specification somewhat ambiguous (Kononova et al., 2022).
3.8.1 Differential Evolution in the Context of EC
DE is closely related to a number of other heuristic algorithms, namely: (i) the standard EA, due to the same computational steps involved in both algorithms; (ii) annealing-inspired algorithms 1 (Kirkpatrick et al., 1983), due to built-in scaling down of the “step size” in both algorithms; (iii) the Nelder-Mead algorithm (Nelder and Mead, 1965) and controlled random search (CRS) (Price, 1977), both of which also operationally rely heavily on the difference vectors to perturb the current trial solutions.
At the same time, DE is remarkably different from traditional EC algorithms (Das and Suganthan, 2011): (i) population-derived differences used for perturbation in DE automatically adapt to the natural scaling of the problem, unlike, for example, an ES which requires explicit rules for adaptation of step size per variable over time; (ii) perturbations to solutions are generated from evolving nonzero population difference vectors rather than a predefined density function in, for example, Estimation of Distribution Algorithms (EDA)—the so-called contour matching property of DE (Price et al., 2005) which allows both a mutation step's size and its orientation to adapt automatically to the landscape of the objective function; (iii) donor solutions in the mutation step of DE are selected without any reference to their fitness values, unlike a GA, where typically some sort of fitness-based parent selection takes place; (iv) while both algorithms ensure elitism, one-to-one competition for survival in DE between parent and child mostly resembles that of a ()-ES, where surviving solutions are selected from the joined pool of parents and children; (v) survivor selection in DE is local as solutions are compared only against a (relatively local) mutated version of itself and not against solutions that have evolved in distant parts of solution space—apart from multi-population versions, no standard EC algorithm behaves in a similar fashion (Price et al., 2005). Finally, the latter property prevents premature convergence and makes DE perfectly suitable for parallelization.
3.8.2 Parameter Control
Population size. To employ the full potential of the chosen mutation operator, the population size should allow choices of non-repeating indices of donor solutions (i.e., for rand/1). At the same time, too large populations should be avoided as potentially deleterious for the convergence process, at the expense of an adequate exploitation phase to refine promising solutions locally.
Control parameters. DE mutation is parameterized via scale factor, whose value in practice should be sufficiently high to counteract selection pressure and generate sufficiently diverse trial solutions to avoid premature convergence (Zaharie, 2002). While values are generally less reliable, their occasional utilization can be beneficial within adaptive schemes described below. DE crossover is parameterized via crossover rate, which controls the proportion of components of the mutant solution copied into the trial solution and, counterintuitively at first, defines a mutation rate—the probability that a component is inherited from a mutant (Price et al., 2005).
(Self-)adaptation. Population size clearly influences population diversity: larger population size means higher diversity and therefore a better exploration of the search space (Zaharie and Micota, 2017). Such reasoning gave rise to algorithms with adaptive schemes, controlling a gradual reduction of population size over the optimization run, to lower chances of stagnation (Brest et al., 2008; Zamuda and Brest, 2012; Tanabe and Fukunaga, 2013), following similar developments in other subfields of EC.
4 Theoretical Aspects
Huge efforts have been devoted in the past 30 years to develop a rigorous theoretical analysis of evolutionary computation, in particular for genetic algorithms and evolution strategies. Following the pioneering work of John Holland's schema theorem (Holland, 1975) on genetic algorithms, Wegener (2001) used a fitness level-based method to analyze the expected running time of the EA. This work presented the first proof that crossover can reduce the running time of EAs on the so-called royal road functions (Garnier et al., 1999). The analysis of the EA for more theoretical problems (e.g., Linear and LeadingOnes) was presented in the work of Droste et al. (2002). Recently, attention was also drawn to the impact of hyperparameters on the running time, for instance, for EAs with parameterized population sizes (Jansen et al., 2005; Witt, 2006; Rowe and Sudholt, 2014; Antipov et al., 2019) and mutation rates (Böttcher et al., 2010; Doerr et al., 2013; Gießen and Witt, 2017). Furthermore, extensive work on the analysis of the crossover operator has also been done in Doerr and Doerr (2015); Doerr et al. (2015); Dang et al. (2017); Corus and Oliveto (2017); Sudholt (2017); Pinto and Doerr (2018); and Antipov et al. (2020).
For continuous optimization, many contributions focus on the asymptotic analysis of evolution strategies, such as the inspiring and fundamental analysis of Beyer (2001) and Auger and Hansen (2011). Moreover, the well-known drift analysis has been introduced by He and Yao (2001, 2003), serving as a general theoretical framework. It was employed for several challenging theoretical questions. For instance, Doerr et al. (2012) provides an elegant proof for the running time of the EA on any linear function. In Witt (2013) tight runtime bounds of EA are presented, and Akimoto et al. (2018) uses drift analysis to analyze the runtime bound for the evolution strategy with the 1/5-success rule. Lengler has summarized some examples of using drift analysis for EAs (Lengler, 2020).
For multiobjective evolutionary algorithms (MOEAs), theoretical results are mainly derived for simple MOEAs on discrete problems. For example, the runtime analyses for (global) simple evolutionary multi-objective optimizers on classic bi-objective pseudo-Boolean optimization problems (Laumanns et al., 2004; Giel and Lehre, 2006; Osuna et al., 2020). Very recently, the first theoretical results were presented for the famous NSGA-II algorithm (Zheng et al., 2022; and Doerr and Qu, 2022a, b).
5 The More the Better? Not Always
In this section, we briefly address the excessive number of optimization algorithms with potentially similar performance in the field, referring to the ever-growing number of paradigms gleaned from nature, claimed to be distinct models for optimization algorithms (Section 5.1). A possible investigation of this problem can be achieved by proper benchmarking, as we argue in Section 5.2.
5.1 The “Bestiary” and the Problem of Too Many Algorithms
Although the rising popularity of evolutionary computation has led to a vast number of positive contributions in the optimization domain at large, there have also been some rather negative developments over the past 30 years. In particular, the trend of devising “novel” nature-inspired optimization heuristics has been a concern in the last decades, and the trend does not seem to slow down (Hussain et al., 2019).
While early evolutionary computation techniques effectively use metaphors to motivate the algorithmic concepts, the use of metaphor has expanded to the point where it is (un)intentionally obfuscating the underlying algorithmic components (Sörensen, 2015). There is now a veritable zoo of metaphor-based optimization heuristics (Campelo and Aranha, 2021, 2018), each with its own terminology, which is hard to decipher for those with a more standard background in optimization heuristics, and for those who are looking for a well-defined formal algorithmic or mathematical description. These “novel” metaphor-based methods still gain attention, and even though many of them have been exposed as providing no novelty, new ones are still devised and promoted (Weyland, 2010; Camacho Villalón et al., 2020; Camacho-Villalón et al., 2022a). For practitioners who are not familiar with the field, but looking for the best possible choice of an optimization algorithm for solving a real-world problem, the ever-growing zoo of metaphor-based heuristics often confuses them and results in the impossibility of making a sensible choice for a specific problem to solve.
We, therefore, strongly advocate (i) the use of mathematical language and formal pseudocode to describe new algorithms; (ii) avoiding the excessive use of analogies with nature and the corresponding language, and (iii) proper benchmarking comparisons (see Section 5.2) against the state-of-the-art optimization algorithms to justify any proposal for new ones.
5.2 Benchmarking for Algorithm Performance Comparison
With an ever-increasing variety of optimization algorithms, questions about when a particular algorithm is useful become difficult to answer. Despite numerous advances in the theoretical analysis of optimization algorithms, it is infeasible to answer our question with only theory. This is due to the sheer number of variations of evolutionary algorithms and the fact that modern runtime analyses only apply to a limited number of theoretical problems and algorithms. There is, however, still a need to gain insight into the behavior of algorithms to decide when they might apply to a particular real-world scenario. As such, benchmarking has grown from an ad-hoc procedure into a field with many toolboxes to help researchers perform their experiments (Bartz-Beielstein et al., 2020).
The standards for experimentation have improved a lot in the last 30 years, and the importance of not just benchmarking, but standardized and reproducible benchmarking, has become increasingly apparent to the community (López-Ibáñez et al., 2021). Various benchmarking frameworks have been developed, providing access to curated sets of problems (Bennet et al., 2021; Hansen et al., 2021; de Nobel et al., 2021b). This way, performance data can be shared, promoting data reuse and enabling straightforward comparisons to state-of-the-art algorithms. One particular example of benchmarking tools promoting this idea is the IOHprofiler (Doerr et al., 2020), a collaboration between Leiden University, Sorbonne Université, Tel-Hai College, and others, which collects performance data from a wide range of sources and makes it directly accessible via a web service,2 such that anyone can analyze the data (Wang et al., 2022).
As noted in Section 3, there is a great variety in the types of optimization problems that can be formulated, and each of them can be applicable for a specific real-world use case (van der Blom et al., 2020). As such, benchmark problems have to be developed for each of these domains. The most used suites in the context of parameter optimization are for single-objective, noiseless optimization without additional constraints, where benchmark collections have largely become standardized in the last decade (Hansen, Finck et al., 2009; Suganthan et al., 2005), but other areas (e.g., constrained optimization) have also seen significant efforts to ensure reliable benchmark setups (see e.g., Hellwig and Beyer, 2019). Some tools, such as the COCO framework, have expanded on a single suite of commonly used problems (Hansen et al., 2010) to include variations with constraints, noise, discrete variables, and multiple objectives (Hansen et al., 2021).
While the specific demands of different communities within EC lead to variations in benchmarking problems, the overall procedures have much in common with each other (Beiranvand et al., 2017). To ensure fair comparisons, evaluation counts provide a hardware-independent measure of computational effort, which allows data generated decades ago to still be used today. While performance data formats have not yet been standardized, moves towards interoperability of benchmark tools are gaining steam, with, for example, analysis pipelines that inherently support data from many different sources (Wang et al., 2022). Further developments in terms of data representation and reuse are promising future directions (Kostovska et al., 2021).
While benchmarking plays a key role in understanding the strengths of optimization algorithms, benchmarking data is useful only when the problem being benchmarked can be clearly understood. While many benchmark problems are built from the ground up to have specific high-level properties (e.g., strong global structure), the way in which the problem and algorithm interact in a lower-level context is often not as easily described. To aid with this process, the field of Exploratory Landscape Analysis (ELA) has been developed (Mersmann et al., 2011). This field aims to characterize problems based on sets of low-level features, such as information content of a set of samples. This type of analysis (Kerschke and Trautmann, 2019) has proven useful in areas like algorithm selection (Bischl et al., 2012), but also for benchmarking and understanding the impact of algorithmic components of modular algorithms (Kostovska, Vermetten, et al., 2022).
6 A Subjective Outlook (Not for 30 Years)
In the domain of continuous parameter optimization problems, as defined in Equation (1), it is extremely difficult today for practitioners and researchers to pick the best possible optimization algorithm for a given problem and to parameterize it correctly. First steps in the direction of problem characterization, for example, by means of Exploratory Landscape Analysis, and problem-specific Combined Algorithm Selection and Hyperparameter Optimization have been made recently, with promising results. Methods such as dynamic algorithm configuration, metaheuristics, and modular algorithm frameworks, as discussed in Section 3.6, clearly indicate one path towards the future of the field—namely, automatically designing the optimal optimization algorithm for a given problem or problem class.
Related to this, we would like to emphasize that there are too many nature-inspired optimization algorithms around, with new ones being proposed continuously. We strongly propose for the future to consolidate the field, understand which algorithms are good for which optimization problem characteristics, and develop configurable algorithm frameworks that allow for the generation, experimentation, and proper benchmarking of existing and new algorithms. A few more topics we propose are the following:
In addition to the common “competitive” form of benchmarking, a focus should also be placed on the explainability of achieved results by understanding the interaction between algorithmic components and problem properties better.
Where possible, reproducibility should become a requirement for all evolutionary computation publications. We should, however, also leave possibilities to publish about real-world applications, which often use proprietary data or simulators but facilitate a better understanding of the requirements of real-world applications.
It would be good to use configurable algorithm frameworks with straightforward interfaces to allow easy use of state-of-the-art algorithms.
We need to get rid of “novel” nature-inspired algorithms and use a standardized description and formalization approach for new algorithms, to easily see whether they are similar or identical to existing methods.
Acknowledgments
It would be a paper on its own to acknowledge all the wonderful people we have had the pleasure to work with during the last 30 years, at the Technical University of Dortmund, at Leiden University, and within our worldwide research network. Thomas Bäck would like to mention one person, Hans-Paul Schwefel, who was the reason that I joined this exciting field of research, and have always enjoyed and continue to enjoy this journey. Thank you very much for this, Hans-Paul, and for your inspiration and continuous support, especially of course for the first years of my journey, as a young PhD student under your guidance. Our special thanks also go to David B. Fogel for his very useful input on Section 2.3, and to Emma Hart for encouraging this project and providing suggestions that helped improve the paper.
Notes
In fact, the first version of DE originated as a floating-point-arithmetic modification of a genetic annealing algorithm (Price et al., 2005) proposed in 1994 in a popular programmer's magazine, Dr. Dobb's Journal.