## Abstract

A recent comparison of well-established multiobjective evolutionary algorithms (MOEAs) has helped better identify the current state-of-the-art by considering (i) parameter tuning through automatic configuration, (ii) a wide range of different setups, and (iii) various performance metrics. Here, we automatically devise MOEAs with verified state-of-the-art performance for multi- and many-objective continuous optimization. Our work is based on two main considerations. The first is that high-performing algorithms can be obtained from a configurable algorithmic framework in an automated way. The second is that multiple performance metrics may be required to guide this automatic design process. In the first part of this work, we extend our previously proposed algorithmic framework, increasing the number of MOEAs, underlying evolutionary algorithms, and search paradigms that it comprises. These components can be combined following a general MOEA template, and an automatic configuration method is used to instantiate high-performing MOEA designs that optimize a given performance metric and present state-of-the-art performance. In the second part, we propose a multiobjective formulation for the automatic MOEA design, which proves critical for the context of many-objective optimization due to the disagreement of established performance metrics. Our proposed formulation leads to an automatically designed MOEA that presents state-of-the-art performance according to a set of metrics, rather than a single one.

## 1 Introduction

Multiobjective optimization problems (MOPs) involve determining the best tradeoff solutions between various, typically conflicting objectives. In the most general case, MOPs are treated in the Pareto-sense and a best possible approximation of the Pareto-optimal front is desired. MOPs arise in many real-world situations and are more difficult to solve than their single-objective counterparts. Given the enormous challenge posed by MOPs, a large number of solution approaches for tackling them have been proposed (Paquete and Stützle, 2007; Ehrgott and Gandibleux, 2004; Deb, 2001; Coello Coello et al., 2007). As one of the main alternatives for tackling MOPs, multiobjective evolutionary algorithms (MOEAs) have become very popular (Deb, 2001; Coello Coello et al., 2007; Mezura-Montes et al., 2008). Relevant contributions from the MOEA community to solving MOPs include effective algorithmic components like dominance sorting (Goldberg, 1989; Deb, 2001), as well as the theoretical basis for the performance assessment of multiobjective algorithms (Zitzler et al., 2003).

Among the open challenges for the research on MOEAs are *many-objective optimization problems* (MaOPs), that is, MOPs with a large number of objectives (Fleming et al., 2005), which can be found in practical applications. Early MOEAs have shown performance limitations on MaOPs, which has motivated the proposal of a number of new algorithms over the last decade (Chand and Wagner, 2015; Li et al., 2015). In what follows, we will refer to MOEAs that have specifically been designed to tackle many-objective problems, as many-objective evolutionary algorithms (EAs). Although several many-objective EAs have been proposed, their effectiveness is not yet fully understood, as only few experimental comparisons exist (Bezerra et al., 2018; Tanabe et al., 2017). In general, such comparisons show that the improvements of the many-objective EAs considered over established MOEAs is less than expected once a rigorous experimental setup is adopted (e.g., parameters are properly configured).

Several issues help explain the results observed in these assessments. One is the difficulties posed by the increase in the number of objectives, such as increased dominance resistance, that is, the growth in the number of nondominated solutions on problems with specific characteristics (Schütze et al., 2011). Recent empirical analyses have reported disagreements between performance metrics (Jiang et al., 2014; Bezerra et al., 2018), which again vary as a function of problem characteristics. Another important issue is the rigor in the assessment of novel MOEAs. Specifically, the literature contains a large number of multi- and many-objective EAs, too large for researchers to properly assess their (dis)advantages in a practical way. In addition, MOEAs are commonly proposed as monolithic blocks, assuming that their components are equally effective and need to be jointly used. Hence, researchers rarely investigate how the proposed algorithmic components interact, or whether components from existing MOEAs could provide more benefits than components from the new MOEAs being proposed. This monolithic approach is reflected by various MOEA frameworks that offer limited composability of components (Bleuler et al., 2003; Igel et al., 2008; Biscani et al., 2010).

In previous works, we have contributed to help address many of these issues. In Bezerra et al. (2016), we proposed a general MOEA template from which a researcher can easily instantiate several existing MOEAs, but also generate a much larger number of novel MOEAs by combining the algorithmic components available from a component-wise algorithmic framework. Moreover, by applying an automatic algorithm design methodology (KhudaBukhsh et al., 2009; López-Ibáñez and Stützle, 2012; Dubois-Lacoste et al., 2011) to this $AutoMOEA$ framework, we have shown that it is possible to generate several automatically designed MOEAs ($AutoMOEAs$) that consistently outperform established MOEAs on both continuous and combinatorial optimization problems. In another work (Bezerra et al., 2018), we have experimentally compared the performance of 14 MOEAs from the literature, after automatically tuning their parameter settings, on a variety of application scenarios considering four number of objectives (2, 3, 5, and 10 objectives), different termination criteria based on the maximum number of function evaluations ($2500$, $10000$, and $40000$ FEs), and considering three different performance metrics (relative hypervolume, additive $\epsilon $-indicator, and inverted generational distance, IGD). From that analysis, we have obtained various insights which should be taken into account by MOEA designers, for example, that using different underlying EA operators can significantly boost the performance of MOEAs for continuous optimization. In addition, we have noticed that some recently proposed MOEAs do not present an empirical performance matching what their proposers had expected. That study not only gives us a solid basis for comparing the performance of automatically generated MOEAs to the performance of current state-of-the-art MOEAs, but it also indicates directions to extend our $AutoMOEA$ framework.

In this work, we propose a strongly extended framework, hereon called $AutoMOEA+$. The main extensions to obtain $AutoMOEA+$ are the following. First, we integrate a further level of composability that allows us to separate between the multiobjective related aspects of the search from the underlying EA. Effectively, this composability refinement greatly expands the design space provided by our template, as any existing MOEA can be coupled with the most relevant EAs from the literature. More importantly, by doing so we contemplate the potential interactions between multiobjective components and underlying EAs. Second, we extend our template to comprise decomposition-based algorithms (Zhang and Li, 2007; Zhang et al., 2009; Hughes, 2003; Deb and Jain, 2014), in addition to the originally comprised dominance- (Deb et al., 2002; Zitzler et al., 2002) and indicator-based (Beume et al., 2007; Zitzler and Künzli, 2004; Bader and Zitzler, 2011) algorithms. Specifically, we implement components from relevant decomposition-based MOEAs such as MOPSO (Hughes, 2003) and NSGA-III (Deb and Jain, 2014), and we allow the free hybridization between all three design paradigms considered. In fact, this is the first work to consider such possibility, and it is one of the major contributions of our study. Third, we further exploit our unified definition of populations and archives to demonstrate how metrics that had been originally proposed as archive truncation techniques can be used as components of our general preference relations. We take as example the metric proposed for the adaptive grid archiver of PAES (Knowles and Corne, 2000), and recast it as a diversity component that can be used in combination with any of the other preference components available in our template.

The automatically devised MOEAs produced in this work (dubbed $AutoMOEA+$ algorithms) present better and/or more robust performance than the state-of-the-art results identified in Bezerra et al. (2018). Specifically, the $AutoMOEA+$ algorithms consistently outperform the 9 MOEAs (and their variants) used for that investigation, among which we highlight NSGA-II (Deb et al., 2002), SPEA2 (Zitzler et al., 2002), IBEA (Zitzler and Künzli, 2004), SMS (Beume et al., 2007),^{1} MOEA/D (Zhang and Li, 2007; Zhang et al., 2009), MO-CMA-ES (Igel et al., 2007; Voß et al., 2010), HypE (Bader and Zitzler, 2011), and NSGA-III (Deb and Jain, 2014). Interestingly, almost all novel components implemented in this work appear in the automatically generated designs, and often in ways that are very different from what human designers would tend to do. These results further evidence the need for flexible approaches that can be explored in a systematic, automated, and effective way, as we propose in this article.

On some many-objective scenarios, the challenge posed by the disagreements between performance metrics deceives the automatic methodology into selecting designs that are high performing according to some metrics, but not according to others. More precisely, the traditional approaches to automated algorithm design consider a single performance metric to be optimized, typically runtime or solution quality (Birattari, 2009; Hoos, 2012). The latter is typically assessed through a unary performance metric when dealing with the automatic design of a multiobjective algorithm (Dubois-Lacoste et al., 2011; López-Ibáñez and Stützle, 2012; Bezerra et al., 2016). However, when the number of objectives is large, the disagreements between performance metrics become strong, and designing with a single metric in mind will inevitably lead to a design that is well performing according to some metrics, but poor-performing according to others (Jiang et al., 2014; Bezerra et al., 2018). To overcome this issue, we propose a multiobjective formulation of the design process, following a recent research trend on multiobjective configuration of algorithms (Bezerra et al., 2017a). Specifically, instead of evaluating solution quality through a single performance metric, we consider that the set of metrics used in the assessment should be jointly optimized during design, as in a multiobjective problem. Effectively, we propose a multiobjective design of multiobjective algorithms. Our experimental evaluation, which focuses on the most challenging experimental scenario, demonstrates the effectiveness of this approach. In particular, the newly devised algorithm shows robustness and effectiveness for all metrics considered.

The main contributions of this article can be summarized as follows:

An augmented framework for instantiating MOEAs, which comprises the most relevant underlying EAs and design paradigms (dominance-, indicator-, and decomposition-based).

An empirical demonstration that state-of-the-art MOEAs for continuous optimization can be automatically designed under different experimental scenarios, and that these designs combine elements from different MOEAs/design paradigms.

The proposal of a multiobjective formulation of the automatic MOEA design problem, from which one can automatically devise a state-of-the-art MOEA for MaOPs robust across a set of disagreeing performance metrics.

The remainder of this article is organized as follows. In Section 2, we review the original $AutoMOEA$ framework and detail how we augment it in this work. In Section 3, we automatically design a set of MOEAs that display state-of-the-art performance on most experimental scenarios considered. Next, Section 4 details the multiobjective design formulation adopted for specific many-objective scenarios, and presents the experimental results confirming its effectiveness. We conclude in Section 5.

## 2 An Augmented MOEA Template

Automated algorithm design approaches observed in the literature can be broadly split into two main categories: (i) *bottom-up* approaches (e.g., hyperheuristics, Ross (2005)), where heuristics are crafted using little human insights and heavily relying on automatically discovered knowledge, and (ii) *top-down* approaches, where human knowledge provides a structural basis (e.g., a template or a grammar) and the automated design process attempts to design the best possible algorithm based on this structure. The scope of the former has been traditionally restricted to the design of heuristics. By contrast, top-down approaches have been increasingly proven effective whether for designing simple heuristics or complex algorithm portfolios (e.g., ensemble methods).^{2}

In previous work (Bezerra et al., 2016), we studied the most popular MOEA designs found in the literature and proposed a template of the general design of a MOEA, shown in Algorithms 1 and 2. Each *abstract* component of this template represents a choice between different algorithmic components found in the literature. A component-wise algorithmic framework that implements this template and provides a set of options for each abstract component can instantiate diverse MOEAs. Our original $AutoMOEA$ framework could instantiate at least six of the most relevant dominance- and indicator-based MOEAs from the literature (Fonseca and Fleming, 1993; Deb et al., 2002; Zitzler et al., 2002; Zitzler and Künzli, 2004; Beume et al., 2007; Bader and Zitzler, 2011), in addition to a large number of valid and novel MOEA designs.

In this work, we augment our previously proposed $AutoMOEA$ framework in several directions, hereon called $AutoMOEA+$. First, we distinguish between multiobjective components and underlying EAs, and allow coupling the same set of MO components with different underlying EAs. This enables the $AutoMOEA+$ framework to instantiate many MOEAs from the literature that are based on differential evolution (Abbass et al., 2001; Abbass, 2002; Madavan, 2002; Robič and Filipič, 2005; Kukkonen and Lampinen, 2005; Tušar and Filipič, 2007; Tagawa et al., 2011). Second, we incorporate decomposition-based algorithmic components, and model these components in a manner that allows designers to combine, within a single algorithm, components originally proposed for dominance-, indicator-, and decomposition-based MOEAs. Finally, we also recast techniques originally proposed for archive truncation as options available for our preference components.

### 2.1 Original Template

The main abstract components that characterize a MOEA depicted in Algorithms 1 and 2 are listed in Table 1, where both *atomic* and *composite* components are given. A composite component such as $Mating$ may comprise composite and/or atomic components. Algorithmic components (options for the abstract components) are listed in Table 2, where we make a distinction between components already available in the $AutoMOEA$ framework and components implemented in this work for $AutoMOEA+$. Below we briefly summarize abstract components of the original $AutoMOEA$ framework; for further details on the original components, we refer to Bezerra et al. (2016).

$Preference$ is a *composite* component that models general preference relations (Zitzler et al., 2010), and comprises a sequence of three *atomic* components in the following order: (1) $SetPart$ partitions solutions into dominance equivalent; (2) $Refinement$ ranks solutions within each partition, e.g., using quality indicators; and (3) $Diversity$ is a Pareto-noncompliant metric used to keep the population well-spread across the objective space. A $Preference$ component can also contain less than three atomic components since $SetPart$, $Refinement$, and/or $Diversity$ can be set to *none*.

$Mating$ uses traditional $Selection$ operators to select individuals to undergo variation. In the case of tournaments, solutions are compared based on a preference relation $PreferenceMat$.

The $Replacement$ and $ReplacementExt$ components define environmental selection and external archive truncation (if a bounded external archive is used), respectively. Both $Replacement$ components ensure elitism, and comprise two other components, namely, a preference relation used to compare solutions ($PreferenceRep$ and $PreferenceExt$, respectively), and a parameter that determines the frequency with which the preference relation is computed (the removal policies $RemovalRep$ and $RemovalExt$, respectively): with *one-shot* removal, preferences are computed once and replacement takes place; *sequential* removal recomputes preferences every time a solution is discarded.

**pop** and **$popext$** are sets of solutions that represent either populations or archives. pop takes part in the evolutionary process and can be configured as a regular *fixed-size* population that may contain dominated solutions or as a *bounded-size* archive that only contains nondominated solutions. $popext$ is an optional external archive that does not participate in the evolutionary process and may be either *bounded* or *unbounded*.

$Initialization$ and $Variation$ comprise problem-specific components, namely, the generation of an initial population and the variation operators that produce new solutions from existing ones.

### 2.2 Underlying EAs

Most MOEA proposals either specify the underlying EA as an integral part of a MOEA design or treat it as an irrelevant detail. However, our recent experimental assessment has shown that a proper choice of the underlying EA operators is critical to the effectiveness of a MOEA, given the strong interactions between MO-components and EA operators (Bezerra et al., 2018). Therefore, our augmented $AutoMOEA+$ framework allows the combination of MO-components with two different underlying EAs, namely *genetic algorithms* (Goldberg, 1989) and *differential evolution* (Price et al., 2005). We model the underlying EA as a composite component $UnderlyingEA$ that comprises composite components $Mating$ and $Variation$ (Table 1), since the choice of EA not only affects variation operators but also the selection of the individuals that undergo variation. These two options are further explained next.

**Genetic algorithms (GAs)** was the only option in our original $AutoMOEA$ framework, and is described in Algorithm 2. When this underlying EA is chosen (option *GA* in Table 2), a mating pool of solutions is built as described by component $MatingGA$. Component $VariationGA$ comprises the sequential application of (domain-specific) crossover and mutation operators. In this article, we use the well-known SBX crossover and polynomial mutation operators.

**Differential evolution (DE)** is detailed in Algorithm 3, which generalizes the structure of most DE-based MOEAs. Component $MatingDE$ selects target and donor vectors, and component $VariationDE$ creates a trial vector through differential mutation and binomial crossover. A distinguishing feature of multi-objective DE algorithms is the $OnlineReplacement$ component (Madavan, 2002; Robič and Filipič, 2005; Kukkonen and Lampinen, 2005). When this component is active, a newly created trial solution can immediately replace the target vector if a given acceptance criterion is satisfied. Some popular DE algorithms differ exactly in this acceptance criterion: DEMO (Robič and Filipič, 2005) uses Pareto dominance, whereas GDE3 (Kukkonen and Lampinen, 2005) uses weak Pareto dominance. We have added both options to our $AutoMOEA+$ framework (Table 2). If the target vector is replaced, then the size of $popnew$ does not increase; else, if trial and target are nondominated (or if online replacement is not adopted at all), *trial* is added to $popnew$. This is represented in Algorithm 3 by the set $S$ produced by $OnlineReplacement$ and later added to $popnew$: if *trial* replaces *target*, $S$ is an empty set and $popnew$ remains unchanged; else, $S$ is a singleton containing only *trial*, which is added to $popnew$. Online replacement would be redundant together with steady state selection ($\lambda =1$) since it becomes equivalent to $Replacement$.

Another novelty of our work is the possibility of using multiple DE schemes (Price et al., 2005). Specifically, so far only the *DE/rand/1* scheme has been adopted in the literature. Here, we also implement a preference-based selection scheme, which is an adaptation of the *DE/target-to-best/1* scheme (*TtoB*, for short). Concretely, designers may configure $MatingDE$ to select target and donor vectors using any $PreferenceMat$ and $Selection$ options, thus increasing the odds of producing a better trial vector. However, this scheme cannot be adopted in combination with online replacement, since $PreferenceMat$ is computed before variation starts, and a trial vector replacing a target vector during variation would lead to an inconsistently evaluated population.^{3}

### 2.3 Deconstructing Decomposition

Decomposition (Hughes, 2003; Zhang and Li, 2007; Zhang et al., 2009; Deb and Jain, 2014) is a search paradigm originally considered by the decision making community and adapted for MOEA research. The basic principle behind this paradigm is to decompose the original MOP into subproblems and optimize them in parallel. Each subproblem is a single-objective projection of the original MOP, which can be obtained using several different methods (Zhang and Li, 2007). An analysis of the decomposition-based MOEA literature reveals that most proposals can be classified as $Refinement$ components. Specifically, most decomposition-based algorithms are able to simultaneously evaluate the convergence of a population (its closeness to the Pareto front) and its diversity (how well the front is covered), with the latter being ensured by the existence of multiple subproblems and the former by optimizing each subproblem. More importantly, decomposition approaches are able to distinguish between dominance-equivalent solutions, the baseline definition for our $Refinement$ components. One exception is NSGA-III (Deb and Jain, 2014), which uses decomposition only for diversity purposes and ensures convergence using the same $SetPart$ component as NSGA-II (Deb et al., 2002). In this work, we implement two components from decomposition-based MOEAs:

**Weighted ranking** was originally proposed in MOPSO (Hughes, 2003). In our framework, it is provided as an option of component $Refinement$, and works as follows. Solutions are ranked according to their performance on each subproblem defined by a weight vector $\lambda \u2208\Lambda $, where $\Lambda $ is a given set of weight vectors. The overall quality of a solution equals its aggregated performance considering the ranks from each subproblem. We use here the algebraic sum to aggregate the performance on the subproblems, but other aggregation functions are possible.

**Reference lines** correspond to the method used in NSGA-III to keep the population spread along the Pareto front. Thus, in our framework, it is an option of component $Diversity$. A reference line is the line intersecting the origin of the axes and a reference point defined by a weight vector $\lambda \u2208\Lambda $. When ranking solutions, each solution is first associated to its nearest reference line in terms of perpendicular distance in the objective space. Next, niche counts are computed for each reference line considering only solutions already selected for the next iteration by previous preference components. For example, when dominance depth is used as $SetPart$ in NSGA-III, the selected solutions are those from the lowest depth fronts that fit in the next population. Finally, the procedure iteratively selects the reference line with lowest niche count and adds one of its associated solutions to the next population. Since we did not find a straight-forward way to differentiate between a one-shot and a sequential $Removal$ policy for this reference-lines procedure, we only combine it with the one-shot policy.

Two additional parameters are crucial in decomposition-based algorithms: the cardinality and the distribution of the generated weight set $\Lambda $. In our framework, the cardinality of $\Lambda $ is upper-bounded by $\Lambda r\xb7\mu $, where $\Lambda r$ is a numerical parameter and $\mu $ is the population size. If the number of weights generated by some method exceeds the upper bound, excessive weights are discarded at random. For the distribution of $\Lambda $, our framework provides methods for generating weights with a *uniform* distribution (Das and Dennis, 1997), the *dichotomic* method proposed by Aneja and Nair (1979) for bi-objective problems, and the *two-layer* method of NSGA-III (Deb and Jain, 2014) for many-objective problems. This latter method uses two numerical parameters $H1$ and $H2$ to determine how many weights will be generated using a uniform distribution in the outer and inner layers, respectively.^{4} However, the effective goal of these parameters is to determine the *search focus* a designer wants to use, i.e., the proportion between weight vectors in the two layers. Rather than configuring these parameters independently, we provide a set of options $\Lambda focus$, as follows:

*Peripheral*focus favors the outer layer by setting $H1$ to the maximum value feasible so that there exists an $H2$ value for which $|\Lambda |\u2264\Lambda r\xb7\mu $.*Central*focus favors the inner layer by setting $H2$ to the maximum value feasible so that there exists an $H1$ value for which $|\Lambda |\u2264\Lambda r\xb7\mu $.*Balanced*tries to balance the importance of both layers. Concretely, $H1$ and $H2$ are set to a maximum feasible value $h$ so that $|\Lambda |\u2264\Lambda r\xb7\mu $. If, however, it is still possible to increase either $H1$ or $H2$, that parameter is increased to prevent wasting weights.

### 2.4 Archive Truncation Techniques

Many different archive truncation techniques, or *archivers*, for short, have been proposed in the literature, as reviewed by López-Ibáñez et al. (2011), who consider that metrics proposed for environmental selection and metrics specifically proposed to keep an external archive bounded both serve the same purpose, and should altogether be considered archivers. In the original $AutoMOEA$ framework we have followed this formulation and, if configured as bounded-size archives, the archiving of both pop and $popext$ are commonly defined by $Replacement$ components. In Bezerra et al. (2016), we have demonstrated the benefits of this formulation by recasting metrics originally proposed for environmental selection into archivers.

In this work, we further demonstrate the benefits from this formulation by recasting metrics originally proposed to keep external archives bounded into $Preference$ components that can be used for any preference-based selection. In addition, this formulation allows the free hybridization of archivers with other metrics. We take as example the *adaptive grid archiver* (AGA) from PAES (Knowles and Corne, 2000), which discretizes the objective space into grid cells that are dynamically computed as a function of the extreme solutions found during the run, and of a numerical parameter that specifies the number of cells per objective. Solutions are compared based on the crowdedness of the grid cell to which they belong, with less crowded regions being favored. In our framework, we model the adaptive grid approach as a $Diversity$ component. As part of a $Preference$ component, it can be used as the selection criterion for building the mating pool, as the replacement strategy of pop or as an archive truncation technique. For example, one could configure $Mating$ to use a $Preference$ relation that combines the decomposition-based weighted ranking as $Refinement$ component with the adaptive grid approach as a $Diversity$ component. Indeed, this is an improvement over the original applications of the AGA (Knowles and Corne, 2000), which had already been considered for mating selection, but in a more simplified $Preference$ component.

The extensions detailed above over the original $AutoMOEA$ framework greatly improve the flexibility of the resulting $AutoMOEA+$ framework, enabling us to automatically design state-of-the-art MOEAs for multi- and many-objective continuous optimization, as we discuss next.

## 3 Automatically Designing Effective MOEAs

In this section, we automatically design MOEAs by configuring our $AutoMOEA+$ framework using $irace$ (López-Ibáñez et al., 2016), an automatic algorithm configuration tool. Our experimental analysis of the automatically designed MOEAs (hereon called $AutoMOEA+$ algorithms) has three main goals. First, we analyze to what extent the $AutoMOEA+$ algorithms match what human designers would choose as effective components, and whether the new components added to the framework appear in the $AutoMOEA+$ algorithms. Second, we assess whether the $AutoMOEA+$ algorithms can outperform the ones generated from our previous $AutoMOEA$ framework (Bezerra et al., 2016), and how they differ. Third, we assess whether the $AutoMOEA+$ algorithms are able to outperform the state-of-the-art MOEAs identified in Bezerra et al. (2018).^{5}

### 3.1 Parameter Space of $AutoMOEA+$

The parameter space of the $AutoMOEA+$ framework contains, besides numerical MOEA parameters such as population size ($\mu $) and number of offspring ($\lambda $) shown in Table 3, parameters for selecting options for abstract algorithmic components that define the MOEA design (Tables 1 and 2), and conditional parameters that need to be set if particular options are selected. Conditional parameters are listed in Table 2, already explained in the previous section, and in Table 3, which we detail next. A first group of parameters concerns archives. When pop is configured as a bounded-size archive instead of a fixed-size population, $\mu $ is interpreted as its maximum capacity and the initial number of solutions is given by $\mu 0=\mu r\xb7\mu $. When a bounded-size external archive $popext$ is used, its capacity is given by $Next$. The next group of parameters concerns the underlying EA. When GA is selected, $pm$ and $pc$ give the probability of applying polynomial mutation and SBX crossover, respectively, for each individual or pair thereof. These operators have associated distribution indices $\eta m$ and $\eta c$ that must be configured. Our framework implements two different mutation schemes for real-parameter optimization (Deb and Deb, 2014): *bitwise* sets the mutation probability per variable $pv$ to $1/nvar$; *fixed* leaves $pv$ as a parameter to be configured. When the underlying EA is set to DE, only two parameters must be set, namely, the crossover probability ($CR$) and the scaling factor ($F$) of the DE operators. Whatever the underlying EA, when the $Selection$ component of $Mating$ is set to deterministic tournament (*DT*), then $TournamSize$ controls the tournament size; if it is set to stochastic tournament (*ST*), then $\gamma $ controls the probability of selecting the best contestant as the winner of a binary tournament. The last group of conditional parameters concerns $Preference$ components. When the *sharing* diversity metric is selected, the radius of the niches ($\sigma share$) must be configured. When the diversity metric is based on nearest neighbors (*NN*), mating considers the distance to the $k$-th nearest one, whereas replacement behaves as a nearest neighbor density estimation (see Bezerra et al., 2016 for details). When the AGA diversity metric is adopted, the number of grid cells is computed as a function of a discretization parameter $l$. Finally, if a decomposition-based preference component is selected, the upper bound size for the weight set $\Lambda $ is a function of a numerical parameter $\Lambda r$.

We remark that the parameter space used to configure the $AutoMOEA+$ algorithms ensures fairness in the comparisons to the original $AutoMOEAs$ and also to the state-of-the-art MOEAs. Specifically, the domains we adopt in this work are reused from the original proposal of the $AutoMOEA$ framework when the given component was already available in that work (Bezerra et al., 2016). In addition, MOEAs in the state-of-the-art assessment were given the same underlying EA choice and their associated parameters were configured using the same domains also adopted here (Bezerra et al., 2018).

### 3.2 Automatic Design Setup

The idea of automatic MOEA design by coupling a component-wise MOEA framework with an automatic configuration tool was originally proposed in Bezerra et al. (2016); thus, we refer to the original publication for the general details of the proposal. Here, we focus on the setup used in the present article, and later we briefly highlight relevant differences between the setups used for configuring $AutoMOEA$ and $AutoMOEA+$. Several elements are needed to set up an automatic design scenario: the benchmark problems, the stopping criteria for the MOEAs, the parameters of the automatic configurator, and the unary performance metric that guides the configurator. Since we wish to compare with the state-of-the-art MOEAs from the literature previously identified in Bezerra et al. (2018), we use the same setup we adopted for tuning MOEAs in that work, summarized in Table 4 and detailed below. The values of $M$ and $FEmax$ and the $IIGD$ metric that had not been considered for the design of the original $AutoMOEAs$ are underlined.

**Benchmark problems.** We consider the box-constrained WFG (Huband et al., 2006) and most DTLZ (Deb et al., 2005) benchmark problems (Problems DTLZ1 and DTLZ3 are excluded due to ceiling effects (Bezerra et al., 2018)) with $M\u2208{2,3,5,10}$ objectives and $nvar\u2208{20,21,\u2026,60}$ variables. We reserve sizes $ntesting\u2208{30,40,50}$ for comparing algorithms and only use sizes $nvar\u2216ntesting$ for the automatic design process to separate between training and testing sets.

**Stopping criterion of MOEAs.** Each MOEA run is stopped after using a maximum of $FEmax$ function evaluations. Here, we evaluate several values of $FEmax\u2208{2500,10000,40000}$. In addition, we set a maximum time limit per run to prevent very long runs making the automatic design process infeasible. The time limit is long ($tmax$ = 1 hour) for $FEmax=2500$ assuming that such value represents an expensive evaluation scenario. Otherwise, the maximum time limit is $tmax$ = 10 minutes.

**Configurator setup.** We use $irace$ (López-Ibáñez et al., 2016) as configurator. Given a set of training benchmark functions and a parameter space description, $irace$ searches for good parameter configurations by evaluating configurations of the target algorithm (in our case, instantiations of the $AutoMOEA+$ framework) on the benchmark functions according to a given unary quality metric. Describing $irace$ in detail is outside the scope of the article and more details can be found in López-Ibáñez et al. (2016). We run $irace$ with its default settings and each run of $irace$ has a budget of $20000$ MOEA runs.

**Unary quality metrics.** Since $irace$ is a single-objective optimizer, it uses unary quality metrics to evaluate the performance of a MOEA run. In particular, we use the unary $\epsilon $-metric ($I\epsilon +$) for large number of objectives ($M=10$) and the relative deviation from an approximation of the optimal hypervolume ($IHrd$) otherwise.^{6} Before computing a metric, we discard objective vectors that exceed the bounds ($u$) given in Table 4 to avoid strong outliers that would skew results. Finally, the nadir point in the computation of the $IHrd$ is $r=1.1\xb7u$ to ensure extreme solutions contribute to the hypervolume.

With the above setup, we run $irace$ once for each combination of $M$ and $FEmax$, resulting in 12 different $AutoMOEA+$ algorithms. For brevity, we refer to a particular scenario using the $\u2329M,FEmax\u232a$ notation, e.g., $\u23292,10k\u232a$ refers to a scenario where problems present two objectives and MOEAs are allowed to use 10 000 FEs. In order to assess their quality, we run each $AutoMOEA+$ algorithm on each of the benchmark functions with sizes $ntesting$. We perform 25 repetitions of each run and compute the mean value for each quality metric $I\epsilon +$, $IHrd$, and inverted generational distance ($IIGD$).^{7} All experiments are run on a single core of Intel Xeon E5410 CPUs @ 2.33 GHz with 6 MB cache size under Cluster Rocks Linux version 6.2/CentOS 6.2.

### 3.3 Trends from the Generated $AutoMOEA+$ Algorithms

The designs of the automatically designed $AutoMOEA+$ algorithms are given in Table 5. Although it is not possible to tell whether a particular component significantly contributes to the performance of a MOEA design without a component-by-component analysis (Fawcett and Hoos, 2016) and possibly various repetitions of the configuration process for each scenario, some trends appear in the designs that have been obtained for the 12 different scenarios.

We first focus on the general trends among the $AutoMOEA+$ designs for scenarios with a number of objectives $M\u2208{2,3,5}$ (top three blocks of Table 5, with a grey background), as these designs share a number of similarities. First, the underlying EA choice differs from what human designers have typically adopted in the literature, as DE was always selected instead of GA. Even concerning the multiobjective DE literature we see a contrast to the traditional designs, since $irace$ always chooses the preference-based scheme (i.e., DE/target-to-best/1, which we propose here) and, as a consequence, never the online replacement. However, it is difficult to find an overall pattern for $Mating$, though deterministic tournaments are used more often than stochastic ones. Second, in contrast to $PreferenceMat$, patterns are clear for component $Replacement$, where hypervolume-based $Refinement$ is almost always used, and scenarios are nearly evenly split between using steady-state selection or sequential replacement. Finally, external archives are more frequently used for lower $M$ values, and the same pattern is observed for $Refinement$ components in $ReplacementExt$.^{8}

We next discuss the similarities and differences in the structure of the $AutoMOEA+$ algorithms focusing on the experimental factors that constitute scenarios:

$M$: First, while all $M<10$ designs use $UnderlyingEA$ =

*DE*, two of three $M=10$ designs adopt $UnderlyingEA$ =*GA*. A second affected component is $Refinement$. The $IHh$ component is clearly frequent in bi-objective scenarios, whereas the $IH1$ indicator is chosen more frequently when $M\u2208{3,5}$. Likely, this is explained by the computational overhead of the $IHh$ indicator since, when $M=3$, no $Refinement$ component is used for the external archives except for the scenario with a larger cutoff time. When $M=10$, $I\epsilon +$ becomes the standard refinement option for $PreferenceRep$, but further investigation would be required to determine if this is a consequence of changing the performance metric for $irace$ when $M=10$. Finally, the occurrences of decomposition-based components increase as $M$ grows: component weighted rank is the most selected $Refinement$ option for mating selection when $M=5$, and all algorithms for $M=10$ use at least one decomposition-based component.$FEmax$: The most evident insight we observe is that external archives tend to become prohibitive when this budget is increased but the maximum runtime is kept constrained. This is initially observed for scenarios with $M=3$, where only $Diversity$ components are used when $FEmax\u2208{10000,40000}$, and made worse on scenarios with $M=5$, where external archives are not used at all when $FEmax\u2208{10000,40000}$. The extreme situation is observed for $AutoMOEA+$$\u232910,40k\u232a$, where no refinement metrics nor external archives are used. An exception to this pattern is $AutoMOEA+$$\u232910,10k\u232a$, which uses an external archive with hypervolume estimation.

^{9}Yet, this $AutoMOEA+$ is the only algorithm for $M=10$ scenarios that does not use steady-state selection, so the adoption of a more costly external archive may be a design compromise between expensive components.

Finally, we observe that the design of $AutoMOEA+$$\u232910,40k\u232a$ differs the most from all other $AutoMOEA+$ designs. Specifically, this algorithm resembles NSGA-III in its randomized mating selection, absence of refinement metrics, and use of GA. However, given that no external archive is used, the population size is rather small for a many-objective scenario. As we will discuss later, this design conducts a search that is too restricted in the objective space, and yet it performs well according to the $I\epsilon +$ indicator.

### 3.4 Comparison between Designs from $AutoMOEA+$ and $AutoMOEA$

To assess the improvements provided by the extensions proposed in this work, we first compare the $AutoMOEA+$ algorithms to the $AutoMOEAs$ designed in Bezerra et al. (2016). In particular, the $AutoMOEAs$ were created for scenarios with $FEmax=10000$ and $M\u2208{2,3,5}$, and we therefore only compare the algorithms on these scenarios; that is, in Table 5, only the rows $\u23292,10k\u232a$, $\u23293,10k\u232a$, and $\u23295,10k\u232a$ are considered in the following discussion. The $AutoMOEAs$ have been tuned separately for the DTLZ and WFG benchmarks, while $AutoMOEA+$ is tuned across the two benchmark sets. Hence, the $AutoMOEAs$ benefit potentially stronger from tuning than the $AutoMOEA+$ algorithms (or the state-of-the-art MOEAs, which use the same setup as the $AutoMOEA+$ algorithms). Thus, the results in favor of the $AutoMOEA+$ algorithms are even more remarkable.

We start with a structural comparison of the $AutoMOEA$ and $AutoMOEA+$ designs. To aid this analysis we show in Table 6 the structure of the $AutoMOEAs$ from Bezerra et al. (2016). The main trends are the following. As previously discussed, DE is always used in the $AutoMOEA+$ algorithms. This design choice highlights the importance of providing different underlying EAs for a component-wise design, as for the $AutoMOEAs$ only the GA operators have been available. $Selection$ approaches are similar between $AutoMOEA$ and $AutoMOEA+$ designs, as tournaments are always used. However, the tournaments from the original $AutoMOEAs$ are deterministic and enforce greater convergence pressure due to the choice of four-ary (once) and eight-ary (five times) tournaments. This design difference is likely explained by the different underlying EAs used. While environmental selection is similar in all designs, they differ in the usage of external archives, which $AutoMOEAs$ select more often.

We proceed to a performance comparison, and to this end a rank sum analysis is given in Table 7 for all $FEmax=10000$ scenarios. The analysis in this section focuses on the first three scenarios, for which the $AutoMOEAs$ were originally designed. Moreover, for a particular scenario, the entry labeled as $Auto$ represents an aggregation of results from the $AutoMOEAs$ designed for each benchmark on that scenario. For instance, the rank sum entry labeled as $Auto$ on scenario $\u23292,10k\u232a$ considers runs from $AutoMOEAD2$ on the DTLZ benchmark and from $AutoMOEAW2$ on the WFG benchmark. Thus, the $AutoMOEAs$ have the advantage of being two separate MOEAs tuned for each specific benchmark, while the $AutoMOEA+$ algorithms are a single design that must generalize over both benchmark sets.

Nevertheless, as seen in Table 7, the rank sums achieved by the $AutoMOEA+$ algorithms are statistically significantly better than the sums achieved by the original $AutoMOEAs$ when $M\u2208{2,3}$, whichever the metric considered. When $M=5$, all automatically designed MOEAs reach statistically equivalent results, but the original $AutoMOEAs$ achieve the lowest rank sums for the $I\epsilon +$ and $IIGD$ metrics. Two factors can help explain this result. First, the disagreements between performance metrics are known to increase with the increase in $M$ (Jiang et al., 2014). Second, the $AutoMOEAs$ were custom-designed for each of the benchmarks. With the increase in $M$, it is natural that the difficulty posed by the benchmarks also increases (Bezerra et al., 2018). Given that each benchmark comprises problems with different characteristics, it seems natural that the need for specialized algorithmic components become stronger.

### 3.5 State-of-the-Art Comparison

To assess whether the performance of the $AutoMOEA+$ algorithms can match that of the state-of-the-art algorithms identified in Bezerra et al. (2018), we start with the aggregative analysis depicted by the rank sums given in Table 7, focusing on the comparison of $AutoMOEA+$ algorithms to the best-performing MOEAs from the literature.

Results shown in Table 7 lead to different conclusions depending on the given performance metric. In general, the $AutoMOEA+$ algorithms are either the top-ranking algorithms or at least statistically equivalent to the best-performing manually designed MOEA, which may vary according to the scenario and/or metric. Differences in favor of the $AutoMOEA+$ algorithms are always statistically significant according to $IIGD$, only for $M\u2208{2,3}$ scenarios according to $I\epsilon +$, and never according to $IHrd$. Indeed, it never happens that an $AutoMOEA+$ algorithm is statistically significantly better than the best-performing manually designed MOEA according to the metric used for tuning ($IHrd$ when $M<10$; $I\epsilon +$, otherwise). Overall, these results lead to two important conclusions. First, they effectively mean that a single MOEA, automatically designed for a given scenario, is able to perform at least as well as, and in many cases better than, the best manually designed MOEAs, even after their parameters have been properly tuned. Second, we see that the disagreements between metrics like $IHrd$ and $IIGD$ can be overcome to some extent by the automatic design process given that, even if differences in performance according to $IHrd$ are not statistically significant, differences for $IIGD$ are. Altogether, the results confirm that the $AutoMOEA+$ algorithms present either the most effective or the most robust performance for the scenarios considered so far.

We proceed to a more fine-grained analysis of the results to better visualize the effects of problem characteristics. Figures 1–3 show boxplots depicting the performance of these algorithms on selected problems from both benchmark sets and increasing $M$. Specifically, problems DTLZ2 and WFG4 represent concave problems, although the difficulty posed by the WFG concave problems tends to be higher than that of the DTLZ ones. Concerning the remaining DTLZ problems, DTLZ6 represents problems with a strong presence of local Pareto fronts, whereas DTLZ7 represents problems with disconnected Pareto fronts. Regarding non-concave WFG problems, we take WFG1 as illustrative. In general, the boxplots evidence the effects of problem characteristics and performance metric disagreements, which become ever stronger with the increase in $M$. Next, we discuss results from each scenario.

$\u23292,10k\u232a$: Figure 1 (top) depicts $IHrd$ results, with which the remaining metrics agree. Effects from problem characteristics are seen in the differences between the relative performance of the algorithms when optimizing DTLZ or WFG problems. In more detail, for the DTLZ benchmark, the best-performing MOEAs present equivalent performance. Conversely, on the WFG set the $AutoMOEA+$ algorithm clearly achieves better and more consistent results for the non-concave problems represented by WFG1, and competitive performance on the concave ones, illustrated by WFG4. In addition, for no problem the original $AutoMOEAs$ are able to outperform $AutoMOEA+$$\u23292,10k\u232a$, whereas the opposite often happens. We also remark the rather different performances of both automatically designed algorithms in DTLZ6. In previous works, we have demonstrated that GA-based MOEAs struggle to solve this problem for a moderate number of variables, whereas DE-based ones can solve it far more easily (Bezerra et al., 2016, 2018); these considerations illustrate the importance of allowing configuration of the underlying EA and/or the variation operators.

$\u23293,10k\u232a$: A similar pattern is observed on this scenario, given on Figure 1 (bottom), where results from $IHrd$ are depicted. The performances on problems DTLZ2 and WFG1 repeat the pattern discussed for the previous scenario. Yet, results for remaining problems become more spread, providing an indication that few algorithms such as $AutoMOEA+$$\u23293,10k\u232a$ perform better and more consistently than the remaining MOEAs. Once again, the $AutoMOEA+$ algorithm outperforms the $AutoMOEAs$ for all problems.

$\u23295,10k\u232a$: Figure 2 shows results from $IHrd$ (top), $I\epsilon +$ (middle), and $IIGD$ (bottom), where one observes two different situations on DTLZ problems. For the problem characteristics represented by DTLZ2 and DTLZ6, we notice that some state-of-the-art MOEAs are able to present performance equivalent to $AutoMOEA+$$\u23295,10k\u232a$, but the latter stands out according to $IIGD$. By contrast, on DTLZ7 $AutoMOEA+$$\u23295,10k\u232a$ performs much worse than the best-performing MOEAs. This drawback is understandable when one observes that the automatic design methodology considers benchmarks as a whole, and specific functions may constitute exceptions to a broader picture. Regarding WFG problems, we see that performance metrics strongly disagree. For instance, the performance of $AutoMOEA+$$\u23295,10k\u232a$ on WFG1 is remarkable according to $IHrd$, the metric used for tuning, but it is surpassed by SMS according to $IIGD$ and also by other MOEAs according to $I\epsilon +$. As for the concave WFG problems, the $AutoMOEA+$ algorithm is always the best-performing, but the gap w.r.t. other MOEAs is very different depending on the metric considered. Finally, we see that $AutoMOEA+$$\u23295,10k\u232a$ improves over the original $AutoMOEA$ for all problems except DTLZ7 according to all metrics.

$\u232910,10k\u232a$: As previously reported in Bezerra et al. (2018), the performance from MOEAs is quite different from the remaining scenarios, in that metric ranges for given problems become much more spread. For this reason, our scaling is unable to fit results from many MOEAs; yet, if boxplots were to render all results visible, the differences between the best-performing algorithms would become difficult to distinguish, concealing the most important part of our analysis. In general, no single MOEA is considered consistent across all problems and metrics. $AutoMOEA+$$\u232910,10k\u232a$, for instance, performs poorly on DTLZ6 and DTLZ7. The extreme situations are observed for $IIGD$ results: a poor relative performance on non-concave problems, contrasting to a very good relative performance on concave problems. Altogether, in comparison to other MOEAs, we see a generally competitive performance from $AutoMOEA+$$\u232910,10k\u232a$.

### 3.6 Stopping Criteria Effects

The analysis above has focused only on scenarios with $FEmax=10000$, for which we have identified two factors that appear critical for the trends observed in the results: (i) the heterogeneity of the problem characteristics that comprise the benchmark sets, and (ii) the disagreements between performance metrics, which become stronger with increasing number of objectives. We next further investigate this latter factor, with a focus on the role played by the stopping criterion in this interaction.

A summary of the rank sum analyses conducted on all scenarios is given in Table 8. Each entry summarizes the result of the statistical tests applied to the rank sums produced by each metric, respectively $IHrd$, $I\epsilon +$, and $IIGD$, and denotes that the given $AutoMOEA+$ algorithm was considered better than (+), equivalent to (=), or worse than (-) the best state-of-the-art MOEA. In addition, if a MOEA is considered better than or equivalent to the $AutoMOEA+$ algorithm for all metrics, it is indicated in parentheses. Results are provided in full as supplementary material, for brevity (Bezerra et al., 2017b).

A few important patterns are noticeable from Table 8. First, the $AutoMOEA+$ algorithms are generally able to match or improve over the performance of the state-of-the-art MOEAs for each scenario. Specifically, improvements are often observed on $IIGD$ analyses, and eventually on $I\epsilon +$ ones, whereas equivalence is far more often observed on $IHrd$ analyses. Second, in more than half of the scenarios the $AutoMOEA+$ algorithms improve over the state-of-the-art according to at least one metric, and rarely they do not match the performance of the state-of-the-art MOEAs for all metrics at the same time. Indeed, the only scenarios where this latter situation occurs are $\u23295,40k\u232a$ and $\u232910,40k\u232a$, both scenarios with a larger $FEmax$ value. Third, it is far more likely that an $AutoMOEA+$ algorithm improve over the state-of-the-art according to a given metric when $M$ and/or $FEmax$ are low—the only exception to this pattern is scenario $\u23292,2.5k\u232a$. Finally, only two manually designed MOEAs are able to either match or outperform a given $AutoMOEA+$ algorithm according to all metrics at the same time: (i) SMS, which matches the performance of $AutoMOEA+$$\u23292,2.5k\u232a$, and (ii) IBEA, which outperforms $AutoMOEA+$$\u232910,40k\u232a$. We next further discuss results grouped by $FEmax$.

**2 500 FEs:** When only a limited number of FEs is given to MOEAs, the differences between the best-performing algorithms are reduced. In fact, in many scenarios it is not possible to identify statistically significant differences between the performances of the $AutoMOEA+$ algorithms, SMS, and IBEA according to Friedman's test. Yet, only once we observe that a single manually designed MOEA is able to match the performance of an $AutoMOEA+$ algorithm for all metrics at the same time, namely SMS when $M=2$. Indeed, for all remaining $M$ values we observe that the $IIGD$ performance of the $AutoMOEA+$ algorithms is considered statistically significantly better than that of the remaining MOEAs. Altogether, these results indicate that it is possible to design MOEAs that obtain good results according to all metrics at the same time even for scenarios where few FEs are available.

**40 000 FEs:** When a larger FE budget is considered, the best-performing MOEAs are able to approximate well the Pareto front for most bi-objective problems. Nonetheless, the rank sum analysis demonstrates that the performance of $AutoMOEA+$$\u23292,40k\u232a$ is statistically significantly better than that of the state-of-the-art MOEAs for all metrics. For $M=3$, the results according to $IIGD$ are statistically significantly only in favor of $AutoMOEA+$$\u23293,40k\u232a$. Differences in conclusions in dependence of the metrics used become stronger for even larger numbers of objectives, as shown in Tables 8 and 9. When $M=5$, we see a different set of best-performing MOEAs depending on the metric considered. $AutoMOEA+$$\u23295,40k\u232a$ is either the best ranking (for $IHrd$) or statistically not significantly different from the top-ranking MOEA (for $I\epsilon +$). This is different for $IIGD$, where it ranks fourth. Finally, for $M=10$ the $AutoMOEA+$ algorithm is unable to outperform or even match the overall performance of the best-performing MOEAs for any of the metrics. In fact, the performance of $AutoMOEA+$$\u232910,40k\u232a$ resembles the performance previously reported for MOEA/D on this scenario (Bezerra et al., 2018), being considered good only according to $I\epsilon +$, even if not competitive with the best-performing MOEAs. A surprising result from this scenario is that IBEA displays very good performance according to all metrics. Considering, however, that the tuning of IBEA and the design of the $AutoMOEA+$ are given the same budget and the tuning of $AutoMOEA+$ does not use any initial configurations, the much larger configuration space of $AutoMOEA+$ may lead to the fact that a design such as IBEA is not (yet) found by $irace$. Yet, this is the only scenario in which having a smaller configuration space translates into a clear advantage for the manually designed MOEAs.

From a more general perspective, the automatic MOEA design is strongly affected by the disagreements between metrics, and this challenge grows not only as a function of $M$, but also of $FEmax$. More precisely, when MOEAs are given more resources, it is natural that their search converges to the region of interest of the performance metric used to guide tuning or the implicit preferences from the algorithm designer. In other words, due to the known disagreements of the metrics, further optimizing the metric that guides the automatic tuning by increasing $FEmax$ leads to worse results on the other metrics. From the rank sum analyses, we see that the automatic design is sensitive to this issue.

### 3.7 Tuning Metric Effects on Single-Objective MOEA Design

The disagreements between performance metrics have evidenced the importance of the tuning metric for the effectiveness of the automatically designed algorithms. We next investigate the effects of designing an $AutoMOEA+$ algorithm for scenario $\u23295,40k\u232a$, optimizing $I\epsilon +$ instead of $IHrd$. The structure of this $AutoMOEA+$ algorithm (hereon called $AutoMOEA+$$\u23295,40k,I\epsilon +\u232a$) is given in Table 10. The most significant structural change w.r.t. $AutoMOEA+$$\u23295,40k\u232a$ depicted in Table 5 concerns the adoption of an external archive with a $PreferenceExt$, which comprises an indicator-based $Refinement$ ($I\epsilon +$) and a decomposition-based $Diversity$ (reference lines). This preference relation is complementary to $PreferenceMat$ and $PreferenceRep$, which both use as $Refinement$ component the $IH1$ indicator. This change in the design seems to reflect the change in tuning metric, as $AutoMOEA+$$\u23295,40k\u232a$ was heavily $IH1$-based, whereas the search of $AutoMOEA+$$\u23295,40k,I\epsilon +\u232a$ is more balanced w.r.t metrics; however, more repetitions of the design process would be required to corroborate this hypothesis.

The rank sum analysis given in Table 11 shows the results from a comparison that includes all MOEAs considered so far. (For brevity, $AutoMOEA+$$\u23295,40k,I\epsilon +\u232a$ is abbreviated as $Auto-\epsilon $ in the table, and only the top-performing MOEAs are given.) The performance of $AutoMOEA+$$\u23295,40k,I\epsilon +\u232a$ reflects the balance between metrics discussed above. Specifically, its performance is considered statistically significantly better than the state-of-the-art MOEAs according to $IIGD$, and equivalent to the best MOEA according to the remaining metrics. In fact, the results according to $IHrd$ are remarkable, given that the performance of $AutoMOEA+$$\u23295,40k,I\epsilon +\u232a$ is considered equivalent to that of $AutoMOEA+$$\u23295,40k\u232a$, which was configured for $IHrd$. More importantly, results from $AutoMOEA+$$\u23295,40k,I\epsilon +\u232a$ evidence that the challenge posed by metric disagreements can be alleviated even for larger $FEmax$ values, at least for a moderate number of objectives.

## 4 A Multiobjective Formulation to Automatic MOEA Design

The disagreement between performance metrics is most evident in the many-objective scenarios, where the best-ranked manually designed MOEA strongly depends on the metric used to measure quality. One possible way to overcome this disagreement is to optimize all metrics simultaneously during the automatic design process. In this section, we propose such a multiobjective formulation, following related work on multiobjective configuration of algorithms (Dréo, 2009; Bezerra et al., 2017a). First, we briefly define the concept of multiobjective configuration and detail our proposal. We then present an experimental investigation to evaluate this formulation on scenario $\u232910,40k\u232a$, the most challenging scenario we have identified so far.

### 4.1 Multiobjective MOEA Design

The fields of automatic algorithm configuration and multiobjective optimization intersect in two main ways (Bezerra et al., 2017a). The first one concerns the automatic configuration of multiobjective algorithms (López-Ibáñez and Stützle, 2012; Dubois-Lacoste et al., 2011; Bezerra et al., 2016), that is, the target algorithm tackles multiobjective problems and, hence, returns a set of mutually nondominated solutions. This is the context of the work presented so far in this article. The second one concerns multiobjective configuration of algorithms (Dréo, 2009; Blot et al., 2016), that is, the configuration of algorithms according to several metrics simultaneously.

In this section, we consider the *multiobjective design of MOEAs*, where the configurator searches for a MOEA design that optimizes multiple performance metrics simultaneously. In particular, we propose to aggregate the various performance metrics that we have used before to evaluate the performance of the MOEAs; that is, we consider an aggregation of the metrics $C={IHrd,I\epsilon +,IIGD}$. For this aggregation we use the hypervolume ($IH$) metric (Zitzler et al., 2002) to make the multiobjective nature of the configuration problem transparent to $irace$. Concretely, candidate evaluation is done in two stages. First, each metric in $C$ is computed, following the same setup described in the previous section. Second, the $IH$ of the subspace dominated by the objective vector representing the performance of the candidate in the *metric space* is computed. To ensure each metric is equally assessed by the $IH$ metric, we use a two-stage normalization approach, as follows. First, we discard points outside the upper bounds defined for each metric,^{10} to avoid strong outliers. Next, we normalize each metric value to the [1,2] interval. The $IH$ metric is computed using point 2.2 as reference.

We use an aggregation for two main reasons. First, to simplify the choice of the final $AutoMOEA+$ to be compared to other MOEAs from several, possibly mutually nondominated $AutoMOEA+$ designs (nondominated w.r.t. the metrics in $C={IHrd,I\epsilon +,IIGD}$). Second, to be able to directly use the $irace$ configurator that expects configurations to be evaluated w.r.t. a single metric. Given the large configuration space, we believe that this approach also helps to better direct the search of $irace$ to very high-performing $AutoMOEA+$ designs.

### 4.2 Empirical Assessment

The experimental setup we adopt to evaluate our approach is similar to the setup adopted in the previous section. The only differences are (i) the number of scenarios, as only scenario $\u232910,40k\u232a$ is considered, and (ii), the way candidate configurations are evaluated, as we adopt the multiobjective formulation described above.

The structure of the $AutoMOEA+$ algorithm designed using the proposed multi-objective formulation (hereon called $AutoMOEA+$$\u232910,40k,MO\u232a$) is given in Table 12. Compared to the structure of $AutoMOEA+$$\u232910,40k\u232a$ given in Table 5, designed to optimize the $I\epsilon +$, we notice significant structural differences. In fact, the only component in common between both algorithms is dominance count for $PreferenceMat$. We also remark how interesting it is that a configuration designed to optimize multiple metrics selects the $I\epsilon +$ as $Refinement$ component for every $Preference$ relation adopted, when this component had not been used at all on the algorithm meant to optimize the $I\epsilon +$ metric. In addition, it is surprising that a large size external archive using NN as $Diversity$ and sequential removal gets selected in a runtime-constrained setup. Yet, the computationally most expensive component of $AutoMOEA+$$\u232910,40k\u232a$, steady-state selection, is not present in $AutoMOEA+$$\u232910,40k,MO\u232a$. Finally, DE is selected in place of GA as underlying EA. Altogether, one can understand these structural changes as a trade-off between computationally demanding components, with $irace$ favoring the combination of refinement metrics and an external archive over steady-state replacement.

The rank sum analysis in Table 13 shows that this change in structure leads to a remarkable performance. Not only $AutoMOEA+$$\u232910,40k,MO\u232a$ ranks first according to all metrics, it is also considered statistically significantly better than all MOEAs according to $IIGD$. This achievement is made yet more important in light of what had been observed in the state-of-the-art assessment conducted in Bezerra et al. (2018), and also in Section 3. Specifically, some algorithms are able to excel according to given metrics at the cost of others, such as NSGA-III for $IIGD$ (interestingly, the metric used in the original article to evaluate NSGA-III performance). Still, $AutoMOEA+$$\u232910,40k,MO\u232a$ is an algorithm that is able to excel according to all metrics, even outperforming NSGA-III for $IIGD$. In fact, the only manually designed MOEA that is able to achieve a balanced, yet effective performance on this scenario is IBEA. Yet, $AutoMOEA+$$\u232910,40k,MO\u232a$ is considered statistically significantly better than IBEA for all metrics but $IHrd$. Finally, it is interesting to observe the large difference in rank sums between $AutoMOEA+$$\u232910,40k,MO\u232a$ and $AutoMOEA+$$\u232910,40k\u232a$: the single-objective design guided by $I\epsilon +$ produces an algorithm that is unable to excel for the very metric for which it was created to optimize. By contrast, the multiobjective formulation leads to an algorithm that is balanced, yet effective for all metrics.

We conclude with a problem-wise assessment, using the boxplots depicted in Figure 4. For clarity, only MOEAs that rank up to fourth according to any of the metrics in the rank sum analyses are included in the comparison. The only problem for which all metrics agree is DTLZ2; for all others, at least one metric strongly disagrees with the remaining ones. This pattern also applies to $AutoMOEA+$$\u232910,40k,MO\u232a$, which is never outperformed by other MOEAs according to all metrics at the same time. The largest gaps between $AutoMOEA+$$\u232910,40k,MO\u232a$ and the remaining MOEAs are seen for $IIGD$, whereas the smallest are observed for $IHrd$. Indeed, the contrasting results between $IHrd$ and $IIGD$ further corroborate the need for algorithm engineering approaches that simultaneously consider multiple metrics, especially given that $IIGD$ has been so widely employed in the design and assessment of many-objective MOEAs. Finally, even when comparing according to $I\epsilon +$, the $AutoMOEA+$ designed to optimize $I\epsilon +$ outperforms only $AutoMOEA+$$\u232910,40k,MO\u232a$ on a few functions, confirming that a design which balances the importance of different metrics can lead to a better overall performance.

## 5 Conclusion

In this work, we have automatically designed state-of-the-art multi- and many-objective evolutionary algorithms (MOEAs) for box-constrained continuous optimization problems. Specifically, we have considered a range of experimental factors such as benchmark problems, number of variables and objectives, stopping criteria, and performance metrics. The $AutoMOEA+$ algorithms designed in this work have demonstrated a remarkably robust performance to all of these factors, especially the latter three. In particular, these results were only made possible through a series of investigations upon which this article builds, namely our proposal of the automatic MOEA design methodology (Bezerra et al., 2016), our review of multiobjective algorithm configuration (Bezerra et al., 2017a), and our assessment of the state-of-the-art in MOEAs for box-constrained continuous optimization (Bezerra et al., 2018).

The convergence of the insights obtained from those studies were translated into two major proposals in this work. The first is the $AutoMOEA+$ framework, which comprises the most relevant MOEA design paradigms (dominance-, indicator-, and decomposition-based), underlying evolutionary algorithms (genetic algorithms and differential evolution), and archive truncation techniques. From this framework, we have automatically designed state-of-the-art MOEAs for all experimental scenarios considered in multiobjective optimization, and for nearly all scenarios considered in many-objective optimization. Many of the design choices present in the $AutoMOEA+$ algorithms differ considerably from what human designers have so far considered; some designs couple components from entirely different design paradigms to produce high-performing MOEA designs. Indeed, all novel components implemented in this article have been used in one or more automatically designed MOEAs, except for the online replacement component. Performance improvements from the $AutoMOEA+$ algorithms over the $AutoMOEAs$ produced from the original framework corroborate the benefits of the extensions we propose in this work.

The second proposal focused on many-objective scenarios, and consists in a multiobjective formulation of the automatic MOEA design. Using this formulation, one can automatically design MOEAs which simultaneously optimize a set of relevant, yet disagreeing metrics ($IHrd$, $I\epsilon +$, and $IIGD$). Perhaps surprisingly, we have shown that an algorithm designed to optimize a set of metrics can even outperform algorithms created with a single metric in mind according to that metric. Overall, the performance of the resulting $AutoMOEA+$$\u232910,40k,MO\u232a$ algorithm is remarkable for several reasons: (1) it ranks first according to all metrics in rank sum analyses; (2) it is considered statistically significantly better than the state-of-the-art MOEAs according to $IIGD$; and (3) it is considered statistically significantly better than IBEA for all but the $IHrd$ metric, the best-performing MOEA for this scenario according to all metrics.

The implications of this work are many and its applications are numerous. First, although we have produced a number of novel state-of-the-art algorithms for the main application domain of MOEAs, our actual contribution is the empirical demonstration that this task is likely feasible for any application domain. The only imperative requirement is the a priori identification of effective domain-specific components, to which the automatic design approach is flexible enough to adapt. Second, our multiobjective formulation of MOEA design is an elegant solution to the disagreement between Pareto-compliant performance metrics, and yet its major contribution is the empirical demonstration that a MOEA should not be engineered with a single metric in mind, regardless of the scenario considered. In addition, our multiobjective formulation of MOEA design has only been tested so far for solution quality assessment. Yet, it seems imperative to account also for runtime in search for algorithms with better anytime behavior.

A final implication of our work concerns the limitations of MOEAs that our approach did not propose to solve, but to put in evidence. Specifically, our previous investigation on manually designed state-of-the-art MOEAs had indicated that a few factors pose challenges that MOEAs are yet to overcome: (i) having too little function evaluations available; (ii) accounting for very heterogeneous problem characteristics; and (iii) scaling to deal with a significant number of variables and/or objectives. In all of these scenarios, the automatically designed MOEAs match or surpass the performance of the manually designed state-of-the-art MOEAs. Still, it becomes evident that the MOEA research community needs to keep pushing in these directions if effective algorithms are to be designed, either manually or automatically.

## Acknowledgments

The research presented in this article has received funding from the COMEX project (P7/36) within the IAP programme of BelSPO. Thomas Stützle acknowledges support from the Belgian F.R.S.-FNRS, of which he is the research director.

## Notes

^{1}

This algorithm is generally referred to as SMS-EMOA in the literature. Here, we dub it SMS for brevity.

^{3}

It would also be possible to use a similar modeling to comprehend other underlying EAs, such as CMA-ES and even *particle swarm optimization* (PSO, Eberhart and Kennedy (1995)). However, this would require operators aware of population-related aspects, such as neighborhood topology in PSO. We leave such investigation for future work.

^{4}

A weight vector belongs to the outer layer if at least one of its components equals zero; otherwise it belongs to the inner layer.

^{5}

We have empirically verified that the performance of the implementations used in this work match what has been reported in the original papers. Specifically, we primarily adopt MOEA implementations from well-established frameworks such as ParadisEO-MOEO (Cahon et al., 2004), Shark (Igel et al., 2008), PaGMO (Biscani et al., 2010), and PISA (Bleuler et al., 2003). When unavailable from those sources, we adopt either official implementations (Zhang, 2007) or implement MOEAs ourselves reusing available code (DE-based and NSGA-III). In the case of NSGA-III, the core implementation is provided by Chiang (2014), from which we fixed existing issues.

^{6}

All metrics considered in this work and also in our state-of-the-art assessment (Bezerra et al., 2018) are unary formulations of binary metrics, using reference fronts for a common comparison. We adopt this formulation to avoid the total number of comparisons that truly binary metrics would require. For further details on the metrics and reference fronts used for their computation, see Bezerra et al. (2018).

^{7}

We are aware that IGD is not Pareto-compliant under some conditions. Yet, we use it to reproduce the setup used in a major share of the literature on many-objective EAs and also in our state-of-the-art assessment (Bezerra et al., 2018).

^{8}

One may assume that using an external archive is always better than not using it when function evaluations are used as stopping criterion, but the time-constrained setups we adopt help explain the cases in which no external archive is used.

^{9}

When $M>3$, component $IHh$ uses a Monte Carlo estimation instead of an exact computation.

^{10}

For $I\epsilon +$ and $IIGD$, the bound is set to 100. For the $IHrd$, the bound is set to 1.0.

## References

*Studies in computational intelligence*.

*Advanced information and knowledge processing*