## Abstract

We continue recent work on the definition of multimodality in multiobjective optimization (MO) and the introduction of a test bed for multimodal MO problems. This goes beyond well-known diversity maintenance approaches but instead focuses on the landscape topology induced by the objective functions. More general multimodal MO problems are considered by allowing ellipsoid contours for single-objective subproblems. An experimental analysis compares two MO algorithms, one that explicitly relies on hypervolume gradient approximation, and one that is based on local search, both on a selection of generated example problems. We do not focus on performance but on the interaction induced by the problems and algorithms, which can be described by means of specific characteristics explicitly designed for the multimodal MO setting. Furthermore, we widen the scope of our analysis by additionally applying visualization techniques in the decision space. This strengthens and extends the foundations for Exploratory Landscape Analysis (ELA) in MO.

## 1 Introduction

Multiobjective optimization is increasingly applied in domains where the single-objective functions are of complex, nonlinear nature, and therefore most likely multimodal. A demonstrative example is given by the problem of antenna placement. If multiple antennas transmitting the same signal are employed, the strength of the signals is a multimodal function over space. We may also consider multiple types of signals, that is, the mobile phone network and a signal from a local sensor network to be maximized. In this case, the maximization of the signal strength is a multiobjective multimodal optimization problem. Due to the radial decay of signal strength around each antenna, its structure resembles that of the multiobjective multisphere problem visualized on the left of Figure 1, which was recently introduced in Kerschke, Wang et al. (2016). Similarly, *multiobjective, multimodal* problems occur also in high energy physics and quantum control (Laforge et al., 2011), drug design by docking considering energy and contact (Nicolaou and Brown, 2013), and in urban planning problems, when we want to choose a location near to different types of facilities (Maulana et al., 2015).

In either of the aforementioned scenarios, further information on the underlying problem is of high importance. In single-objective optimization, *Exploratory Landscape Analysis* (ELA, Mersmann et al., 2011) is known as a sophisticated technique for characterizing various properties of a continuous landscape by means of numerical features (e.g., the landscape's curvature, or the distribution of the local optima). These are initially computed on a small sample of evaluated points, and may be used to derive valuable information about a problem's landscape. Among others, Kerschke, Preuss et al. (2016) designed topological features that were used for detecting funnel structures. Using ELA features (in general) one can effectively enhance algorithm selection and/or configuration models (Liefooghe et al., 2015) as shown in the scheme on the right side of Figure 1. However, the generalization of such techniques to the multiobjective domain remains an open research problem, and first requires a thorough understanding of landscape features. In Kerschke, Wang et al. (2016), formal definitions for multimodality in multiobjective optimization problems are introduced, which provide a first step towards generalizing the ELA framework to multiobjective optimization problems.

Being able to select the right algorithm on the basis of a small sample via applying ELA (as done for the single-objective case in Bischl et al., 2012) would be our vision. However, setting up the necessary features is not trivial because we have to explicitly target the interaction between the single-objective functions, for which not much is known, especially when the functions themselves are all multimodal. As we rely on an existing, highly configurable problem generator, we follow a *bottom-up* approach and first try to understand the combined effect the objective functions have on different types of algorithms, especially under specific variations of a similar problem composition. We treat the problem instances as white boxes, such that we can determine, for example, how many local fronts an algorithm is able to find and how many solutions are distributed on either one. This enables a much more informed view onto the algorithm–problem interactions as would be possible otherwise. However, in a real-world setting, this knowledge would not be available. Our general idea is thus to find out what measurable characteristics can describe the observed algorithm behavior well and then, later, use this knowledge in order to come up with ELA features that allow choosing the right algorithm for an unknown black-box problem on the basis of a small sample.

After summarizing the related work in Section 2, we extend the foundation laid in Kerschke, Wang et al. (2016) in different ways:

The necessary topological definitions for treating multimodality in the multiobjective context are developed further in Section 3.

In Section 4, we look into the treated problems from an analytical perspective, and especially derive the Pareto fronts and efficient sets.

As the shape of generated problems is generalized to ellipsoids, new visualization techniques explicitly take the decision space into account and help us grasp the interactions between problem and algorithm characteristics (Section 5).

Section 6 describes the two employed algorithms, especially the Hypervolume Indicator Gradient Ascent (HIGA-MO), in more detail.

New problem and algorithm characteristics, so to say the white-box predecessors of new MO-related ELA features, are set up in Section 7.

In Section 8, we experimentally analyze the behavior of the two algorithms on the new problems, and explain it with the newly introduced characteristics.

## 2 Related Work

In the past, the analysis of local properties of multiobjective optimization problems focused mainly on single point methods. The Fritz John and Karush Kuhn Tucker conditions form necessary and sufficient conditions for Pareto optimality in the continuous case, given the regularity conditions of differentiability and convexity of the objective functions (Miettinen, 1998). Such conditions can easily be restated to provide single point landscapes, for example, by minimizing the residual of the angle between the objective function vectors in the unconstrained (bi-objective) case. Moreover, if full knowledge of the search landscape is available, the normalized dominance rank can be considered as a measure of closeness to the efficient set (Fonseca, 1995). More general conditions on local efficiency can be stated on level sets (Ehrgott, 2005). A set-oriented view of multimodality is, however, new but it seems to better support the analysis of population-based algorithms for approximating the Pareto front.

In discrete optimization the problem of analyzing local properties of Pareto fronts has been further advanced. Single point analysis is classically done by stating non-dominance in some environment. Following this, Stadler and Flamm (2003) generalized Barrier trees from single-objective optimization to barrier forests of partially ordered landscapes, of which multiobjective optimization are a special case. The so-called Barrier forest allows visualization of the structure of basins of attraction for local search algorithms that accept only dominating points and how they are separated by barriers. Moves to nondominated points are not allowed, which might limit the usability of these Barrier forests in the analysis of multiobjective optimization. A priori landscape analysis for discrete problems was also proposed by Tantar et al. (2008), with a focus on visualizing design space boundaries of combinatorial problems. Verel et al. (2011, 2013) instead proposed set-oriented definitions of multimodality and local optimality, inspired by ideas of indicator-based multiobjective optimization and set-dominance expressed in earlier work. In recent work, discrete landscape features and neighborhood based search heuristics on binary search spaces are discussed, considering the $\u025b$-indicator as a measure of proximity to the Pareto front (Liefooghe et al., 2015; Daolio et al., 2016). The generalization of these concepts to continuous domains is still new. Preuss et al. (2006) motivated the need for such studies by a detailed analysis of synthetical problems in low dimensions. However, in this work we do not explicitly focus on diversity issues of optimizers in multimodal settings (e.g., Ulrich et al., 2010; Zadorojniy et al., 2012) but rather investigate general search behavior of respective solvers. The previous work of Kerschke, Wang et al. (2016), on which this article is based, presents a first step in the direction of understanding multimodal landscapes in multiobjective optimization in a more systematic way. Here the definition of local optimality of a point was generalized to the definition of a locally efficient set, which can be viewed as an attractor for population-based local search. Moreover, scalable test problems for multimodal single-objective optimization, originally introduced by Wessing (2015), were generalized to the multiobjective case. However, the rich structure of the objective space, as compared to single-objective optimization, allows extension of these basic definitions to a more comprehensive framework of reasoning about landscape features.

## 3 Multimodality

In this section, we introduce the definition of multimodality for the multiobjective landscapes. The search and objective spaces of the multi-objective functions studied here are subsets of $Rn$. Most of our definitions can also be generalized to other spaces; however, due to space limitations, this will not be part of this work.

Let $A\u2286Rs$. The subset $A$ is called *connected* if and only if there do not exist two open subsets $U1$ and $U2$ of $Rs$ such that $A\u2286(U1\u222aU2)$, $(U1\u2229A)\u2260\u2205$, $(U2\u2229A)\u2260\u2205$, and $(U1\u2229U2\u2229A)=\u2205$; or equivalently there do not exist two non-empty subsets $A1$ and $A2$ of $A$ which are open in the relative topology of $A$ such that $(A1\u222aA2)=A$ and $(A1\u2229A2)=\u2205$. Let $B$ be a non-empty subset of $Rs$. A subset $C$ of $B$ is a *connected component* of $B$ iff $C$ is non-empty, connected, and there exists no strict superset of $C$ that is connected.

Now, let $f:X\u2192Rm$ be a multiobjective function (which we want to “minimize”) with component functions $fi:X\u2192R$, $i=1,\cdots ,m$ and $X\u2286Rd$. Given a totally ordered set $(T,\u2264)$, with total order $\u2264$, the Pareto order $\u227a$ on $Tk$ for any $k\u2208N$ is defined as follows: Let $t(1)=(t1(1),\cdots ,tk(1)),t(2)=(t1(2),\cdots ,tk(2))\u2208Tk$. We say $t(1)\u227at(2)$ iff $ti(1)\u2264ti(2)$, $i=1,\cdots ,k$, and $t(1)\u2260t(2)$. Specializing this to the reals with its natural, total order we obtain the Pareto order on $Rm$. A point $x\u2208X$ is called *Pareto efficient* or *global efficient* or for short *efficient* iff there does not exist $x\u02dc\u2208X$ such that $f(x\u02dc)\u227af(x)$. The set of all the (global) efficient points of $X$ is denoted by $XE$ and is called the *efficient subset* of $X$ (or *efficient* set of $f$). The image of $XE$ under $f$ is called the Pareto front of $f$.

Defining a local efficient point in $X$ (or of $f$) is as straightforward as defining local minimizers (maximizers) for single-objective functions. This is in contrast to defining local efficient *sets*, which are needed for the multi-criteria setting.

A point $x\u2208X$ is called *locally efficient point* of $X$ (or of $f$) if there is an open set $U\u2286Rd$ such that there is no point $x\u02dc\u2208(U\u2229X)$ such that $f(x\u02dc)\u227af(x)$. The set of all the local efficient points of $X$ is denoted by $XLE$.

A point $x\u2208X$ is called *global efficient point* of $X$ (or of $f$) if there is no point $x\u02dc\u2208(Rd\u2229X)$ such that $f(x\u02dc)\u227af(x)$. The set of all the global efficient points of $X$ is termed *(global) efficient set* (or Pareto set) of $f$ and denoted by $XE$.

A subset $A\u2286X$ is a *local efficient set* of $f$ if $A$ is a connected component of $XLE$ (= set of the local efficient points of $X$).

A subset $P$ of the image of $f$ is a *local Pareto front* of $f$, if there exists a local efficient set $E$ such that $P=f(E)$.

Note that the local efficient set has been defined for the combinatorial search domain in Paquete et al. (2004). Furthermore the (global) Pareto front of $f$ is obtained by taking the image under $f$ of the union of connected components of $XE$. If $XE$ is connected and $f$ is continuous on $XE$, the Pareto front is also connected. In this work we used the notion of connectedness to define the local efficient sets. There still remains the task of extending the notion of efficient set by looking at connectedness in the objective space. For instance, it could happen that two different local efficient sets are mapped onto the same set in the objective space. This raises many questions, which need to be addressed in future work.

With a view towards algorithms which compute approximations to (local) efficient sets and/or (local) Pareto fronts you need to be able to tell whether a finite set is a subset of a connected component (i.e., a finite subset of $XLE$ is a subset of some local efficient set). A finite subset of a Euclidean space is never connected unless it consists of one point. Of course, if a set $S$ is connected and it is a subset of the local efficient points of $X$, then $S$ is a subset of some local efficient set. In case we are dealing with neighborhood systems, finite sets could very well be connected (or even path connected).

Let $\u025b\u2208R>0$ and $S\u2286Rq$ for some $q$. $S$ is $\u025b$-*connected* if for any two points $si,sk\u2208S$ there is a finite set of points ${si+1,\cdots ,sk-1}\u2286S$ such that $d(si,si+1)\u2264\u025b$, $\cdots $, $d(sk-1,sk)\u2264\u025b$, where $d$ is the Euclidean distance function on $Rq$.

A finite subset $S$ of $X$ is a subset of $XLE$, if it consists of local efficient points of $X$ and $S$ is $\u025b$-connected—with $\u025b$ being below a relatively small threshold $\u025b0>0$.

Let $S$ be a *finite* subset of $XLE$. Then $S$ is an $\u025b$-*local efficient set*, if $S\u2260\u2205$, and $S$ is $\u025b$-connected.

## 4 Analysis on Simple Mixed-Peak Problems

In this section, the bi-objective problem that is used as our benchmark is introduced with detailed discussions on its properties. To facilitate the later analysis of the multiobjective landscapes, the analytical Pareto fronts and corresponding efficient sets are derived for this problem class.

### 4.1 Mixed-Peak Functions

*Multiple Peaks Model 2*(MPM2, Wessing, 2015), is adopted to illustrate the proposed topological definitions and further analyze the behavior of explorative algorithms. Such a function class is a mixture of similar unimodal functions; that is, the peaks, that have

*convex*local level sets, which is typically combined with the well-known Karush-Kuhn-Tucker theorem to identify local efficient points. In addition, the complexity of the problem can be easily controlled by the number of peaks. The mixed-peak function is defined as an unconstrained function $f:Rd\u2192R$ that is subject to minimization:

*quasi-concave*unimodal peak, whose negative leads to

*quasi-convex*valleys on function $f$. According to the optproblems package (Wessing, 2016), it has the following parameters: (1) number of peaks $N\u2208Z>0$, (2) center $ci\u2208Rd$, height $hi\u2208[0,1]$ and radius $ri\u2208[0.25d,0.5d]$ per peak, with decision space dimension $d$, (3) “shape” $si\u2208[1.5,2.5]$ per peak, controlling the landscape's steepness, (4) rotation of the elliptical level sets based on a

*positive definite*matrix $\Sigma i$. In the following, we will use the norm notation $\u2225x-ci\u2225\Sigma i:=(x-ci)\u22a4\Sigma i(x-ci)$ as it can be considered as the Mahalanobis distance w.r.t. $\Sigma i$.

**Ridges:**As a result from the definition of $f$ (Eq. 1), the landscape can contain ridges. The set of all ridges of $f$ can be represented by:

*active*. From now on, the active peak function at $x$ is denoted as $g\tau $ w.r.t. $\tau =argmax1\u2264i\u2264N{gi(x)}$. In fact, ridges separate the decision space into many

*active regions*, on each of which only a single peak function $g$ is active:

**Convex Local Level Sets:**Given the quasi-concavity of each peak $gi$, $1-gi$ has local convex level sets in $Rd$. If the function $1-gi$ is restricted to an $\u025b$-Euclidean ball $B\u025b(x*)=x\u2208Rd|x-x*<\u025b$ for every $x*\u2208Rd$ and every $\u025b>0$, the resulting function $1-gi|B\u025b(x):B\u025b(x)\u2192R$ also has local convex level sets. Also, due the fact that the active regions $Ai$'s are disjoint and open, for every non-ridge point $x*$, it is possible to find a $\delta >0$ (depending on $x*$) such that $B\delta (x*)\u2282A\tau $ and $(B\delta (x*)\u2229Ai)=\u2205$, $\u2200i\u2260\tau $ ($\tau $ is the unique index of the active peak function at $x*$). Then the restricted $f$ to $B\delta (x*)$, $f|B\delta (x*)$ equals $1-g\tau |B\delta (x*)$ and it thus has local convex level sets. Therefore, we have the following conclusion:

### 4.2 Mixed-Peak Bi-Objective Problem

**One Peak Scenario:**We first consider a simple case where each objective consists of one peak without any ridges in the domain. Then, the objectives degenerate to:

*necessary condition*for $x*\u2208Rd$ being efficient is:

*continuous*in $Rd$. Therefore, it must take any value between its minimum and maximum, resulting in $0\u2264k<\u221e$. Taking the range of $k$ into account, every point that satisfies Eq. 5 can be written as:

*locally Pareto efficient*and the efficient set of the problem is expressed as:

**Multiple Peaks:**If each of the objective functions consists of multiple peak functions, namely $N>1$, the efficient set derived in Eq. 7 can be adapted in the following manner: suppose function $f1$ and $f2$ contain $N$ and $N'$ peaks, respectively. For each pair of peaks between two objective functions (e.g., $gi$ and $gj'$), a

*pseudo-efficient set*can be calculated according to Eq. 7 as if the rest of the peaks in both objective functions were not existing:

## 5 Visualizing the Decision Space of Multiobjective Landscapes

Within recent work, a new approach for visualizing the decision space of multimodal multiobjective landscapes based on a scalar combination of its gradients was introduced (Kerschke and Grimme, 2017). It depicts the interaction of overlapping multiobjective local optima and provides a first understanding of a problem's landmarks, such as ridges and valleys, in rather unexplored multiobjective settings. Figuratively speaking, the method visualizes the behavior of a “multi-objective” ball, which behaves like a gradient descent optimization algorithm, on multiobjective landscapes.

^{1}which lie on connections between any pair of peaks (from the different objectives). Note that the connections are straight lines for the mixed-sphere problems or curved lines for the mixed-ellipse problems, respectively. As a path along the gradients leads to the (attracting) local efficient set, we use the cumulated path lengths as objective value (or “height”) of our scalar representation of the multiobjective landscape. The scalarized problem can then be visualized within a two-dimensional heatmap or within a three-dimensional surface plot as shown in Figure 3. We also enhanced our heatmap by adding the contour lines of the objective-wise mixed-sphere (or mixed-ellipse) problems, the combined gradient vectors (only every 50th value per dimension for better readability) and the true local efficient sets.

The mixed-sphere problem shown in Figure 3 contains two peaks within the first (indicated by orange contour lines) and one peak within the second objective (white dashed lines). The coloring of the plots represents the ($log10$-transformed) cumulated length of the path of gradients to reach an at least locally efficient point and the corresponding path-length-to-color mapping is shown in the color bar on the right. In the following, we highlight some of the peculiarities that we detected within the plots.

**Basins of attraction:** Within the shown example, two basins of attraction, each comprising one locally attracting connected set, are visible. The existence of these basins (along with their included connected sets) supports our thesis of the “multiobjective ball,” which follows the combined gradients until it converges in a local efficient set. Interestingly, an area of attraction can comprise several disjoint local efficient sets—for example, the light blue and orange segments in the left connected set within the heatmap of Figure 3—simply because (adjacent) parts of this connected set belong to different dominance layers. In this scenario, the right segment of the left connected set (i.e., the orange line) is dominated by the entire right connected set (dark blue line) as shown within the plot of the (theoretically) true local fronts in the objective space, Figure 4.

**Discontinuities:** We furthermore discovered overlays of basins of attraction, resulting in “cliffs” within the landscape. These abrupt changes within the landscape are created by competing peaks within the single objectives. Even in the rather simple scenario from Figure 3, which contains two peaks in the first and only one peak in the second objective, this behavior becomes visible. The two competing peaks of the first objective cause a shift in the gradient landscape—leading to a cliff along the line of equal height of its two peaks, that is, the position where the spherical contour lines of the two peaks (from the first objective) intersect. If the two objectives would contain more peaks, the competition among the peaks would lead to an even more rugged landscape. As a result of this “discontinuity” in the landscape, minor changes in the (starting) position can cause the “multiobjective ball” to “roll” towards a completely different locally efficient set.

**Expression of basins of attraction in the objective space:** The coloring of the gradient landscape in Figure 3 represents the ($log10$-transformed) cumulated lengths of the gradient paths towards the nearest attracting local efficient set. As each observation exists in the decision *and* objective space, we are able to transfer the coloring scheme to the objective space by coloring the image in objective space according to each sample point's color within the decision space. The result is shown in the right plot of Figure 4 and while both connected fronts (i.e., the images of the connected sets) become visible, one cannot identify the local fronts purely based on the coloring. However, by comparing the dominance relationship between the connected fronts, one could of course (manually) split the connected fronts into local fronts. Also, points within the vicinity of a local front are dominated, supporting our definition of a local front. Furthermore, one can see that points with an increased distance to the local front are colored in a darker shade of red; that is, their cumulated path lengths towards the local optima are longer.

**Interpretation of 3D visualization of multiobjective landscapes:** Three-dimensional surface plots and the respective heatmaps are usually a good and helpful tool for detecting valleys, ridges or other characteristics of the analyzed problem landscape. However, in this case the objective function, which actually defines the “height” of the landscape, is the cumulated length of the path of the gradients; that is, it is a mapping from the original two objectives into a scalar objective. Consequently, when interpreting plots, we have to keep in mind that they do not describe the landscape in the common single-objective sense but rather the interaction of the objectives.

Also, there are some technical limitations to this visualization approach: Ideally, all points belonging to an efficient set should have a combined gradient length of zero. Thus, all local efficient sets, as well as the corresponding local fronts, should have the same color. However, as the coloring within the plots indicate, none of the one million discrete points from the grid has a value below $10-5$. Due to numerical imprecision, none of these points has a gradient of exactly zero. By comparing heatmaps of multiple scenarios (e.g., the ones that are shown within Sect. 8), one can also detect that the sizes of the valleys (i.e., their lengths and widths) vary and thus, the probability of detecting these valleys, including their comprised local efficient sets, likely varies as well.

In general, we suggest to always consider multiple visualizations of the landscape, as each of the plots explains a different aspect of the problem and consequently, one can get a better understanding of the entire problem by looking at it from different angles.

## 6 Explorative Algorithm

In this section, two multiobjective optimizers are introduced that follow entirely opposing search dynamics: a gradient-based method that is able to converge to local and especially global efficient sets accurately, and a naïve stochastic hill-climbing approach in which each search point performs a simple (1+1)-selection. In general, neither of the two algorithms exploits the external archive technique and the population size was chosen by balancing the algorithm's running time and the reliability of the induced algorithmic features.

**Hypervolume Indicator Gradient Ascent (HIGA-MO):**This algorithm computes the steepest ascent direction of the hypervolume indicator w.r.t. the decision vectors. Such a direction, called

*Hypervolume Indicator Gradient*, is proposed in Emmerich et al. (2007) and Emmerich and Deutz (2012) and the practical gradient ascent algorithm based on it, called

*Hypervolume Indicator Gradient Ascent Multiobjective Optimization (HIGA-MO)*is improved in Wang, Deutz et al. (2017) and Wang, Ren et al. (2017). We first denote the approximation to the Pareto efficient set as a set of decision vectors $X=x(1),x(2),\u2026,x(\mu ),x(i)\u2208Rd,i=1,2,\u2026,\mu $. In HIGA-MO, the set-based differentiation is considered, i.e., the decision vectors are concatenated into $X=x(1)\u22a4,x(2)\u22a4,\u2026,x(\mu )\u22a4\u22a4\u2208R\mu \xb7d$, using the same ordering as in $X$. In this treatment, an approximation set $X$ is considered as a single point in the product space $R\mu \xb7d$. Analogously, the corresponding objective values are encapsulated in $Y=y(1)\u22a4,y(2)\u22a4,\u2026,y(\mu )\u22a4\u22a4\u2208R\mu \xb7m$, $y(i)=f(x(i))\u2208Rm,i=1,2,\u2026,\mu $. Thus, one can explicitly define a vector-valued mapping $F:R\mu \xb7d\u2192R\mu \xb7m$ as $Y:=F(X)$. Furthermore, the hypervolume indicator $H$ can be expressed as a continuous mapping from $R\mu \xb7d$ to $R$, $HF(X):=H\u2218F(X)=H(F(X))$, whose gradient

*sub-gradient*, which is the local hypervolume change rate by moving each decision vector infinitesimally. Moreover, one can calculate the sub-gradients by applying the chain rule (Emmerich and Deutz, 2012; Wang, Ren et al., 2017):

Note that the gradient of objective functions $\u2207fk(x(i))$ can be approximated numerically if no analytical knowledge on the functions is available. In addition, the term $\u2202H(Y)/\u2202yk(i)$ are the partial derivatives of the hypervolume indicator $H$ w.r.t. the $y(i)$'s, which are calculated as the length of the steps of the attainment curve (see Emmerich and Deutz, 2012 for details). Consequently, the hypervolume indicator gradient is a *linear combination* of objective-wise gradients.

The hypervolume indicator gradient is well-defined for nondominated subsets of the approximation set $X$ due to the fact that the image of each decision vector contributes to the hypervolume. For any *strictly dominated* point, the sub-gradient associated with it is zero because such a point has no contribution to the hypervolume. In order to move such points towards the (global) Pareto efficient set, the well-known *nondominated sorting* technique (Srinivas and Deb, 1994) is adopted to solve this issue. In principle, the Pareto set approximation is partitioned into multiple locally non-dominated *layers*. Then the hypervolume indicator sub-gradient at a point is computed with respect to the layer to which the point belongs.

The nondominated sorting-based approach is of particular interest to our landscape exploration task. To explore a multimodal multiobjective landscape, it is important to search for local efficient sets. In the nondominated sorting-based approach, each layer has its own local hypervolume indicator and thus is *locally* optimized, which will not necessarily converge to the global efficient set. In this sense, the layers can be treated as candidate approximation sets to local efficient sets.

**Stochastic Local Search (SLS):** For comparison of local search behavior, we implement a simple local search strategy based on parallel perturbations. Essentially, each decision point of the current approximation set is perturbed once per round. According to a simple (1+1)-selection scheme, within each iteration, the original decision point is replaced when dominated by the perturbed one. Initially, $\mu $ decision points are generated using Latin hypercube sampling (LHS). In every iteration, each decision vector is mutated by a standard normal distribution that is truncated to $[-\sigma ,\sigma ]$. After the elitist and parallel selection process based on domination, $\mu $ decision points are available for the next iteration. The loop is repeated until a termination criterion (here: maximum number of rounds) is reached. The rationale of using this simple approach is to contrast HIGA-MO with a local search representative that is unable to traverse along local Pareto fronts. We expect this approach to get stuck in local efficient solutions.

## 7 Problem and Algorithm Characteristics

In the preceding sections, we have introduced the selected optimization problems (Sect. 4) and the applied algorithms (Sect. 6), along with their expected differences regarding the problem complexity or algorithm behavior. Such (detailed) knowledge of the problem landscapes and algorithm performances is crucial when trying to solve an *Algorithm Selection Problem* (Rice, 1976). That is, based on information of the problem landscapes and the algorithm performances, one can train a so-called *algorithm selection model*, which tries to select the best suited solver (out of a given portfolio of optimization algorithms) for an unseen optimization problem.

Unfortunately, the aforementioned “information” is often derived manually, based on the authors' observations and/or knowledge. However, in the future, that process should be automatized, for example, by using exploratory landscape features as it is already common practice within single-objective optimization. As a first step towards the development of such features, we propose some characteristics, which can be seen as a numerical representation of the problem landscapes and algorithms. However, while sophisticated ELA features (Bischl et al., 2012; Kerschke et al., 2015; Mersmann et al., 2011) are computed based on a small sample of observations from the entire landscape, we consider the problems within this article as white-box problems and thus can use information from the entire landscape to compute some representative numbers.^{2}

### 7.1 Problem Characteristics

The first group of characteristics aims at quantifying the problem landscapes. While the (connected) count characteristics should also apply to combinatorial landscapes, the length characteristics require a specific metric for computing the respective lengths.

**Count Characteristics:** These characteristics describe the landscapes by the number of local efficient sets (count.les), connected sets (count.sets), domination layer (count.layer) or peaks per objective (count.peaks1, count.peaks2). The count.ps_rel computes the ratio of local efficient sets that actually are global efficient sets (= Pareto sets). Here, a value of one means that all local efficient sets are non-dominated and thus form the Pareto set. Note that it is sufficient to compute the latter ratio for the local efficient sets as there exists a bijective function between the local efficient sets (in the decision space) and the local fronts (in the objective space).

**Length Characteristics:** As the pure number of fronts or sets might be misleading due to varying lengths—for example, the light blue front within Figure 4 is much shorter than the dark blue line—the next six characteristics focus on the actual lengths^{3} of the local fronts and efficient sets. So, length.les_total represents the total length of all local efficient sets and length.ps_rel measures the relative length of Pareto sets; that is, the total length of all global efficient sets divided by the total length of all local efficient sets (including the global ones). Thus, a value of one is equivalent to a landscape in which all points from the local efficient sets are globally nondominated. The third characteristic of this group (length.ps_ratio) standardizes the former characteristic (length.ps_rel) by the analogon among the count characteristics (count.ps_rel).^{4}

The remaining three characteristics of this group measure the analogous properties for the local fronts; that is, the images of the local efficient sets: length.lf_total measures the total length of all local fronts, length.pf_rel is the ratio of the total length of the Pareto fronts compared to the total length of all local fronts, and length.pf_ratio standardizes the latter by the count ratio of Pareto fronts and local fronts.

**Connected Front/Set Characteristics:** As some algorithms (e.g., HIGA-MO) are able to travel along the local efficient sets (or local fronts), we also need information on sets or fronts that are connected to any of the Pareto sets or fronts. Thus, conn_ps.count_abs counts the number of local efficient sets that have a direct or indirect connection to at least one of the Pareto sets and conn_ps.length measures the total length of all of these sets. These two characteristics are also the foundation for the remaining ones: conn_ps.count_rel gives the proportion of the number of sets that are (somehow) connected to any of the Pareto sets and conn_ps.length_rel provides the same information based on the length of these sets. Analogously to the previous four characteristics, conn_pf.count_abs lists the number of local fronts that are connected to any of the Pareto fronts, conn_pf.length measures the total length of the aforementioned fronts and conn_pf.count_rel and conn_pf.length_rel provide the corresponding count and length ratios (compared to all local fronts).

### 7.2 Algorithm Characteristics

We also propose characteristics describing the distribution of an algorithm's final population across the problems' local fronts and efficient sets. Based on these, we want to get a better understanding of the algorithms' behavior across the different problems.

**Population Characteristics:** These characteristics measure the percentage of individuals from an algorithm's final population that are actually located in the vicinity of specific local efficient sets or local fronts. More precisely, pop_glob.single_front measures the ratio of individuals that are located in the proximity of the global Pareto front and pop_glob.single_set measures the analogon within the decision space. Similarly, the percentage of individuals from the final population that are located in the vicinity of any of the local fronts in general is measured by pop_loc.single_front, whereas pop_loc.single_set again measures the analogon in the decision space. The final characteristics of this category measure the ratio of individuals that are located close to a local front (pop_glob.conn_front) or set (pop_glob.conn_set) that is connected to any of the Pareto fronts or sets, respectively.

**Coverage Characteristics:** The remaining proposed characteristics describe the relation of fronts (or sets) and the final “population” from the opposite perspective. That is, they measure the percentage of fronts (or sets) that are covered^{5} by at least one individual from the population. The first two characteristics, cov_loc.single_set and cov_loc.single_front, measure the ratio of local efficient sets or fronts that are covered by (at least) one individual. Analogously, cov_glob.single_set and cov_glob.single_front measure the percentage of covered global efficient sets or Pareto fronts, respectively. The final characteristics (cov_loc.conn_front, cov_glob.conn_front, cov_loc.conn_set and cov_glob.conn_set) describe the connected fronts or sets; that is, all fronts that are connected to each other are regarded as a single front, and then the previous four characteristics are computed for those aggregated fronts and sets, respectively. Note that as the number of fronts (or sets) might be larger than the population size (i.e., the number of considered individuals), we standardize each characteristic by its corresponding highest achievable value (i.e., the minimum of population size and considered fronts or sets).

**Feature Computation:** Obviously, some features (such as conn_ps.count_rel) require more computational resources than others. Nevertheless, so far the biggest part of the resources is needed for matching the individuals to the correct set or front, respectively. These computations take even longer for line segments in which the points of different sets (or fronts) are so close to each other that individuals that actually belong to different line segments, alternate. Here, the worst-case scenario is alternations of irregular frequencies between line segments.

## 8 Experiments

The main goal of this article is an improved understanding of multimodality in the context of multiobjective optimization. In order to perform experiments, we first created a benchmark consisting of easily configurable multiobjective and multimodal test problems. To be more precise, we manipulated the MPM2-generator (Wessing, 2015) in a way that it produces bi-objective (instead of single-objective) multiple peak problems.^{6}

The rational behind using the mixed-sphere and mixed-ellipse problems rather than using other well-known multiobjective benchmarks, such as DTLZ (Deb et al., 2005) or ZDT (Zitzler et al., 2000), is the fact that the former allows control of the multimodality and thus hardness of the problems, whereas the latter ones are (at least for now) already too complicated for our purposes, which is a (hopefully) complete understanding of the interactions of the multimodal objectives. As it turns out, even for this rather simple setting, the problems quickly become (with an increasing number of peaks per objective) very difficult and highly multimodal.

### 8.1 Setup of Benchmark Problems

We created a benchmark with two groups of two-dimensional problems: 20 mixed-sphere problems and 20 mixed-ellipse problems. Each of the objectives of the 40 instances contains between one and five peaks. Furthermore, for a specific problem the contour lines of all peaks (of both objectives) are either spheres or rotated ellipses. Consequently, the local efficient sets can either be found somewhere on (a) a line segment (in case of sphere-shaped peaks) or (b) a curve (in case of ellipse-shaped peaks) between each pair of peaks from the different objectives.

Also, the generator was configured to place the peaks randomly within the decision space to account for multimodal landscapes. The alternative “default” option leads to nearby aligned peaks that would result in a funnel-like landscape, which is much more similar to a unimodal optimization problem rather than a multimodal one. The complete setup of this benchmark is given within Table 1.

$#$Peaks | $f1$ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |

(n.peaks) | $f2$ | 1 | 2 | 2 | 2 | 3 | 5 | 2 | 4 | 2 | 2 | 2 | 3 | 5 | 2 | 1 | 2 | 2 | 2 | 3 | 5 |

Seed | $f1$ | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 2 | 3 | 4 | 5 |

(seed) | $f2$ | 3 | 5 | 6 | 7 | 8 | 9 | 8 | 7 | 5 | 6 | 7 | 8 | 9 | 4 | 4 | 5 | 6 | 7 | 8 | 9 |

$#$Peaks | $f1$ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |

(n.peaks) | $f2$ | 1 | 2 | 2 | 2 | 3 | 5 | 2 | 4 | 2 | 2 | 2 | 3 | 5 | 2 | 1 | 2 | 2 | 2 | 3 | 5 |

Seed | $f1$ | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 1 | 2 | 3 | 4 | 5 |

(seed) | $f2$ | 3 | 5 | 6 | 7 | 8 | 9 | 8 | 7 | 5 | 6 | 7 | 8 | 9 | 4 | 4 | 5 | 6 | 7 | 8 | 9 |

### 8.2 Setup of Multiobjective Optimization Algorithms

We tested two conceptually different optimizers in order to investigate the challenges imposed by different degrees of multimodality on the optimization progress. Specifically, a stochastic local search (SLS) technique is contrasted to a gradient-based strategy (HIGA-MO). The population size is set to $\mu =50$ for both algorithms while the initial step-size is set to 0.01 for SLS and 0.001 for HIGA-MO. While the step-size remains constant in case of the naïve SLS, HIGA-MO will actively perform a step-size adaptation^{7} (Wang, Deutz et al., 2017). Thus, in the beginning it will rather explore the landscape by making larger steps and towards the end, it will exploit promising areas with rather small steps. The maximal number of iterations has been set to 500 iterations for HIGA-MO and, accounting for the rather naïve structural concept of the second solver, to 800 iterations for SLS. These budgets are chosen according to the ratio of expected running time (consumed to achieve the target convergence measure) between HIGA-MO and SLS with respect to the algorithm setting above. Note that a systematic benchmark of solvers is not the focus of this work, but rather explaining algorithm behavior in general.

### 8.3 Experimental Results

In the following, we will first have a look at four scenarios, which we considered to be rather easy to grasp and among the most representative of the 40 benchmark problems, and visually study the behavior of HIGA-MO and SLS on those instances. Afterwards, we will analyze the problem and algorithm characteristics across the entire benchmark in more detail.^{8} This should give us some first insights on whether our suggested problem characteristics can actually be used for distinguishing the problems from each other and also, which problem property might cause which algorithm behavior.

In addition to the results that we can show here, we provide more material, including various tables, figures, as well as videos for all 40 benchmark problems online.^{9}

#### 8.3.1 Exemplary Scenarios

As a first (explorative) step, we analyze four scenarios by visualizing their multi-objective gradient landscapes (as introduced in Sect. 5), as well as the trace of the population from the two algorithms.

The visualizations of the first scenario (ID 35 from the benchmark) are shown within the previous sections. More precisely, Figures 3 and 4 show the multiobjective gradient landscape of this problem within the decision and objective space, whereas Figure 5 depicts the differences within the behavior of HIGA-MO and SLS. As Figure 3 reveals, the analyzed scenario consists of two peaks in the first (visualized by orange contour lines) and one peak in the second objective (white dotted lines), resulting in two basins of attraction, comprising a set of connected local efficient sets each. Those connected sets are also visible by the yellow/green/blue valleys that lead towards the local efficient sets. Note that the coloring represents the depth of the valley. Due to the competition between the two peaks from the first objective, the right part of the left connected set (light orange line) is dominated by the entire connected set of the right basin (dark blue line). When looking at the traces of HIGA-MO (upper row of Figure 5) and SLS (lower row), one can see that both algorithms succeed in finding the connected sets. But while SLS gets stuck in these local optima (as indicated by the blue points^{10} within Figure 5), HIGA-MO follows its goal, i.e., maximizing the dominated hypervolume, and consequently leaves the dominated part (light orange line) of the two sets in the left basin in order to travel towards the global efficient part (light blue line).

The scenario within Figure 6 (ID 32) represents a slightly more difficult sphere-problem with two peaks in the first and three peaks in the second objective. In addition to this, the corresponding three-dimensional surface plot and the mapping from the decision into the objective space are shown in the top row of Figure 9. The problem landscape also consists of two big basins of attraction. However, each one of them actually also contains a smaller basin. When looking at the objective space, one can also see that the corresponding local fronts (each of them colored in two shades of blue) are always closely located to local fronts from their respective surrounding bigger basins. Nevertheless, HIGA-MO again is able to find all Pareto fronts (including the dark blue front). It is also visible that HIGA-MO “pushes” a lot of individuals^{11} from one basin towards the other one, or more precisely from the local efficient sets located near the peak at $\mu 1\u2248(0.4,0.4)T$, that is, the turquoise and light green lines, towards the other peak at $\mu 2\u2248(1.0,0.5)T$ and from there along the adjacent global efficient set (yellow line). The few individuals from HIGA-MO's final population that are located along the light blue/middle blue and especially along the turquoise/light green sets might be caused by the limited number of generations. In case of the SLS, one would not expect any bigger changes with additional generations. Its points are located along all the local fronts (including the turquoise and light green fronts/sets), but it is not able to leave these (globally) dominated areas.

The two objectives from the next scenario (ID 7) are based on two and one ellipse-shaped peaks, respectively. As indicated by the multi-objective gradients within Figure 7, this landscape also consists of two basins of attraction. Furthermore, one can see the ridge, that is, the bended line starting approximately at $(0.0,0.6)T$ and ending at about $(0.9,0.0)T$, between the two basins. The problem basically contains two connected sets, but due to a partial overlap of their corresponding fronts (in the objective space), the intermediate section of the right connected set (red line) is globally dominated by the dark blue line. Analogously, the left connected set is cut in half (at least in the decision space), because its upper part (blue) is dominated by the light blue section. When looking at the traces of the algorithms' populations, it is peculiar that SLS barely finds any of the local fronts in general while HIGA-MO again finds all Pareto sets, and thus, obviously ignores the globally dominated sections of the connected sets.

Figure 8^{,}^{9} shows the final scenario (ID 5), which is based on two objectives with (slightly) ellipse-shaped peaks. The first objective consists of a single peak, the second contains three peaks. Again, the results of the algorithm runs are quite interesting: although the majority of individuals from HIGA-MO finds the global efficient set, not all of them succeed. In general, all of its individuals quickly converge to any of the local fronts, but while the individuals that reach the green local efficient sets, rather quickly travel towards the orange global efficient set—leading to the so-called *channeling effect*^{12}—individuals that come across the light or dark blue local efficient sets do not manage to leave that area completely and instead focus on spreading along the “better” half of the efficient set (according to the *true* local fronts)—that is, the dark blue segment—as depicted by the blue points within the HIGA-MO plots in the middle row of Figure 8.

Summarizing the findings across all four analyzed scenarios, we were able to describe the behavior of the two algorithms based on our visualization approaches. Furthermore, we could show that HIGA-MO, as a multiobjective global optimizer, in most cases finds the global optima, whereas SLS, as a (multiobjective) local search algorithm, usually converges in a local optimum. Thus, although not surprising, multimodality has different impacts with respect to conceptually different search strategies.

#### 8.3.2 Analyzing the Problem and Algorithm Characteristics

As shown within the first part of this section, our benchmark allows us to distinguish between different algorithms. So far, these differences were mainly based on visual inspections across a subset of the benchmark. In the following, we will analyze the landscape and algorithm characteristics, which were introduced in Section 7, to get a basic idea of possible causes for certain algorithm behavior and show the relevance of our benchmark in that it covers a broad range of problem instances within our chosen classes. Such findings would be a good indication for possible multiobjective landscape features. Note that the following investigations are based on sophisticated visualization techniques, which in most cases successfully reduce the underlying dimensionality. However, for completeness, we also provided the exact values for each of the 20 problem and 28 algorithm characteristics (14 per algorithm) across all 40 benchmark problems within the additional material.^{9}

In a first step, we visualized the correlation matrix, which is based on all pairwise (Pearson) correlations among all 48 characteristics, within Figure 10. The colors of the boxes represent the correlations. That is, while blue boxes correspond to positive correlations, red boxes correspond to negative correlations and the intensity of the color indicates the magnitude or strength of the correlations, that is, highly correlated characteristics yield to darker boxes (Friendly, 2002). Furthermore, the characteristics are aligned based on a (complete linkage) hierarchical clustering approach (Jobson, 1992). Given this clustering approach, two clusters were revealed: a smaller cluster, completely consisting of eleven problem characteristics (five out of six count characteristics, four out of six length characteristics, and two out of eight connected front/set characteristics), and a bigger cluster, comprising the remaining 37 characteristics. The corresponding correlation matrices of these two clusters are highlighted by red squares within Figure 10.

By analyzing the magnitude of the correlations in more detail, one can find indications for possible redundancy among the characteristics. For instance, the number of local efficient sets that have a direct or indirect connection to the Pareto set (prob.conn_ps.count_abs) and the counterpart in the objective set (prob.conn_pf.count_abs) are highly correlated (0.990) to each other. Also, the characteristics with the corresponding proportions (instead of the absolute numbers) of the corresponding sets/fronts (prob.conn_ps.count_rel and prob.conn_pf.count_rel) are highly correlated (0.989). An overview of the ten highest correlated problem characteristics can be found in Table 2.

. | Characteristic 1 . | Characteristic 2 . | Correl. . |
---|---|---|---|

1 | prob.conn_ps.count_abs | prob.conn_pf.count_abs | 0.990 |

2 | prob.conn_ps.count_rel | prob.conn_pf.count_rel | 0.989 |

3 | prob.conn_pf.length | prob.conn_pf.count_rel | 0.925 |

4 | prob.count.les | prob.count.sets | 0.911 |

5 | prob.conn_ps.count_rel | prob.conn_pf.length | 0.903 |

6 | prob.count.ps_rel | prob.conn_pf.length_rel | 0.896 |

7 | prob.conn_ps.length_rel | prob.conn_pf.length_rel | 0.893 |

8 | prob.count.layers | prob.length.lf_total | 0.884 |

9 | prob.count.les | prob.count.layers | 0.874 |

10 | prob.conn_ps.length | prob.conn_pf.length | 0.872 |

. | Characteristic 1 . | Characteristic 2 . | Correl. . |
---|---|---|---|

1 | prob.conn_ps.count_abs | prob.conn_pf.count_abs | 0.990 |

2 | prob.conn_ps.count_rel | prob.conn_pf.count_rel | 0.989 |

3 | prob.conn_pf.length | prob.conn_pf.count_rel | 0.925 |

4 | prob.count.les | prob.count.sets | 0.911 |

5 | prob.conn_ps.count_rel | prob.conn_pf.length | 0.903 |

6 | prob.count.ps_rel | prob.conn_pf.length_rel | 0.896 |

7 | prob.conn_ps.length_rel | prob.conn_pf.length_rel | 0.893 |

8 | prob.count.layers | prob.length.lf_total | 0.884 |

9 | prob.count.les | prob.count.layers | 0.874 |

10 | prob.conn_ps.length | prob.conn_pf.length | 0.872 |

Focusing on the algorithm characteristics, it is noticeable that HIGA-MO shows a strong positive correlation (0.982) between the percentage of individuals that are located in the neighborhood of a local front that has a connection to any of the Pareto fronts (HIGA-MO.pop_glob.conn.front) and the individuals that are located close to a Pareto front (HIGA-MO.pop_glob.single.front). This effect is plausible: as explained within the description of the four exemplary problems, HIGA-MO is able to travel along adjacent fronts towards a better front (w.r.t. the dominance relationship) once it finds a local front in general. As a result, fronts that are connected to the Pareto fronts, but which actually are not Pareto dominant themselves, will be uncovered. Consequently, both characteristics basically measure the coverage of the Pareto fronts (for HIGA-MO). Interestingly, SLS also shows a strong positive correlation (0.892) between these two characteristics, but it is caused by a completely different behavior: the likeliness of SLS actually finding a global efficient set is—as already described for the four exemplary problems—similar to the one of SLS finding a local efficient set that is connected to any of the global efficient sets. Note that in contrast to HIGA-MO, the naïve SLS does not profit from these connections and instead converges to those multiobjective local optima—remember, it is just a local search algorithm and not a global optimizer. Further details about the ten strongest correlations between pairs of algorithm characteristics can be found within Table 3.

. | Characteristic 1 . | Characteristic 2 . | Correl. . |
---|---|---|---|

1 | HIGA-MO.pop_glob.conn.front | HIGA-MO.pop_glob.single.front | 0.982 |

2 | HIGA-MO.cov_glob.conn.front | SLS.cov_glob.conn.front | 0.969 |

3 | HIGA-MO.pop_glob.conn.front | HIGA-MO.pop_loc.single.front | 0.935 |

4 | HIGA-MO.pop_loc.single.front | SLS.pop_loc.single.front | 0.932 |

5 | HIGA-MO.pop_glob.single.front | HIGA-MO.pop_loc.single.front | 0.930 |

6 | SLS.pop_glob.conn.front | SLS.pop_glob.single.front | 0.892 |

7 | SLS.pop_glob.conn.set | SLS.pop_glob.single.set | 0.888 |

8 | HIGA-MO.cov_glob.single.front | SLS.cov_glob.single.front | 0.886 |

9 | HIGA-MO.pop_glob.single.front | SLS.pop_loc.single.front | 0.880 |

10 | HIGA-MO.pop_glob.conn.front | SLS.pop_loc.single.front | 0.873 |

. | Characteristic 1 . | Characteristic 2 . | Correl. . |
---|---|---|---|

1 | HIGA-MO.pop_glob.conn.front | HIGA-MO.pop_glob.single.front | 0.982 |

2 | HIGA-MO.cov_glob.conn.front | SLS.cov_glob.conn.front | 0.969 |

3 | HIGA-MO.pop_glob.conn.front | HIGA-MO.pop_loc.single.front | 0.935 |

4 | HIGA-MO.pop_loc.single.front | SLS.pop_loc.single.front | 0.932 |

5 | HIGA-MO.pop_glob.single.front | HIGA-MO.pop_loc.single.front | 0.930 |

6 | SLS.pop_glob.conn.front | SLS.pop_glob.single.front | 0.892 |

7 | SLS.pop_glob.conn.set | SLS.pop_glob.single.set | 0.888 |

8 | HIGA-MO.cov_glob.single.front | SLS.cov_glob.single.front | 0.886 |

9 | HIGA-MO.pop_glob.single.front | SLS.pop_loc.single.front | 0.880 |

10 | HIGA-MO.pop_glob.conn.front | SLS.pop_loc.single.front | 0.873 |

Aside from detecting possible redundancy among the characteristics or explaining certain algorithmic behavior, it is of interest to find out which problem characteristics might cause specific algorithm characteristics, independent of influences from any of the other characteristics. Thus, we are interested in the strongest correlations between problem and algorithm characteristics as listed within Table 4. Not very surprisingly, the nine strongest correlations between problem and algorithm characteristics are, without any exceptions, negative and seven out of these nine pairs actually state that an increasing number of sets, fronts or layers leads to a reduced coverage of global or local sets/fronts. The two exceptions to this are correlations based on the percentage of HIGA-MO's final population that is located in the proximity of any local efficient set (HIGA-MO.pop_loc.single.set) and the number of fronts/sets that have a connection to the Pareto sets/fronts (prob.conn_ps.count_abs and prob.conn_pf.count_abs); that is, the more fronts/sets are connected to the Pareto fronts/sets, the less likely HIGA-MO will find any of the fronts/sets in general. Looking at the strongest *positive* correlation between problem and algorithm characteristics, it is plausible that an increase in the relative length of the Pareto set (prob.conn_ps.length_rel) causes a higher percentage of SLS's final population to be located near the Pareto set (SLS.pop_glob.single.set).

. | Characteristic 1 . | Characteristic 2 . | Correl. . |
---|---|---|---|

1 | prob.count.les | HIGA-MO.cov_loc.single.set | −0.828 |

2 | prob.count.sets | HIGA-MO.cov_loc.conn.set | −0.818 |

3 | prob.conn_pf.count_abs | HIGA-MO.pop_loc.single.set | −0.811 |

4 | prob.count.sets | HIGA-MO.cov_loc.single.set | −0.805 |

5 | prob.conn_ps.count_abs | HIGA-MO.pop_loc.single.set | −0.791 |

6 | prob.count.layers | HIGA-MO.cov_loc.single.set | −0.772 |

7 | prob.count.les | SLS.cov_loc.single.set | −0.765 |

8 | prob.count.sets | SLS.cov_loc.conn.set | −0.745 |

9 | prob.count.les | SLS.cov_loc.conn.set | −0.741 |

10 | prob.conn_ps.length_rel | SLS.pop_glob.single.set | 0.732 |

. | Characteristic 1 . | Characteristic 2 . | Correl. . |
---|---|---|---|

1 | prob.count.les | HIGA-MO.cov_loc.single.set | −0.828 |

2 | prob.count.sets | HIGA-MO.cov_loc.conn.set | −0.818 |

3 | prob.conn_pf.count_abs | HIGA-MO.pop_loc.single.set | −0.811 |

4 | prob.count.sets | HIGA-MO.cov_loc.single.set | −0.805 |

5 | prob.conn_ps.count_abs | HIGA-MO.pop_loc.single.set | −0.791 |

6 | prob.count.layers | HIGA-MO.cov_loc.single.set | −0.772 |

7 | prob.count.les | SLS.cov_loc.single.set | −0.765 |

8 | prob.count.sets | SLS.cov_loc.conn.set | −0.745 |

9 | prob.count.les | SLS.cov_loc.conn.set | −0.741 |

10 | prob.conn_ps.length_rel | SLS.pop_glob.single.set | 0.732 |

These findings are also supported by Figure 11, which displays a *biplot* (Gabriel, 1971; Gower and Hand, 1995) of a *principal component analysis* (PCA; Härdle and Simar, 2015) based on the correlation of all 48 characteristics. The goal of such a PCA is to reduce the number of dimensions of the original data set. This is achieved by representing the data by so-called *principal components* (PCs), which basically are a linear combination of the variables from the original data set. Each of these PCs is constructed in such a way that it explains the highest amount of variance (and thus information) within the hyperplane that is orthogonal to all the previously constructed PCs. Consequently, the first PC explains the most variance of the original data set, the second PC the second most, etc. As shown within Figure 11, the two PCs derived from a PCA based on all characteristics already explain 55%, and thus, more than half, of the variance of all 48 characteristics. As the name of the figure (“biplot”) indicates, it actually summarizes two figures within one: (a) a projection of the 40 benchmark problems onto the (hyper)plane that is constructed by the first two PCs, showing problems with sphere-shaped peaks as green dots and instances with ellipse-shaped peaks as red dots and (b) the explanatory input of each characteristic on the first two PCs as arrows, colored according to the group that they represent: problem characteristics are shown as green arrows, whereas the algorithm characteristics are colored in blue (HIGA-MO) and pink (SLS). Note that arrows of characteristics, which perfectly explain at least one of the two PCs, would actually touch the shown circle.

As one can see, the characteristics basically form the same two groups as within the visualization of the correlation matrix: the eleven problem characteristics, which formed the smaller cluster in Figure 10 are pointing in the negative direction of PC1, whereas the remaining characteristics once again form a second (bigger) cluster. Also, some “sub-clusters,” such as prob.conn_ps.count_abs and prob.conn_pf.count_abs (bottom left of the biplot), are even more visible than within the correlation matrix plot (topleft). While the majority of problem characteristics, with the exception of problem characteristics describing the connectedness towards the Pareto fronts/sets (prob.conn_*), are mostly horizontally aligned and thus only have a small influence on the second PC, the latter is mainly influenced by algorithm characteristics. Furthermore, although the problems seem to be separable by the shape of their peaks, they are mainly distinguished by the second PC, that is, rather by the algorithm than the problem characteristics. This is due to the different algorithm concepts and thus their different behavior with respect to the degree of multimodality. These differences tend to be larger for ellipse-shaped peaks.

When looking at the biplots that are purely based on the 20 landscape characteristics or on the 28 algorithm characteristics, the aforementioned discoveries become even more obvious. Although, in case of the landscape-based PCA (see Figure 12), the first two PCs explain roughly $70%$ of the variance, they are not able to clearly separate problems with sphere-shaped peaks from the ones with ellipse-shaped peaks. Instead, it rather can be used to group the landscape characteristics into two (or maybe three) clusters. The left group, with the exception of the two characteristics that measure the total lengths of the fronts and sets that are connected to the Pareto fronts/sets, comprises characteristics of the proportion of counts or lengths of the corresponding fronts/sets to the number/lengths of all fronts/sets (indicated by the suffix rel), whereas the remaining characteristics form a second cluster. Eventually, one could divide the latter once again, given that prob.conn_ps.count_abs and prob.conn_pf.count_abs, that is, the number of local efficient sets or local fronts that are connected to the Pareto sets/fronts, form a potential third cluster. Therefore, we can conclude that the chosen landscape characteristics manage to cover different aspects of the problem landscape.

In contrast to that, the PCA based on the algorithm characteristics (for either one of the two algorithms) is able to split the data according to the shape of the peaks. In case of the stochastic local search algorithm, the knowledge of SLS.pop_loc.single.front is already sufficient to correctly classify $95%$ of the problems.^{13} For HIGA-MO, it is much harder to find a classification model of comparable accuracy. For instance, Figure 13 displays a classification tree, which satisfies this condition but it requires five algorithm characteristics.

Summarizing the previous findings, one can say that the problem characteristics capture different properties of a problem's landscape than the shape of its peaks, whereas the algorithm characteristics indicate that the peak-shape actually influences the behavior of HIGA-MO and SLS. Moreover, the benchmark problems cover various landscape characteristics and degrees of multimodality. By correlating problem and algorithm characteristics the basic algorithm behaviors could be explained and conceptual differences were detected. Certainly, for both algorithms the problem gets harder with an increasing number of (local) fronts, but in general HIGA-MO is able to cope with it much better. The next step will be the construction of exploratory landscape features based on our findings, which will be able to assess the multimodality aspects prior to optimization and thereby allow to build algorithm selection models. For this purpose also a larger set of (state-of-the-art) optimization algorithms will have to be applied in a systematic way.

## 9 Discussion and Outlook

This article provides concepts for thoroughly understanding multimodality in the context of multiobjective optimization problems, both theoretically as well as experimentally. A specifically designed benchmark set constructed by means of a sophisticated multiple peaks generator is introduced and used as a testbed for contrasting two conceptually different search strategies, that is, hypervolume gradient ascent and stochastic local search. Mixed spheres and elliptically shaped variants reveal different degrees of multimodality and levels of problem hardness. Problems' landscapes and specifically basins of attractions are visualized based on a scalar combination of gradients in order to substantially increase problem understanding which is clearly enhanced by deriving local front specifications analytically.

Moreover, algorithm characteristics are introduced which allow assessment of algorithm behavior with respect to the detection of global and local Pareto fronts which can further be used for performance assessment. Those are related to respective problem characteristics in a systematic way using multivariate analytical techniques. Obviously, problem characteristics reflect much more information than just the number of peaks and the spherical or elliptical structure. Basic algorithm behaviors could be explained and conceptual differences detected in that certainly the problem gets harder with an increasing number of (local) fronts, but in general HIGA-MO is able to cope with it much better.

Therefore, the basis for systematically constructing multiobjective Exploratory Landscape Features is formed which has huge potential with respect to algorithm benchmarking, selection, and design, also for higher dimensional problems. A thorough and systematic benchmark of state-of-the-art multiobjective optimizers will be conducted while simultaneously extending the problem generator to varying problem topologies. Moreover, specific design of optimizers addressing both diversity in decision space as well as multimodality of the landscape will be addressed.

## Acknowledgments

The authors acknowledge support by the European Research Center for Information Systems (ERCIS) and H. Wang from NWO PROMIMMOOC, project no. 650.002.001.

## Notes

^{1}

We used $\delta =10-6$ for the gradient approximation and considered points for which the length of the respective summed (normalized) gradient vectors was below $10-3$ to be locally efficient. As we discretize the search space we might only end up in a point that is in the vicinity of the (true) efficient set.

^{2}

One could, for instance, use the R-package flacco (Kerschke, 2017) for computing such features.

^{3}

Note that all the length features are approximated numerically by calculating the cumulative chordal distance of the samples on the curve. This distance converges asymptotically with rate $O(1/N2)$ ($N$ is the number of samples) for uniformly spaced samples (Kozera et al., 2003).

^{4}

From the points sampled along the theoretical Pareto front/efficient set for drawing the curves, we approximate the lengths by computing the sum of the Euclidean distances between respective neighbors.

^{5}

A front or set is “covered” if an individual is located in its $\u025b$-environment.

^{7}

The step-size adaption uses cumulative step-size control with parameters $\alpha =0.5$ and $c=0.2$.

^{8}

For the algorithm characteristics, an individual was considered to be in a set's (or front's) vicinity, if the difference between the respective individual and the closest point from the closest respective set (or front, respectively) was at most $5\xb710-3$ for each of the two dimensions (or objectives). In contrast to that, we were able to use a much more detailed grid for the computation of the landscape characteristics and hence, were able to use a much smaller threshold of $10-3$.

^{10}

The coloring of the points represents the dominance relation among the final population; that is, red points represent the first, blue points the second, and green points the third layer.

^{11}

The “push” is caused by the fact that HIGA-MO performs local searches along the front and once an individual crosses a ridge, it strives for the “better” front.

^{12}

By *channeling* we refer to the effect, in which multiple individuals walk along the same path, ultimately showing darker paths connecting the local fronts. Such “channels” result from local efficient sets that are connected to ridges.

^{13}

19 out of 20 sphere-shaped problems have a value of at most 0.74, whereas the same amount of ellipse-shaped problems has a value of greater or equal to 0.82.

## References

*Studies in Computational Intelligence*

*International series in operations research & management science*