## Abstract

Several test function suites are being used for numerical benchmarking of multiobjective optimization algorithms. While they have some desirable properties, such as well-understood Pareto sets and Pareto fronts of various shapes, most of the currently used functions possess characteristics that are arguably underrepresented in real-world problems such as separability, optima located exactly at the boundary constraints, and the existence of variables that solely control the distance between a solution and the Pareto front. Via the alternative construction of combining existing single-objective problems from the literature, we describe the bbob-biobj test suite with 55 bi-objective functions in continuous domain, and its extended version with 92 bi-objective functions (bbob-biobj-ext). Both test suites have been implemented in the COCO platform for black-box optimization benchmarking and various visualizations of the test functions are shown to reveal their properties. Besides providing details on the construction of these problems and presenting their (known) properties, this article also aims at giving the rationale behind our approach in terms of groups of functions with similar properties, objective space normalization, and problem instances. The latter allows us to easily compare the performance of deterministic and stochastic solvers, which is an often overlooked issue in benchmarking.

## 1 Introduction

Numerical benchmarking is an important part of (black-box) optimization that helps to understand algorithm behavior and recommend algorithms. In order to obtain meaningful results, a benchmarking experiment should be (i) based on a thorough, well-documented and well-understood methodology and (ii) either be conducted on real-world problems of interest or a collection of artificial test functions that possess comprehensible difficulties observed in practical optimization problems. This holds true for both single- and multiobjective problems but for the latter, the methodology is less advanced at the moment.

Many artificial test functions that are frequently used in multiobjective optimization have been derived by setting up the Pareto front shape first without relating it to the intrinsic difficulties of the objective functions. Such an approach has the advantage that the analytical forms of the Pareto front and the Pareto set can be exploited to facilitate the performance assessment. Another aspect of state-of-the-art test suites for multiobjective optimization is the fact that not much progress has been made to avoid the overrepresentation of functions that are too simple or have questionable properties. Several existing (and still frequently used) multiobjective test suites, for example, contain a large share of functions that are separable, have the Pareto set on the domain boundary, or contain distance and position variables,^{1} artificial features not reflecting well the difficult black-box problems observed in practice.

In the context of single-objective algorithm benchmarking, much progress has been made in recent years in the design of artificial test functions that represent a wide range of difficulties observed in practice. The black-box optimization benchmarking test suite (bbob, Hansen et al., 2009) in particular has received wide acceptance as its 24 test functions have various advantages over previous test suites. The functions are well understood and expose algorithms to a variety of real-world difficulties such as multimodality, ill-conditioning, nonseparability of the variables, and nonlinearities. The bbob functions are grouped into five function groups with functions within a group sharing similar difficulties (such as multimodality with weak global structure) and with the aim to not overemphasize certain difficulties. Each function has one or several concrete scientific questions associated with it that can be answered by looking at algorithm performance results on that function (or in combination with another function). General statements beyond the tested concrete functions are possible by testing invariance properties of algorithms such as scaling, rotation, and affine invariance. In contrast to the previously mentioned approaches to building multiobjective test suites, we suggest to focus on introducing the known difficulties of real-world problems into the test suite. This is analogous to the single-objective case, but has the disadvantage that analytical formulas for the Pareto front and Pareto set might not be available. The motivation behind this approach is that in practice, multiobjective problems are constructed in exactly this way—with each objective corresponding to a separate single-objective function. Concretely, we propose a generic way to combine the well-established and -understood single-objective functions from the bbob test suite (Hansen et al., 2009). Using the bbob test functions as building blocks allows us to build upon a careful statistical choice of the functions (without overrepresenting a certain type of problem) as well as comprehensive difficulties. In turn, the proposed multiobjective functions are more likely to be practically relevant than previous benchmarks. In particular, our proposal fulfills all five recommendations for benchmark suites mentioned by Huband et al. (2006, p. 485); see also Section 3. We showcase our idea by implementing two bi-objective test suites within the COCO platform (Hansen et al., 2021) that supports automated benchmarking. The implementation within the COCO platform has two important advantages over many other existing test suites: (1) numerical experiments can be done more easily by automation, and, even more importantly, (2) it is possible to compare new results with a large variety of already benchmarked algorithms and to share the results effortlessly. As of end of 2020, the results of 32 algorithm implementations^{2} have been collected and made available online.^{3}

The lack of analytical expressions for the Pareto sets and fronts in our approach is addressed by collecting and visualizing approximations from numerical experiments with a large variety of algorithms. The collected hypervolume values are available online for performance assessment.^{4} To understand the quality of the reference Pareto set approximations, we investigate necessary optimality conditions in Section 6. We also provide dominance- and gradient-based plots to gain additional insight into the problem properties.

The nonexistence of analytical forms of Pareto set and Pareto front in our approach can be even seen as an advantage: the combination of existing single-objective test functions allows, in a controlled way, to mimic the typical constructions of real-world problems and to empirically investigate the resulting Pareto set and Pareto front shapes from such constructions. Moreover, the proposed multiobjective benchmark functions come in the form of pseudo-random *instances*, which allows to more easily compare deterministic and stochastic approaches and makes it possible to investigate the variance of properties like a connected or convex Pareto front emanating from the combination of single-objective test functions.

The contributions of this article are: (i) the purposeful choice of single-objective functions and their combinations into bi-objective problems, (ii) cherry-picking instances to avoid pathological combinations, (iii) a classification of the problems, (iv) implementation of two suites for the COCO platform, (v) identifying problem attributes and properties, (vi) provision of Pareto set approximations with an analysis of their quality, (vii) various visualizations of the constructed functions (see also the supplementary material^{5}), (viii) empirical identification of target values, and (ix) a guide to performance assessment.

The article is organized as follows. We start by outlining the fundamental definitions in multiobjective optimization and benchmarking in Section 2 and review multiobjective benchmark suites and their properties in Section 3. Section 4 introduces the main concepts behind the single-objective bbob test suite with more details given in the appendix. Section 5 proposes the bbob-biobj and bbob-biobj-ext test suites which are then analyzed visually in Section 6. The article concludes with a discussion of how to report algorithm performance data in Section 7 and final remarks in Section 8.

## 2 Preliminaries

**minimization**problems of the form

*dominance relation*. A solution $x\u2208Rn$ is thereby said to

*dominate*another solution $y\u2208Rn$ if $f\alpha (x)\u2264f\alpha (y)$ and $f\beta (x)\u2264f\beta (y)$ hold and at least one of the inequalities is strict. We adopt the notation $f\alpha $ for the first objective (resp. $f\beta $ for the second objective) instead of $f1$ and $f2$ to avoid confusion with notations adopted within the single-objective bbob test suite.

Solutions which are not dominated by any other solution are called *Pareto-optimal* or *efficient solutions*. All Pareto-optimal solutions constitute the *Pareto set* of which an approximation is sought. The Pareto set's image in the objective space $F(Rn)$ is called the *Pareto front*. Two specific points in the objective space are important to mention:

The

*ideal point*defined as the vector in objective space that contains the optimal $F$-value for each objective*independently*. More precisely, if $f\alpha opt:=infx\u2208Rnf\alpha (x)$ and $f\beta opt:=infx\u2208Rnf\beta (x)$, the ideal point is given by $zideal=(f\alpha opt,f\beta opt).$The

*nadir point*(in objective space) consisting in each objective of the worst value obtained by a Pareto-optimal solution. More precisely, if we denote the Pareto set by $P$, the nadir point satisfies $znadir=supx\u2208Pf\alpha (x),supx\u2208Pf\beta (x).$In the specific case where each of the two objective functions has a unique global minimum (that is, a single point in the search space which maps to the global minimum function value), $znadir=(f\alpha (argminf\beta (x)),f\beta (argminf\alpha (x))).$

All given definitions generalize trivially to problems with more than two objectives. When solving an unconstrained multiobjective problem as the above, usually the goal is to find, with as few evaluations of $F$ as possible, a set of nondominated solutions which is (i) as large as possible and (ii) has objective values as close to the Pareto front as possible.^{6} Here, we define this optimization goal via the maximization of a quality indicator, namely, the hypervolume (Zitzler et al., 2003) of the set of all nondominated solutions found so far (Brockhoff et al., 2016).

In this context, an optimization algorithm addressing the above minimization problem is given an *instance* of $F$: we implement each generic multiobjective function $F$ as a parametrized function $F\theta :Rn\u2192Rm$ with a problem dimension $n$ and a parameter value $\theta \u2208\Theta $ that might depend on $n$ and $m=2$, the number of objectives. The parameter value $\theta $ determines a *function instance*. For example, $\theta $ can encode the location of the optimum, $x\alpha opt$, of the first objective $f\alpha \theta :Rn\u2192R$ with $f\alpha x\alpha opt(x)=||x-x\alpha opt||2$. Search space rotations and shifts in the objective values are other generic transformations that can be encoded in instances (Hansen et al., 2021).

## 3 Review of Existing Multiobjective Test Suites

Many multiobjective test suites have been proposed throughout the years. Here, we in particular discuss those that are scalable in the problem dimension and that are unconstrained or box-constrained and defined in the continuous domain—the focus of our proposal for a new benchmark suite.

The (evolutionary) multiobjective optimization field first performed numerical comparisons of algorithms on single, independently proposed test problems like those by Kursawe (1990) and Fonseca and Fleming (1995), see for example, Tan et al. (2002), or in real-world studies, see Van Veldhuizen and Lamont (1998) for an early overview. A first attempt to create a consistent multiobjective test *suite* with several problems with desired properties was, to the best of our knowledge, the work of Van Veldhuizen and Lamont (1999b, 1999a). The authors clearly stated the need for scalable test suites and emphasized that problems of a test suite should possess practically relevant features. In the years that followed, several other scalable test suites have been proposed, of which the most established ones are (i) the ZDT suite of Zitzler et al. (2000), scalable in the number of variables but with only two objectives, (ii) its rotated version, the IHR problem suite of Igel et al. (2007), (iii) the DTLZ suite of Deb et al. (2005), with seven problems, all scalable in the number of variables and objectives, (iv) the WFG suite of Huband et al. (2006) with nine scalable problems of various difficulties, (v) the LZ suite of Li and Zhang (2009) containing problems with more complicated Pareto sets, (vi) the CEC2007 suite, combining and extending 13 existing test functions from the literature (Huang et al., 2007), (vii) the CEC2009 suite with 13 problems overall (Zhang et al., 2009), and finally (viii) the CEC2017 suite with 15 collected problems, tailored toward many-objective optimization (Cheng et al., 2017).

Most of these test suites have some desirable properties such as well-understood Pareto sets and Pareto fronts with shapes of various kinds (linear, convex, concave, discontinuous). But they also possess artificial characteristics that stem from the easier construction of such problems—overrepresenting properties such as no or only few dependencies among variables, Pareto sets located exactly at the boundary constraints, and the differentiation between position and distance variables. Although, for example, the importance of non-separable test functions in single-objective test suites is unquestioned and even Deb (2001, p. 353f) states its significance, most proposed multiobjective test problems are still separable or mostly separable in the sense that a function is separable if it can be optimized variable by variable.

Another way of generating multiobjective test problems is to decide on the single objectives and simply combine them to a multiobjective problem. Examples for this are the double sphere or Schaffer function number 1 (Schaffer, 1985), the combination of convex quadratic functions such as ellipsoids and cigtab functions by Igel et al. (2007) or the multimodal functions based on spherical functions by Emmerich and Deutz (2007), the latter construction of which has been extended by Kerschke et al. (2016).

Even though all test suites in the above list are scalable in the problem dimension, we rarely see performance studies that investigate the scaling of the algorithms with the problem dimension.

The arguably most complete paper on the topic of multiobjective benchmark problems to date is still the work of Huband et al. (2006) where the authors (i) identify important properties test functions should have, (ii) discuss in detail all other available test suites at that time with respect to these properties, and (iii) finally propose a new, well-motivated test suite that avoids many pitfalls of other test suites. In particular, Huband et al. (2006, p. 485) recommend that multiobjective test suites should, in addition to recommendations for single-objective test suites:

contain a few unimodal test problems to test convergence velocity relative to different Pareto optimal geometries and bias conditions,

cover the three core types of Pareto optimal geometries: degenerate Pareto optimal fronts, disconnected Pareto optimal fronts, and disconnected Pareto optimal sets,

have a majority of its test problems multimodal with a few deceptive problems,

have the majority of problems nonseparable, and

contain problems that are both nonseparable and multimodal to be representative of real-world problems.

All five recommendations are fulfilled for the test suites proposed in this article. Similar to the single-objective bbob functions, the WFG suite of Huband et al. (2006) employs problem transformations that change the properties like (non-)separability, bias, and the shape of the Pareto front of underlying *raw* objective functions.

One common property of the above mentioned test suites is that their Pareto sets can be described in analytical form. This certainly has an advantage for performance assessment but it also restricts the types of real-world problem characteristics that can be captured with such functions.

However, in practice, multiobjective optimization problems are typically constructed by combining objective functions that are defined (and understood) independently such as cost and performance of a new product. The objective functions might thereby come from different domains and share or do not share common properties such as uni-/multimodality, (non-)separability, asymmetry, etc.

The idea of defining multiobjective test problems by combining single-objective ones is therefore straightforward and has been proposed before, as mentioned above, for example, in Schaffer (1985), Igel et al. (2007), Emmerich and Deutz (2007), or Horn et al. (2015). To create a benchmark suite with challenging properties observed in practice, we follow here the same path and combine some of the existing, well-established, well-understood functions of the bbob test suite to create new multiobjective suites.

## 4 The Single-Objective bbob Functions

Our multiobjective test suites build on the single-objective bbob function suite (Hansen et al., 2009; Finck et al., 2009). The bbob suite includes unimodal and multimodal test functions, separable and nonseparable functions, well-conditioned and ill-conditioned problems; see Appendix A for a short definition of those properties. Each function is parametrized (see Section 2) and is scalable with respect to the dimension. Some function pairs allow to answer specific questions: for instance, the ellipsoid and the Rastrigin functions are present in a separable and a nonseparable version such that one can investigate how much an algorithm exploits separability. The functions are unbounded but the search domain of interest is $[-5,5]n$. Except for the linear slope function $f5$, the optimum of each function lies in the hypercube $[-4,4]n$ for all instances and all dimensions. The optimum of the linear function is attained at a (randomly chosen) corner of the hypercube $[-5,5]n$ and in the orthant beyond this point.

The bbob test suite has 24 functions, split into five groups. The **Separable** group contains five separable functions, the **Moderate** group consists of four functions with low or moderate conditioning, including multimodal functions, the **Ill-conditioned** functions group contains five unimodal functions with high conditioning, the **Multimodal** functions group comprises five multimodal functions with adequate global structure, and the **Weakly-structured** group consists of five multimodal functions with weak global structure. Problem difficulty is typically increasing from the first to the last group, but there are exceptions, for example, solving the Büche-Rastrigin function from the first group is difficult for most algorithms. The manually assigned function groups are also used to aggregate performance, allowing meaningful statements to be made about the performance on functions with specific properties reflected by the groupings.

## 5 The Proposed Bi-Objective Test Suites

Given the well-motivated bbob functions from Hansen et al. (2009), it is natural to construct a multiobjective test suite from these single-objective functions. For the bi-objective case, a pairwise combination of all 24 functions results in $242=576$ bi-objective functions. We however assume that multiobjective optimization algorithms are not sensitive to permutations of the objectives, so that we do not include $F=(f\beta ,f\alpha )$ if $(f\alpha ,f\beta )$ is already present.^{7} This results in the total of $252=300$ function combinations—the number of 2-combinations drawn with replacement.

While a benchmarking suite should contain a large number of different problems to avoid overfitting, first tests in Brockhoff et al. (2015) showed that having 300 functions is impracticable in terms of the overall running time of a benchmarking experiment. Therefore, a subset of these 300 functions was selected.

We present two such selections—the bbob-biobj test suite with 55 functions and its extension, the bbob-biobj-ext test suite with 92 functions. We also provide visualizations for some of the functions, showing different Pareto set and front shapes. Plots for all 92 functions are provided via the supplementary material.

### 5.1 The bbob-biobj Test Suite

The bi-objective bbob-biobj test suite is created by exploiting the organization of the bbob functions into groups. More precisely, only two (representative) functions from each of the bbob function groups are chosen. This way, we do not introduce any bias toward a specific group. In addition, within each group, the functions are chosen to be the most representative without repeating similar functions. For example, only one ellipsoid, one Rastrigin, and one Gallagher function are included in the bbob-biobj suite although they appear in multiple versions in the bbob suite.

These ten bbob functions are chosen to create the bbob-biobj test suite:

From the separable functions group: sphere function $f1$ and separable ellipsoid function $f2$.

From the functions with low or moderate conditioning: attractive sector function $f6$ and original Rosenbrock function $f8$.

From the unimodal functions with high conditioning: sharp ridge function $f13$ and sum of different powers function $f14$.

From the multimodal functions with adequate global structure: Rastrigin function $f15$ and Schaffer F7 function with condition number 10 $f17$.

From the multimodal functions with weak global structure: Schwefel $xsinx$ function $f20$ and Gallagher 101 peaks function $f21$.

All bbob-biobj functions are scalable in the search space dimension and come in the form of instances as it is the case with the original bbob suite.

In the following, we specify the common properties of the bbob-biobj functions and the main rationale behind them while concrete details on each of the 55 functions are provided via the supplementary material.

#### 5.1.1 Function Domain

Since we use the single-objective bbob functions to construct the bbob-biobj suite, all functions are unbounded and the extreme solutions of the Pareto set are guaranteed to lie within $[-5,5]n$.

Note that the Pareto set can partially lie outside of this area but that the major part of the Pareto set is expected to lie within it. Initial experiments found Pareto set approximations to be partially outside of the hypercube $[-5,5]n$ on a few functions in low dimensions; see also Section 6.5.

#### 5.1.2 Normalization of Objectives

None of the 55 bbob-biobj functions is explicitly normalized and the optimization algorithms therefore have to cope with objective values in different ranges. Typically, different orders of magnitude between the objective values can be observed.

However, to assess performance on the test suites, we normalize the objectives based on the ideal and nadir points before calculating the hypervolume indicator (Brockhoff et al., 2016). Both points can be computed, because the unique global optimum is known for the used 10 bbob base functions. In the black-box optimization benchmarking setup of the COCO platform, the algorithm is allowed to use the values of the nadir point as an upper bound on a region of interest in objective space but no other normalization is directly available to the algorithm.

#### 5.1.3 Instances

Our proposed test functions are parametrized and their instances are instantiations of the underlying parameters of the bbob functions (see Hansen et al., 2021).^{8} In addition, we assert two conditions:

1. The Euclidean distance between the two single-objective optimal solutions (also called the extreme solutions) in the search space is at least $10-4$.

2. The Euclidean distance between the ideal and the nadir point in the non-normalized objective space is at least $10-1$.

Each function instance has an integer ID. The relation between the instance ID, $KIDF$, of a bi-objective function $F=(f\alpha ,f\beta )$ and the instance IDs, $KIDf\alpha $ and $KIDf\beta $, of its underlying single-objective functions is $KIDf\alpha =2KIDF+1$ and $KIDf\beta =KIDf\alpha +1$. If the two above conditions are not satisfied for all dimensions and functions in the bbob-biobj suite, we increase the instance ID of the second objective successively until both properties are fulfilled. Exceptions to the above rule are, for historical reasons, the bbob-biobj instance IDs 1 and 2 which contain the single-objective instance IDs 2 and 4 and the IDs 3 and 5, respectively. This results in the same instances 1 to 5 here and in Brockhoff et al. (2015).

An implementation of the above procedure is in effect a test function generator. Yet we emphasize that the generated instances of a parametrized function are meant to represent problems with similar difficulties. In the following, we only use a small number of fixed instances: for each bi-objective function and given dimension, the bbob-biobj suite contains by default 15 instances,^{9} which are chosen in advance and will remain fixed for a long period of time for comparability reasons.^{10}

Problem instances in the multiobjective case tend to differ more wildly than in the single-objective case. Even when for each single objective the different instances are similar, different combinations of them can result in very different shapes of the Pareto front (for example, continuous vs. discontinuous) or in different difficulties to solve such problems (the orientation of level sets, for example, might be in accordance between the objectives or perpendicular—resulting in significantly different multiobjective problems when two ill-conditioned functions are combined).^{11} Additionally, for very different instances the target levels defined for the performance assessment may become incomparable. Consequently, we estimate the Pareto set's hypervolume per instance and do not adopt the technique from the single-objective case to aggregate results from different instances by simulated restarts (Hansen et al., 2021) to estimate runtimes to target values unattained on a given instance. The second reason to discard simulated restarts in the multiobjective case is the performance evaluation setup in COCO: we measure performance based on the entire archive of all so-far-evaluated solutions. Runs on different instances however produce incompatible archives, hence simulated restarts cannot account for the hypervolume attained by previous runs.

### 5.2 The bbob-biobj-ext Test Suite

Having all combinations of only a subset of the single-objective bbob functions in a test suite like the above bbob-biobj one has also a few disadvantages. Using only a subset of the 24 bbob functions introduces a bias toward the chosen functions and reduces the variety of difficulties a bi-objective algorithm is exposed to. Allowing all combinations of (a subset of the) bbob functions also increases the percentage of problems for which both objectives are from different bbob function groups. In practice, however, both objective functions may often come from a similar “function domain.”

The rationale behind the extended test suite, denoted as bbob-biobj-ext, is to reduce the mentioned effects. To this end, we add all within-group combinations of bbob functions which are not already in the bbob-biobj suite and which do not combine a function with itself. For technical reasons, we also remove the Weierstrass function $f16$ because its optimum is not necessarily unique and computing the nadir point is more challenging. This extension adds $3\xb7(4+3+2+1-1)+2\xb7(3+2+1-1)=3\xb79+2\xb75=37$ functions, resulting in 92 functions overall.

#### 5.2.1 The bbob-biobj-ext Functions and Function Groups

The concrete function definitions are given in the supplementary material.

#### 5.2.2 Normalization and Instances

Normalization of the objectives and instances for the bbob-biobj-ext test suite are handled in the same manner as for the bbob-biobj suite, that is, the objective functions are not normalized and 15 instances are prescribed for a typical experiment.

### 5.3 Reference Implementation

Both the bbob-biobj and the bbob-biobj-ext suites have been implemented in the COCO platform which is freely available and supports access in various programming languages (Python, C/C++, Java, MATLAB/Octave). After installation of COCO,^{12} this is as easy as typing (here in an IPython shell):

In [1]: import cocoex

In [2]: biobj_suite = cocoex.Suite('bbob-biobj', '', '')

In [3]: fun = biobj_suite.get_problem_by_function_dimension_instance(17, 10, 1)

In [4]: fun([0]*10) # evaluate in origin

Out[4]: array([3.33606646e+06, 5.31128506e+01])

## 6 Insights into Problem Properties

In order to better understand the functions in the bbob-biobj and bbob-biobj-ext suites, we visualize and analyze their properties. We display their best known Pareto set and Pareto front approximations together with 1-dimensional search space cuts in Section 6.1, investigate a necessary optimality condition in Section 6.2, and provide 2-dimensional dominance-based (Section 6.3) and gradient-based plots (Section 6.4) known from the literature for all proposed functions. All discussed plots are available in the supplementary material.

### 6.1 Plots of Pareto Set and Front Approximations

In this section, we plot the best currently known approximations of the Pareto set and the Pareto front, obtained from running many algorithms and collecting the non-dominated solutions, evaluated during those runs.^{13}

We provide four different plots for each function: the top row of Figure 3 shows the projection onto an axes-parallel cut of the search space defined by two variables ($x1$ and $x4$ here). The second row shows the projection onto a random plane that contains both single-objective optima^{14} and gives contour lines for each objective function. The coordinate system is chosen such that the two optima lie on the first axis and their geometric center is the origin. The third row depicts the same data in the objective space in original scaling (as seen by the algorithm), while the last row shows the same in log-scale, normalized so that the ideal point is at (0, 0) and the nadir point is at (1, 1).

In addition to the best known Pareto set/Pareto front approximations (in black), the plots of Figure 3 show various 1-dimensional cuts (“lines”) through the search space: (i) along a random direction through each single-objective optimum (in blue), (ii) along each coordinate axis through each single-objective optimum (blue dotted lines), (iii) along the line connecting both single-objective optima (in red), (iv) two fully random lines^{15} (in yellow), and (v) a random line in the random projection plane going through both optima^{16} (in green). All those straight lines have the same length of 20 with the support vector in its exact middle. Ticks along the lines indicate line segments of the same length in search space. Thicker points on the lines depict solutions that are non-dominated with respect to all points on the same line.

The shape of these lines in objective space and the distribution of nondominated solutions along these lines (in objective and search space) indicate the difficulty of the bbob-biobj-ext problems for algorithms based on line searches. Sometimes smooth, sometimes rather chaotic looking lines in the objective space indicate that the proposed multiobjective problems show a wide range of different landscapes.

Furthermore, the first two rows of Figure 3 highlight the projected region $[-5,5]n$ as a gray-shaded area while the gray-shaded areas in the objective space plots in the last two rows denote the regions of interest between the ideal ($\xd7$) and nadir points ($+$).

Pareto set and Pareto front approximations of Figure 3 are downsampled such that only one solution per grid point is shown—with the precision of 2 decimals for the search space plots and 3 decimals for the objective space plots to define the grid. The number of all available and actually displayed solutions is indicated in the legend of each plot. Due to this downsampling, the number of displayed points ($\u223c20000$ or less) in Figure 3 is much smaller than the number of non-dominated solutions contained in the Pareto set approximation (about $6\xd7106$ for $F1$, $3\xd7106$ for $F10$ and $6\xd7105$ for $F46$). For links to the illustrations of *all* functions, we refer to the supplementary material.$5$

### 6.2 Investigations on Necessary Optimality Conditions

An important question to ask is whether our collected Pareto set approximations are qualitatively good, that is, how close they are to the true Pareto front. Assuming differentiable objective functions, a necessary condition for optimality of a point $x$ is that the two objective gradients are collinear and point in opposite directions (Miettinen, 1999).

^{17}and compute the angle between them. The angle between the gradients is closely related to the

*normalized multiobjective gradient*by Kerschke and Grimme (2017), in the bi-objective case defined as

The proof is given in Appendix B. A value of 0, 1, or 2 for $1+cos(v,w)$ in (2) means that the gradients point in the exact opposite direction (indicating local non-dominance), are orthogonal, or point in the very same direction, respectively.

*gradient angle plots*in the following. Shown are the first five instances (using the same color) of three different bbob-biobj functions in dimensions 2, 3, and 5.

^{18}For the gradient angle plots of

*all*functions, we refer to the supplementary material.

Depending on the function, two different observations can be made. For simpler, smooth functions such as the double sphere function $F1=(f1,f1)$, almost all values of $1+cos(\alpha )$ are smaller than $10-4$ for all instances and dimensions—indicating that the gradients point in nearly opposite directions. For smooth functions, this also suggests that our Pareto set approximations are reasonably close to a true local Pareto set. This also holds for other combinations of objectives such as the sphere/sum of different powers function $F6=(f1,f14)$, the sphere/Schwefel function $F9=(f1,f20)$, the attractive sector/sum of different powers function $F23=(f6,f14)$, the double sum of different powers function $F41=(f14,f14)$, and the double Schwefel $xsinx$ function $F53=(f20,f20)$ and with slightly higher cosine values for several other functions, for example the attractive sector/Gallagher 101 peaks function $F27=(f6,f21)$ and the double original Rosenbrock function $F28=(f8,f8)$. Both lists are non-exhaustive.

On the other hand, there are functions with higher values of $1+cos(\alpha )$, indicating (at first sight) that the Pareto set approximations are less well converged. Two examples, the sphere/sharp ridge function $F5=(f1,f13)$ and the original Rosenbrock/sharp ridge function $F29=(f8,f13)$ are shown in Figure 4. A closer look at the functions with the (strongest) right-shift in the ECDF reveals that often one of the objectives is multimodal, flat, or its gradient changes non-smoothly (which is the case for the sharp ridge function of Figure 4). In the latter case in particular, we do not expect that the values of $1+cos(\alpha )$ are close to zero. We have verified that for the combination of sphere and sharp ridge, all solutions in our Pareto set approximation lie along a line between the two single-objective optima, but that depending on which side of the sharp ridge a point is located, the two objectives' gradients are pointing in the same or the opposite direction, resulting in small and large values of $1+cos(\alpha )$, respectively.

Although we do not know the exact Pareto set for the proposed functions (except for the double sphere function $F1=(f1,f1)$), the generated nondominated Pareto set approximations seem to be accurate enough also based on consistency: both, the large amount of nondominated solutions from several algorithms, implemented by various authors as presented here, and the nondominated solutions on a 2-dimensional grid, as presented in the next subsection, show qualitatively the same sets.^{19}

### 6.3 Dominance-Based Plots

The first row of Figure 5 presents plots of the dominance ratio (Fonseca, 1995, p. 71ff.). Based on solutions on a regular grid in search space, the dominance ratio of each grid point is the ratio of all grid points that dominate it. The plot uses a logarithmic color scale with the overall nondominated points shown in yellow.

The nondominated grid points are presented in the second row of Figure 5 in black, together with the level sets of the first objective in blue and the second in red.

The last row of Figure 5 shows the so-called local dominance landscape plots by Fieldsend et al. (2019). These plots use, for each grid point, one of three different colors. In dark green, we show all grid points $P$ for which *all* 8 neighboring grid points (in the Moore neighborhood) are mutually nondominated to $P$. In landscape analysis terms, we can call the green regions in the plots “dominance-neutral local optima regions” (Fieldsend et al., 2019) in the sense that a dominance-based hill-climber will be able to explore a green region by single nondominated moves. The pink areas are comprised of grid points for which *at least one* neighboring grid point dominates it and all dominating movement paths from neighbors in pink regions lead to the same green region (Fieldsend et al., 2019). Each pink area is considered a “basin of attraction” of a green area in the sense that a local dominance-based hill climber can only move toward the included green area. Last, a grid point is assigned a white color if at least two of its dominating neighbors belong to two different basins of attraction—white areas in the plots therefore show the boundaries between the basins. The existence of multiple, distinct basins of attractions can be interpreted as multimodality of a multiobjective problem: it is denoted unimodal if and only if a single basin of attraction is present (and thus no two distinct local Pareto-optimal sets exist). Otherwise, the problem is called multimodal.

### 6.4 Gradient-Based Plots

The gradient angle plots of Section 6.2 gave insight into the angle between the gradients of solutions contained in the best known Pareto set approximations. We now present bi-objective gradient plots based on 2-dimensional grids.

*all*functions, we again refer to the supplementary material.

The second row in Figure 6 shows the length of the path from each grid point toward the next local optimum. Inspired by the cumulated gradient field landscapes (Kerschke and Grimme, 2017), the length is (recursively) defined as Euclidean distance to the Moore neighbor to which the bi-objective gradient points^{20} plus the path length of this neighbor.^{21} Cumulated gradient field landscapes sum gradient lengths instead of Euclidean distances. We use the latter because they better quantify the actual distance to the local optimum, however both approaches lead to qualitatively very similar figures. Plotting the lengths of the normalized grid-based local search paths as in Figure 6 allows a visualization of the difficulty of multiobjective landscapes, at least for problems with 2 variables: although not as clearly as in the previous dominance-based plots we can infer the boundaries of the basins of attractions of the local Pareto-optimal sets as the places in the search space where the path lengths for neighbored grid points decrease in two different directions, see the examples in Figure 6. With the normalized bi-objective gradient plots, we can therefore see how many distinct local Pareto-optimal sets exist.

### 6.5 Summary of Observed Problem Properties

From the above visualizations we report basic properties of the bbob-biobj-ext test functions and the corresponding Pareto set/front approximations that we provide for performance assessment. We investigate properties like continuous vs. discontinuous Pareto fronts and sets, the existence of nondominated points outside a certain region, and uni- or multimodality. Table 1 details these findings for the 2-variable instances 1–5 where search space related numbers correspond to the region $[-5,5]n$.

We briefly summarize main findings from Table 1. First, different instances of the same function share most of the time the same properties—especially for the number of basins of attraction and the Pareto front convexity. Similarly, 78 of the 92 bbob-biobj-ext functions do not differ among the instances 1–5 in terms of Pareto set/front (dis-)continuity—only the number of the connected Pareto set/front parts changes in case discontinuities are observed.

Out of the 92 bbob-biobj-ext functions, only 27 (of which 18 belong to the 55 bbob-biobj suite) have a continuous Pareto front and a continuous Pareto set in all five instances (i.e., 29% resp. 33% of the functions). Only 9 functions show a convex Pareto front for all five instances. Similarly, all five instances of 59 functions are multimodal, as indicated by more than one basin of attraction. Hence we conclude that most of the proposed functions possess challenging Pareto set and Pareto front shapes.

Furthermore, the Pareto set of most problem instances of the proposed test suites lies within the hyperbox $[-5,5]n$. However, 51 of the investigated $92\xb75=460$ instances (from 28 functions) show nondominated points outside. For this reason, the COCO implementation defines the region of interest as $[-100,100]n$ although the single-objective optima are always in $[-4,4]n$ (with the exception of the linear function $f5$).

## 7 On Reporting Algorithm Performance

The definition of (test) problem instances is only the first step toward benchmarking optimization algorithms. Further steps entail to choose, record and display the actual performance measure(s). As the proposed test problems have been implemented in the COCO software, the postprocessing module of COCO can be directly used for the performance assessment (up to LaTeX templates for publication of the results) and we briefly discuss what this comprises. As best practice, we usually conduct seven steps to analyze the performance of an algorithm, here also illustrated with an example.

**Benchmarking Experiment**First, we run the algorithm on a chosen benchmark suite. COCO offers example code in all supported languages. A minimal code example in Python is shown in Hansen et al. (2021, Figure 2). Here, we have run the COMO-CMA-ES (Touré et al., 2019) with population size 100 on the bbob-biobj suite.**Choosing Algorithms for Comparison**We decide on (baseline) algorithms to compare with. As of May 2021, data for browsing$3$ and downloading from 32 algorithms$2$ run on the bbob-biobj test suite are available online. Here, we choose SMS-EMOA and NSGA-II as baselines.**Postprocessing**The postprocessing of COCO, here invoked by the command “python -m cocopp COMO-100 NSGA-II-Matlab SMS-EMOA-DE”, then downloads the corresponding baseline data and displays the algorithms performance in both html and LaTeX/PDF format.

For multiobjective problems, COCO measures the value of an extended hypervolume indicator of all non-dominated solutions evaluated so far.^{22} Within a benchmarking experiment, COCO records the number of function evaluations (“runtime”) to reach certain indicator values, given as precisions to a reference value. For each instance, function, and dimension, the runtime to 58 target precisions is displayed (amongst other visualizations) in the form of empirical cumulative distribution functions (ECDF), see Hansen, Auger et al. (2016) and Hansen et al. (2010).^{23}

Test suites that contain many function instances in several dimensions require the aggregation of data when analyzing and sharing results. Conveniently, runtimes from different instances or functions can be meaningfully displayed in a single ECDF graph.

**Displaying and Discussing Summary Results**In publications, we show summary ECDFs over all functions in 5-D and 20-D and the results for each function group in 10-D.

^{24}For budgets larger than about $5\xb7103$ times dimension evaluations, COMO-100 overall outperforms SMS-EMOA-DE and NSGA-II-Matlab, where the latter two are also more affected by increasing dimension. The qualitative differences of the runtime behaviors between the algorithms remain similar also within function groups, while differences are smaller in the groups involving multimodal objective functions.

**Investigating and Discussing Complementary Results**We always inspect all results on all single functions. Depending on the outcome of this inspection, we display further graphs. They may show the dependency on dimension or any other remarkable observation.

Figure 9 shows ECDF graphs for three single functions. We observe exceptionally good behavior of SMS-EMOA-DE on the double sphere function (Figure 9, left). The middle plot of Figure 9 shows an example where NSGA-II is better than SMS-EMOA-DE for all budgets. Finally, on the right plot, we observe that COMO-100 matches and exceeds the performance of the best 2016 reference algorithm on the ellipsoid/Gallagher 101 peaks function $F19=(f2,f21)$ for budgets between $5\xb7103$ and $104$ times dimension evaluations but also does not improve anymore shortly after.

**Processed Data Sharing**We put the html output from the postprocessing online, see*Postprocessed data*in the supplementary material. During scientific collaborations, we often share such data repeatedly with collaborators early in the process.Figure 9:**Raw Data Sharing**Finally, we publish the raw data sets by creating a submission issue on GitHub.^{25}Raw data can also be “archived” with the cocopp.archiving module. Archived data that are then put online can be used in the postprocessing (cocopp) by everyone who knows the corresponding URL.

## 8 Conclusions

Designing test suites is a crucial part of benchmarking optimization algorithms. Arguably, the most problematic aspect of using *artificial* test functions to assess performance is the *representativeness* of these regarding difficulties observed in real-world problems. In this article, we suggest to address the problem of representativeness in the multiobjective case by combining established single-objective test functions with known difficulties observed in practice. Following the concepts of the single-objective bbob test suite, we propose two concrete bi-objective test suites based on the idea of combining (subsets of) the existing bbob single-objective functions.

Our approach contrasts most of the existing test suites for multiobjective optimization. These are based on the desirable property of having well-understood Pareto sets and Pareto fronts with analytical forms but have, on the other hand, artificial characteristics that are arguably underrepresented in real-world problems. Examples of such properties are separability, optima located exactly at the boundary constraints, and the existence of variables that solely control the distance between a solution and the Pareto front.

The disadvantage of unknown analytical forms of the Pareto sets and Pareto fronts in our proposal is addressed by collecting the nondominated solutions from extensive experiments with dozens of different optimization algorithms and providing and visualizing the Pareto set and Pareto front approximations for each problem. These visualizations lead to new insights into how such non-analytical Pareto sets and Pareto fronts may look in practice.

Our proposal is currently restricted to two objectives. With a growing number of objectives, the number of arbitrary combinations of single-objective functions grows quickly. This leads to long running times of the benchmarking experiment and, more importantly, discourages routinely scrutinizing new results on each function individually. Hence, we think that further pruning choices should be made to define test suites with more objectives.

## Acknowledgments

This work was supported by the grant ANR-12-MONU-0009 (NumBBO) of the French National Research Agency. We also thank Ilya Loshchilov and Oswin Krause for their initial suggestions on how to extend the bbob-biobj test suite. Tea Tušar acknowledges financial support from the Slovenian Research Agency (research project No. Z2-8177 and research program No. P2-0209) and the European Commission's Horizon 2020 research and innovation program (grant agreement No. 692286).

## Notes

^{1}

A function is said to have a *distance variable* if changing this variable only results in dominating or dominated solutions. In other words, a distance variable determines solely the distance of a solution from the Pareto front. A *position variable*, in turn, only results in incomparable solutions when changed, see Huband et al. (2006) for details.

^{2}

Updated list of algorithm data sets: https://numbbo.github.io/data-archive/bbob-biobj/.

^{3}

Postprocessed performances of most available algorithms are displayed online at https://numbbo.github.io/ppdata-archive/bbob-biobj/2016-all/ and https://numbbo.github.io/ppdata-archive/bbob-biobj/2019-all/.

^{4}

The best known hypervolume values for all supported test instances are available via the COCO platform at https://github.com/numbbo/coco/blob/master/code-experiments/src/suite_biobj_best_values_hyp.c. To create these hypervolume “reference” values, we relied on data from a large variety of numerical experiments—for details, see Section 6.1.

^{5}

Webpage with supplementary material: https://numbbo.github.io/bbob-biobj.

^{6}

The distance in objective space is defined here in such a way that the nadir and ideal points have in each coordinate the distance of one. Note also that finding a set of nondominated solutions as large as possible might not always be the ultimate goal, in particular if the number of objective functions is large.

^{7}

Note that this assumption is not true for simple classical strategies such as optimizing successively a weighted sum $w\xb7f1(x)+(1-w)\xb7f2(x)$ with increasing weight $0\u2264w\u22641$.

^{8}

Performing numerical benchmarking experiments on a set of different instances of a parametrized function instead of experiments on a single fixed function has an immediate advantage: deterministic algorithms and stochastic algorithms can be compared easily in the same way stochastic algorithms are naturally compared. Running a deterministic algorithm on different instances of the same parametrized function introduces stochasticity of the runtime to reach certain target difficulties among runs in the same way as the combined stochasticity from the instance generation and the random events within a stochastic algorithm. Care, however, has to be taken that the variation of problem difficulty among instances is relatively low compared with the variation of difficulty between the actual benchmark functions. This assumption does not always hold for all instances of highly multimodal functions or for instances of most combinatorial optimization problems.

^{9}

In principle, as for the instance generation for the bbob suite, the number of possible instances for the bbob-biobj suite is unlimited (Hansen et al., 2021). However, even the tiniest performance difference can be made statistically significant with a high enough number of instances and repetition.

^{10}

Another reason to fix the instances for a long(er) time period is that no analytical form of the Pareto sets and Pareto fronts is available such that (good enough) Pareto set approximations need to be available for all instances for a proper performance assessment of algorithms.

^{11}

We have observed performance differences of up to two orders of magnitude (in the number of function evaluations to reach a certain hypervolume indicator precision) between instances in some cases. Differences of more than one order of magnitude happen in maximally 30% of the function/dimension pairs with typical algorithms on the proposed bbob-biobj test suite. Due to the higher amount of multimodal functions in the bbob-biobj-ext suite, differences among instances are more common.

^{12}

Besides downloading the code, installation of the experiments part with access to the function suites is a single call ``python do.py build-LANGUAGEOFINTEREST''.

^{13}

The nondominated solutions have been obtained from extensive numerical experiments with a wide range of optimization algorithms. In particular, all 32 algorithm data sets$2$, submitted to the BBOB workshops in the years 2016–2019, contributed together with data from additional runs of random search, weighted sum and Chebyshev scalarization functions optimized by the quasi-Newton BFGS method and by CMA-ES (Hansen and Ostermeier, 2001). Moreover, other well-known multiobjective algorithms such as NSGA-II (Deb et al., 2002), SMS-EMOA (Beume et al., 2007), HMO-CMA-ES (Loshchilov and Glasmachers, 2016), and COMO-CMA-ES (Touré et al., 2019) have been run as well. Non-dominated solutions evaluated during all these experiments have been collected to form the displayed Pareto set approximations. Those Pareto set approximations have also been used to calculate reference target precision values for the performance assessment—for details, see Brockhoff et al. (2016).

^{14}

To obtain the second axis, a standard normally distributed vector is orthogonalized with respect to the first axis (via Gram-Schmidt).

^{15}

of random direction through a random point drawn uniformly in $[-4,4]n$

^{16}

with a random direction within the plane and through a random point drawn uniformly in $[-4,4]2$

^{17}

Using a step length of $\u025b=10-8$ to approximate the $i$-th coordinate of the gradient in $x\u2208Rn$ by $f(x+\u025b\xb7ei)-f(x-\u025b\xb7ei)2\u025b$ where $ei\u2208Rn$ is the Cartesian unit vector in the $i$th variable.

^{18}

Only every hundredth point of the Pareto set approximations is displayed to limit computation time and file sizes. We checked for some cases that the display of all solutions does not change the plots significantly. Reducing the number of displayed points further, for example to every thousandth point, however, starts to visibly roughen the plots. The legends in Figure 4 show the number of displayed points compared with the entire size of the Pareto set approximation for each dimension and for the last instance.

^{19}

In the cases where our Pareto set approximations go beyond the region $[-5,5]n$, the grid-based approach shows additional artifacts for solutions connected to the grid boundary.

^{20}

The 360 degrees of possible directions are assigned evenly to the eight neighbors.

^{21}

A grid point whose normalized bi-objective gradient is below $10-6$ in length is considered to be at a (local) Pareto-optimum and assigned zero path length.

^{22}

More specifically, the (multiobjective) performance measure used in COCO makes a case distinction. When at least one solution dominates the hypervolume reference point, the hypervolume indicator of all non-dominated solutions is used, normalized so that the ideal point is in the origin and the nadir point is in [1,1], with this nadir point as the (normalized) reference point. Otherwise, the performance is measured as the negative of the minimal distance of any found solution (in objective space and normalized as above) to the box $[0,1]2$; for details, see Brockhoff et al. (2016).

^{23}

All plots shown in this article have been prepared with COCO version 2.4 and the corresponding hypervolume reference values. Improved hypervolume reference values as well as the bbob-biobj-ext suite are available from version 3.0.

^{24}

Displayed is the performance of the three algorithms COMO-CMA-ES with a population size of 100 (“COMO-100”, Touré et al., 2019; Dufossé and Touré, 2019), SMS-EMOA with differential evolution as search operator (“SMS-EMOA-DE”, Beume et al., 2007; Auger et al., 2016b) and the MATLAB implementation of NSGA-II (“NSGA-II-Matlab”, Deb et al., 2002; Auger et al., 2016a).

## References

*Genetic and Evolutionary Computation Conference Companion (GECCO)*

*Genetic and Evolutionary Computation Conference Companion (GECCO)*

*European Journal of Operational Research*

*Genetic and Evolutionary Computation Conference (GECCO)*

*CoRR*

*Benchmark functions for the CEC'2017 competition on many-objective optimization*

*Multidisciplinary design optimization in computational mechanics*

*Multi-objective optimization using evolutionary algorithms*

*IEEE Transactions on Evolutionary Computation*

*Evolutionary multiobjective optimization: Theoretical advances and applications*

*Genetic and Evolutionary Computation Conference Companion (GECCO)*

*Evolutionary Multi-Criterion Optimization*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Real-parameter black-box optimization benchmarking 2009: Presentation of the noiseless functions*

*Evolutionary Computation*

*ACM Transactions on Mathematical Software*

*Real-parameter black-box optimization benchmarking 2009: Experimental setup*

*Genetic and Evolutionary Computation Conference Companion (GECCO)*

*Optimization Methods and Software*

*Evolutionary Computation*

*CoRR*

*Evolutionary Multi-Criterion Optimization*

*Problem definitions for performance assessment of multiobjective optimization algorithms*. Technical Report. Nanyang Technological University.

*IEEE Transactions on Evolutionary Computation*

*Evolutionary Computation*

*International Conference on Evolutionary Multi-Criterion Optimization*

*Parallel Problem Solving from Nature*

*Parallel Problem Solving from Nature*

*IEEE Transactions on Evolutionary Computation*

*Genetic and Evolutionary Computation Conference Companion (GECCO)*

*International Conference on Genetic Algorithms and Their Applications*

*Artificial Intelligence Review*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Multiobjective evolutionary algorithm research: A history and analysis*

*Genetic and Evolutionary Computation Conference (GECCO) Workshop Program*

*Symposium on Applied Computing*

*Evolutionary Computation*

*IEEE Transactions on Evolutionary Computation*

## Appendix A On the Single-Objective bbob Functions

This section discusses properties of real-world problems and how the bbob test suite models different problem difficulties. It then gives more details about the bbob functions, their function groups and instances.

### A.1 Real-World Function Properties

We present here in short the general properties of objective functions that are related to difficulties observed in real-world problems. It depends on these properties whether an optimization problem is easy or hard to solve. These properties build the basis of the five function groups described in Section 4.

A *separable* function does not have any dependencies among its variables and can therefore be optimized by applying $n$ independent one-dimensional optimizations along each coordinate axis while keeping the other variables fixed. Difficult optimization problems are typically not separable and thus, *nonseparable* optimization problems should be considered. The typical well-established technique to generate nonseparable benchmark functions from separable ones is the application of a rotation matrix. That is, if $g(x)$ is a separable function with respect to $x$ and $R\u2208Rn\xd7n$ is a rotation matrix, then $g(Rx)$ will generally be nonseparable with respect to $x$.

A *unimodal* function has only one local minimum which is at the same time also its global one. A *multimodal* function has more local minima which is highly common in practical optimization problems. We consider a multimodal function to have *weak global structure* if the qualities (the $f$-values) of the local optima are only weakly related with their locations in search space, for example, when neighboring optima do not generally have similar quality values.

*Ill-conditioning* is another typical challenge of real-parameter optimization and, besides multimodality, probably the most common one. The condition number measures, loosely speaking, how strongly the steepness of the gradient depends on the position within a level set. A small condition number, close to one, indicates a well-conditioned function with little dependency. A large condition number indicates a more difficult, ill-conditioned function with strong dependency between steepness and position. The condition number of convex quadratic functions is the ratio between largest and smallest eigenvalue of the Hessian matrix. Geometrically, these eigenvalues correspond, respectively, to the shortest and longest principal axis of the contour ellipsoids.

The bbob test suite contains ill-conditioned functions with a typical conditioning of $106$. We believe this is a realistic requirement, while we have seen practical problems with conditioning as large as $1010$ (Collange et al., 2010).

### A.2 Balancing Problem Difficulties

It is worth noting that in several existing single-objective test suites, some of the easier properties are overrepresented. For example, in the CUTEr/CUTEst test suite (Gould et al., 2005), 202 (54%) out of the 375 functions that are labeled as unconstrained or bound constrained are of the “sum of squares” type; a further 58 (15%) are quadratic. Furthermore, out of the 191 problems with a fixed dimension, there are 49 (26%) that have only two variables while only 31 (16%) have a dimension larger than 10.

Such an overrepresentation is not a big problem per se, but when making statements on algorithm performance aggregated over all functions in a suite, one has to keep in mind that the performance of the better algorithms might simply come from the fact that they are tailored toward simpler problems.

With the bbob test suite, all problems are scalable in dimension and belong to a certain problem group, sharing similar difficulties. It is therefore possible to aggregate performance data only over a subset of the functions sharing the same properties. Having all problem groups of similar size also avoids problems of overfitting to certain difficulties if aggregated results are presented.

### A.3 Function Instances

In an abstract manner, the functions optimized are instances of a parametrized function $F\theta $ (as introduced in Section 2); the parameter $\theta $ is instantiated pseudo-randomly from an integer number, the so-called instance number, as well as potentially from the function number. We refer to a function class as a set of functions ${F\theta :\theta \u2208\Theta}$ and we often name the function class after its raw function.

Transformations that are shared by all bbob functions are shifts in the optimal function value and a pseudo-random location of the optimum. In addition, several of the nonseparable functions are created by pseudo-random rotations of the search space and many of the simpler functions are made less regular by non-linear transformations in both search and objective space. See Hansen et al. (2009) for more details.

Though the potential set of instances for a given bbob function is unbounded (and can be indexed by any positive integer), numerical benchmarking experiments are typically advised on 10–15 of those instances. Default instances in the COCO implementation might change from year to year to avoid overfitting. Note also that in some cases, single instances might be more difficult/easier to solve than others. However, in general, the differences in difficulty among instances of the same bbob function are smaller than among different functions.

### A.4 Normalization and Target Difficulties

All bbob functions are normalized in the sense that the given target function values/difficulties around the optimal function value are comparable over functions and instances. Functions are provided with an $f$-offset such that the optimal function value is, loosely speaking, a realization of a Cauchy distribution with median zero and interquartile range 200. The optimal function value is furthermore rounded to two decimal places and set to $\xb11000$ if its absolute value exceeds 1000 (Hansen et al., 2009). The target difficulties are computed as a set of differences to the optimal function value. The differences are equally spaced on the log scale and the same for all functions and instances. Algorithms however are not allowed to use or exploit any of this information (Hansen, Tušar et al., 2016).