## Abstract

In test-based problems, commonly approached with competitive coevolutionary algorithms, the fitness of a candidate solution is determined by the outcomes of its interactions with multiple tests. Usually, fitness is a scalar aggregate of interaction outcomes, and as such imposes a complete order on the candidate solutions. However, passing different tests may require unrelated
“skills,” and candidate solutions may vary with respect to such capabilities. In this study, we provide theoretical evidence that scalar fitness, inherently incapable of capturing such differences, is likely to lead to premature convergence. To mitigate this problem, we propose disco, a method that automatically identifies the groups of tests for which the candidate solutions behave similarly and define the above skills. Each such group gives rise to a *derived objective*, and these objectives together guide the search algorithm in multi-objective fashion. When applied to several well-known test-based problems, the proposed approach significantly outperforms the conventional two-population coevolution. This opens the door to efficient and generic countermeasures to premature convergence for both coevolutionary and evolutionary algorithms applied to problems featuring aggregating fitness functions.

## 1 Introduction

Many optimization and learning problems approached in evolutionary computation involve objective functions that reward the candidate solutions by counting the number of *tests* they pass. When evolving computer programs or controllers, passing a test requires producing the desired output for a given input. When learning game strategies, tests are embodied by opponents, and a candidate solution passes a test if it wins a game against it. In these problems, known in the context of coevolutionary algorithms (Axelrod, 1997; Hillis, 1992; Reynolds, 1994; Rosin, 1997) as *test-based problems* (Bucci et al., 2004; de Jong and Pollack, 2004; Jaśkowski, 2011), candidate solutions need to *interact* with multiple “environments” in order to be evaluated.

An objective function that counts the number of passed tests usually forms an inherent part of the problem and makes it amenable to many conventional algorithms that expect a scalar objective. However, the aggregation of interaction outcomes inevitably leads to information loss, primarily due to compensation: two solutions that pass *k* tests each are considered equally valuable, no matter *which* particular tests they pass. Also, aggregation neglects the fact that some tests can be inherently more difficult than others, or more or less hard to reach for a given search algorithm. Algorithms that rely on aggregation are oblivious to these aspects, and thus are prone to inferior performance.

This problem is more common than it may appear, because aggregating objective functions prevail in practice. Problems that involve a single test are few and far between. In most applications, candidate solutions need to be tested in various environments, due to their complexity and/or the presence of noise in an evaluation process. Also, compensation is not exclusive to domains with binary interaction outcomes, where a test can be only passed or failed, but will occur also in the case of continuous outcomes.

As was argued by Krawiec and O’Reilly (2014a) and Kaplan and Hafner (2006), the aggregation of test outcomes causes an unnatural “evaluation bottleneck” in the communication between an evaluation function and a search algorithm. However, this is more a matter of routine than a necessity. Wherever possible, a search algorithm should be provided with richer information on solution characteristics, which will enable it to perform better. Several past studies followed that intuition (see Section 5), but little has been done so far to propose a generic, principled approach to address this issue.

Following the conceptual framework laid down by Bucci et al. (2004) and de Jong and Bucci (2006), in Liskowski and Krawiec (2014) we started investigating the possibility of heuristic discovery of the *underlying objectives* of a problem. The generic approach we proposed therein addresses the compensation problem by clustering the interaction outcomes into *derived objectives* and so capturing the most prominent “skills” exhibited by candidate solutions in the context of a population. The derived objectives are more informative than a scalar evaluation and lend themselves to multi-objective evolutionary algorithms (Deb et al., 2002).

Here we extend Liskowski and Krawiec (2014) with several new contributions. We lay down formal underpinnings, proving that derived objectives preserve a great deal of dominance between candidate solutions. We provide additional motivation, delineating the class of scenarios in which scalar evaluation is inevitably misleading, while our approach still has a chance to succeed. We propose several new variants of the method, including automatic adjustment of the number of derived objectives, rendering the method virtually parameter-free. The resulting configurations undergo a thorough experimental analysis in three unrelated test-based domains: iterated prisoner’s dilemma (Chong and Yao, 2005), number games (de Jong and Pollack, 2004), and density classification (Ficici and Pollack, 2001). We analyze the algorithm performance and search dynamics, the number of discovered objectives, and the correlations between them. The outcomes vote in favor of our approach and have interesting implications.

## 2 Test-Based Problems

Optimization problems are traditionally formulated as a search for a candidate solution that maximizes (or minimizes) a given objective function. However, in some of the most interesting problems that evolutionary computation might address, obtaining the exact value of an objective function is computationally intractable. This may stem from many causes; here, we are interested in those pertaining to *test-based problems* (Bucci et al., 2004; de Jong, 2004; Jaśkowski and Krawiec, 2011; Popovici et al., 2012), where the performance of a candidate solution is determined by the outcomes of multiple *interactions* with *tests*. An interaction between a candidate solution and a test produces a scalar outcome that reflects the capability of the former to *pass* the latter. Typically, the set of tests is large, making it infeasible to evaluate candidate solutions on all of them.

Formally, we define a test-based problem as a tuple , where:

is a set of candidate solutions (

*solutions*for short),is a set of tests with which the candidate solutions interact,

is an

*interaction function*, andis a candidate solution

*quality function*.

In general, solving a test-based problem consists in finding a candidate solution with certain properties captured by a *solution concept* (Ficici, 2004), which articulates the goal of search. A solution concept is a formalism that specifies which elements of a search space are solutions to a problem. A solution may be a single candidate solution or comprise a *subset* of them; in either case, a solution concept defines the properties such a formal object must meet. Much of the theoretical work has analyzed coevolutionary algorithms in terms of the solution concepts they implement (Ficici, 2004, 2005; Popovici et al., 2012), showing the importance of understanding and correctly choosing the right solution concept for the problem at hand.

*expected utility*—that is, the average outcome against all tests: A solution in the sense of MEU is an that maximizes in . Even though MEU has been proven to be globally nonmonotonic (Ficici, 2004) and

*Q*tends to be sensitive to the distribution of tests’ characteristics (

*behaviors*,

*phenotypes*) (Jaśkowski et al., 2016), MEU is very common in the literature, often presented as the concept implemented by competitive coevolution, where tests evolve to challenge the candidate solutions and thus force them to improve in quality. It is also popular in real-world studies on, for instance, games (even when not explicitly referred to), because average performance against opponents is often the characteristic sought for in practice (all of the above reservations notwithstanding).

*Q*is challenging in many test-based problems, because the number of tests in is usually large or infinite. This can be mitigated by estimating utility by confronting the solution with a

*sample*of tests of a computationally manageable size. This leads to an approximate quality function, which is commonly used as a

*fitness function*in evolutionary computation: Notice that

*f*is a unbiased estimator of

_{T}*Q*when

*T*is a uniform sample of . In the following discussion, we refer to this as “scalar evaluation.”

The algorithms that naturally match the class of test-based problems are two-population coevolutionary algorithms (Axelrod, 1997; Hillis, 1992; Reynolds, 1994; Rosin, 1997). A typical coevolutionary algorithm maintains a population of candidate solutions and a separate population of tests . In every generation, each candidate solution interacts with every test , producing an interaction outcome *g*(*s*,*t*). The interaction outcomes are gathered in an *interaction matrix**G*, from which the fitness values of individuals in *S* and *T* are calculated. Despite the risk of falling victim to so-called *coevolutionary pathologies* (Blair and Pollack, 1997; Paredis, 1997; Watson and Pollack, 2001), coevolutionary algorithms have proved effective at solving many nontrivial instances of test-based problems, including learning game strategies (Chellapilla and Fogel, 2001) and evolving controllers (Stanley and Miikkulainen, 2004). The method proposed in this study is a variant coevolutionary algorithm.

Test-based problems are typically associated with applications in games. Indeed, a game-playing agent must usually be tested against many opponents in order to assess its quality, and the number of such opponents is often very large (Popovici et al., 2012). The class of test-based problems, however, is much wider. Programs evolved in genetic programming are usually applied to multiple *fitness cases*. When evolving controllers that, for example, maintain the balance of an inverted pendulum or drive a vehicle, it is common to perform multiple simulations that vary in initial conditions or other parameters (see Poli et al., 2008, for a review). Also, whenever the evaluation of candidate solutions is stochastic or involves noise, their robustness needs to be assessed in multiple scenarios that vary in the realization of the underlying random variable(s).

For the sake of this study, it will be sufficient to consider a subclass of test-based problems with a binary interaction function (). We say that *s**passes*(or *solves*) *t* iff , and that *s**fails**t* iff . With such a definition of interaction, the expected utility *Q* counts the tests passed by *s*.

## 3 Motivations

*compensation*(Section 1): if two solutions pass the same number of tests, they are considered equally valuable, regardless of

*which*particular tests they pass. Otherwise, one of the compared solutions is deemed better, but, again, disregarding the behavior on particular tests. Compensation can be avoided by comparing the candidate solutions using the

*dominance relation*:

Dominance compares the behavior of solutions on particular tests and in this sense is more scrupulous than the expected utility. However, it is a partial relation and fails to provide a useful search gradient whenever none of the compared solutions passes a superset of tests solved by the other solution (cf. Krawiec, 2002). This is unfortunately common: for two unrelated solutions, it is much more likely that they are mutually nondominated than that one of them dominates the other, and that likelihood grows with the number of tests in *T*.

Expected utility and dominance thus occupy two extremes in scrutinizing interaction outcomes, and the method proposed in Section 4 is a compromise between them. To motivate its design, we discuss the vices of scalar evaluation in more detail.

1. Tests are often characterized by different *difficulties*, which are not known a priori. Coming across a problem instance with all tests equally difficult is much less likely than difficulty varying across the tests. In our previous work (Jaśkowski et al., 2013), we estimated the difficulty of various game strategies, meaning the number of opponents they defeat. We observed highly nonuniform distributions of difficulty among the tests. Scalar evaluation measures are oblivious to that: solving *k* easy tests is considered equally valuable as solving *k* difficult tests by a search algorithm.

2. Any stochastic search algorithm (other than a purely random search) has a *search bias*—that is, it is more likely to visit some candidate solutions than others. For that instance, the search bias of a single-bit mutation in a genetic algorithm inclines it to visit, in a given search step, solutions that are similar (in the sense of Hamming distance) to the solutions in the current population.

As a consequence of diverse test difficulties and search biases, a search algorithm driven by a scalar evaluation measure tends to converge to candidate solutions that solve the tests that are easier and better “reachable.” In parallel search techniques like evolutionary or coevolutionary algorithms, the probable aftermath is premature convergence, which we illustrate with the following example.

Consider a test-based problem with two tests . There are thus four possible combinations of interaction outcomes: 00, 01, 10, and 11. The worst candidate solutions fail both tests (00) and thus receive fitness 0. The best (optimal) candidate solutions pass both tests (11), and their fitness is 2.

*behaviors*of candidate solutions on tests; the representations of candidate solutions and the underlying search space are arbitrary. Given a problem instance and a search operator , we could verify which moves in are realizable by

*o*, and based on that produce a directed graph reflecting the corresponding realizable transitions between the combinations of test outcomes. In that graph, nodes correspond to combinations of interaction outcomes, and edges correspond to moves. For instance, the edge connecting the node 01 to the node 11 in Figure 1a indicates that there exists at least one pair of candidate solutions such that and , , , and .

Consider a hypothetical problem instance with a graph such as that shown in Figure 1a, with nodes arranged in layers that correspond to values of expected utility. Assume a search algorithm equipped with a search operator *o*, which starts with one or more candidate solutions in 00—that is, such that they fail both tests. It does not take long to realize that such a problem is easy to solve: the transitions are aligned along the fitness gradient, so the algorithm will likely traverse the path from 00 to 11. The problem in Figure 1b is also solvable, as a path from 00 to 11 exists. Nevertheless, the transition from 01 to 10 is not accompanied by an improvement (the fitness remains 1). A search algorithm that accepts only strict improvements will get stuck at 01.

This is not much of a problem for stochastic search (e.g., an evolutionary algorithm), which can still move from 01 to 10 by pure chance. Consider, however, the problem shown in Figure 1c, this time with three tests (and thus fitness varying from 0 to 3 inclusive). Once search reaches the combination 011, further progress can be made only by moving to the combination 100, which implies decreasing fitness from 2 to 1. Only search algorithms that accept deteriorations can escape this trap and so avoid premature convergence.

The above example is simple for the sake of clarity. In practice, the transitions between combinations of test outcomes, rather than being possible or impossible, will be more or less *likely*. Nevertheless, the problem will persist and manifest in the *likelihood* rather than the *possibility* of reaching an optimum.

The presence and absence of various paths in Figure 1 is closely related to fitness landscapes (Wright, 1932). In particular, the case in Figure 1a can be associated with a unimodal fitness landscape, the one in Figure 1b with a *plateau*, and the one in Figure 1c with a *trap* (*deception*). This is, however, where the analogy ends. Fitness landscapes visualize a scalar objective function and stretch over the space of solutions arranged with respect to the actions of search operators. The nodes in our graphs correspond not to candidate solutions, but to the behavioral equivalence classes determined by combinations of interaction outcomes.

In nontrivial problems, such graphs will be not only large, but also very “tangled.” This is because the mapping from the “genotype” of candidate solutions (the elements of ) to the phenotype/behavior (the elements of ) can be particularly complex. A minute modification of the former may cause a dramatic change in the latter. On the other hand, even a major change in genotype can be phenotypically neutral. The domains of game playing and program synthesis are good examples here. In games, the complexity of the genotype–phenotype mapping stems from the usually sequential nature of games, where rewards for players are known only after they have made a series of moves. In program synthesis, complexity of the genotype–phenotype mapping results primarily from the interactions between instructions within a program. In Hu et al. (2011), a weighted graph similar to those in Figure 1 was constructed, with nodes corresponding to combinations of outputs of GP programs, and the weights of edges reflecting the likelihood of moving from one behavior to another. The graph was strongly asymmetric, with some transitions very common and some extremely unlikely (see Fig. 2 in Hu et al., 2011). As a consequence, some nodes were almost isolated from the remaining part of the graph, which made them particularly difficult to arrive at.

Evolutionary algorithms, by performing more or less global parallel search, are *in principle* resistant to compensation of the interaction outcomes, because their stochastic nature allows them to visit (albeit only in the limit) all points in the reachable search space. This, however, does not mean that they cannot be extended to address the above issues, which we attempt in the next section.

## 4 Discovery of Search Objectives (disco)

The previous section showed that achieving certain *combinations* of interaction outcomes can be the key to a more effective search. Based on this observation, we propose Discovery of Search Objectives (disco), a method that identifies the combinations that prevail in the current interaction matrix and uses them to gauge the candidate solutions. This causes novel combinations of test outcomes to be protected from being “forgotten” in the search, even if they are inferior in terms of expected utility.

disco replaces the conventional evaluation stage of a two-population coevolutionary algorithm (Section 2). Given a population *S* of *m* candidate solutions and a population *T* of *n* tests, it proceeds as follows:

Calculate the interaction matrix

*G*between the candidate solutions from*S*and the tests from*T*using the interaction function*g*.Cluster the tests. We treat every column of

*G*—that is, the vector of the interaction outcomes of all solutions from*S*with a test*t*—as a point in an*m*-dimensional space. A clustering algorithm of choice is applied to the*n*points obtained in this way. The outcome of this step is a partition of the original*n*tests in*T*into*k*subsets/clusters , , where and .

The columns of implicitly define the *k**derived objectives* that characterize the candidate solutions in *S*. The value of the *j*th such objective for a candidate solution *s _{i}* is . Note, however, that these objectives are derived from the interactions between a specific

*S*and a specific

*T*, and as such are undefined outside

*S*. Nevertheless, to emphasize the analogy between the derived objectives and the original interaction function

*g*, we will alternatively denote as .

The derived objectives form a multi-objective characterization of the candidate solutions in the context of the current population of tests. They can be subsequently employed by a multi-objective selection method—for instance, nsga-ii (Deb et al., 2002).

Let us illustrate disco with the following example.

Consider the matrix of interactions between the population of candidate solutions and the population of tests , shown in Figure 2a. The four-dimensional space of interaction outcomes is shown in two two-dimensional scatterplots (Figs. 2b and 2c) that span and , respectively. The performance of candidate solutions on the tests *t*_{1} and *t*_{3} is quite correlated, and so is performance on *t*_{2} and *t*_{4}. Assume the clustering algorithm (step 2 of disco) notices these similarities and produces clusters that partition *T* into , with and . Averaging the interaction outcomes within *T _{j}*s (step 3 of disco) results in the derived interaction matrix shown in Figure 2d. Figure 2e presents the locations of the candidate solutions in the space of derived objectives.

If interaction outcomes are to be maximized, the only dominance holding in the original space is . In the space of derived objectives (Figure 2e), *b* still dominates *a*. However, now also *c* dominates *d*, though originally these two solutions were mutually nondominated (incomparable). As a result of
“compressing” of the original interaction matrix, some information about the dominance structure has been lost.

In the particular case of *c* and *d*, introducing dominance in favor of *c* may be desirable, as *c* outperforms *d* on two of the original tests (*t*_{3}, *t*_{4}), while only one test (*t*_{1}) supports the opposite relation (and *t*_{2} is neutral in this respect). disco trades the lower number of resulting objectives for a certain inconsistency with the original interaction outcomes. Nevertheless, we posit that this imprecision may be a price worth paying for obtaining a potentially useful search gradient. In Section 4.2, we present a detailed analysis of dominance preservation.

disco broadens the bottleneck of evaluation in characterizing the candidate solutions with *k* objectives rather than with a single one (cf. the motivations in Section 3). On the other hand, keeping *k* small ensures that the dominance relation in the derived space is dense enough to provide a reasonably strong search gradient (as opposed to the dominance on all tests, discussed next to Eq. 3 back in Section 3). Candidate solutions that feature different “skills” (embodied by particular derived objectives) can coexist in a population even if some of them are clearly better than others in terms of scalar evaluation. For instance, solution *c* in the above example is not dominated in , although its scalar fitness () is lower than that of the most fit solution *b* ().

Because disco drives selection using multiple derived objectives, it might seem appropriate to pair it not with the solution concept of MEU, but with that of Pareto optimality (Section 2). We claim, however, that this convergence is only apparent. Notice first that the desired objectives are transient, derived in every generation independently, from a usually different interaction matrix. Therefore, one cannot claim that disco optimizes for any specific objectives throughout an entire run. Moreover, the solution concept of Pareto optimality comes in handy when one aims at finding a *set* of candidate solutions that possibly closely approximate the nondominated ones and exploit the trade-off between the objectives. This is not the case in the class of problems we consider here (and in the experimental part): our goal is to find *one* solution that maximizes the odds of passing *any test*, for which MEU is the most appropriate solution concept.

### 4.1 Properties of disco

The evaluation conducted by disco is contextual: as in all coevolutionary algorithms, the outcome of the evaluation of a candidate solution in *S* depends on the current tests in *T*. However, that outcome depends also on the other candidate solutions in *S*, because they together determine the result of clustering. This interaction between candidate solutions is not common among the two-population coevolutionary algorithms.

Because clustering partitions the population of tests *T*, rather than, for instance, selecting some of the tests, none of the tests are discarded. Also, clustering guarantees that the tests that are mutually redundant (i.e., identical columns in *G*) will support the same derived objective. In general, the more tests are similar in terms of the solutions’ performance on them, the more likely they will end up in the same cluster and contribute to the same derived objective.

For , disco degenerates to a single-objective approach: all tests form one cluster, and has a single column that contains the solutions’ (normalized) estimates of expected utility defined in (1). Setting implies , and every derived objective being associated with a single test.

The derived objectives sum up to the scalar fitness (Eq. 2—that is, —and each of them is a linear combination of selected columns in *G*. This feature is essential for dominance preservation, as we demonstrate in the next subsection.

### 4.2 Preservation of Dominance

As we demonstrated in Example ^{2}, the objectives derived by disco are lossy—that is the dominance relation in the space^{1} of derived objectives is in general different from the dominance relation in the space *G*. In this section, we investigate the nature of those differences.

Consider a pair of candidate solutions *s*_{1}, *s*_{2} with different interaction outcomes—that is, . Table 1 lists the combinations of the possible relations between *s*_{1} and *s*_{2} in the original space of interaction outcomes () and in the space of the derived objectives (). The cases in which the relation is consistent in both spaces are marked by “Preserved.” In the remaining combinations, three types of distortions may take place, two of which can be likened to errors in statistical tests:

Preserved | FN | Inversion | |

FP | Preserved | FP | |

Inversion | FN | Preserved |

Preserved | FN | Inversion | |

FP | Preserved | FP | |

Inversion | FN | Preserved |

False positive distortions (FP), when one of the solutions dominates the other in , although they were incomparable in

*G*,False negative distortions (FN), when a dominance in

*G*ceases to hold in , andInversions of dominance.

*G*. For dominance inversion to happen, at least one derived objective must exist such that —that is (cf. Eq. 4), which is equivalent to For this to hold, at least one of the summed terms has to be negative. This is impossible, however, as , and thus there is no test

*t*such that . Therefore, disco will never cause dominance inversion, and is in this sense not deceptive.

A false negative error is excluded by the same token. For to hold in , at least one derived objective has to reverse the ordering of the solutions in question (i.e., ), which, as we have shown above, is impossible.

*t*such that and at least one test such that . Let us assume that these are the only tests in

*T*and that a single derived objective is built from them—that is, The presence and direction of dominance between

*s*

_{1}and

*s*

_{2}in the derived space will depend on the sign of . Clearly, depending on how much

*s*

_{1}is worse than

*s*

_{2}on

*t*and better than

*s*

_{2}on , the sign of that expression can be arbitrary. This holds also if there are other tests in

*T*.

Therefore, disco can commit false positive errors—that is, posit a dominance in for a pair of solutions when it is factually absent in *G*. This distortion is unavoidable, given the reduced dimensionality of .^{2} On the face of it, this may be considered undesirable. However, note that *G* does not fully characterize the candidate solutions in the first place, because *T* is only a sample from a universe of tests (cf. Section 2). Attempting to perfectly preserve the dominance in may not be worth the effort, given that *G* captures only partial characteristics of the solutions in *S*. By the same token, we do not find it critical that the clustering algorithms employed by disco are heuristic, and thus may produce suboptimal groupings of tests.

## 5 Related Work

The hypothesis that test-based problems may feature certain internal structures originated in the DEMO lab of Jordan Pollack and emerged in the early works by Bucci and de Jong (Bucci et al., 2004; de Jong and Bucci, 2006; Jaśkowski and Krawiec, 2011), who proposed formal methods for framing such structures as coordinate systems (CSs). A CS arranges the candidate solutions with respect to axes, each of them being an ordering of tests. If the outcomes of the interactions between all candidate solutions and all tests are known for a given test-based problem, a CS can be constructed that exactly reproduces the original dominance relation —that is, arranges the candidate solutions so that their spatial relationships in the CS are consistent with . Interestingly, for every test-based problem there exists a CS of minimal dimensionality, and that dimensionality may reflect the problem difficulty.

Such exact CSs suffer, however, from several problems. First, by exactly reproducing , a CS does not provide any additional information that could help drive the search process. In particular, if is sparse—that is, few solutions dominate other solutions—there are no grounds for preferring some solutions to the others. disco, to the contrary, may introduce “false positive” dominances as shown in Section 4.2, and is thus more likely to guide the search. The second challenge is the exponential complexity of the algorithms that construct an exact CS from an interaction matrix (Jaśkowski, 2011; Jaśkowski and Krawiec, 2011), which prevents them from being used online within an evolutionary process. Also, an interaction matrix built from current populations that are only transient samples of and in an evolutionary run is by definition incomplete, so applying a costly exact algorithm seems wasteful. For this reason, CSs are interesting tools for studying the internal structure of test-based problems, but not necessarily useful “search drivers” and, except for de Jong and Bucci (2006), no past work has reported using a CS to support a search process.

*s*in the context of a population

*S*is defined as: where is the subset of candidate solutions in

*S*that solve test

*t*—that is, . Therefore, candidate solutions

*share*the rewards for solving particular tests, each of which can vary from to 1 inclusive. Higher rewards are provided for solving the tests that are rarely solved by population members (small ). The rewards change as

*S*evolves, which can help with escaping local minima.

More sophisticated methods that reward solutions for having rare characteristics have been proposed, some of them also for GP. Lasarczyk et al. (2004) designed a method that maintains a weighted graph that spans tests, where an edge weight reflects the historical frequency of a pair of tests being solved simultaneously. The graph allows then to select a subset of tests that are used to evaluate the candidate solutions. In a similar spirit, the co-solvability proposed by Krawiec and Lichocki (2010) identifies skills with individuals’ ability to pass *pairs* of fitness cases, and as such can be considered a “second-order” IFS.

In a sense, these methods reward the solutions for being “original” in their capabilities of passing particular tests. A more recent contribution with similar motivations is novelty search (Lehman and Stanley, 2011; Mouret and Doncieux, 2012), where a measure of behavioral similarity substitutes for the traditional objective function. The measure is used to reward the individuals who differ significantly from the other individuals in the population and from selected past individuals stored in an archive. Novelty promotes the emergence and coexistence of different skills in a population, and can be seen as an analogue to *curiosity* and *intrinsic motivation* in reinforcement learning and developmental robotics (Kaplan and Hafner, 2006; Oudeyer et al., 2007). In practice, it must be combined with a complexifying algorithm like neat to ensure a principled order in which behaviors are discovered (Lehman and Stanley, 2011), usually discovering the simple behaviors before the more complex ones. In contrast, the approach proposed in this article constructs search objectives that are *related* to the original objective and so promotes the emergence of behaviors that pertain to a search goal.

disco transforms a single-objective test-based problem into a multi-objective one. On the other hand, if we consider every test as an “elementary” objective, disco can be seen as a method that transforms a *many-objective* problem into a multi-objective one. The recent interest in such transformation techniques (Brockhoff and Zitzler, 2006; López Jaimes et al., 2008; Singh et al., 2011) is justified by the inferior performance of some multi-objective evolutionary algorithms (MOEAs) when more than three objectives are involved (Khare et al., 2003; Knowles and Corne, 2007). Such *objective reduction* approaches assume the existence of redundant objectives in a given *M*-objective problem and aim to identify the smallest set of *m* () conflicting objectives that generates the same Pareto-optimal front as the original problem, by preserving either the dominance relation or the correlation structure of the original problem. In the former case, the redundant objectives are removed (Brockhoff and Zitzler, 2006); in the latter, elimination concerns the objectives that are nonconflicting—that is, along the significant eigenvectors of the matrix of correlations between the original objectives (Saxena et al., 2013).

In contrast to disco, which derives the objectives online in every generation of an evolutionary run, some of the above methods work *offline*: a series of ordinary MOEA runs is carried out, each of them followed by the reduction of objectives. This process completes when the objective set does not change in two successive iterations. The applicability of the above methods is limited to problems in which the objectives, each preferably multivalued, form an inherent part of the problem statement. disco, on the other hand, does not require any predefined objectives, but derives them from an interaction matrix (cf. Section 2), where each column conveys little information on its own (binary interaction outcomes, in the simplest case). The class of test-based problems is thus substantially different from the class of many-objective problems, and disco is designed to handle the former rather than the latter.

## 6 Experimental Analysis

The following computational experiment is intended to quantify the impact of derived objectives on the efficiency of coevolutionary learning in test-based problems representing various domains. Our secondary objective is to inspect the resulting objectives and cast some light on the dynamics of the search process driven by them. The code used to conduct the experimental analysis is available online.^{3}

### 6.1 Basic Coevolutionary Configurations

All configurations described in the following implement the generational two-population coevolutionary algorithm, with separate populations of candidate solutions *S* and tests *T*. In every generation, each candidate solution interacts with every test , producing an interaction matrix *G* (cf. Section 2). The configurations differ only in the way the fitness values of the individuals in *S* are calculated from *G*.

In conventional coevolutionary learning (cel), our reference method, the fitness of a candidate solution *s* is the estimate of its expected utility (Eq. 2), calculated from the interactions with all tests in *T*—that is, the average of the corresponding row of *G*. This fitness is then used by a conventional scalar selection operator (tournament of size 5 or or , depending on the domain; Section 6.4).

disco (Section 4) clusters the columns of the interaction matrix *G* and produces objectives *g _{j}* that characterize the candidate solutions in

*S*. To avoid fixing

*k*in advance, we decided to employ x-means (Pelleg and Moore, 2000), an extension of the popular

*k*-means algorithm. Given the admissible range of

*k*, x-means picks

*k*that produces the clustering that maximizes the Bayesian information criterion. In this experiment, we allow x-means to consider and employ the Euclidean metric to measure the distances between the observations (the columns of

*G*). For a detailed description of this method, the reader is referred to Pelleg and Moore (2000).

The resulting *k* objectives preclude selection methods that assume scalar fitness values. Therefore, disco employs nsga-ii, the arguably most popular multi-objective selection algorithm (Deb et al., 2002). nsga-ii first merges the current candidate solutions and their children *C* into one set . The values of the *k* objectives *g _{j}* for

*S*directly follow from , the result of disco’s clustering (Eq. 4). The objective values for the children in

*C*are obtained by calculating the interaction matrix

*G*between

_{C}*C*and the tests in

*T*, grouping the columns of

*G*in the same way as in

_{C}*G′*and averaging the interaction outcomes within the partitions analogously to Equation 4.

Next, nsga-ii builds the Pareto ranking of the elements of based on the dominance relation (cf. Eq. 3) in the space of derived objectives *g _{j}*. The layers of the ranking that host the worse half of individuals are discarded. Then, the parents are selected from the remaining solutions via a tournament selection on Pareto ranks (tournament size 5 in our case). The ties on ranks are resolved using a

*crowding*measure, which favors objective values that are distant from those of the other candidate solutions. A detailed description of nsga-ii can be found in Deb et al. (2002).

*distinctions*(Ficici and Pollack, 2001). The fitness of a test is the number of pairs of candidate solutions in

*S*that it differentiates: promotes the tests that

*inform*the candidate solutions about the differences between them, rather than about how they

*perform*. Distinctions have proved effective at maintaining the

*coevolutionary gradient*in many nontrivial test-based problems (de Jong and Pollack, 2004; Stanley and Miikkulainen, 2004; Bongard and Lipson, 2005).

### 6.2 Additional Control Configurations

Comparing disco to cel cannot fully explain the anticipated differences, because *two* aspects differentiate these methods. First, nsga-ii as employed by disco is more sophisticated than the scalar selection used in cel: it not only operates on a combined pool of parents and offspring (and is thus quite elitist), but essentially performs a two-stage selection by first rejecting the lower ranks of the Pareto ranking and then employing tournaments on ranks and crowding to draw the parents (Deb et al., 2002). Second, the sole fact of having *any* multiple objectives causes selection to operate differently, because the likelihood of dominance between solutions decreases with the number of objectives, and many inconclusive comparisons result in weaker selection pressure.

To isolate these aspects, we design two additional control setups. To control for nsga-ii selection, we design the 1-means setup, which is simply disco with the numbers of clusters . Technically, there is no need to run clustering in this setup, as all tests by definition belong to the same cluster. In this configuration, the selection of candidate solutions is driven by a single derived objective *g*_{1} that, by averaging the outcomes on all tests in *T*, is equivalent to the expected utility (Eq. 1; cf. Section 4.1)—that is, the fitness used by cel. However, contrary to cel, which relies on scalar selection, 1-means employs nsga-ii, and thus involves ranking.

To control for the presence of multiple objectives, we provide the rand configuration, which replaces disco’s clustering of the interaction matrix *G* with the following steps. First, rand draws *k* from at random. Then it randomly partitions the columns of *G* into *k* derived objectives, which are then treated in the same way as in disco. Thus, contrary to 1-means, rand relies on multiple objectives (unless the drawn *k* happens to be
1), but those objectives do not reflect any similarities between the columns of *G*. This configuration should help us verify whether discovering *meaningful* clusters is essential.^{4}

### 6.3 Extensions of disco

The basic disco algorithm clusters the interaction outcomes using the Euclidean distance, which allows handling continuous as well as multivalued interaction outcomes (e.g., the ipd benchmark presented below features ternary interaction outcomes). However, using the Euclidean distance may render it difficult to capture the *combinations* of skills exhibited by particular solutions.

Consider a population hosting two candidate solutions . In such a case, clustering takes place in a two-dimensional space with dimensions corresponding to *s*_{1} and *s*_{2}. Assume that clustering leads to two clusters with centroids in and . Now consider a test *t* with the interaction outcomes —that is, and . The Euclidean distance between *t* and *m*_{1} amounts to ( iterates over dimensions; we omit the square root for simplicity), while . Therefore, *t* will be assigned to the second cluster. However, that cluster groups tests that are more often solved by *s*_{1} than by *s*_{2} (). One may argue that the first cluster is more suitable for *t*, as it hosts the tests with complementary characteristics ().

To address this problem, we consider two extensions of disco. In disco-bin, we binarize the centroids—that is, the distance is defined as , where denotes rounding to the closest integer. For the example above, and , so *t* would be assigned to the first cluster.^{5}

*i*th dimension if less than percent of the tests in the cluster are solved by

*s*, and by 1 otherwise. This reasoning ignores the fact that a strong solution will solve

_{i}*most*(and not half) of the tests in

*T*(and, conversely, a weak solution will solve hardly any tests). Setting the threshold to percent thus seems rather arbitrary; a more adequate value can be estimated from the performance of

*s*on all tests. Therefore, in the last configuration, disco-avg, the distances are calculated according to the formula , where

_{i}### 6.4 Test Problems

The suite of benchmarks consists of iterated prisoner’s dilemma (ipd) (Axelrod and Hamilton, 1981), numbers games (NGs) (Watson and Pollack, 2001), and the density classification task (dct) (Das et al., 1995), elegant and well-defined problems that have been widely used to analyze evolutionary learning algorithms. In particular, they are excellent testbeds for coevolutionary algorithms, due to their test-based nature and large (for NGs, infinite) number of tests. The popularity of these problems as benchmarks for competitive coevolution stems primarily from their nontrivial definition and high difficulty, manifested in the failure to obtain quality solutions with generic metaheuristics. Another particularly appealing feature of these benchmarks is their kinship to complex real-world scenarios. For instance, ipd is widely used to model systems in biology (Nowak and Sigmund, 2004), psychology (Roy, 2000), and economics (Hemesath, 1994). Recently, it has gained even more popularity as a demanding task for competitive environments (Miller, 1996; Darwen and Yao, 2002; Chong and Yao, 2005; Chong et al., 2012). The cellular automata used in dct have applications in many fields, including CPUs, cryptography (Wolfram, 2002), and real-world biological and chemical systems (Wolfram, 1986; Kier et al., 2005). dct is also very difficult, as witnessed by the relatively slow improvement of its best known solutions over time (Juillé and Pollack, 1998; Pollack, 1998; Ficici and Pollack, 2001; Wolz and De Oliveira, 2008). Finally, NGs were designed to objectively measure the performance of coevolutionary algorithms and to determine whether they are vulnerable to coevolutionary pathologies.

In this study we refrain from testing disco on real-world problems, as this would render the interaction function much more expensive to evaluate. In fact, the experiments were computationally demanding even for the benchmarks considered here. Besides, true real-world applications would require that the interactions take place in a physical, controlled environment like those typical for robotics and artificial life—for instance, the predator-and-prey domain (Nolfi and Floreano, 1998).

As ipd and NGs are symmetric games, we employ the same strategy representation for candidate solutions and tests. In the asymmetric dct, different representations are necessary. We describe the details below.

#### 6.4.1 Iterated Prisoner’s Dilemma

ipd is an abstract two-player game involving a series of interactions, each of which is a prisoner’s dilemma (PD) game. ipd is primarily used to study cooperation in social, economic, and biological interactions. It is considered nontrivial in being a non-zero-sum game and in its iterative character, making it an attractive playground for competitive environments (Chong et al., 2012).

In a PD, a player can make one of two choices: cooperate or defect. If both players cooperate, they receive a payoff *R*, whereas if they both defect they get a smaller payoff *P*. Defecting against a cooperator gives a payoff *M* that is higher than *R*, and the cooperator in such a case receives the lowest possible payoff *m*. In sum, the PD payoff matrix must satisfy two conditions: and (Poundstone, 1992).

Following other studies (Frean, 1996; Darwen and Yao, 2001; Chong and Yao, 2005; Harrald and Fogel, 1996), we consider ipd in a version extended to multiple choices (moves or *levels**of cooperation*) and employ the *memory-one* form of ipd, in which players remember their moves from the previous PD iteration only and represent strategies as look-up tables (Axelrod, 1997). A *c*-choice ipd strategy is a matrix *M*, where *m _{ij}* specifies a player’s move to be made, given his previous move

*i*and the opponent’s previous move

*j*. The other element of a player’s strategy is the initial move

*m*

_{00}, so in total a strategy is represented by numbers in the range .

*s*and a test

*t*involves a series of PD

*episodes*. In each PD episode,

*s*makes move

*i*,

*t*makes move

*j*, and that brings them the payoffs and , respectively. The outcome of an ipd game is determined by comparing the total payoffs

*p*and

_{s}*p*accumulated over PD episodes:

_{t}We use ipd with choices, which we found to be much more demanding than the three-choice ipd used in Chong et al. (2012). Each strategy is a look-up table of moves, and the size of the search space is . Each ipd game consists of PD episodes.

For ipd, all configurations maintain populations of candidate solutions and tests. However, because nsga-ii effectively merges the parents and the offspring prior to selection, we set the size of the candidate solution population to for cel. This provides for fair comparison: every method engages ipd games per generation. With runs lasting for generations, the total effort per run amounts to games.

Both populations are initialized with uniformly randomized look-up tables. For selection, a tournament of size 5 is used. The only source of genetic variation is a mutation that iterates over all elements of a look-up table and with probability .2 replaces the original choice with one of the remaining choices selected at random. This operator has been found to provide sufficient variation for multiple-choice ipd (Chong and Yao, 2005).

#### 6.4.2 Numbers Games

compare-on-all (coa) and compare-on-one (coo) are variants of the NG (Watson and Pollack, 2001) proposed in de Jong and Pollack (2004). Candidate solutions and tests are points in an *l*-dimensional space represented as real-number vectors of length *l*.

Straightforward formulation notwithstanding, both problems are well-known coevolutionary benchmarks (de Jong, 2005, 2007; Service and Tauritz, 2008; Bucci et al., 2004) and enable the objective and precise measurement of search progress. Furthermore, having explicit underlying objectives (corresponding to the *l* dimensions of strategy vectors) makes them a particularly suitable testbed for disco. coo is more challenging, in being designed to induce *overspecialization* (Watson and Pollack, 2001), a coevolutionary pathology in which candidate solutions and tests focus on some underlying objectives (or even a single one) while ignoring the remaining ones. To make progress on this problem, a coevolutionary algorithm has to maintain the tests that support all underlying objectives from the very beginning of the run.

We consider -dimensional variants of coa and coo. Following de Jong and Pollack (2004), for both populations, the initial values in each dimension are uniformly sampled from . Offspring individuals are created from the parents by picking at random two dimensions from and adding a random value *x* chosen uniformly from . Both populations host 200 individuals each. For selection, we use the evolution strategy (Beyer and Schwefel, 2002), with and . Each evolutionary run lasts for 1,000 generations.

#### 6.4.3 Density Classification Task

In the density classification task (dct), the objective is to find a one-dimensional binary cellular automaton (CA) that performs majority voting. A CA is a discrete model studied among others in computability theory, mathematics, and theoretical biology (Kier et al., 2005; Wolfram, 1986, 2002). In dct, the candidate solutions are *rules* that govern the state transitions of CAs, while tests are bit vectors of length *l* that determine the *initial configurations* (IC) of the automata. The next state of the *i*th bit is determined by applying the rule to the window that comprises the previous bits at positions *i* − *r* through *i* + *r*. A rule is represented as a look-up table. A window of size implies possible combinations of bits in a window and the same number of entries in the look-up table. Therefore, the search space comprises rules, and there are possible initial conditions.

The objective is to construct a rule *s* that causes the CA to converge, within a prescribed number of iterations, to the state of all ones if the percentage (*density*) of ones in the IC *t* is greater than or equal to .5. Otherwise, the rule should cause the CA to converge to the state of all zeros. An interaction between *s* and *t* starts with the CA in the initial state determined by *t* and consists in iteratively applying *s* to all elements of the current configuration (cf. Fig 3c).

The initial population of CA rules contains lookup tables drawn at random. Because drawing the ICs in a direct way causes them to be usually very difficult (the expected number of ones is close to ), they require a more sophisticated initialization. First, a number *d* is uniformly sampled from the interval . Then, a vector of length *l* is filled with ones on the first *d* positions and zeroes on the remaining positions. Finally, the vector is randomly shuffled. As a result, the number of ones per IC is uniformly distributed in the population of tests.

Rules and ICs are varied by a 2% and a 5% per-bit mutation rate, respectively. Both populations host 200 individuals each. For selection, the evolution strategy (Beyer and Schwefel, 2002) is used. An evolutionary run lasts for generations, resulting in the total effort of 8,000,000 interactions. We consider three dct instances of various difficulties: an easy one (dct-1, , , ), studied in Olsson (1998); a medium one (dct-2, , , ), used among others in Jaśkowski (2011); and a hard one (dct-3, , , ), investigated in Juillé and Pollack (1998).

### 6.5 Performance

The assessment of candidate solutions in *S*, whether single-objective in cel or multi-objective in disco, depends on the current state of the population of tests *T*, and is thus *subjective*. As a result, a candidate solution’s fitness may strongly differ from its true performance. The *objective* performance measure for all test problems considered here is the expected utility (Eq. 1). To estimate it, we let a candidate solution interact with 50,000 random tests, generated by the domain-specific procedures used for initializing the population of tests (see the previous three sections). In this way we assess the best-of-run individuals—that is, the candidate solution in the last population with the highest subjective fitness. This measurement does not affect the algorithms’ behavior.

Table 2 presents the expected utility of the best-of-run solutions for particular benchmarks and methods averaged over 60 coevolutionary runs, accompanied by 95% confidence intervals. To compare the methods on all benchmarks simultaneously, we employ the Friedman’s test for multiple achievements of multiple subjects (Kanji, 2006). Compared to analysis of variance, it does not require the distributions of the variables in question to be normal. Friedman’s test operates on average ranks, which for the considered methods are as follows:

. | cel . | 1-means . | rand . | disco . | disco-bin . | disco-avg . |
---|---|---|---|---|---|---|

ipd | ||||||

dct-1 | ||||||

dct-2 | ||||||

dct-3 | ||||||

coo-3 | ||||||

coo-4 | ||||||

coo-5 | ||||||

coa-3 | ||||||

coa-4 | ||||||

coa-5 |

. | cel . | 1-means . | rand . | disco . | disco-bin . | disco-avg . |
---|---|---|---|---|---|---|

ipd | ||||||

dct-1 | ||||||

dct-2 | ||||||

dct-3 | ||||||

coo-3 | ||||||

coo-4 | ||||||

coo-5 | ||||||

coa-3 | ||||||

coa-4 | ||||||

coa-5 |

disco-avg | disco-bin | disco | cel | 1-means | rand |

1.9 | 2.2 | 2.3 | 4.7 | 4.9 | 5.0 |

disco-avg | disco-bin | disco | cel | 1-means | rand |

1.9 | 2.2 | 2.3 | 4.7 | 4.9 | 5.0 |

The *p* value is , which indicates that at least one method performs significantly different from the remaining ones. Bold font marks the methods that are outranked at the .05 significance level by *all*disco*variants* (the first* three* methods in the ranking) according to post-hoc analysis using a symmetry test (Hollander et al., 2013).

The derived objectives allow disco to outperform the standard coevolutionary search (cel) and the other two control setups (1-means, rand) not only on aggregated ranks, but also on every benchmark, regardless of problem difficulty. This result is often further improved by disco-bin and disco-avg. For the conceptually more challenging benchmarks—that is, ipd and dct—disco-avg fares consistently the best. For the abstract NGs, the ranking of disco variants is less predictable.

Interestingly, for some benchmarks we also observe a positive influence of decomposing the scalar fitness function by random clustering (rand). Nevertheless, the inferior overall performance of rand suggests that the objectives that result from random partitioning of the interaction matrix do not provide as efficient search gradient. In NGs, such a blind extraction of objectives even misleads the search algorithm and ultimately leads to worse performance than simple coevolution (cel).

The other control setup, 1-means, aimed at isolating the impact of nsga-ii selection, achieves the worst performance in almost every benchmark. Thus, while nsga-ii selection is the crucial part of disco, using it in a single-objective setting does not translate into an improved search performance.

The obtained results support our claim that disco is capable of meaningful grouping of tests *and* exploiting the resulting derived objectives in a multi-objective setting. The superiority of all disco variants with respect to all three control setups (cel, 1-means, rand) corroborates our hypothesis that better performance can be achieved only by the simultaneous involvement of these two capabilities.

### 6.6 Number of Derived Objectives

In disco, the x-means clustering algorithm dynamically adjusts the number of clusters (and derived objectives) *k* to the actual interaction matrix. This number may convey certain information about problem characteristics. In Figure 4, we present histograms of *k* for every benchmark, gathered from generations of all runs of all three disco variants. Given that the disco configurations overall outperformed the control ones, the observed values of *k* should be considered as having positive impact on the method’s performance. The graphs reveal that for easier problems such as ipd and dct-1, a lower number of objectives is sufficient to effectively improve the search performance, while the harder ones typically benefit from greater values of *k*.

Figure 4 shows that x-means rarely sets ; in fact, we observe this happening only for coa. Apparently, interaction outcomes can usually be captured better using more than one cluster (at least in terms of the Bayesian information criterion used by x-means). Also, as this is accompanied by high performance of disco, we may say that operating in a multidimensional objective space is in a sense favored by the approach. At the other extreme, objectives are never derived, suggesting that this upper limit was a good choice for the problems studied here.

Finally, it may be interesting to note that both disco and disco-bin have the tendency to maintain a greater number of extracted objectives, while disco-avg has the opposite property. We should, however, admit that these considerations have to be taken with a grain of salt, as the particular *k*s observed in Figure 4 result not only from the coevolutionary dynamics, but also from the particular measure (Bayesian information criterion) used by x-means.

### 6.7 Correlation of Objectives

As we argued in Section 3, we anticipate that nontrivial problems would feature mutually exclusive underlying objectives (“skills”)—that is, such that it is difficult to simultaneously make progress on all of them. It is thus interesting to ask whether such “polarization” becomes reflected in the objectives derived by disco.

To quantify the dissimilarity between any two objectives and discovered by disco for a given population of candidate solutions *S* and a population of tests *T*, we employ the Pearson linear correlation coefficient calculated for the candidate solutions in *S*—that is, the correlation between the two corresponding columns in the compressed matrix of interactions between *S* and *T*. In Figure 5, we present the correlations calculated in this way, averaged over all pairs of objectives (columns of ) in a given generation of an evolutionary process and across all evolutionary runs. The correlation of the objectives discovered by disco is usually much lower than the correlation for rand. Because rand partitions *T* randomly, each objective it defines is based on a random sample of *T*, and the averages calculated from such samples tend to be similar, and thus exhibit high correlation. disco, on the other hand, attempts to find a partitioning of *G* that minimizes the within-cluster variation—that is, the amount by which the columns of *G* within a cluster differ from each other. The objectives it discovers are thus likely to be significantly different from each other and to capture diversified aspects of solutions’ capabilities. This is particularly evident for dct and coa, where the correlation of objectives discovered by disco gradually decreases in the early stage of evolution and then stabilizes. Interestingly, for dct, the correlation becomes even negative, in which case an improvement on one objective causes a deterioration on the other. These observations corroborate our hypothesis regarding the mutually exclusive character of the derived objectives.

### 6.8 Intra- and Inter-Cluster Variance

*between*the derived objectives, without actually considering how “well defined” they are internally. To investigate this aspect, we inspect the derived objectives using the tools characteristic for cluster analysis: the within-cluster variance and between-cluster variance of the clusters associated with the derived objectives. We define the within-cluster sum of squares (WSS) as where

*T*is the

_{i}*i*th cluster and

*m*is its centroid (calculated using the arithmetic mean). Lower WSS implies greater similarity between the observations in clusters.

_{i}In Figure 6, we present WSS averaged across all evolutionary runs, plotted against the generation number. The clusters discovered by disco are typically more compact than those resulting from the random partitioning performed by rand. This is not surprising, since random grouping is condemned to a larger variance. What is more interesting, though, is the prevailing decreasing trend over the course of evolution, which suggests that the objectives gradually converge toward specific skills revealed by the candidate solutions. WSS also decreases for rand; however, Table 2 showed that this trend is not accompanied by good performance. This indicates that the objectives discovered by disco are indeed meaningful—that is, capable of creating a useful search gradient for the candidate solutions.

The decrease of WSS is not a rule, however: we observe just the opposite trend for, for instance, dct. We hypothesize that this may be attributed to the candidate solutions initially exhibiting very similar behaviors (when, e.g., the tests in the first generations turn out to be too difficult for most candidate solutions in the population).

*n*is the size of the

_{i}*i*th cluster and

*m*is the global mean of the data. Figure 7 presents the BSS of the derived objectives averaged across evolutionary runs. The BSS for rand is close to zero most of the time. As for WSS, this was expected given the probabilistic nature of the partitioning performed by rand. For all variants of disco, we observe relative stabilization of BSS, typically preceded by a gradual decrease. However, for dct, we observe a rapid increase of BSS in the early stages of evolution. For dct-1, the easiest instance of the problem, this is followed by a slight drop. In both domains, pure disco achieves the highest BSS, though disco-avg and disco-bin are not far behind. Interestingly, in the case of the NGs the situation is quite different: BSS is initially high, suggesting that the objectives are very diverse. Over time it gradually decreases, causing the objectives to lose their distinct character. The decrease of BSS co-occurs with a slight rise of the correlation between objectives (cf. Fig. 5).

### 6.9 Visualization of the Derived Objectives

Correlations and within- and between-cluster variance provide only cursory information about the derived objectives. A deeper insight can be provided by presenting them graphically. Figures 8 and 9 visualize the objectives derived by disco in the last generation of a selected single run for every benchmark. The following procedure was used to create the graphs. First, we scanned the final populations of evolutionary runs in search of compressed interaction matrices that had columns. For each row in , a green point marks the performance of a candidate solution on the two derived objectives. The labels on the axes indicate the numbers of tests that contributed to the corresponding objective (i.e., the sizes of the corresponding clusters).

In Figure 8 we group the graphs for the runs that ended with highly decorrelated derived objectives. Toward this aim, we use the Pearson correlation coefficient of the performance of candidate solutions on the derived objectives. Due to the coevolutionary nature of disco, the final candidate solutions in *S* are adapted to the tests in the final *T*, so the green marks in Figure 8 reflect only certain combinations of performances on the derived objectives. To illustrate the characteristics of the derived objectives in a more unbiased way, we plot also the performance of random candidate solutions. Toward this end, 5,000 random candidate solutions are generated using the problem-specific procedures described in Section 6.4. The performance of each such solution on the derived objectives is measured by performing interactions with the tests that gave rise to the two objectives in and averaging the outcomes within the two clusters (see Eq. 4). The points obtained in this way are then plotted in red. Where the marks overlap, color saturation reflects their density.

Each panel in Figure 8 corresponds to a different pair of derived objectives, specific to a problem being solved, a run, and the states of both coevolving populations at the end of the run. The common feature of all graphs is that the evolved solutions stretch between the axes of objectives, each of them differently exploiting the trade-off between them. In doing so, they clearly attempt to approximate the Pareto front and adopt different shapes, from widely stretched to much more centralized. disco is clearly able to maintain diversity in a population till the very end of evolutionary runs, and so mitigates premature convergence.

The random solutions, on the other hand, typically perform well only on one objective each. Furthermore, they are usually dominated by the evolved candidate solutions, and only some of them come close to the Pareto fronts of the evolved candidate solutions. The number of random individuals who manage to achieve a nonzero performance on both derived objectives is relatively small.

The spatial arrangements of the solutions shown in Figure 8 are characteristic for runs that ended with decorrelated objectives, and as such meet our expectations about the behavior of the method. Nevertheless, occasionally disco derives uncorrelated or positively correlated objectives, which lead to the “anomalous” distributions presented in Figure 9. Another type of anomaly is when the numbers of tests that support particular objectives become highly imbalanced. In some cases that imbalance may strongly distort the distribution of solutions, as in the case of ipd, where the objective plotted on the ordinate consists of only two tests, causing both the random and the evolved solutions to align in five horizontal layers, corresponding to the possible aggregated outcomes of interactions with the tests. In the case of coa-3 in Figure 9, we observe dense clusters of both candidate and random solutions near the axes, indicating overspecialization to one of the objectives. In such a case, neither the evolved nor the random solutions trade off the skills identified by the objectives particularly well.

Let us emphasize, however, that our distinction of “normal” (Fig. 8) and “anomalous” (Fig. 9) behaviors of disco is rather subjective and intended only to illustrate the spectrum of possible outcomes. In general, we did not observe significant correlation between the “esthetics” of solutions’ arrangements and the performance. Also, these graphs present the states of derived objectives in the final population of the run, when good solutions have usually already been found and it is difficult to make further progress.

## 7 Discussion

The experiment demonstrated that disco is able to identify meaningful derived objectives (Section 6.9) that are often internally cohesive (Section 6.8) and mutually nonredundant (Section 6.7). The method autonomously adjusts the number of derived objectives to the problem characteristics and the dynamics of evolutionary search (Section 6.6) and systematically improves the performance in comparison to the conventional coevolutionary algorithm driven by scalar evaluation (Section 6.5). The method maintains these features across diversified benchmarks of various difficulties.

The derived objectives can be arranged into *coordinate systems* that have natural graphical interpretations (Figs. 8 and 9). In this respect, they are similar to the coordinate systems of *underlying objectives* studied in the past works on coevolutionary algorithms (Bucci et al., 2004; de Jong and Bucci, 2006). The apparent similarity notwithstanding, the derived objectives cannot be expected to correspond to the underlying objectives, for several reasons. As we have shown in Section 4.2, disco can introduce additional (i.e., not backed up by an interaction matrix) dominance relationships between the candidate solutions; the coordinate systems of underlying objectives, on the contrary, are *exact* in perfectly preserving the dominance. Second, the clustering conducted by disco is heuristic and thus not guaranteed to optimally assign the tests to the derived objectives. Finally, disco allows the interaction outcomes to be arbitrarily valued, while the exact coordinate systems assume binary interaction outcomes.

The heuristic character of disco is advantageous in several respects. First, it entails only moderate computational overhead (discussed in more detail following), while the problem of construction of an exact coordinate system has been proven (in Jaśkowski and Krawiec, 2011; Jaśkowski, 2011) to be NP-hard. Second, by being based on the outcomes of interactions with a transient population of tests *T*, the derived objectives match the current capabilities of the candidate solutions in *S*. In other words, the derived objectives evolve along the candidate solutions and may adapt to their capabilities, creating a suitable search gradient while avoiding overspecialization.

The tests in *T* are rewarded for distinctions calculated directly from the original interaction matrix *G* (and not from ). Interestingly, they are not explicitly affected by disco’s multi-objective evaluation. This implies that the first iteration of evolutionary loop in disco produces the same second population of tests as in cel (and in all other configurations considered here). However, the candidate solutions selected in that iteration from *S* in disco are likely to be different from those selected in the first iteration of cel. This results in a different interaction matrix in the second generation, and consequently other selection outcomes in *T*. In this indirect way, the multi-objective evaluation of disco affects the dynamics of evolution in the population of tests.

The process of discovering the derived objectives obviously incurs an additional computational cost, which for *k*-means-like heuristic clustering algorithms is of the order of , where *m* and *n* are respectively the sizes of *S* and *T*. Multi-objective selection is also more computationally demanding than the traditional selection operators based on scalar fitness, due to the complexity of the nsga-ii algorithm (Deb et al., 2002). The total overhead is thus linear in the function of *n*, which encourages using disco with relatively large populations of tests and moderately sized populations of candidate solutions.

These expenses result, however, from the postprocessing of interaction outcomes, while in many applications the interactions are what consume the majority of the computational budget. This particularly applies to many test-based problems where there are *multiple* tests to interact with, and a single interaction outcome may require running a possibly complex program (in GP), performing a costly simulation, or playing a game that involves multiple turns. In such cases, the cost of clustering and multi-objective selection may be an insignificant fraction of the overall computation time.

The empirical evidence gathered from our experiments confirms the moderate overhead of the derivation process. Table 3 presents the runtimes for particular methods and benchmarks averaged over 60 coevolutionary runs, accompanied by 95% confidence intervals. The times are clearly higher for disco when compared to cel across all the benchmarks, but the overhead is only percent on average with respect to cel, and it never exceeds percent. These numbers could be further reduced by using more efficient algorithms or, for instance, limiting the number of internal iterations of x-means clustering (which normally proceeds until data points stop migrating between clusters). This would not necessarily deteriorate the quality of the evolved solutions, because clustering optimally is probably not essential here, given that the evolutionary search is by nature stochastic.

. | cel . | 1-means . | rand . | disco . | disco-bin . | disco-avg . |
---|---|---|---|---|---|---|

ipd | ||||||

dct-1 | ||||||

dct-2 | ||||||

dct-3 | ||||||

coo-3 | ||||||

coo-4 | ||||||

coo-5 | ||||||

coa-3 | ||||||

coa-4 | ||||||

coa-5 |

. | cel . | 1-means . | rand . | disco . | disco-bin . | disco-avg . |
---|---|---|---|---|---|---|

ipd | ||||||

dct-1 | ||||||

dct-2 | ||||||

dct-3 | ||||||

coo-3 | ||||||

coo-4 | ||||||

coo-5 | ||||||

coa-3 | ||||||

coa-4 | ||||||

coa-5 |

The objectives derived by disco are expected to be more informative than a scalar evaluation by providing otherwise unavailable grounds for preferring some solutions over the others. However, the proposed method relies heavily on the possibility of identifying any prevailing and exploitable patterns in the interaction outcomes. The more similar the behavior of candidate solutions on the tests is, the harder it becomes for a clustering algorithm to discover the groups of tests that could be attributed to skills. For instance, when disengagement (Cartlidge, 2004) occurs—that is, all maintained tests are solved by all candidate solutions (or all are failed)—all tests end up in the same cluster, causing the method to degenerate to a single-objective approach and lose its upper hand. Scenarios close to disengagement (a very large fraction of solved or failed tests) may also cause the numbers of tests supporting different objectives to become highly unbalanced, leading to disrupted approximation of the Pareto front. Furthermore, disco may also underperform when the original objective function is inherently single-objective (e.g., in simple, “max ones” types of problems) or hard to automatically decompose into derived objectives. Finally, disco can discover the underlying objectives only if their existence is manifested behaviorally—that is, reflected in the outcomes of interactions between the candidate solutions and tests. If the skills do not manifest in interaction outcomes, disco has no means to discover them. This may be the case in problems where passing any test requires all (or most) skills.

## 8 Conclusions

The disco algorithm proposed here is a means to widen the evaluation bottleneck between the fitness function and a search algorithm. By providing the search process with multiple characteristics of candidate solutions, disco makes a search algorithm better informed. As we argued elsewhere (Krawiec and Swan, 2013), we postulate that treating the fitness function as a black box is unjustified, especially when more detailed information on the solutions’ characteristics, like the interaction outcomes, is easily available. Such information may deserve more effort in conceptual analysis, implementation, and computational expense to harness it, but, as we showed in this study, these costs may pay off with a more effective search method.

In replacing the original objective function with heuristic and transient derived objectives, disco subscribes to the concept of a *search driver* (Krawiec and O’Reilly, 2014a, 2014b; Krawiec, 2015). A single derived objective is a search driver in the sense that it conveys only partial information about the quality of candidate solutions. Relying on such search drivers is not necessarily less efficient than using the original objective (expected utility, in this article). In a rugged and multimodal fitness landscape, the original objective may turn out to be more deceptive than an imperfect search driver. This becomes particularly true in disco, where multiple diversified search drivers are used simultaneously, and so mitigate premature convergence.

There are several directions in which this research can be taken further. The technical improvements of clustering were already discussed in Section 7; however, clustering is just one possible way of deriving a multi-objective search gradient from an interaction matrix. Other techniques can be considered, like building derived objectives from arbitrary, not necessarily disjoint, subsets of tests (Liskowski and Krawiec, 2016). Last but not least, it would be interesting to combine this approach with the other solution concepts mentioned in Section 2 (Popovici et al., 2012).

## Acknowledgments

P. Liskowski acknowledges support from Grant No. 2014/15/N/ST6/04572, and K. Krawiec acknowledges support from Grant No. 2014/15/B/ST6/05205, both funded by the National Science Centre, Poland.

## References

## Notes

^{1}

To simplify notation, we use the same symbols *G* and *G*’ to denote the interaction *matrices* and the associated *spaces* of objectives.

^{2}

In general, the likelihood of committing such errors grows with the ratio of the number of tests (dimensionality of the original dominance space) to the number of derived objectives.

^{3}

https://github.com/pliskowski/ECJ-2015

^{4}

Note that the objectives derived by rand, by being selected at random, are estimates of expected utility in the same sense as (Eq. 2), albeit based on smaller samples than the entire population of tests *T*. Also, similarly to disco, they sum up to scalar fitness.

^{5}

For binary domains, disco-bin with the Euclidean distance is equivalent to disco with the Hamming distance (assuming centroids were determined by the mode in place of the arithmetic mean). However, this is not true for multivalued interaction functions like in the ipd benchmark.