## Abstract

Co-optimization problems often involve settings in which the quality (*utility*) of a potential solution is dependent on the scenario within which it is evaluated, and many such scenarios exist. Maximizing expected utility is simply the goal of finding the potential solution whose expected utility value over all possible scenarios is best. Such problems are often approached using coevolutionary algorithms. We are interested in the design of generally well-performing *black-box* algorithms for this problem, that is, algorithms which have access to the utility function only via input–output queries. We research this matter by focusing on three main questions: 1) are some algorithms strictly better than others when judged in aggregation over *all* possible instances of the problem? that is, is there “free lunch”? 2) do optimal algorithms exist? and 3) if so, do they have a tractable implementation? For a specific expected-utility maximization context, involving several assumptions and performance choices, we answer all three questions affirmatively and concretely: we provide examples of free lunch; we describe the general operation of optimal algorithms; we characterize situations when this operation has a very simple and efficient implementation, situations when the computational cost can be significantly reduced, and situations when tractability of optimal algorithms might be out of reach.

## 1 Introduction

Consider the real-world application of designing electrical power transmission systems (power grids) that are resilient to faults, whether malicious or accidental. Control devices can be strategically placed so as to isolate the damage and redistribute system resources; resilience is measured as the portion of the system that can still be provided with power (Service, 2008). Both the space of designs (device placements) and the space of possible damages are very large. A design can have different resilience to different damages and different designs can have different resilience to the same damage. Assessing the resilience of a design to a damage is typically done by running a time-consuming simulation, as there is no known closed-form mathematical expression for it. Loosely, the goal is to find designs that are very resilient over a broad range of damages. This goal could be made precise in several different ways, leading to different solution concepts. For instance, ideally we would like a design whose resilience to any given damage is better than the resilience of any other design to that damage (*simultaneous maximization of all outcomes*). But such a design may not exist, as there may be trade-off situations where one design is more resilient than another for one damage, but less so for another damage. So we may have to be content to find a set of nondominated designs, where a design is nondominated if any other design is less resilient than it for at least one damage (*Pareto optimization*). Still, there may be a very large number of nondominated designs and we may want to further refine our goal. If we suspect malicious intent, we may want to find the design whose worst resilience over all damages is as good as possible (*worst-case optimization*). Or, if accidental damage is common, we may want to find the design whose average resilience over all possible damages is as good as possible (*maximization of expected utility*).

All of the above are examples of so-called *test-based co-optimization*—the damages can be thought of as tests. We are interested in algorithms that treat the utility (resilience) assessment in *black-box* fashion: they can query its value for any design-damage input pair, but do not have any other knowledge about its nature. The reason for this latter restriction can be three-fold: utility may truly be black-box (e.g., if it is the outcome of a real-world experiment, rather than a computation); we may have some knowledge about the nature of utility, but transforming it into something that can be used in an algorithm is a time-consuming human activity with no guarantee of success; we may want to design broadly applicable algorithms that do not depend on specific properties of a particular set of utility functions.

Such situations are common in the field of engineering design in general and critical infrastructure protection in particular, since designs need to behave appropriately across a wide range of situations they might encounter once deployed, and assessment of behavior is often done via simulation. Test-based co-optimization is also a useful framing for the design of task-specific algorithms (e.g., for sorting networks the tests are the input sequences to be sorted (Hillis, 1990)) or of game-playing strategies (the strategies serve both as the object of the search and as tests).

As such, the study of black-box test-based co-optimization is an important area of research. It has emerged over the past decade or so as a means of formalizing the types of problems that coevolutionary algorithms were used to tackle (Popovici et al., 2010). This has led to the recognition that the evolutionary metaphor is not the only approach to such problems: other examples include co-optimization adapted simulated annealing (Service and Tauritz, 2008a) and reinforcement learning (Sutton and Barto, 1998; Tesauro, 1995). All of these fields were rich with empirical and theoretical study of their own breed of algorithms. Bringing them under the co-optimization umbrella has sparked generic theoretical research into the structure of co-optimization problems and what may or may not be possible from an algorithmic standpoint, without assuming any particular metaphor (whether biological-, physics-, or human-inspired). Examples include “global” monotonicity of solution concepts (Ficici, 2004) and order-theoretic and geometric characterizations of Pareto co-optimization (Bucci, 2007).

The present work is concerned with whether some algorithms can be *generally* better than others for a specific co-optimization solution concept, in other words, whether there is “free lunch,” and if so, how to take advantage of it. Wolpert and Macready (2005) first posed these questions in a context involving worst-case optimization and showed that free lunch exists both for an algorithm's *exploration mechanism* (the component deciding at which points to evaluate the utility function) and for its *output mechanism* (the component deciding what potential solution to output given what has been evaluated so far). Exploration mechanism free lunch has subsequently been shown to exist for a slightly different worst-case optimization context (Service, 2009b), simultaneous maximization of all outcomes (Service and Tauritz, 2008b), and Pareto co-optimization (Service, 2009a). Whether a solution concept allows free lunch was shown to be orthogonal to whether that solution concept is monotonic (Service, 2009b; Popovici and De Jong, 2009).

When free lunch exists, it is meaningful to ask whether optimal algorithms exist and whether they can actually be implemented. Optimal output mechanisms have been defined for a variety of co-optimization contexts (Wolpert and Macready, 2005; Service and Tauritz, 2008b; Popovici and De Jong, 2009) and shown to be tractable, under certain conditions, for worst-case optimization (Popovici et al., 2011). Optimal exploration mechanisms have only been investigated, and shown to exist and sometimes be practical, for that same worst-case optimization context (Popovici and Winston, 2015).

Our solution concept of interest, maximization of expected utility, is probably the most common goal in practice, and thus popular in coevolution. It is also of interest due to its similarities with supervised machine learning, where the goal is finding predictors maximizing accuracy over many data points. From a theoretical standpoint, it was shown to be globally nonmonotonic by Ficici (2004),^{1} yet allow for “locally” or “internally” monotonic algorithms (De Jong, 2005); but in spite of its importance, it has received little study from the perspective of general performance: Popovici and De Jong (2009) studied output mechanisms and showed *strict* optimality (and therefore free lunch) is possible, and only briefly discussed tractability of optimal output mechanisms. No studies exist on free lunch, optimality, and tractability of exploration mechanisms for maximizing expected utility—this is the gap the present article fills.

The contents are structured as follows. In Sections 2.1 and 2.2, we review the mathematical formulations of co-optimization problems and algorithms. In Section 2.3, we make precise the notions of performance, “better” and “in general” from the perspective of maximizing expected utility. In Section 3, we study and show the existence of free lunch and optimal algorithms for this problem. By identifying commonalities and differences between expected utility optimization and worst-case optimization, we are able to transfer, with some adjustments, results from Popovici and Winston (2015). Section 4, which constitutes the bulk of the article, is highly specific to the problem of maximizing expected utility and shows under which conditions it is tractable to implement the optimal exploration mechanism and how to reduce its computational complexity—all novel results for this problem. The findings are very positive for some domains and less encouraging for others. This should help guide the focus of future research into designing co-optimization algorithms for maximizing expected utility. All proofs are relegated to the accompanying supplementary materials.^{2}

## 2 Co-Optimization Background

### 2.1 Domains and Problems

The **interactive domain** of resilient power grid design described in Section 1 is characterized by two **roles**, each with an associated **entity set**: for designs and for damages. Resilience of a design to a damage (utility) is given by a **metric**. is the set of all possible resilience values. A tuple is called an **interaction** and is called a **measured interaction**. At present we consider only deterministic metrics; that is, .

Our **co-optimization problem** consists of searching a **space of potential solutions**, in this case , for an element (a design) that conforms to a certain criteria called a **solution concept**, in this case optimizing expected utility. Specifically, given a metric-dependent **quality function**, a potential solution is an actual or true **solution** if and only if it optimizes , that is, if , where is one of or , and we assume that the domain is such that is guaranteed to exist. The quality function for expected utility is given by , where and represents expectation, which we assume is well defined for our domain. Since utility (resilience in the power grid domain) is typically such that higher values are better, for the rest of the article we use and talk about **maximum expected utility** (MEU); however, should be such that lower values are better (for instance if represents cost), then would be . A quality function like “collapses” the metric , which assigns real values to interactions, into a function which assigns real values to potential solutions (representing how good they are).^{3} The elements of (e.g., the damages) are referred to as **tests**, because they help test which elements of are solutions and which are not.

We further assume and to be finite. Together, these conditions are sufficient (though not necessary) to guarantee that and exist. Moreover, for finite , the expectation is equal to the average, and maximizing the average is further equivalent to maximizing the sum, given by . The solution concept of maximizing expected utility has therefore also been referred to as **maxiavg** (Popovici and Winston, 2015) or **maxisum** (Popovici and De Jong, 2009). For simplicity, in the rest of this article we will use maxisum.

The worst-case optimization problem from Popovici and Winston (2015), upon which we draw, is simply obtained by instantiating with (the respective solution concept is called **maximin**).

In this framework we use the word “problem” to denote the pairing of a potential solution set with a solution concept applicable to that . We call a **problem instance** the coupling of a problem with a specific metric that is given to us in black-box manner: we do not know which we are presented with, its analytical or algebraic form is unknown or nonexistent, but its value can be queried for any input.^{4} One such query is typically seen as the unit of cost for any algorithm attempting to solve the problem.

### 2.2 Algorithms

In light of the above, the generic structure and operation of algorithms for black-box co-optimization problems has been refined to the following, reproduced with minor adjustments from (Popovici and Winston, 2015).

We call a **history** a finite *sequence* of measured interactions, that is, tuples , . We denote by the set of all possible histories, that is, the set of all finite sequences with elements from . Then an **algorithm** consists of two functions:

the

**exploration mechanism**, ; it determines what interaction to assess next, given the interactions assessed so far, their measurements, and the number of additional interactions that the algorithm is allowed to evaluate after the current one, called the**budget**; we denote the set of all exploration mechanisms bythe

**output mechanism**, ; it determines what potential solution to return, given the assessed interactions and their measurements; we denote the set of all output mechanisms by .

We write and denote the set of all co-optimization algorithms by . The operation of an algorithm on metric starting from history with initial budget is equivalent to:

Since we use only deterministic metrics, we must additionally require that the metric be **consistent with the history**, meaning that if the algorithm were to decide to re-evaluate an interaction already present in the (by calling ), it should observe the same value as that associated with the interaction in . We denote by the set of metrics consistent with history .

### 2.3 Performance

To investigate whether some co-optimization algorithms perform better than others in some general sense, one must make precise the notions of algorithm performance, “better” and “general.” We do so for maximizing expected utility, by following the steps laid out in Popovici and Winston (2015) and making almost exactly the same choices as made there for worst-case optimization, since these two co-optimization problems have very similar structure.

In the above framework an algorithm has three input parameters: the metric , the budget , and the starting history . So we investigate algorithmic performance for the maxisum solution concept across *multiple* metrics, budgets and starting histories. But to start, we must define performance with respect to a single one of each: we call **single-metric algorithm performance** and denote by the performance of algorithm when running for steps starting from history on metric . must help us decide, for a given metric, budget and starting history, whether one algorithm is better than another, and do so in a way that “respects” the solution concept's quality function: loosely speaking, for one algorithm to have better performance than another, the former algorithm should generally output potential solutions of better quality than those outputted by the latter algorithm.

*fixed-budget performance*(Jansen and Zarges, 2012) and is the main interest of practitioners in the field of black-box optimization.

^{5}Mathematically, the potential solution outputted at the end is given by , where stores the history accumulated during the iterations of the loop from the algorithm snippet. We denote this history by , as it is fully determined by these four parameters and not dependent on the output mechanism . Then we define as the quality on metric of said potential solution: To separate the contributions of the two components of the algorithm, we define a function assigning performance values to output mechanisms independent of exploration mechanisms, , denoting the performance of output mechanism on metric given history as the quality on of the potential solution outputted by for : Then the performance of the whole algorithm can be expressed as the performance of its output mechanism for the history produced by its exploration mechanism:

We now have a set of functions specifying what we mean by “better” performance when dealing with a specific problem instance (metric) : higher values for and are better. To be able to make “general” performance statements, we need a way of aggregating over metrics. **Aggregated performance** is important because often in real-world applications a certain problem must be solved multiple times under varying circumstances (i.e., for different metrics) and it would be costly, or perhaps infeasible due to time constraints, to design a new algorithm each time. Instead, one would like to design the algorithm once and know that it performs well in multiple circumstances.

^{6}So we define aggregated performance for potential solutions, output mechanisms, and complete algorithms as follows:

Since all performances are defined with respect to some history, averaging is done only over metrics consistent with that history. To ensure these averages are well defined we also further assume that is finite: together with the assumption that and are also finite, this guarantees that is finite and we can average over it.^{7}

From here on, we assume , , and are finite, and let denote the total number of tests.

A few important remarks are in order at this junction. The above formulas use uniform averaging, whose underlying assumption is that all metrics are equally likely. In practice, this is an unrealistic assumption, as experience teaches us that real-world problem classes tend to have structure to them. The present work focuses on uniform averaging as a necessary first step for maxisum, since there is no prior work on free lunch and optimality of exploration mechanisms for this problem. Interestingly though, as we will see in Section 3, *nontrivial free lunch exists even under this uniformity assumption*, a notable difference compared to the no free lunch of traditional optimization (Wolpert and Macready, 1997). Note also that the generic framework of Popovici and Winston (2015) does not impose such an assumption, other means of aggregating over problems could be investigated, including to reflect nonuniform distributions, in which case free lunch should be even more prevalent. The challenge in extending the present work to other distributions will lie in the ability to formalize them; this matter is further discussed in the concluding remarks of Section 5.

## 3 Optimality and Free Lunch

Using these notions of performance, we can now talk in precise terms about aggregate optimality and free lunch. We can use these concepts with respect to algorithms or their components, the output mechanism, and the exploration mechanism. An algorithm is aggregately optimal over some set of algorithms if its value is greater than or equal to that of any other algorithm in . If all inequalities between pairs of algorithms in are in fact equalities, then the above optimality is trivial and we say that there is no free lunch. Otherwise, if at least two algorithms have different values, then there is free lunch, and finding an optimal algorithm, if one exists, is a meaningful endeavor. Note that aggregate optimality is a stronger result the larger the set of algorithms it holds over, while free lunch is stronger the smaller the set.

Since takes two additional parameters, and , the above notions can be investigated with respect to a single budget and/or starting history, or multiple ones. Similar notions can be defined for output mechanisms by using values for one or multiple histories, thus independent of any exploration mechanism or budget. For exploration mechanisms, we cannot define aggregate optimality and free lunch in isolation, we must do so with respect to some fixed output mechanism and use the values for the resulting algorithm.

*Bayes*. Subsequent works extended this notion to other solution concepts (Service and Tauritz, 2008b; Service, 2009b) and formalized it for any solution concept for which we can define an aggregated performance for potential solutions (Popovici and De Jong, 2009; Popovici and Winston, 2015); namely:

^{1}together with the choice of defining by means of a quality function in the manner of Equation (2.2) provide sufficient conditions for the existence of output mechanisms

^{8}and the fact that they are aggregately optimal over with respect to

*any*history (Popovici and Winston, 2015). The aggregate performance of output mechanisms is: That same assumption together with the choice of defining by means of a in the manner of Equation (2.3) lead to the first result concerning exploration mechanisms for maxisum: for any output mechanism (including ) and any budget, there exist exploration mechanisms aggregately-optimal over with respect to them for

*any*history (Popovici and Winston, 2015).

The above results apply not just to maxisum, but to any solution concept defined by means of a quality function , as long as we make the finiteness Assumption ^{1} and the performance choices described by Equations (2.2) and (2.3) for that . We now further refine our analysis to take into account the actual definition of .

### 3.1 Maxisum: Output Mechanisms

Popovici and De Jong (2009) investigated optimality and free lunch of output mechanisms for maxisum and showed that is *strictly* aggregately optimal in some situations (meaning there is output mechanism free lunch over ) and that the tractability of implementing hinges on the ability to know or compute the average of the values in ; the investigation concerned histories with evaluations for only two potential solutions. We review the key elements of those results, as they are also necessary for our study of aggregately-optimal exploration mechanisms in Section 3.2. We also extend the analysis to include generic histories and further considerations on tractability, by adapting some maximin notions introduced in Popovici and Winston (2015).

**expected sum**() of potential solution given history . It depends on only via those interactions in that involve , and can be expressed in closed form. Specifically, it is a function of the number of

*distinct*tests that has interacted with in , which we call the

**number of tests seen so far**, and the sum of metric values over the interactions with those tests, which we call the

**current sum**. We refer to the combination of these two pieces of information as the

**type**of potential solution with respect to history , and write , where is the number of tests seen and is the current sum.

**Completely unevaluated**potential solutions, that is, ones that have not seen any tests as part of the history, have type .

**Previously evaluated**potential solutions have . The expected sum is equal to the current sum plus the product of expected value from an interaction with an unseen test, , and the number of unseen tests, :

This formula was first loosely derived by Popovici and De Jong (2009). The supplementary materials of the present article show a formal proof. Its key enablers are the finiteness Assumption ^{1}, the associativeness of averaging, and depending on only via those interactions in that involve . The latter of these, in particular, may not hold for certain nonuniform distributions over problems.

Many of the results in the remainder of this article rely on properties of the function. In particular, the relationship between values of two potential solutions is preserved if the same value is observed when each of them sees an additional test (details can be found in the supplementary materials). For simplicity, from here on we write to mean .

The free lunch results of Popovici and De Jong (2009) consisted of examples of histories with pairs of potential solutions having different values for those histories. In particular, it was shown that is *strictly* better than the *greedy* output mechanism often used in practice, which returns the previously evaluated potential solution with maximum current sum.

Let us investigate more generic histories. In practice, the set is so large that we do not expect to ever be able to evaluate each potential solution in with at least one test; therefore, for the remainder of this article we consider only histories such that there exists at least one completely unevaluated potential solution. decides what to output based on expected sums; it considers all previously evaluated potential solutions as well as one “representative” completely unevaluated potential solution—chosen uniformly at random, since all of them have type and consequently the same . Thus the maximization over depicted by Equation (3.2) is in fact a maximization over a much smaller set of potential solutions (at most ) and it needs to know only the types of those potential solutions.

To express this, we use the notion of a **compressed history**. Let the notation denote the set of all integers between and , including both. Let denote the set of all possible tuples . Then a compressed history is a multiset with elements from . We denote the set of all compressed histories by . The **compression** of a history , denoted by , is a compressed history consisting of the multiset of types for all distinct potential solutions in the history (different potential solutions can have the same type). The compression of the empty history is the empty multiset, . For a nonempty history with distinct potential solutions, the compression will be a multiset of the form .

Note that the first element in tuples from can be 0 and the second element can freely vary over . However, for a given , not all elements of are realizable as types with respect to some actual history. Completely unevaluated potential solutions always have type and any type of a previously evaluated potential solution must obey the constraints that and that can be expressed as a sum of elements from . We allow for “degenerate” types and compressed histories (that aren't actually the compression of a real history) because they are useful in performance computations.^{9}

We can now express the performance of for a given history as a function of the compression of that history as follows:

Consequently, comparing the values of different potential solutions, whether previously evaluated or not, does not actually require knowing the exact value of . It does, however, require knowing or being able to compute . Additionally, in order to be able to implement , the algorithm must store, at a minimum, all previously evaluated potential solutions, the number of *distinct* tests that each such potential solution has interacted with, and the respective current sums . Note that it is not necessary to store the individual measurements that produced the respective sums. To be able to accurately count distinct tests, the algorithm may actually need to store the tests themselves in order to perform identity comparisons—unless the exploration mechanism has some characteristics allowing such counts to be maintained without the need for comparisons.

In regards to , some real-world domains may in fact have the property that we know or can easily compute this value. For instance, this is the case when the outcome of evaluating a potential solution with a test is merely a pass/fail (e.g., 1/0) or perhaps a ranking (e.g., from very bad to very good on a scale from 1 to some small integer). Or, may be an equally-spaced discretization of some continuous interval; we do not need to actually perform a summation over the number of distinct values in , but merely compute the midpoint between the minimum and maximum of said interval. If does not have equally spaced values, but we do know the values in it, then the average can be computed via actual summation and we need to do this only once; thus, it is not of great concern in terms of computational resources.

The difficult case is when we know is included in some other set, but not all values in that set are actually possible, yet we do not know which are and which aren't. Then we cannot completely implement and the issue is not one of computational expense, but of lack of information. Note the difference in tractability of for maxisum compared to maximin: there, even when we had complete information about the domain, the ability to implement could be hindered by high computational costs; for maxisum, if we know then implementing is easy in all situations the algorithm might encounter.

If we cannot get , we may still be able to implement in some situations by making use of bounds we can establish on (for instance, any bounds on are also bounds on ). Comparing some and is equivalent to comparing with . Depending on the values of , , , , the bounds on might (or might not) tell us the outcome of the comparison.

In light of these considerations, and since the performance of a complete algorithm, , is dependent on , from this point on we concern ourselves only with the case where is known. This allows us to transform the set by subtracting from each of its values and obtain an equivalent optimization problem. The average of the transformed set is 0, which leads to (i.e., the expected sum is the same as the current sum), and therefore . So in this situation differs from *greedy* only through the presence of the 0 inside the ; this becomes relevant when all previously evaluated potential solutions have negative current sums, in which case will output a completely unevaluated potential solution whereas *greedy* will output the previously evaluated one with the largest current sum, even though it is negative.

### 3.2 Maxisum: Exploration Mechanisms

We now turn to investigating aggregate optimality and free lunch for exploration mechanisms for maxisum, an area which has not been researched by previous works.

The similarity of maxisum to maximin allows us to transfer from Popovici and Winston (2015) results concerning existence and generic operation and performance of aggregately optimal exploration mechanisms. However, when it comes to tractability of implementation, the differences between the solution concepts come into play and we prove new results specific to maxisum, which we present in Section 4. Here we describe the transferable results.^{10} Like for output mechanisms, the two key enablers of these results, and therefore the transferability, are the finiteness Assumption ^{1}, the associativeness of averaging, and the fact that depends on only via those interactions in that involve .

Traditionally, the study of free lunch had concerned itself only with algorithms whose exploration mechanisms return only **nonrepeating** interactions, that is, ones not already evaluated as part of the history (not in ∃ ).^{11} We call such exploration mechanisms **nonrevisiting** and denote their set by . All of the free lunch results we present in the remainder of the article hold over and thus are nontrivial. Nonetheless, the framework we use here allows talking about free lunch and aggregate optimality over any set of algorithms; the reason for this is to enable future studies of subclasses of that simply cannot guarantee nonrepeating interactions, for instance, due to stochasticity or finite memory.

*any*budget and

*any*history. Moreover, an exploration mechanism is aggregately optimal over if and only if its operation is equivalent to a recursive maximization of over nonrepeating interactions:

In fact, this maximization over the extremely large set is further equivalent to a maximization over the much smaller set of types of these interactions with respect to the history . The type of an interaction with respect to a history is simply the type of the potential solution involved in that interaction with respect to that history; that is, if then .

The set of possible types a nonrepeating interaction can have with respect to a history is completely determined by the compression of that history alone. Note that if a potential solution is **fully evaluated**, so its type is , then will not contain any interaction involving . Consequently, for any , if we must have that . Thus we define the set of types possible with respect to the compressed history , , as .

^{12}Therefore, it is useful to express the compression of a history lengthened with one measured interaction as a function of the compression of the original history, the type of the interaction with respect to that history and the measurement of the interaction:

With these notions in place, the aggregate performance and operation of optimal explorations can be described by means of the following function:

With this, the recursive maximization of Equation (3.9) can be refined as follows:

^{1}hold. An exploration mechanism is aggregately optimal over if and only if its operation is equivalent to a recursive maximization of over interaction types, where has the property: The performance of any such is related to via: and such an exploration mechanism is

*also*aggregately optimal over for any budget and any history

*and*the complete algorithm is aggregately optimal over for any budget and any history.

The proof of Theorem ^{5} also hints at the possible existence of exploration mechanism free lunch over : for a given and , if there exists a type that does not maximize the described average, then any exploration mechanism satisfying is *not* aggregately optimal for budget and history ; that is, we have free lunch over the set of exploration mechanisms returning nonrepeating interactions.

*strictly*better aggregate performance for and than those returning nonrepeating interactions of type or , and we have free lunch in this scenario.

## 4 Tractability

We now set out to investigate whether optimal exploration mechanisms could actually be implemented, given that their generic operation as described by Theorem ^{5} consists of a recursive optimization of . One very positive finding is that for a certain class of domains of practical interest implementing optimal exploration mechanisms is equivalent to a simple, efficient procedure and therefore always tractable (Theorem ^{18} in Section 4.5). For the remaining domains, we present several different means of reducing the computational cost of the recursive optimization; but we also show some less-than-optimistic evidence that completely avoiding recursive optimization might not be possible sometimes.

Theorem ^{5} describes maximizing an average of values of , and itself is defined recursively as a maximization problem. This process can be represented as a complete tree of depth with alternating maximization and averaging layers (examples in Sections 4.1 and 4.2); nodes performing averaging have a branching factor of ; nodes performing maximization have a branching factor equal to where is the history at that node; each leaf node computes an value. For budget and a history with distinct types, the number of nodes in this tree is a large multiple of . The exact number depends on actual measurements in the history and their relationship to values in , since is not constant as we go down the tree: while the size of the history always increases by one every other layer, the number of distinct types can either increase by one (e.g., when new potential solutions are evaluated), or decrease by one (e.g., if after the evaluation the new type is the same as another type; or if the respective potential solution is now fully evaluated), or it can stay the same. Thus, at first glance, solving the maximization in Theorem ^{5} appears computationally prohibitive even for modest values of , and .

It turns out though that there are situations for which we can mathematically prove what the solution to the recursive optimization is, and thus do not need to perform said optimization computationally by evaluating the tree. Such sufficient conditions for the optimal exploration mechanism to be tractable are characterized by properties of the domain—namely, the nature of —and the state of the algorithm—specifically, the relationship between the remaining budget, the total number of tests , and the number of tests seen by potential solutions evaluated as part of the history.

And, if at some point during the run of the optimal exploration mechanism on some metric we do actually need to perform recursive optimization to determine the optimal choice by evaluating the tree, this actually involves determining the optimal choices for all future choice points that could possibly be encountered during that run, as well as for choice points that may be encountered when running the algorithm on other metrics—since the averaging nodes represent the fact that we do not know which metric we are running on: each root-to-leaf path in the tree represents a different set of metrics. Storing the results for the inner nodes of the tree will eliminate or reduce the need to run recursive optimizations in the future.

Moreover, the branching factor of the recursion can be reduced by means of mathematically proven properties, a process which we call *pruning*. Pruning may be possible both for maximization nodes and for averaging nodes. We use the following shorthand notation for the quantity computed by averaging nodes:

### 4.1 Pruning by Caching

To better understand the nature of recursion trees^{13} and see some of the pruning in action, let us look at a simple example. Consider a domain with pass/fail test outcomes; that is, and . Suppose we start from the empty history with budget n=5 and wish to implement the optimal exploration mechanism, which, in combination with , would guarantee us expected performance . At the first step, since , we have only one choice: evaluate a new interaction of some potential solution with some test. In math-speak, , since afterwards we will have 4 interactions left. Suppose when evaluating the new interaction we observe measurement 0. The history becomes , whose compression is . We now have two choices: evaluate with a new test, or evaluate a new potential solution with some test; that is, . To decide which choice is better, we need to compute for each and a remaining budget of 3, and this launches the recursive maximization whose tree is depicted in Figure 1. Due to space constraints, the tree is expanded only up to the level containing . This is also the level where we first start to see opportunities for pruning. For instance, by the time we need to compute , we have already computed , and since multisets are order independent, the former is the same as the latter and if we have cached its value we need not recompute it. This is depicted in the tree by bold font and a lack of further branching out. Of the 18 nodes on this level, a total of 7 are such repeats; thus we need to continue the recursion for only 11 of them. Additional such pruning opportunities occur on the level of .

Upon computing the values for the entire tree, we find . Thus we evaluate a new potential solution with some test. Regardless of the measurement we observe, the for the resulting compressed history has already been computed, along with the -s it depends on and which of them is maximum; thus the optimal choice for the exploration mechanism is readily available. And the same is true for the remainder of the algorithm's run, as we just follow a path through the tree, dictated by the measurements we observe, and do not need to run any new recursive maximization.

Now suppose we need to run the algorithm again on a new problem instance (metric), still starting from the empty history and with budget . If for the first interaction we choose to evaluate (which could be same or different from the one we chose on our run on the previous metric) we once again observe a 0, then everything we need to know to make the optimal exploration decisions has already been computed and we just follow a (possibly different) path through the same tree in Figure 1. However, if for the first interaction we observe a 1, then we are in a previously unassessed situation and need to determine which is larger, or . The recursive maximization tree for this is shown in Figure 2, also expanded only to the level of . It has fewer branches than the tree in Figure 1. That's because some of its subtrees are actually the same (modulo multiset order) as ones already assessed as part of the tree in Figure 1. These pruning points are once again marked in bold font and without further branching out. Once we perform the recursive maximization for this tree as well, any future runs of our algorithm on new metrics but still starting from the empty history and budget 5 will no longer require any such recursive maximization.

Thus, the cost of the optimal exploration mechanism gets amortized the more instances of the problem we need to solve. Nonetheless, for larger budgets and larger -s, the upfront cost may still be prohibitive. We thus investigate further ways to prune the recursive optimization tree.

### 4.2 Other Generic Pruning

The next set of results concerns pruning related to previously evaluated potential solutions whose expected sum is less than that of completely unevaluated potential solutions, (i.e., it is worse than what we would expect before having seen any tests).

Such potential solutions are never a strictly better choice for the next interaction to evaluate. To express this, note that since the type of an interaction depends only on the potential solution involved in that interaction, we can extend the expected sum function to apply to types: . With that, we have:

The algorithmic implication is that at maximization nodes we can prune the child subtrees corresponding to types with expected sum at most .

Additionally, potential solutions with expected sum at most do not contribute anything to optimal performance:

Let Assumption ^{1} hold. We denote by the multiset formed only of those types in with expected sum strictly greater than , appearing in with same multiplicity as in .

From an algorithmic standpoint, this means that to the child subtrees that didn't get pruned we can pass down shorter compressed histories, which means lower branching factor for all the maximization nodes in those subtrees. Applying this type of pruning to the tree in Figure 1 we obtain the much smaller tree in Figure 3. Up to level , the number of internal nodes has gone down from 24 to 11; and instead of 29 placeholders () for there are only 7; the reduction factor for the full tree will further be impacted by pruning on the level . In general, this factor is likely to vary with and ; investigating this relationship is subject for further work.

In addition to reducing the size of the recursive maximization tree, we can also reduce the algorithm's memory footprint: whenever the measurements observed so far make the expected sum of a potential solution less than or equal to , we can “drop from consideration” that potential solution. What we mean by this is that, in the process of dropping its type from our running compressed history, we can also drop all information associated with the solution: drop records of the tests it has seen, how many there were, their measurements, as well as the expected sum; we (may) still need to store the potential solution itself in a list of potential solutions not to attempt to evaluate again. In subsequent sections we will see additional situations in which we can take such memory-saving actions.

A subtler consequence of Proposition ^{7} and ^{8} is that some of the child subtrees of averaging nodes may be equivalent (in that they solve the same maximization), so we need evaluate only one of them and use that value multiple times in the average. To see why that is, recall that . If is such that there are multiple -s in for which , then all the nodes for those -s solve the same maximization problem of computing .

For an example, consider , so and . Let be a history containing , (the specific value of is not important for this example). We have and consequently for any we have ; thus and instead of computing 7 respective subtrees we can compute only one.

The pruning opportunities described so far may occur in any domain and for any algorithmic inputs (budget and starting history). Next, we investigate situations when even more drastic pruning can be done if certain properties hold for our combination of domain, budget, and history. We find that different kinds of pruning can be performed depending on the relationship between the budget and the total number of tests .

### 4.3 Budget Considerations

The initial interest in co-optimization had sparked from domains where the number of tests, , is so large and/or the metric so expensive to compute, that we couldn't hope to fully evaluate even a single potential solution; that is, our budget would always be smaller than . But there are plenty of domains of interest where the number of tests is moderate and/or the computational complexity of the metric is manageable, so that we could afford budgets allowing full evaluation of multiple potential solutions; that is, would be larger than . Still, we would like to make the best use of our budget and fully evaluating every potential solution we sample is likely suboptimal given the previously discussed free lunch and the description of the optimal exploration mechanism. This would be particularly important for domains with a very large potential solution space , where sampling many potential solutions may be desirable. Here are some domain examples illustrating the possible relationships between and .

Popovici et al. (2007) studied a ship-design domain with , as tests consisted of damage to combinations of 3 out of 42 ship compartments. A design was a placement of smart valves in the ship's piping network, and the resilience of a design to a damage, , was given by the portion of the ship still serviceable with water after valve closures isolated the damage. It was assumed that a control algorithm could perfectly determine the damage location based on sensor data, thus a “perfect-response resilience” could be computed via a fast graph traversal algorithm, and evaluating one design with *all* damages took under a second on a typical core. Since there are 86,400 seconds in one day, it means that 1-core-day would allow full evaluation of about designs, the equivalent of about interactions. This is a case where is considerably larger than . The low cost of means that we could use full evaluation to reduce this problem to traditional optimization.^{14} However, designs is still a very small fraction of the full design space, which in that domain was ; by using more time and more cores, we could increase the number of designs by two or three orders of magnitude, yet still not make a significant dent in the search space. So we may still want to treat this as co-optimization in order to take advantage of free lunch and be able to explore more designs without fully evaluating at least some of them.

While perfect-response resilience is an insightful metric, control algorithms are often imperfect. Computing more realistic resilience in the presence of an imperfect control algorithm requires running a fluid dynamics simulation. This is a much more expensive type of metric : evaluating a design with a *single* damage can take from one to several minutes. Assuming one-minute, 1-core day gives us interactions, which is almost 10 times smaller than . If we have more time and more cores, we could shift into a situation with . Then again, if the simulation actually takes many more minutes we would likely have to settle for .

Consider also the domain of designing sorting networks with as few gates as possible (Hillis, 1990). For networks handling sequences of length , the total number of tests is given by the number of all possible binary sequences of length , .^{15} The metric is cheap: running an input sequence through a network and checking if the output is sorted is very fast; a single core should be able to perform several orders of magnitude such evaluations per second. So if is small, we can afford a budget much larger than . But as grows, inevitably will become much smaller than .

We start by presenting pruning results for the case when is larger than .

### 4.4 Pruning for Large Budgets

If our initial budget was larger than , then at some point during the run of the algorithm we may have in our history some fully evaluated potential solutions. While these do not impact the branching factor of the recursive optimization tree, since they cannot be further tested, there are some algorithmic optimizations we can make in regards to memory requirements. For instance, note that for a fully evaluated potential solution there is no point in storing the tests it has seen: we know it has seen all of them and there won't be any more test-identity checks to be performed for it. One additional result is that if there are multiple such fully evaluated potential solutions, at most one of them—the one with the best expected sum—influences the performance and future decisions of the optimal exploration mechanism (Proposition ^{13}), so the rest of them can be dropped from consideration, as described in the previous section. To express this, we define the following ordering of types based on their expected sum:

We also need notation to split a compressed history into types for fully evaluated, no longer testable potential solutions and types for **partially evaluated**, still-testable potential solutions:

We partition a compressed history into two multisets, , where and and the multiplicity of elements in is preserved in or respectively .

Then our result of pruning fully evaluated potential solutions is as follows:

In words, of all fully evaluated potential solutions we drop all but the one with best expected sum, and we drop this one as well if there are at least still-testable potential solutions with a greater or equal expected sum. Note that if all fully evaluated potential solutions have an expected sum at most , then the second branch applies (due to (4.8)) and we drop them all, which is consistent with Proposition ^{8}. In fact, because of Corollary ^{10}, on the right-hand side of the equalities in Proposition ^{13} we can replace with .

Note that such memory-saving pruning may occur regardless of whether the *remaining* budget is small or large, but to reach a state with multiple fully evaluated potential solutions when starting from an empty history, the *initial* budget must have been large. We now turn our attention to pruning we can do at any point during the run of the algorithm if the budget still available at that point is large. Specifically, we are able to show that if the remaining budget is strictly greater than what's needed to fully evaluate all the partially evaluated potential solutions in the current history, then the next optimal choice is to evaluate an interaction of a new potential solution with some test.

The number of interactions needed to fully evaluate all the partially evaluated potential solutions in can be expressed as and our first “complete pruning” result is:

However, if we start from the empty history with a budget such that can be factorized as , , , then we will be able to run the optimal exploration mechanism for at least steps without the need for recursive optimization, before the remaining budget no longer satisfies the condition of Theorem ^{15}. We say “at least” rather than “exactly” due to Proposition ^{7} and ^{8}: if for some potential solutions we observe measurements , which implies , we cannot simply add this to the running compressed history before proceeding to the next step. The remaining budget still goes down by 1, but the amount stays the same; thus it is more likely the inequality condition still holds. For an example, see Table 1.

Interaction . | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | ||

Optimized | [] | ||||||||

0 | 1 | 1 | 1 | 2 | 3 | 3 | 4 | 4 | |

0 | 1 | 1 | 1 | 2 | 3 | 3 | 4 | 4 | |

0 | 3 | 3 | 3 | 6 | 9 | 9 | 12 | 12 | |

Remaining budget | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 |

Theorem ^{15} condition | true | true | true | true | true | true | true | true | false |

Interaction . | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | ||

Optimized | [] | ||||||||

0 | 1 | 1 | 1 | 2 | 3 | 3 | 4 | 4 | |

0 | 1 | 1 | 1 | 2 | 3 | 3 | 4 | 4 | |

0 | 3 | 3 | 3 | 6 | 9 | 9 | 12 | 12 | |

Remaining budget | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 |

Theorem ^{15} condition | true | true | true | true | true | true | true | true | false |

While this is a very handy result, after starting from the empty history with an initial budget greater than , we will eventually reach a state characterized by a compressed history and remaining budget such that . What then? As we show in the next section, there may be further situations for which we can perform pruning.

### 4.5 Pruning for Small Budgets

We now consider the other end of the spectrum, where the remaining budget is at most sufficient to fully evaluate a single one of the partially evaluated potential solutions (so definitely not all of them and have leftovers, like in the previous section). The partially evaluated potential solutions that have seen the most tests are the ones that need the fewest additional tests to become fully evaluated, so we define:

If the remaining budget is strictly less than we do not have enough to fully evaluate even one partially evaluated potential solution; if exactly equal, then we can fully evaluate one but have no leftover budget. In such “small budget” cases, for domains where has a certain symmetry property, we can prune the choices down to a single one (Theorem ^{18}), so the optimal exploration mechanism need not assess the recursive optimization tree. For other domains, we show such pruning can still be done for nodes associated with certain algorithmic states (Theorem ^{19}). For the rest of the cases, we show results that help reduce the computational complexity of the recursive optimization (Proposition ^{20} and ^{22}). We start with the first and strongest result, applicable to domains whose set has the symmetry property described below:

We say the finite set is symmetric (around its own mean), and write it as , if .

For domains with symmetric around 0, in “small budget” situations the best choice is to evaluate a new interaction for a still-testable potential solution with the highest positive current (and expected) sum (this potential solution could be a partially evaluated one or a completely unevaluated one):

Suppose the domain has symmetric around 0 and the number of tests so large that even our initial budget is less than . If we start from an empty history, since , the condition of Theorem ^{18} is satisfied. Moreover, it will continue to be satisfied for the remainder of the algorithm: at each step the remaining budget decreases by one, and can increase at most by one (it can also stay the same or even decrease). Thus Theorem ^{18} will be applicable all throughout. This means that *in such domains, implementing the optimal algorithm is always fully tractable*, as it never requires running a recursive optimization!

As for how common these domains may be, it was the very characteristic of an extremely large that initially sparked interest in co-optimization, so this situation is in fact of great practical interest. Additionally, note that a large is also what set co-optimization apart as more difficult than traditional optimization. It is thus interesting to see that, at least under the uniformity assumption, it is for very large that we may be able to guarantee the optimal algorithm has a tractable implementation.

Concerning symmetry around 0, as pointed out in Section 3.1, whenever we know we can subtract it from and obtain an equivalent problem but with , such that . So if the original is symmetric around its mean, then is symmetric around 0 and thus we can take advantage of Theorem ^{18}. Domains with a symmetric are not that uncommon: for instance, any equally spaced discretization of some continuous interval is symmetric around its mean. Of course, Theorem ^{18} does not require equal spacing; however, it is unclear how often we might encounter domains with a that is symmetric but does not have equally spaced values. Symmetric is also a characteristic of domains with binary or rank-type outcomes, which are of practical interest. Thus, Theorem ^{18} is an important positive result.

What of domains with asymmetric ? We show that we can still perform the same type of pruning down to one choice for some specific algorithmic states: if there is a large enough difference between the expected sums of the first two testable types in the compressed history's -ordering, then testing the first such type is the best choice.

Note we also request , but this is not much of a restriction; as noted above, we can transform by subtracting its average from each of its elements to obtain an equivalent optimization problem where the average is 0. More problematic is the condition on the difference of expected sums, which may not be satisfied very often. In fact, when starting from an empty history, the condition will most likely become false pretty soon: the optimal exploration mechanisms keeps evaluating a new potential solution until we observe ; then the compressed history becomes and we have , if the remaining budget at that point is still at least 1.^{16} Nonetheless, Theorem ^{19} may still be useful for reducing computational complexity while performing recursive optimization, as it may lead to pruning some subtrees.

We present two additional results concerning pruning for the purpose of reducing computational complexity when recursive optimization cannot be avoided. The first does not directly perform pruning, but rather increases the chances of being able to prune due to caching of results, as described in Section 4.1. Specifically, for domains with , the exact number of tests a potential solution has seen is irrelevant if it is small enough:

This means that certain compressed histories that otherwise appear distinct are in fact equivalent from the perspective of their value, and thus we can avoid computing their respective subtrees and instead take advantage of cached results. Note also that we do not need all still-testable types to have , we can apply Proposition ^{20} only selectively to those that do.

Last but not least, we show that we can prune the number of choices competing for optimality, as well as the compressed history, to at most types, namely those types with largest expected sums in the -ordering, forming a *prefix* of a compressed history as follows:

With that, pruning to at most choices is expressed as follows:

Of course, it is advantageous to combine the “small-budget” type of pruning presented in this section with the pruning of tuples whose expected sum is too small, as previously described in Sections 4.2 and 4.4. Proposition ^{13} means we may be able to completely drop the nontestable tuple and Corollary ^{10} means we can replace all occurrences of with (i.e., drop the testable tuples whose expected sum is at most ). Consequently, it is in fact only the maximum number of tests among *testable tuples whose expected sum is strictly greater than * (i.e., ) that needs to be less than or equal to in order for Theorem ^{18}, ^{19}, and Proposition ^{22} to apply.

Proposition ^{22} leads to further pruning of choices only if the number of testable tuples left, which equals , is greater than .^{17} If is exactly , then we are pruning only the choice