We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size- subset of arms (also known as a super arm) from given arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is (log ), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10 super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.
The stochastic multiarmed bandit (MAB) is a classical decision-making model, that characterizes the trade-off between exploration and exploitation in stochastic environments (Lai & Robbins, 1985). While the best-studied objective is to minimize the cumulative regret or maximize the cumulative reward (Bubeck & Cesa-Bianchi, 2012; Cesa-Bianchi & Lugosi, 2006), another popular objective is to identify the best arm with the maximum expected reward from given arms. This problem, called pure exploration or best arm identification in the MAB, has received much attention (Audibert & Bubeck, 2010; Chen & Li, 2015; Even-Dar, Mannor, & Mansour, 2002, 2006; Jamieson, Malloy, Nowak, & Bubeck, 2014; Kaufmann, Cappé, & Garivier, 2016).
An important variant of the MAB is the multiple-play MAB problem (MP-MAB), in which the agent pulls different arms at each round. In many application domains, we need to make a decision to take multiple actions among a set of all possible choices. For example, in online advertisement auctions, companies want to choose multiple keywords to promote their products to consumers based on their search queries (Rusmevichientong & Williamson, 2006). From millions of available choices, a company aims to find the most effective set of keywords by observing the historical performance of the chosen keywords. This decision making is formulated as the MP-MAB, where each arm corresponds to each keyword. In addition, MP-MAB has further applications, such as channel selection in cognitive radio networks (Huang, Liu, & Ding, 2008), ranking web documents (Radlinski, Kleinberg, & Joachims, 2008), and crowdsourcing (Zhou, Chen, & Li, 2014). Owing to various applications, MP-MAB has received much attention, and several algorithms have been proposed for regret minimization (Agrawal, Hegde, & Teneketzis, 1990; Anantharam, Varaiya, & Walrand, 1987; Komiyama, Honda, & Nakagawa, 2015; Lagrée, Vernade, & Cappe, 2016). Adversarial case has been also studied in the literature (Cesa-Bianchi & Lugosi, 2012; Combes, Talebi Mazraeh Shahi, Proutiere, & Lelarge, 2015).
In this paper, we study the multiple-arm identification that corresponds to the pure exploration in the MP-MAB. In this problem, the goal is to find the size- subset (a super arm) with the maximum expected rewards. The problem is also called the top- selection or -best arm identification and has been extensively studied recently (Bubeck, Wang, & Viswanathan, 2013; Cao, Li, Tao, & Li, 2015; Gabillon, Ghavamzadeh, & Lazaric, 2012; Gabillon, Ghavamzadeh, Lazaric, & Bubeck, 2011; Kalyanakrishnan & Stone, 2010; Kalyanakrishnan, Tewari, Auer, & Stone, 2012; Kaufmann & Kalyanakrishnan, 2013; Chaudhuri & Kalyanakrishnan, 2017, 2019; Zhou et al., 2014). This prior work has considered the semi-bandit setting, in which we can observe a reward of each single arm in the pulled super arm, or assumed that a single arm can be queried. However, in many application domains, it is costly to observe a reward of individual arms, or sometimes we cannot access feedback from individual arms. For example, in crowdsourcing, we often obtain a lot of labels given by crowdworkers, but it is costly to compile labels according to labelers. Furthermore, in software projects, an employer may have complicated tasks that need multiple workers, in which the employer can evaluate only the quality of a completed task rather than a single worker performance (Retelny et al., 2014; Tran-Thanh, Stein, Rogers, & Jennings, 2014). In such scenarios, we wish to extract expert workers who can perform the task with high quality from a sequential access to the quality of the task completed by multiple workers.
In this study, we tackle the multiple-arm identification with full-bandit feedback, where only a noisy observation of the total sum of a super arm is given at each pull rather than a reward of each pulled single arm. This setting is more challenging since estimators of expected rewards of single arms are no longer independent of each other. To solve this problem, one might use an algorithm for the noncombinatorial top- identification problem based on the idea such that the difference in mean reward between two single arms and can be estimated by pulling super arms and for any size- set . However, such an approach fails to reduce the number of samples, as shown in section 6, since it cannot fully exploit the information from arms at each pull.
We can see our problem as an instance of the pure exploration in linear bandits, which has received increasing attention (Lattimore & Szepesvari, 2017; Soare, Lazaric, & Munos, 2014; Tao, Blanco, & Zhou, 2018; Xu, Honda, & Sugiyama, 2018). In linear bandits, each arm has its own feature , while in our problem, each super arm can be associated with a vector . Most linear bandit algorithms have, however, the time complexity at least proportional to the number of arms. Therefore, a naive use of them is computationally infeasible since the number of super arms is exponential. A modicum of research on linear bandits has addressed the time complexity (Jun, Bhargava, Nowak, & Willett, 2017; Tao et al., 2018). Jun et al. (2017) proposed efficient algorithms for regret minimization, which results in the sublinear time complexity for . Nevertheless, in our setting, they still have to spend time, where is a constant, which is exponential. Thus, to perform multiple-arm identification with full-bandit feedback in practice, the computational infeasibility needs to be overcome since fast decisions are required in real-world applications.
In this study, we design algorithms, which are efficient in terms of both the time complexity and the sample complexity. Our contributions are summarized as follows:
1. We propose a polynomial-time approximation algorithm (algorithm 1) for an NP-hard 0-1 quadratic programming problem arising in confidence ellipsoid maximization. In the design of the approximation algorithm, we utilize algorithms for a classical combinatorial optimization problem called the densest -subgraph problem (DS); (Feige, Peleg, & Kortsarz, 2001). Importantly, we provide a theoretical guarantee for the approximation ratio of our algorithm (theorem 2).
2. Based on our approximation algorithm, we propose bandit algorithms (algorithm 2) that runs in time (theorem 3) and provide an upper bound of the sample complexity (theorem 4) that is still worst-case optimal. This result means that our algorithm achieves an exponential speedup over linear bandit algorithms while keeping statistical efficiency. Moreover, we design two heuristic algorithms, which empirically perform well. We propose algorithm 3, which employs the first-order approximation of confidence ellipsoids, and algorithm 4, which is based on the lower-upper confidence-bound algorithm.
3. We conduct a series of experiments on both synthetic and real-world data sets. First, we run our proposed algorithms on synthetic data sets and verify that our algorithms give good approximation to an exhaustive search algorithm. Next, we evaluate our algorithms on large-scale crowdsourcing data sets with more than super arms, demonstrating the superiority of our algorithms in terms of both time complexity and sample complexity.
Note that the multiple-arm identification problem is a special class of the combinatorial pure exploration, where super arms follow certain combinatorial constraints such as paths, matchings, or matroids (Cao & Krishnamurthy, 2017; Chen, Gupta, & Li, 2016; Chen, Gupta, Li, Qiao, & Wang, 2017; Chen, Lin, King, Lyu, & Chen, 2014; Gabillon, Lazaric, Ghavamzadeh, Ortner, & Bartlett, 2016; Huang, Ok, Li, & Chen, 2018; Perrault, Perchet, & Valko, 2019). We can also design a simple algorithm (algorithm 5 in appendix A) for the combinatorial pure exploration under general constraints with full-bandit feedback, which results in a looser but general sample complexity bound. All proofs in this paper are given in appendix F.
2.1 Problem Definition
Let for an integer . For a vector and a matrix , let . For a vector and a subset , we define . Now we describe the problem formulation formally. Suppose that there are single arms associated with unknown reward distributions . The reward from for each single arm is expressed as , where is the expected reward and is the zero-mean noise bounded in for some . The agent chooses a size- subset from single arms at each round for an integer . In the well-studied semi-bandit setting, the agent pulls a subset and then can observe for each independently sampled from the associated unknown distribution . However, in the full-bandit setting, she can observe the sum of rewards only at each pull, which means that estimators of expected rewards of single arms are no longer independent of each other.
We call a size- subset of single arms a super arm. We define a decision class as a finite set of super arms that satisfies the size constraint: ; thus, the size of the decision class is given by . Let be the optimal super arm in the decision class : . In this letter, we focus on the -PAC setting, where the goal is to design an algorithm to output the super arm that satisfies for and , . An algorithm is called -PAC if it satisfies this condition. In the fixed confidence setting, the agent's performance is evaluated by her sample complexity: the number of rounds until the agent terminates.
2.2 Confidence Bound
In order to handle full-bandit feedback, we utilize approaches for best arm identification in linear bandits. In best arm identification problems, the agent sequentially estimates from past observations and confirms that the estimation error is small or not. We introduce the necessary notation as follows. Let be a sequence of super arms and be the corresponding sequence of observed rewards. Let denote the indicator vector of super arm ; for each , if and otherwise.
In our problem, the proposition holds for . Two allocation strategies, named G-allocation and -allocation, are discussed in Soare et al. (2014). Approximating the optimal G-allocation can be done via convex optimization and an efficient rounding procedure, and -allocation can be computed in similar manner (see appendix D for details).
2.3 Computational Hardness
3 Confidence Ellipsoid Maximization
Notice that QP can be seen as an instance of the uniform quadratic knapsack problem, which is known to be NP-hard (Taylor, 2016), and there are few results of polynomial-time approximation algorithms even for a special case (see appendix C for details).
In this study, by utilizing algorithms for a classical combinatorial optimization problem, called the densest -subgraph problem (DS), we design an approximation algorithm that admits theoretical performance guarantee for QP with positive-definite matrix . The definition of the DS is as follows. Let be an undirected graph with nonnegative edge weight .
For a vertex set , let be the subset of edges in the subgraph induced by . We denote by the sum of the edge weights in the subgraph induced by : . In the DS, given and positive integer , we are asked to find with that maximizes . Although the DS is NP-hard, there are a variety of polynomial-time approximation algorithms (Asahiro, Iwama, Tamaki, & Tokuyama, 2000; Bhaskara, Charikar, Chlamtac, Feige, & Vijayaraghavan, 2010; Feige et al., 2001). The current best approximation result for the DS has an approximation ratio of for any (Bhaskara et al., 2010). The direct reduction of QP to the DS results in an instance that has arbitrary weights of edges. Existing algorithms cannot be used for such an instance since these algorithms need an assumption that the weights of all edges are nonnegative.
Now we present our algorithm for QP, which is detailed in algorithm 1. The algorithm operates in two steps. In the first step, it constructs an -vertex complete graph from a given symmetric matrix . For each , the edge weight is set to . Note that if is positive definite, holds for every , which means that is an instance of the DS (lemma 10 in appendix F). In the second step, the algorithm accesses the densest -subgraph oracle (DS-Oracle), which accepts as input and returns in polynomial time an approximate solution for the DS. Note that we can use any polynomial-time approximation algorithm for the DS as the DS-Oracle. Let be the approximation ratio of the algorithm employed by the DS-Oracle. By sophisticated analysis on the approximation ratio of algorithm 1, we have the following theorem.
For QP with any positive-definite matrix , algorithm 1 with -approximation DkS-Oracle is a -approximation algorithm, where and represent the minimum and maximum eigenvalues of , respectively.
4 Main Algorithm
Based on the approximation algorithm proposed in the previous section, we propose two algorithms for the multiple-arm identification with full-bandit feedback. Note that we assume that since the multiple-arm identification with is the same as best arm identification problem of the MAB.
4.1 Proposed Algorithm Based on Static Allocation
First, we deal with static allocation strategies, which sequentially sample a super arm from a fixed sequence of super arms. In general, adaptive strategies will perform better than static ones, but due to the computational hardness, we focus on static ones to analyze the worst-case optimality in terms of the minimum gap . In static algorithms, the agent pulls a super arm from a fixed set of super arms until a certain stopping condition is satisfied. Therefore, it is important to construct a stopping condition guaranteeing that the estimate belongs to a set of parameters that admits the empirical best super arm as an optimal super arm as quickly as possible.
Let be the computation time of the DS-Oracle. Then at any round , SAQM (algorithm 2) runs in time.
Most existing approximation algorithms for the DS have efficient computation time. For example, if we employ the algorithm by Feige et al. (2001) as the DS-Oracle that runs in time in algorithm 1, the running time of SAQM becomes , where the exponent is equal to that of the computation time of matrix multiplication (see Lee Gall, 2014). If we employ the algorithm by Asahiro et al. (2000) that runs in , the running time of SAQM also becomes .
It is worth mentioning that if we have an -approximation algorithm for CEM with a more general decision class (such as paths, matchings, matroids), we have the same sample complexity bound in theorem 4 for the combinatorial pure exploration (CPE) with general constraints. For the top- identification setting, we have in theorem 4.
Soare et al. (2014) considered the oracle sample complexity of a linear best-arm identification problem. The oracle complexity, which is based on the optimal allocation strategy derived from the true parameter , is if we ignore the terms that are not related to and . Soare et al. (2014) showed that the sample complexity with G-allocation strategy matches the oracle sample complexity up to constants in the worst case. The sample complexity of SAQM is also worst-case optimal in the sense that it matches , while SAQM runs in polynomial time.
Note that if we use proposition 2, which is specified in section 5, instead of proposition 1, we have a better sample complexity bound in terms of (or ). However, the sample complexity with proposition 2 becomes complicated since it will depend on regularization parameter . Therefore, we analyze the sample complexity bound based on proposition 1 to clarify how the sample complexity depends on , , and rather than (or ).
Although the quantity is unbounded in the worst case, it is upper-bounded by , as pointed out in Soare et al. (2014), if we use G-allocation strategy as sampling strategy . However, the exact G-allocation might be hard to compute if is large, since the number of variable can be exponential.
We also note that under some mild conditions, we have an upper bound of the condition number of .
Suppose that SAQM (algorithm 2) employs sampling strategy , in which the agent pulls any size- set including single arm with probability for all : . Then the condition number of satisfies
5 Heuristic Algorithms
In this section, we propose two algorithms that output the optimal super arm but do not have a sample complexity bound. The first algorithm employs the first-order approximation of confidence ellipsoids for checking the stopping condition efficiently. The second one is based on an adaptive sampling strategy. Empirically, we evaluated these algorithms and observed that they perform well in our experiments.
5.1 First-Order Approximation for Confidence Ellipsoid Maximization
5.2 Adaptive Algorithm
In the previous section, using our approximation scheme of confidence ellipsoid maximization, we proposed algorithms based on static allocation strategies. However, in order to design a near-optimal sampling algorithm, we should focus on adaptive algorithms, which adaptively changes arm selection strategies based on the past observations at every round. In this section, we propose an adaptive algorithm making use of existing classical methods.
We propose an algorithm named CLUCB-QM that employs an adaptive strategy based on the combinatorial lower-upper confidence bound (CLUCB) algorithm (Chen et al., 2014). The CLUCB algorithm is originally designed for semi-bandit settings. We can modify the algorithm to solve full-bandit settings; whereas the original algorithm queries one single arm in one period, we can use it for the full-bandit settings by replacing the one single arm with two super arms that are different only by one single arm. However, with the original stopping condition used in Chen et al. (2014), the algorithm works poorly in terms of number of samples, as we will observe in our experiments. In the CLUCB-QM we propose, we use the stopping criterion as in SAQM that maintains the least-squares estimator and involves confidence ellipsoid maximization.
The entire procedure of CLUCB-QM is detailed in algorithm 4. This algorithm keeps up confidence radius for each single arm . The vector penalizes single arms belonging to the current empirical best super arm and encourages exploring singe-arms out of . CLUCB-QM chooses a single arm that has the largest confidence radius among the symmetric difference and . The algorithm queries two super arms with one arm difference: CLUCB-QM pulls such that and , and then pulls , where is a fixed single arm. Since CLUCB-QM employs the stopping condition 4.1 as in SAQM, we can prove that the CLUCB-QM algorithm outputs -optimal super arm with probability at least . Although we do not have a sample complexity bound for CLUCB-QM, it performs better than SAQM in our experimental results.
In this section, we evaluate the empirical performance of our algorithms: SAQM (algorithm 2), SA-FOA (algorithm 3), and CLUCB-QM (algorithm 4). We implement a baseline algorithm, ICB (algorithm 5 in appendix A), which works in polynomial time. ICB employs simplified confidence bounds obtained by diagonal approximation of confidence ellipsoids. Note that ICB can solve the combinatorial pure exploration problem with general constraints and results in another sample complexity (lemma 9 in appendix A).
To verify the effectiveness of our novel stopping rule that uses confidence ellipsoids, we implement CLUCB (algorithm 6 in appendix E.1); its stopping rule is different from CLUCB-QM, but its sampling rule is the same, as CLUCB-QM. CLUCB estimates the gap between each single arm and a fixed arm by pulling two super arms with one single arm difference. CLUCB is based on the lower-upper confidence-bound algorithm for solving general constrains proposed by Chen et al. (2014) as in CLUCB-QM, and it employs a stopping condition that does not require confidence ellipsoid maximization. Notice that CLUCB is -PAC (see its sample complexity in appendix E.1).
We implement two baseline algorithms that invoke noncombinatorial top- identification algorithms: one is the elimination-based algorithm ME (algorithm 7 with ME-Subroutine algorithm 8 in appendix E.2); the other is the confidence-bound-based algorithm LUCB (algorithm 7 with LUCB-Subroutine algorithm 9 in appendix E.2) ME employs the median elimination algorithm by Kalyanakrishnan and Stone (2010) for the noncombinatorial setting as a subroutine with simple modification; we sample when base arm should be sampled for a fixed size-() subset . LUCB is a counterpart of ME that employs the lower-upper confidence-based algorithm proposed by Kalyanakrishnan et al. (2012). Note that LUCB and ME are also -PAC (see appendix E for their sample complexity).
We compare our algorithms with two exponential time algorithms, SA-Ex and CLUCB-Ex, which reduce our problem to the pure exploration problem in the linear bandit (see appendix E.3 for details). We conduct the experiments on small synthetic data sets and large-scale real-world data sets.
All experiments were conducted on a Macbook with a 1.3 GHz Intel Core i5 and 8 GB memory. All codes were implemented by using Python. In all experiments, we employed the approximation algorithm called the greedy peeling (Asahiro et al., 2000) as the DS-Oracle. Specifically, the greedy peeling algorithm iteratively removes a vertex with the minimum weighted degree in the currently remaining graph until we are left with the subset of vertices with size . The algorithm runs in .
6.1 Synthetic Data Sets
To see the dependence of the performance on the minimum gap , we generate synthetic instances as follows. We first set the expected rewards for the top- single arms uniformly at random from [0,1]. Let be the minimum expected reward in the top- single arms. We set the expected reward of the th best single-arm to for the predetermined parameter . Then we generate the expected rewards of the rest of single arms by uniform samples from so that expected rewards of the best super arm are larger than those of the rest of super arms by at least . We set the additive noise distribution . In all instances, we set and . For a regularization parameter in CLUCB-QM, we set as in Xu et al. (2018). SAQM, SA-FOA and ICB employ G-allocation strategy. In SA-FOA, is set to 2 for all experiments.
6.2 Approximation Error
First, we examine the approximation precision of our approximation algorithms. The results are reported in Figure 1. SAQM and SA-FOA employ some approximation mechanisms to test the stopping condition in polynomial time. Recall that SAQM approximately solves CEM in equation 2.4 to attain an objective value of , and SA-FOA approximately solves the maximization problem in equation 5.1 to attain an objective value of . We set up the experiments with single arms and . We run the experiments for the small gap () and large gap (). We plot the approximation ratio and the additive approximation error of SAQM and SA-FOA in the first 100,000 rounds. From the results, we can see that their approximation ratios are almost always greater than 0.9, which are far better than the worst-case guarantee proved in theorem 2. In particular, the approximation ratio of SA-FOA in the small gap case is surprisingly good (around 0.95) and grows as the number of rounds increases. This result implies that there is only a slight increase of the sample complexity caused by the approximation, especially when the expected rewards of single arms are close to each other.
6.3 Running Time
Next, we conduct the experiments to compare the running time of algorithms. We set and on synthetic data sets. We report the computational time in one round in Figure 2. Since ME is an elimination-based algorithm, we report the computational time, which is overall time divided by the number of samples required by the algorithm. As can be seen, SA-Ex and CLUCB-Ex are prohibitive on instances with large number of super arms, while our algorithms can run fast even if becomes larger, which matches our theoretical analysis. The results indicate that polynomial-time algorithms are of crucial importance for practical use.
6.4 Number of Samples
Finally, we evaluate the number of samples required to identify the best super arm for varying . Based on the above observation, we set . The result is shown in Figure 3. We observed that our algorithms always output the optimal super arm. The result indicates that the numbers of samples of our algorithms are comparable to those of SA-Ex and CLUCB-Ex. Notice that LUCB does not show a decreasing trend in Figure 3. The reason may be that it reduces the problem to two instances, where the gap between the best th and th arm can be no longer .
6.5 Performance on Real-World Crowdsourcing Data Sets
We use the crowdsourcing data sets compiled by Li, Baba, and Kashima (2017) whose basic information is shown in Table 1. The task is to identify the top- workers with the highest accuracy only from a sequential access to the accuracy of part of labels given by some workers. Notice that the number of super arms is more than in all experiments. All data sets are hard instances as is less than 0.05. We set and . Since SA-Ex and CLUCB-Ex are prohibitive, we compare the other algorithms. SAQM, SA-FOA, and ICB employ uniform allocation strategy.
|Data Set .||Number of Tasks .||Number of Workers .||Average .||Best .||.|
|Data Set .||Number of Tasks .||Number of Workers .||Average .||Best .||.|
Note: “Average” and “Best” give the average and the best accuracy rate among the workers, respectively.
The result is shown in Table 2, which indicates the applicability of our algorithms to the instances with a massive number of super arms. Moreover, all algorithms found the optimal subset of crowdworkers. In all data sets, SA-FOA outperformed the other algorithms. Recall that ICB uses the simplified confidence bound for the gap between two super arms. On the other hand, SA-FOA uses the approximation of the confidence ellipsoids for the gap between two super arms, which results in better performance than ICB. SAQM approximately computes the maximal confidence ellipsoid bound for the reward of one super arm rather than the gap between two super arms, which may result in worse performance than SA-FOA. CLUCB-QM, which employs the same sampling rule as CLUCB and the same stopping rule as SAQM, performed better than CLUCB and SAQM. This result may indicate that an adaptive sampling rule is more desirable than a static sampling rule, and using a confidence ellipsoid is more desirable than considering an individual confidence bound. ME, LUCB, and CLUCB discard the information from arms at each pull, which may cause the unfavorable results. LUCB worked better than CLUCB, since the original version of CLUCB was designed for very general combinatorial constraints, while LUCB was designed only for the top- setting. Notice that ME is phased adaptive while LUCB is fully adaptive, ME performed poorly in all instances, although ME is the counterpart of LUCB.
|Data Set .||SAQM .||SA-FOA .||CLUCB-QM .||CLUCB .||ME .||LUCB .||ICB .|
|Data Set .||SAQM .||SA-FOA .||CLUCB-QM .||CLUCB .||ME .||LUCB .||ICB .|
Note: Each value is an average over 10 realizations.
We studied the multiple-arm identification with full-bandit feedback, where we can observe only the sum of the rewards, not the reward of each single arm. Although we can regard our problem as a special case of pure exploration in linear bandits, an approach based on linear bandits is not computationally feasible since the number of super arms may be exponential. To overcome the computational challenges, we designed a novel approximation algorithm with theoretical guarantee for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we proposed -PAC algorithms SAQM that run in time and provided an upper bound of the sample complexity, which is still worst-case optimal; the result indicates that our algorithm provided an exponential speedup over an exhaustive search algorithm while keeping the statistical efficiency. We also designed two heuristic algorithms that empirically perform well: SA-FOA using first-order approximation and CLUCB-QM based on lower-upper confidence-bound algorithm. Finally, we conducted experiments on synthetic and real-world data sets with more than super arms, demonstrating the superiority of our algorithms in terms of both computation time and sample complexity. There are several directions for future research. It remains open to design adaptive algorithms with a problem-dependent optimal sample complexity. Another interesting question is to seek a lower bound of any -PAC algorithm that works in polynomial time. Extension for combinatorial pure exploration with full-bandit feedback is another direction.
Appendix A: Simplified Confidence Bounds for the Combinatorial Pure Exploration
In this appendix, we see the fundamental observation of employing a simplified confidence bound to obtain a computationally efficient algorithm for the combinatorial pure exploration problem. We consider any decision class in which super arms satisfy any constraint where a linear maximization problem is polynomial-time solvable. The examples of decision class considered here are paths, matchings, or matroids (see appendix B for the definition of matroids). The purpose of this appendix is to give a polynomial-time algorithm for solving the combinatorial pure exploration with general constraints by using the simplified confidence bound and see the trade-off between statistical efficiency and computational efficiency. The -PAC algorithm proposed in this section, named ICB, is also evaluated as a simple benchmark strategy in our experiments .
For a matrix , let denote the th entry of . We construct a simplified confidence bound, named an independent confidence bound, which is obtained by diagonal approximation of confidence ellipsoids. We start with the following lemma, which shows that lies in an independent confidence region centered at with high probability.
Given any instance of combinatorial pure exploration with full-bandit feedback with decision class , ICB (algorithm 5) at each round runs in polynomial time.
For example, ICB runs in time for matroid constraints, where is the computation time to check whether a given super arm is contained in the decision class. Note that is polynomial in for any matroid constraints. For example, if we consider the case where each super arm corresponds to a spanning tree of a graph and a decision class corresponds to a set of spanning trees in a given graph .
The proof is given in appendix F. Notice that in the MAB, this diagonal approximation is tight since becomes a diagonal matrix. However, for combinatorial settings where the size of super arms is , there is no guarantee that this approximation is tight; the approximation may degrade the sample complexity. Although the proposed algorithm here empirically performs well when the number of single arms is not large, as seen in Figure 3, it is still unclear that using the simplified confidence bound should be desired instead of ellipsoids, confidence bounds. This is the reason we focus on the approach with confidence ellipsoids.
Appendix B: Definition of Matroids
A matroid is a combinatorial structure that abstracts many notions of independence such as linearly independent vectors in a set of vectors, called the linear matroid, and spanning trees in a graph, called the graphical matroid (Whitney, 1935). Formally, a matroid is a pair , where is a finite set called a ground set, and is a family of subsets of called independent sets that satisfies the following axioms:
such that , such that .
A weighted matroid is a matroid that has a weight function . For ; we define the weight of as .
Let us consider the following problem: given a weighted matroid with , we are asked to find an independent set with the maximum weight, that is, . This problem can be solved exactly by the following simple greedy algorithm (Karger, 1998). The algorithm initially sets to the empty set. Then the algorithm sorts the elements in with the decreasing order by weight, and for each element in this order, the algorithm adds to if . Letting be the computation time for checking whether is independent, we see that the running time of the above algorithm is .
Appendix C: Uniform Quadratic Knapsack Problem
Appendix D: Allocation Strategies
In this section, we briefly introduce the possible allocation strategies and describe how to implement a continuous allocation into a discrete allocation for any sample size . We report the efficient rounding procedure introduced in Pukelsheim (2006). In the G-allocation strategy, we make the sequence of selection to be for , which is an NP-hard optimization problem. Massive studies have proposed approximate solutions to solve it in the experimental design literature (Bouhtou, Gaubert, & Sagnol, 2010; Sagnol, 2013). We can optimize the continuous relaxation of the problem by the projected gradient algorithm, multiplicative algorithm, or interior point algorithm. From the obtained optimal allocation , we wish to design a discrete allocation for fixed sample size .
Given an allocation , recall that . Let be the number of pulls for arm and be the size of . Then, letting the frequency results in samples. If , this allocation is a desired solution. Otherwise, we conduct the following procedure until becomes 0: increase a frequency , which attains to , or decrease some with to . Then lies in the efficient design apportionment (see Pukelsheim, 2006). Note that since the relaxation problem has an exponential number of variables in our setting, we are restricted to the number of instead of dealing with all super arms.
Appendix E: Details of Baseline Algorithms
E.1 Details of
CLUCB algorithm is efficient in terms of computational time since the algorithm runs in polynomial time. However, as observed in our experiments, this naive baseline algorithm does not work well, especially for real-world data sets, since it cannot exploit the information from arms at each pull.
E.2 Details of and
The entire procedure of ME is detailed in algorithm 7 with ME subroutine algorithm 8. First, we choose any subset of arms and find the best subset among by the median elimination algorithm proposed by Kalyanakrishnan and Stone (2010). When should be pulled, we pull the super arm instead of . By this procedure, we can get the arms to maximize , which are equal to the arms to maximize . Clearly, the best subset in can be obtained by finding the best subset in . ME is -PAC, and its sample complexity is .
LUCB is a counterpart of ME, which is detailed in algorithm 7 with LUCB Subroutine algorithm 9. LUCB employs the lower-upper confidence algorithm proposed by Kalyanakrishnan et al. (2012) whose sample complexity is , where is defined by equation E.1. Since this subroutine is -PAC, we see that LUCB is also -PAC. However, the sample complexity can be very large when the gap between the best th and th arm in the instance for its first subroutine is much smaller than .
ME and LUCB are computationally efficient since they run in polynomial time. However, they must invoke a subroutine two times, which will increase the number of samples.
E.3 Details of Exponential Algorithms
The entire procedure of SA-Ex is detailed in algorithm 10. In this algorithm and SAQM, the difference is only the stopping condition. In SAQM, we approximately solve confidence ellipsoid maximization, whereas SA-Ex conducts an exhaustive search to obtain the exact solution. Thus, SA-Ex runs in exponential: . The entire procedure of CLUCB-Ex is detailed in algorithm 11. This algorithm also reduces our problem to the pure exploration problem in the linear bandit and thus runs in exponential time. The stopping condition used in CLUCB-Ex and SA-Ex is the same. CLUCB-Ex adaptively pulls super arms based on CLUCB strategy as in CLUCB and CLUCB-QM.
Appendix F: Proofs
First, we introduce the notation. For , let be the value gap between two super arms: . Also, let be the empirical gap between two super arms: .
F.1 Proof of Lemma 6
F.2 Proof of Lemma 9
F.3 Proof of Theorem 2
We begin by showing the following three lemmas.
Let be any positive-definite matrix. Then , constructed by algorithm 1, is a nonnegative weighted graph.
For any , we have and since is a positive-definite matrix. If , it is obvious that . We consider the case . In the case, we have , where the last inequality holds from the definition of positive-definite matrix . Thus, we obtain the desired result.
Let be any positive-definite matrix and be the adjacency matrix of the complete graph constructed by algorithm 1. Then, for any such that , we have .
Let be any positive-definite matrix and be the adjacency matrix of the complete graph constructed in algorithm 1. Then for any subset of vertices , we have , where and represent the minimum and maximum eigenvalues of , respectively.
We consider the following two cases: case i, , and case ii, .
Since is positive definite, we have . That gives us the desired result.
We are now ready to prove theorem 2.
Therefore, we obtain .
F.4 Proof of Theorem 3
Updating can be done in time, and computing the empirical best super arm can be done in time. Moreover, confidence maximization CEM can be approximately solved in polynomial time, since quadratic maximization QP is solved in polynomial time as long as we employ polynomial time algorithm as the DS-Oracle. Let be the computation time of DS-Oracle. Then we can guarantee that SAQM runs in time.
F.5 Proof of Theorem 4
Before stating the proof of theorem 4, we give the technical lemmas.
Given any , assume that occurs. Then if SAQM (algorithm 2) terminates at round , we have .
We are now ready to prove theorem 4.
We define event as . We can see from proposition 1 that the probability that event occurs is at least . In the rest of the proof, we assume that this event holds. By lemma 15 and the assumption on , we see that the output is -optimal super arm. Next, we focus on bounding the sample complexity.
We thank the anonymous reviewers for their comments and suggestions. Y.K. was supported by a Grant-in-Aid for JSPS Fellows (No. 18J23034) and JST CREST grant number JPMJCR1403, including AIP challenge program. A.M. was supported by a Grant-in-Aid for Research Activity Start-up (No. 17H07357) and a Grant-in-Aid for Early-Career Scientists (No. 19K20218). J.H. was supported by a Grant-in-Aid for Scientific Research on Innovative Areas (No. 16H00881). M.S. was supported by KAKENHI 17H00757.
L.X. is now at Gatsby Computational Neuroscience Unit, University College London.