## Abstract

We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size-$k$ subset of arms (also known as a *super arm*) from given $n$ arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms $K$ is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is $O$(log $K$), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10$10$ super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.

## 1 Introduction

The stochastic multiarmed bandit (MAB) is a classical decision-making model, that characterizes the trade-off between exploration and exploitation in stochastic environments (Lai & Robbins, 1985). While the best-studied objective is to minimize the cumulative regret or maximize the cumulative reward (Bubeck & Cesa-Bianchi, 2012; Cesa-Bianchi & Lugosi, 2006), another popular objective is to identify the best arm with the maximum expected reward from given $n$ arms. This problem, called *pure exploration* or *best arm identification* in the MAB, has received much attention (Audibert & Bubeck, 2010; Chen & Li, 2015; Even-Dar, Mannor, & Mansour, 2002, 2006; Jamieson, Malloy, Nowak, & Bubeck, 2014; Kaufmann, Cappé, & Garivier, 2016).

An important variant of the MAB is the multiple-play MAB problem (MP-MAB), in which the agent pulls $k(\u22651)$ different arms at each round. In many application domains, we need to make a decision to take multiple actions among a set of all possible choices. For example, in online advertisement auctions, companies want to choose multiple keywords to promote their products to consumers based on their search queries (Rusmevichientong & Williamson, 2006). From millions of available choices, a company aims to find the most effective set of keywords by observing the historical performance of the chosen keywords. This decision making is formulated as the MP-MAB, where each arm corresponds to each keyword. In addition, MP-MAB has further applications, such as channel selection in cognitive radio networks (Huang, Liu, & Ding, 2008), ranking web documents (Radlinski, Kleinberg, & Joachims, 2008), and crowdsourcing (Zhou, Chen, & Li, 2014). Owing to various applications, MP-MAB has received much attention, and several algorithms have been proposed for regret minimization (Agrawal, Hegde, & Teneketzis, 1990; Anantharam, Varaiya, & Walrand, 1987; Komiyama, Honda, & Nakagawa, 2015; Lagrée, Vernade, & Cappe, 2016). Adversarial case has been also studied in the literature (Cesa-Bianchi & Lugosi, 2012; Combes, Talebi Mazraeh Shahi, Proutiere, & Lelarge, 2015).

In this paper, we study the multiple-arm identification that corresponds to the pure exploration in the MP-MAB. In this problem, the goal is to find the size-$k$ subset (a super arm) with the maximum expected rewards. The problem is also called the *top-$k$ selection* or *$k$-best arm identification* and has been extensively studied recently (Bubeck, Wang, & Viswanathan, 2013; Cao, Li, Tao, & Li, 2015; Gabillon, Ghavamzadeh, & Lazaric, 2012; Gabillon, Ghavamzadeh, Lazaric, & Bubeck, 2011; Kalyanakrishnan & Stone, 2010; Kalyanakrishnan, Tewari, Auer, & Stone, 2012; Kaufmann & Kalyanakrishnan, 2013; Chaudhuri & Kalyanakrishnan, 2017, 2019; Zhou et al., 2014). This prior work has considered the *semi-bandit* setting, in which we can observe a reward of each single arm in the pulled super arm, or assumed that a single arm can be queried. However, in many application domains, it is costly to observe a reward of individual arms, or sometimes we cannot access feedback from individual arms. For example, in crowdsourcing, we often obtain a lot of labels given by crowdworkers, but it is costly to compile labels according to labelers. Furthermore, in software projects, an employer may have complicated tasks that need multiple workers, in which the employer can evaluate only the quality of a completed task rather than a single worker performance (Retelny et al., 2014; Tran-Thanh, Stein, Rogers, & Jennings, 2014). In such scenarios, we wish to extract expert workers who can perform the task with high quality from a sequential access to the quality of the task completed by multiple workers.

In this study, we tackle the multiple-arm identification with *full-bandit* feedback, where only a noisy observation of the total sum of a super arm is given at each pull rather than a reward of each pulled single arm. This setting is more challenging since estimators of expected rewards of single arms are no longer independent of each other. To solve this problem, one might use an algorithm for the noncombinatorial top-$k$ identification problem based on the idea such that the difference in mean reward between two single arms $i$ and $j$ can be estimated by pulling super arms ${i}\u222aA$ and ${j}\u222aA$ for any size-$(k-1)$ set $A\u2286[n]$. However, such an approach fails to reduce the number of samples, as shown in section 6, since it cannot fully exploit the information from $k-1$ arms at each pull.

We can see our problem as an instance of the pure exploration in *linear bandits*, which has received increasing attention (Lattimore & Szepesvari, 2017; Soare, Lazaric, & Munos, 2014; Tao, Blanco, & Zhou, 2018; Xu, Honda, & Sugiyama, 2018). In linear bandits, each arm has its own feature $x\u2208Rn$, while in our problem, each super arm can be associated with a vector $x\u2208{0,1}n$. Most linear bandit algorithms have, however, the time complexity at least proportional to the number of arms. Therefore, a naive use of them is computationally infeasible since the number of super arms $K=nk$ is exponential. A modicum of research on linear bandits has addressed the time complexity (Jun, Bhargava, Nowak, & Willett, 2017; Tao et al., 2018). Jun et al. (2017) proposed efficient algorithms for regret minimization, which results in the sublinear time complexity $O(K\rho )$ for $\rho \u2208(0,1)$. Nevertheless, in our setting, they still have to spend $O(n\rho k)$ time, where $\rho \u2208(0,1)$ is a constant, which is exponential. Thus, to perform multiple-arm identification with full-bandit feedback in practice, the computational infeasibility needs to be overcome since fast decisions are required in real-world applications.

In this study, we design algorithms, which are efficient in terms of both the time complexity and the sample complexity. Our contributions are summarized as follows:

1. We propose a polynomial-time approximation algorithm (algorithm 1) for an NP-hard 0-1 quadratic programming problem arising in confidence ellipsoid maximization. In the design of the approximation algorithm, we utilize algorithms for a classical combinatorial optimization problem called the *densest $k$-subgraph problem* (D$k$S); (Feige, Peleg, & Kortsarz, 2001). Importantly, we provide a theoretical guarantee for the approximation ratio of our algorithm (theorem ^{2}).

2. Based on our approximation algorithm, we propose bandit algorithms (algorithm 2) that runs in $O(logK)$ time (theorem ^{3}) and provide an upper bound of the sample complexity (theorem ^{4}) that is still worst-case optimal. This result means that our algorithm achieves an exponential speedup over linear bandit algorithms while keeping statistical efficiency. Moreover, we design two heuristic algorithms, which empirically perform well. We propose algorithm 3, which employs the first-order approximation of confidence ellipsoids, and algorithm 4, which is based on the lower-upper confidence-bound algorithm.

3. We conduct a series of experiments on both synthetic and real-world data sets. First, we run our proposed algorithms on synthetic data sets and verify that our algorithms give good approximation to an exhaustive search algorithm. Next, we evaluate our algorithms on large-scale crowdsourcing data sets with more than $1010$ super arms, demonstrating the superiority of our algorithms in terms of both time complexity and sample complexity.

Note that the multiple-arm identification problem is a special class of the combinatorial pure exploration, where super arms follow certain combinatorial constraints such as paths, matchings, or matroids (Cao & Krishnamurthy, 2017; Chen, Gupta, & Li, 2016; Chen, Gupta, Li, Qiao, & Wang, 2017; Chen, Lin, King, Lyu, & Chen, 2014; Gabillon, Lazaric, Ghavamzadeh, Ortner, & Bartlett, 2016; Huang, Ok, Li, & Chen, 2018; Perrault, Perchet, & Valko, 2019). We can also design a simple algorithm (algorithm 5 in appendix A) for the combinatorial pure exploration under general constraints with full-bandit feedback, which results in a looser but general sample complexity bound. All proofs in this paper are given in appendix F.

## 2 Preliminaries

### 2.1 Problem Definition

Let $[n]={1,2,\cdots ,n}$ for an integer $n$. For a vector $x\u2208Rn$ and a matrix $B\u2208Rn\xd7n$, let $\u2225x\u2225B=x\u22a4Bx$. For a vector $\theta \u2208Rn$ and a subset $S\u2286[n]$, we define $\theta (S)=\u2211e\u2208S\theta (e)$. Now we describe the problem formulation formally. Suppose that there are $n$ single arms associated with unknown reward distributions ${\varphi 1,\cdots ,\varphi n}$. The reward from $\varphi e$ for each single arm $e\u2208[n]$ is expressed as $Xt(e)=\theta (e)+\epsilon t(e)$, where $\theta (e)$ is the expected reward and $\epsilon t(e)$ is the zero-mean noise bounded in $[-R,R]$ for some $R>0$. The agent chooses a size-$k$ subset from $n$ single arms at each round $t$ for an integer $k>0$. In the well-studied semi-bandit setting, the agent pulls a subset $Mt$ and then can observe $Xt(e)$ for each $e\u2208Mt$ independently sampled from the associated unknown distribution $\varphi e$. However, in the full-bandit setting, she can observe the sum of rewards $rMt=\theta (Mt)+\u2211e\u2208Mt\epsilon t(e)$ only at each pull, which means that estimators of expected rewards of single arms are no longer independent of each other.

We call a size-$k$ subset of single arms a *super arm*. We define a *decision class*$M$ as a finite set of super arms that satisfies the size constraint: $M={M\u22862[n]:|M|=k}$; thus, the size of the decision class is given by $K=nk$. Let $M*$ be the optimal super arm in the decision class $M$: $M*=argmaxM\u2208M\theta (M)$. In this letter, we focus on the *$(\u025b,\delta )$-PAC* setting, where the goal is to design an algorithm to output the super arm $Out\u2208M$ that satisfies for $\delta \u2208(0,1)$ and $\u025b>0$, $Pr[\theta (M*)-\theta (Out)\u2264\u025b]\u22651-\delta $. An algorithm is called *$(\u025b,\delta )$-PAC* if it satisfies this condition. In the fixed confidence setting, the agent's performance is evaluated by her sample complexity: the number of rounds until the agent terminates.

### 2.2 Confidence Bound

In order to handle full-bandit feedback, we utilize approaches for best arm identification in linear bandits. In best arm identification problems, the agent sequentially estimates $\theta $ from past observations and confirms that the estimation error is small or not. We introduce the necessary notation as follows. Let $Mt=(M1,M2,\u2026,Mt)\u2208Mt$ be a sequence of super arms and $(rM1,\u2026,rMt)\u2208Rt$ be the corresponding sequence of observed rewards. Let $\chi M\u2208{0,1}n$ denote the indicator vector of super arm $M\u2208M$; for each $e\u2208[n]$, $\chi M(e)=1$ if $e\u2208M$ and $\chi M(e)=0$ otherwise.

In our problem, the proposition holds for $\sigma =kR$. Two allocation strategies, named *G-allocation* and $XY$-*allocation*, are discussed in Soare et al. (2014). Approximating the optimal G-allocation can be done via convex optimization and an efficient rounding procedure, and $XY$-allocation can be computed in similar manner (see appendix D for details).

### 2.3 Computational Hardness

## 3 Confidence Ellipsoid Maximization

*$\alpha $-approximation algorithm*if it returns a solution that has an objective value greater than or equal to the optimal value times $\alpha \u2208(0,1]$ for any instance. Let $W\u2208Rn\xd7n$ be a symmetric matrix. CEM introduced in equation 2.4 can be naturally represented by the following 0-1 quadratic programming problem:

Notice that QP can be seen as an instance of the *uniform quadratic knapsack problem*, which is known to be NP-hard (Taylor, 2016), and there are few results of polynomial-time approximation algorithms even for a special case (see appendix C for details).

In this study, by utilizing algorithms for a classical combinatorial optimization problem, called the *densest $k$-subgraph problem* (D$k$S), we design an approximation algorithm that admits theoretical performance guarantee for QP with positive-definite matrix $W$. The definition of the D$k$S is as follows. Let $G=(V,E,w)$ be an undirected graph with nonnegative edge weight $w=(we)e\u2208E$.

For a vertex set $S\u2286V$, let $E(S)={{u,v}\u2208E:u,v\u2208S}$ be the subset of edges in the subgraph induced by $S$. We denote by $w(S)$ the sum of the edge weights in the subgraph induced by $S$: $w(S)=\u2211e\u2208E(S)we$. In the D$k$S, given $G=(V,E,w)$ and positive integer $k$, we are asked to find $S\u2286V$ with $|S|=k$ that maximizes $w(S)$. Although the D$k$S is NP-hard, there are a variety of polynomial-time approximation algorithms (Asahiro, Iwama, Tamaki, & Tokuyama, 2000; Bhaskara, Charikar, Chlamtac, Feige, & Vijayaraghavan, 2010; Feige et al., 2001). The current best approximation result for the D$k$S has an approximation ratio of $\Omega (1/|V|1/4+\epsilon )$ for any $\epsilon >0$ (Bhaskara et al., 2010). The direct reduction of QP to the D$k$S results in an instance that has arbitrary weights of edges. Existing algorithms cannot be used for such an instance since these algorithms need an assumption that the weights of all edges are nonnegative.

Now we present our algorithm for QP, which is detailed in algorithm 1. The algorithm operates in two steps. In the first step, it constructs an $n$-vertex complete graph $G\u02dc=(V,E,w\u02dc)$ from a given symmetric matrix $W\u2208Rn\xd7n$. For each ${i,j}\u2208E$, the edge weight $w\u02dcij$ is set to $wij+wii+wjj$. Note that if $W$ is positive definite, $w\u02dcij\u22650$ holds for every ${i,j}\u2208E$, which means that $G\u02dc$ is an instance of the D$k$S (lemma ^{10} in appendix F). In the second step, the algorithm accesses the *densest $k$-subgraph oracle* (D$k$S-Oracle), which accepts $G\u02dc$ as input and returns in polynomial time an approximate solution for the D$k$S. Note that we can use any polynomial-time approximation algorithm for the D$k$S as the D$k$S-Oracle. Let $\alpha DkS$ be the approximation ratio of the algorithm employed by the D$k$S-Oracle. By sophisticated analysis on the approximation ratio of algorithm 1, we have the following theorem.

For QP with any positive-definite matrix $W\u2208Rn\xd7n$, algorithm 1 with $\alpha DkS$-approximation DkS-Oracle is a $1k-1\lambda min(W)\lambda max(W)\alpha DkS$-approximation algorithm, where $\lambda min(W)$ and $\lambda max(W)$ represent the minimum and maximum eigenvalues of $W$, respectively.

## 4 Main Algorithm

Based on the approximation algorithm proposed in the previous section, we propose two algorithms for the multiple-arm identification with full-bandit feedback. Note that we assume that $k\u22652$ since the multiple-arm identification with $k=1$ is the same as best arm identification problem of the MAB.

### 4.1 Proposed Algorithm Based on Static Allocation

First, we deal with static allocation strategies, which sequentially sample a super arm from a fixed sequence of super arms. In general, adaptive strategies will perform better than static ones, but due to the computational hardness, we focus on static ones to analyze the worst-case optimality in terms of the minimum gap $\Delta min=argminM\u2208M\u2216{M*}\theta (M*)-\theta (M)$. In static algorithms, the agent pulls a super arm from a fixed set of super arms until a certain stopping condition is satisfied. Therefore, it is important to construct a stopping condition guaranteeing that the estimate $\theta ^t$ belongs to a set of parameters that admits the empirical best super arm $M^t*=argmaxM\u2208M\theta ^t(M)$ as an optimal super arm $M*$ as quickly as possible.

^{2}, and this stopping condition allows the output to be $\u025b$-optimal with high probability. As the following theorem states, SAQM provides an exponential speedup over exhaustive search algorithms.

Let $poly(n)DkS$ be the computation time of the D$k$S-Oracle. Then at any round $t>0$, SAQM (algorithm 2) runs in $O(max{n2,poly(n)DkS})$ time.

Most existing approximation algorithms for the D$k$S have efficient computation time. For example, if we employ the algorithm by Feige et al. (2001) as the D$k$S-Oracle that runs in $O(n\omega )$ time in algorithm 1, the running time of SAQM becomes $O(n\omega )$, where the exponent $\omega \u22642.373$ is equal to that of the computation time of matrix multiplication (see Lee Gall, 2014). If we employ the algorithm by Asahiro et al. (2000) that runs in $O(n2)$, the running time of SAQM also becomes $O(n2)$.

It is worth mentioning that if we have an $\alpha $-approximation algorithm for CEM with a more general decision class $M$ (such as paths, matchings, matroids), we have the same sample complexity bound in theorem ^{4} for the combinatorial pure exploration (CPE) with general constraints. For the top-$k$ identification setting, we have $\alpha =\Omega k-12n-18\lambda min(\Lambda p)\lambda max(\Lambda p)$ in theorem ^{4}.

Soare et al. (2014) considered the oracle sample complexity of a linear best-arm identification problem. The oracle complexity, which is based on the optimal allocation strategy $p$ derived from the true parameter $\theta $, is $O(H\u025blog(1/\delta ))$ if we ignore the terms that are not related to $H\u025b$ and $\delta $. Soare et al. (2014) showed that the sample complexity with G-allocation strategy matches the oracle sample complexity up to constants in the worst case. The sample complexity of SAQM is also worst-case optimal in the sense that it matches $O(H\u025blog(1/\delta ))$, while SAQM runs in polynomial time.

Note that if we use proposition 2, which is specified in section 5, instead of proposition 1, we have a better sample complexity bound in terms of $k$ (or $n$). However, the sample complexity with proposition 2 becomes complicated since it will depend on regularization parameter $\lambda $. Therefore, we analyze the sample complexity bound based on proposition 1 to clarify how the sample complexity depends on $\alpha $, $H\u025b$, and $\delta $ rather than $k$ (or $n$).

Although the quantity $\rho (p)$ is unbounded in the worst case, it is upper-bounded by $n$, as pointed out in Soare et al. (2014), if we use G-allocation strategy as sampling strategy $p$. However, the exact G-allocation might be hard to compute if $n$ is large, since the number of variable can be exponential.

We also note that under some mild conditions, we have an upper bound of the condition number of $\Lambda p$.

Suppose that SAQM (algorithm 2) employs sampling strategy $p\u2208P$, in which the agent pulls any size-$k$ set including single arm $i$ with probability $\Omega (1/n)$ for all $i\u2208[n]$: $mini\u2208[n]\u2211M\u2208supp(p):i\u2208Mp(M)=\Omega (1/n)$. Then the condition number of $\Lambda p$ satisfies $\lambda max(\Lambda p)\lambda min(\Lambda p)=O(nk).$

## 5 Heuristic Algorithms

In this section, we propose two algorithms that output the optimal super arm but do not have a sample complexity bound. The first algorithm employs the first-order approximation of confidence ellipsoids for checking the stopping condition efficiently. The second one is based on an adaptive sampling strategy. Empirically, we evaluated these algorithms and observed that they perform well in our experiments.

### 5.1 First-Order Approximation for Confidence Ellipsoid Maximization

### 5.2 Adaptive Algorithm

In the previous section, using our approximation scheme of confidence ellipsoid maximization, we proposed algorithms based on static allocation strategies. However, in order to design a near-optimal sampling algorithm, we should focus on adaptive algorithms, which adaptively changes arm selection strategies based on the past observations at every round. In this section, we propose an adaptive algorithm making use of existing classical methods.

^{3}). Let $\epsilon t$ be an R-sub gaussian noise. If the $l2$-norm of parameter $\theta $ is less than $S$, then statement

We propose an algorithm named CLUCB-QM that employs an adaptive strategy based on the combinatorial lower-upper confidence bound (CLUCB) algorithm (Chen et al., 2014). The CLUCB algorithm is originally designed for semi-bandit settings. We can modify the algorithm to solve full-bandit settings; whereas the original algorithm queries one single arm in one period, we can use it for the full-bandit settings by replacing the one single arm with two super arms that are different only by one single arm. However, with the original stopping condition used in Chen et al. (2014), the algorithm works poorly in terms of number of samples, as we will observe in our experiments. In the CLUCB-QM we propose, we use the stopping criterion as in SAQM that maintains the least-squares estimator and involves confidence ellipsoid maximization.

The entire procedure of CLUCB-QM is detailed in algorithm 4. This algorithm keeps up confidence radius $radt(e)$ for each single arm $e\u2208[n]$. The vector $\theta \u02dct(e)$ penalizes single arms belonging to the current empirical best super arm $M^t*$ and encourages exploring singe-arms out of $M^t*$. CLUCB-QM chooses a single arm $et$ that has the largest confidence radius among the symmetric difference $M^t*$ and $M\u02dct=argmaxM\u2208M\theta \u02dct(M)$. The algorithm queries two super arms with one arm difference: CLUCB-QM pulls $Mt\u2208M$ such that $et\u2208Mt$ and $e'\u2209Mt$, and then pulls $Mt'=Mt\u2216{et}\u22c3{e'}$, where $e'$ is a fixed single arm. Since CLUCB-QM employs the stopping condition 4.1 as in SAQM, we can prove that the CLUCB-QM algorithm outputs $\u025b$-optimal super arm with probability at least $1-\delta $. Although we do not have a sample complexity bound for CLUCB-QM, it performs better than SAQM in our experimental results.

## 6 Experiments

In this section, we evaluate the empirical performance of our algorithms: SAQM (algorithm 2), SA-FOA (algorithm 3), and CLUCB-QM (algorithm 4). We implement a baseline algorithm, ICB (algorithm 5 in appendix A), which works in polynomial time. ICB employs simplified confidence bounds obtained by diagonal approximation of confidence ellipsoids. Note that ICB can solve the combinatorial pure exploration problem with general constraints and results in another sample complexity (lemma ^{9} in appendix A).

To verify the effectiveness of our novel stopping rule that uses confidence ellipsoids, we implement CLUCB (algorithm 6 in appendix E.1); its stopping rule is different from CLUCB-QM, but its sampling rule is the same, as CLUCB-QM. CLUCB estimates the gap between each single arm and a fixed arm by pulling two super arms with one single arm difference. CLUCB is based on the lower-upper confidence-bound algorithm for solving general constrains proposed by Chen et al. (2014) as in CLUCB-QM, and it employs a stopping condition that does not require confidence ellipsoid maximization. Notice that CLUCB is $(\epsilon ,\delta )$-PAC (see its sample complexity in appendix E.1).

We implement two baseline algorithms that invoke noncombinatorial top-$k$ identification algorithms: one is the elimination-based algorithm ME (algorithm 7 with ME-Subroutine algorithm 8 in appendix E.2); the other is the confidence-bound-based algorithm LUCB (algorithm 7 with LUCB-Subroutine algorithm 9 in appendix E.2) ME employs the median elimination algorithm by Kalyanakrishnan and Stone (2010) for the noncombinatorial setting as a subroutine with simple modification; we sample ${i}\u222aA$ when base arm $i\u2208[n]$ should be sampled for a fixed size-($k-1$) subset $A\u2286[n]$. LUCB is a counterpart of ME that employs the lower-upper confidence-based algorithm proposed by Kalyanakrishnan et al. (2012). Note that LUCB and ME are also $(\epsilon ,\delta )$-PAC (see appendix E for their sample complexity).

We compare our algorithms with two exponential time algorithms, SA-Ex and CLUCB-Ex, which reduce our problem to the pure exploration problem in the linear bandit (see appendix E.3 for details). We conduct the experiments on small synthetic data sets and large-scale real-world data sets.

All experiments were conducted on a Macbook with a 1.3 GHz Intel Core i5 and 8 GB memory. All codes were implemented by using Python. In all experiments, we employed the approximation algorithm called the *greedy peeling* (Asahiro et al., 2000) as the D$k$S-Oracle. Specifically, the greedy peeling algorithm iteratively removes a vertex with the minimum weighted degree in the currently remaining graph until we are left with the subset of vertices with size $k$. The algorithm runs in $O(n2)$.

### 6.1 Synthetic Data Sets

To see the dependence of the performance on the minimum gap $\Delta min$, we generate synthetic instances as follows. We first set the expected rewards for the top-$k$ single arms uniformly at random from [0,1]. Let $\theta min-k$ be the minimum expected reward in the top-$k$ single arms. We set the expected reward of the $(k+1)$th best single-arm to $\theta min-k-\Delta min$ for the predetermined parameter $\Delta min\u2208[0,1]$. Then we generate the expected rewards of the rest of single arms by uniform samples from $[-1,\theta min-k-\Delta min]$ so that expected rewards of the best super arm are larger than those of the rest of super arms by at least $\Delta min$. We set the additive noise distribution $N(0,1)$. In all instances, we set $\delta =0.05$ and $\u025b=0.5$. For a regularization parameter in CLUCB-QM, we set $\lambda =1$ as in Xu et al. (2018). SAQM, SA-FOA and ICB employ G-allocation strategy. In SA-FOA, $\u2113$ is set to 2 for all experiments.

### 6.2 Approximation Error

First, we examine the approximation precision of our approximation algorithms. The results are reported in Figure 1. SAQM and SA-FOA employ some approximation mechanisms to test the stopping condition in polynomial time. Recall that SAQM approximately solves CEM in equation 2.4 to attain an objective value of $Zt$, and SA-FOA approximately solves the maximization problem in equation 5.1 to attain an objective value of $Zt'$. We set up the experiments with $n=10$ single arms and $k=5$. We run the experiments for the small gap ($\Delta min=0.1$) and large gap ($\Delta min=1.0$). We plot the approximation ratio and the additive approximation error of SAQM and SA-FOA in the first 100,000 rounds. From the results, we can see that their approximation ratios are almost always greater than 0.9, which are far better than the worst-case guarantee proved in theorem ^{2}. In particular, the approximation ratio of SA-FOA in the small gap case is surprisingly good (around 0.95) and grows as the number of rounds increases. This result implies that there is only a slight increase of the sample complexity caused by the approximation, especially when the expected rewards of single arms are close to each other.

### 6.3 Running Time

Next, we conduct the experiments to compare the running time of algorithms. We set $n=10,12,\u2026,24$ and $k=n/2$ on synthetic data sets. We report the computational time in one round in Figure 2. Since ME is an elimination-based algorithm, we report the computational time, which is overall time $t$ divided by the number of samples required by the algorithm. As can be seen, SA-Ex and CLUCB-Ex are prohibitive on instances with large number of super arms, while our algorithms can run fast even if $n$ becomes larger, which matches our theoretical analysis. The results indicate that polynomial-time algorithms are of crucial importance for practical use.

### 6.4 Number of Samples

Finally, we evaluate the number of samples required to identify the best super arm for varying $\Delta min$. Based on the above observation, we set $\alpha =0.9$. The result is shown in Figure 3. We observed that our algorithms always output the optimal super arm. The result indicates that the numbers of samples of our algorithms are comparable to those of SA-Ex and CLUCB-Ex. Notice that LUCB does not show a decreasing trend in Figure 3. The reason may be that it reduces the problem to two instances, where the gap between the best $k$th and $(k+1)$th arm can be no longer $\Delta min$.

### 6.5 Performance on Real-World Crowdsourcing Data Sets

We use the crowdsourcing data sets compiled by Li, Baba, and Kashima (2017) whose basic information is shown in Table 1. The task is to identify the top-$k$ workers with the highest accuracy only from a sequential access to the accuracy of part of labels given by some workers. Notice that the number of super arms is more than $1010$ in all experiments. All data sets are hard instances as $\Delta min$ is less than 0.05. We set $k=10$ and $\u025b=0.5$. Since SA-Ex and CLUCB-Ex are prohibitive, we compare the other algorithms. SAQM, SA-FOA, and ICB employ uniform allocation strategy.

Data Set . | Number of Tasks . | Number of Workers . | Average . | Best . | $\Delta min$ . |
---|---|---|---|---|---|

IT | 25 | 36 | 0.54 | 0.84 | 0.04 |

Medicine | 36 | 45 | 0.48 | 0.92 | 0.03 |

Chinese | 24 | 50 | 0.37 | 0.79 | 0.04 |

Pokémon | 20 | 55 | 0.28 | 1.00 | 0.05 |

English | 30 | 63 | 0.26 | 0.70 | 0.03 |

Science | 20 | 111 | 0.29 | 0.85 | 0.05 |

Data Set . | Number of Tasks . | Number of Workers . | Average . | Best . | $\Delta min$ . |
---|---|---|---|---|---|

IT | 25 | 36 | 0.54 | 0.84 | 0.04 |

Medicine | 36 | 45 | 0.48 | 0.92 | 0.03 |

Chinese | 24 | 50 | 0.37 | 0.79 | 0.04 |

Pokémon | 20 | 55 | 0.28 | 1.00 | 0.05 |

English | 30 | 63 | 0.26 | 0.70 | 0.03 |

Science | 20 | 111 | 0.29 | 0.85 | 0.05 |

Note: “Average” and “Best” give the average and the best accuracy rate among the workers, respectively.

The result is shown in Table 2, which indicates the applicability of our algorithms to the instances with a massive number of super arms. Moreover, all algorithms found the optimal subset of crowdworkers. In all data sets, SA-FOA outperformed the other algorithms. Recall that ICB uses the simplified confidence bound for the gap between two super arms. On the other hand, SA-FOA uses the approximation of the confidence ellipsoids for the gap between two super arms, which results in better performance than ICB. SAQM approximately computes the maximal confidence ellipsoid bound for the reward of one super arm rather than the gap between two super arms, which may result in worse performance than SA-FOA. CLUCB-QM, which employs the same sampling rule as CLUCB and the same stopping rule as SAQM, performed better than CLUCB and SAQM. This result may indicate that an adaptive sampling rule is more desirable than a static sampling rule, and using a confidence ellipsoid is more desirable than considering an individual confidence bound. ME, LUCB, and CLUCB discard the information from $k-1$ arms at each pull, which may cause the unfavorable results. LUCB worked better than CLUCB, since the original version of CLUCB was designed for very general combinatorial constraints, while LUCB was designed only for the top-$k$ setting. Notice that ME is phased adaptive while LUCB is fully adaptive, ME performed poorly in all instances, although ME is the counterpart of LUCB.

Data Set . | SAQM . | SA-FOA . | CLUCB-QM . | CLUCB . | ME . | LUCB . | ICB . |
---|---|---|---|---|---|---|---|

IT | 62,985 | 1437 | 9896 | 405,313 | 111,603 | 91,442 | 43,773 |

Medicine | 96,174 | 865 | 15,678 | 400,953 | 139,504 | 109,124 | 66,468 |

Chinese | 88,209 | 1060 | 19,438 | 754,439 | 301,635 | 129,795 | 99,424 |

Pokémon | 83,209 | 328 | 1994 | 151,748 | 331,799 | 89,674 | 19,705 |

English | 121,890 | 1023 | 31,300 | 671,274 | 380,060 | 117,611 | 114,406 |

Science | 276,325 | 1505 | 100,950 | 1,825,106 | 1,292,074 | 224,494 | 418,155 |

Data Set . | SAQM . | SA-FOA . | CLUCB-QM . | CLUCB . | ME . | LUCB . | ICB . |
---|---|---|---|---|---|---|---|

IT | 62,985 | 1437 | 9896 | 405,313 | 111,603 | 91,442 | 43,773 |

Medicine | 96,174 | 865 | 15,678 | 400,953 | 139,504 | 109,124 | 66,468 |

Chinese | 88,209 | 1060 | 19,438 | 754,439 | 301,635 | 129,795 | 99,424 |

Pokémon | 83,209 | 328 | 1994 | 151,748 | 331,799 | 89,674 | 19,705 |

English | 121,890 | 1023 | 31,300 | 671,274 | 380,060 | 117,611 | 114,406 |

Science | 276,325 | 1505 | 100,950 | 1,825,106 | 1,292,074 | 224,494 | 418,155 |

Note: Each value is an average over 10 realizations.

## 7 Conclusion

We studied the multiple-arm identification with full-bandit feedback, where we can observe only the sum of the rewards, not the reward of each single arm. Although we can regard our problem as a special case of pure exploration in linear bandits, an approach based on linear bandits is not computationally feasible since the number of super arms may be exponential. To overcome the computational challenges, we designed a novel approximation algorithm with theoretical guarantee for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we proposed $(\epsilon ,\delta )$-PAC algorithms SAQM that run in $O(logK)$ time and provided an upper bound of the sample complexity, which is still worst-case optimal; the result indicates that our algorithm provided an exponential speedup over an exhaustive search algorithm while keeping the statistical efficiency. We also designed two heuristic algorithms that empirically perform well: SA-FOA using first-order approximation and CLUCB-QM based on lower-upper confidence-bound algorithm. Finally, we conducted experiments on synthetic and real-world data sets with more than $1010$ super arms, demonstrating the superiority of our algorithms in terms of both computation time and sample complexity. There are several directions for future research. It remains open to design adaptive algorithms with a problem-dependent optimal sample complexity. Another interesting question is to seek a lower bound of any $(\epsilon ,\delta )$-PAC algorithm that works in polynomial time. Extension for combinatorial pure exploration with full-bandit feedback is another direction.

## Appendix A: Simplified Confidence Bounds for the Combinatorial Pure Exploration

In this appendix, we see the fundamental observation of employing a simplified confidence bound to obtain a computationally efficient algorithm for the combinatorial pure exploration problem. We consider any decision class $M$ in which super arms satisfy any constraint where a linear maximization problem is polynomial-time solvable. The examples of decision class considered here are paths, matchings, or matroids (see appendix B for the definition of matroids). The purpose of this appendix is to give a polynomial-time algorithm for solving the combinatorial pure exploration with general constraints by using the simplified confidence bound and see the trade-off between statistical efficiency and computational efficiency. The $(\epsilon ,\delta )$-PAC algorithm proposed in this section, named ICB, is also evaluated as a simple benchmark strategy in our experiments .

For a matrix $B\u2208Rn\xd7n$, let $B(i,j)$ denote the $(i,j)$th entry of $B$. We construct a simplified confidence bound, named an *independent confidence bound*, which is obtained by diagonal approximation of confidence ellipsoids. We start with the following lemma, which shows that $\theta $ lies in an independent confidence region centered at $\theta ^t$ with high probability.

Given any instance of combinatorial pure exploration with full-bandit feedback with decision class $M$, ICB (algorithm 5) at each round $t\u2208{1,2,\u2026}$ runs in polynomial time.

For example, ICB runs in $O(max{n2,ng(n)})$ time for matroid constraints, where $g(n)$ is the computation time to check whether a given super arm is contained in the decision class. Note that $g(n)$ is polynomial in $n$ for any matroid constraints. For example, $g(n)=O(n)$ if we consider the case where each super arm corresponds to a spanning tree of a graph $G=(V,E)$ and a decision class corresponds to a set of spanning trees in a given graph $G$.

The proof is given in appendix F. Notice that in the MAB, this diagonal approximation is tight since $Axt$ becomes a diagonal matrix. However, for combinatorial settings where the size of super arms is $k\u22652$, there is no guarantee that this approximation is tight; the approximation may degrade the sample complexity. Although the proposed algorithm here empirically performs well when the number of single arms is not large, as seen in Figure 3, it is still unclear that using the simplified confidence bound should be desired instead of ellipsoids, confidence bounds. This is the reason we focus on the approach with confidence ellipsoids.

## Appendix B: Definition of Matroids

A *matroid* is a combinatorial structure that abstracts many notions of independence such as linearly independent vectors in a set of vectors, called the *linear matroid*, and spanning trees in a graph, called the *graphical matroid* (Whitney, 1935). Formally, a matroid is a pair $J=(E,I)$, where $E={1,2,\u2026,n}$ is a finite set called a *ground set*, and $I\u22862E$ is a family of subsets of $E$ called *independent sets* that satisfies the following axioms:

$\u2205\u2208I$.

$X\u2286Y\u2208I\u27f9X\u2208I$.

$\u2200X,Y\u2208I$ such that $|X|<|Y|$, $\u2203e\u2208Y\u2216X$ such that $X\u222a{e}\u2208I$.

A *weighted matroid* is a matroid that has a weight function $w:E\u2192R$. For $F\u2286E$; we define the weight of $F$ as $w(F)=\u2211e\u2208Fw(e)$.

Let us consider the following problem: given a weighted matroid $J=(E,I)$ with $w:E\u2192R$, we are asked to find an independent set with the maximum weight, that is, $argmaxF\u2208Iw(F)$. This problem can be solved exactly by the following simple greedy algorithm (Karger, 1998). The algorithm initially sets $F$ to the empty set. Then the algorithm sorts the elements in $E$ with the decreasing order by weight, and for each element $e$ in this order, the algorithm adds $e$ to $F$ if $F\u222a{e}\u2208I$. Letting $g(n)$ be the computation time for checking whether $F$ is independent, we see that the running time of the above algorithm is $O(nlogn+ng(n))$.

## Appendix C: Uniform Quadratic Knapsack Problem

## Appendix D: Allocation Strategies

In this section, we briefly introduce the possible allocation strategies and describe how to implement a continuous allocation $p$ into a discrete allocation $xt$ for any sample size $t$. We report the efficient rounding procedure introduced in Pukelsheim (2006). In the *G-allocation* strategy, we make the sequence of selection $xt$ to be $xtG=argminxt\u2208Rn\xd7tmaxx\u2208X\u2225x\u2225Axt-1$ for $X\u2286Rn$, which is an NP-hard optimization problem. Massive studies have proposed approximate solutions to solve it in the experimental design literature (Bouhtou, Gaubert, & Sagnol, 2010; Sagnol, 2013). We can optimize the continuous relaxation of the problem by the projected gradient algorithm, multiplicative algorithm, or interior point algorithm. From the obtained optimal allocation $p$, we wish to design a discrete allocation for fixed sample size $t$.

Given an allocation $p\u2208P$, recall that $supp(p)={j\u2208[K]:pj>0}$. Let $ti$ be the number of pulls for arm $i\u2208supp(p)$ and $s$ be the size of $supp(p)$. Then, letting the *frequency*$ti=\u2308t-12spi\u2309$ results in $\u2211i\u2208supp(p)ti$ samples. If $\u2211i\u2208supp(p)ti=t$, this allocation is a desired solution. Otherwise, we conduct the following procedure until $\u2211i\u2208supp(p)ti-n$ becomes 0: increase a frequency $tj$, which attains $tj/pj=mini\u2208supp(p)ti/pi$ to $tj+1$, or decrease some $tj$ with $(tj-1)/pj=maxi\u2208supp(p)(ti-1)/pi$ to $tj-1$. Then $(ti,\u2026,ts)$ lies in the efficient design apportionment (see Pukelsheim, 2006). Note that since the relaxation problem has an exponential number of variables in our setting, we are restricted to the number of $supp(p)$ instead of dealing with all super arms.

## Appendix E: Details of Baseline Algorithms

### E.1 Details of $CLUCB$

^{2}). With probability at least $1-\delta $, CLUCB algorithm (algorithm 6) returns $\u025b$-optimal super arm $M^*$ and the number of samples $T$ is bounded as

CLUCB algorithm is efficient in terms of computational time since the algorithm runs in polynomial time. However, as observed in our experiments, this naive baseline algorithm does not work well, especially for real-world data sets, since it cannot exploit the information from $k-1$ arms at each pull.

### E.2 Details of $ME$ and $LUCB$

The entire procedure of ME is detailed in algorithm 7 with ME subroutine algorithm 8. First, we choose any subset $A$ of $k-1$ arms and find the best $k$ subset $B*$ among $B=[n]\u2216A$ by the median elimination algorithm proposed by Kalyanakrishnan and Stone (2010). When $i\u2208B$ should be pulled, we pull the super arm ${i}\u222aA$ instead of $i$. By this procedure, we can get the $k$ arms $i\u2208B$ to maximize $\theta ({i}\u222aA)$, which are equal to the arms $i\u2208B$ to maximize $\theta (i)$. Clearly, the best $k$ subset in $[n]$ can be obtained by finding the best $k$ subset in $A\u222aB*$. ME is $(\epsilon ,\delta )$-PAC, and its sample complexity is $Onk2\epsilon 2logk\delta $.

LUCB is a counterpart of ME, which is detailed in algorithm 7 with LUCB Subroutine algorithm 9. LUCB employs the lower-upper confidence algorithm proposed by Kalyanakrishnan et al. (2012) whose sample complexity is $O\u2211e\u2208[n]1max{\Delta e,\epsilon}2log1max{\Delta e,\epsilon}2/\delta $, where $\Delta e$ is defined by equation E.1. Since this subroutine is $(\epsilon ,\delta )$-PAC, we see that LUCB is also $(\epsilon ,\delta )$-PAC. However, the sample complexity can be very large when the gap between the best $k$th and $(k+1)$th arm in the instance for its first subroutine is much smaller than $\Delta min$.

ME and LUCB are computationally efficient since they run in polynomial time. However, they must invoke a subroutine two times, which will increase the number of samples.

### E.3 Details of Exponential Algorithms

The entire procedure of SA-Ex is detailed in algorithm 10. In this algorithm and SAQM, the difference is only the stopping condition. In SAQM, we approximately solve confidence ellipsoid maximization, whereas SA-Ex conducts an exhaustive search to obtain the exact solution. Thus, SA-Ex runs in exponential: $O(nk)$. The entire procedure of CLUCB-Ex is detailed in algorithm 11. This algorithm also reduces our problem to the pure exploration problem in the linear bandit and thus runs in exponential time. The stopping condition used in CLUCB-Ex and SA-Ex is the same. CLUCB-Ex adaptively pulls super arms based on CLUCB strategy as in CLUCB and CLUCB-QM.

## Appendix F: Proofs

First, we introduce the notation. For $M,M'\u2208M$, let $\Delta (M,M')$ be the *value gap* between two super arms: $\Delta (M,M')=|\theta (M)-\theta (M')|$. Also, let $\Delta ^(M,M')$ be the *empirical gap* between two super arms: $\Delta ^(M,M')=|\theta ^t(M)-\theta ^t(M')|$.

### F.1 Proof of Lemma ^{6}

### F.2 Proof of Lemma ^{9}

^{6}, we see that the probability that event $E=\u2229t=1\u221eEt$ occurs is at least $1-\delta $. In the event $E$, we see that the output $M^*$ is an $\u025b$-optimal super arm. In the rest of the proof, we assume that event $E$ holds. Next, we focus on bounding the sample complexity $T$. By recalling the stopping condition A.2, a sufficient condition for stopping is that for $M*$ and for $t>n$,

^{6}with $x=\chi M*-\chi M\xaf$, with probability at least $1-\delta $, we have

### F.3 Proof of Theorem ^{2}

We begin by showing the following three lemmas.

Let $W\u2208Rn\xd7n$ be any positive-definite matrix. Then $G\u02dc=(V,E,w\u02dc)$, constructed by algorithm 1, is a nonnegative weighted graph.

For any $(i,j)\u2208V2$, we have $wii\u22650$ and $wjj\u22650$ since $W$ is a positive-definite matrix. If $wij\u22650$, it is obvious that $w\u02dcij=wij+wii+wjj\u22650$. We consider the case $wij<0$. In the case, we have $wij+wii+wjj>2wij+wii+wjj\u22650$, where the last inequality holds from the definition of positive-definite matrix $W$. Thus, we obtain the desired result. $\u25a1$

Let $W\u2208Rn\xd7n$ be any positive-definite matrix and $W\u02dc=(w\u02dcij)$ be the adjacency matrix of the complete graph constructed by algorithm 1. Then, for any $S\u2286V$ such that $|S|\u22652$, we have $w(S)\u2264w\u02dc(S)$.

Let $W\u2208Rn\xd7n$ be any positive-definite matrix and $W\u02dc=(w\u02dcij)$ be the adjacency matrix of the complete graph constructed in algorithm 1. Then for any subset of vertices $S\u2286V$, we have $w\u02dc(S)w(S)\u2264(|S|-1)\lambda max(W)\lambda min(W)$, where $\lambda min(W)$ and $\lambda max(W)$ represent the minimum and maximum eigenvalues of $W$, respectively.

We consider the following two cases: case i, $\u2211{i,j}\u2208E(S):i\u2260jwij\u22650$, and case ii, $\u2211{i,j}\u2208E(S):i\u2260jwij<0$.

*Case i*. Since $W=(wij)1\u2264i,j\u2264n$ is a positive-definite matrix, we see that diagonal component $wii$ is positive for all $i\u2208V$. Thus, we have

Since $W$ is positive definite, we have $w(S)>0$. That gives us the desired result.

*Case ii*. In this case, we see that

We are now ready to prove theorem ^{2}.

^{10},

^{12}, and

^{13}hold for $W$. We have

Therefore, we obtain $Z\u22651k-1\lambda min(W)\lambda max(W)\alpha DkSOPT$. $\u25a1$

### F.4 Proof of Theorem ^{3}

Updating $Axt-1$ can be done in $O(n2)$ time, and computing the empirical best super arm can be done in $O(n)$ time. Moreover, confidence maximization CEM can be approximately solved in polynomial time, since quadratic maximization QP is solved in polynomial time as long as we employ polynomial time algorithm as the D$k$S-Oracle. Let $poly(n)DkS$ be the computation time of D$k$S-Oracle. Then we can guarantee that SAQM runs in $O(max{n2,poly(n)DkS})$ time. $\u25a1$

### F.5 Proof of Theorem ^{4}

Before stating the proof of theorem ^{4}, we give the technical lemmas.

Given any $t>n$, assume that $Et'$ occurs. Then if SAQM (algorithm 2) terminates at round $t$, we have $\theta (M*)-\theta (M^*)\u2264\u025b$.

We see that approximation ratio of CEM $\alpha \tau =\Omega k-1/2n-1/8\lambda min(\Lambda p)\lambda max(\Lambda p)$ for sufficiently large $\tau $ from theorem ^{2} and lemma ^{6} below if we use the best approximation algorithm for the D$k$S as the D$k$S-Oracle (Bhaskara et al., 2010).

We are now ready to prove theorem ^{4}.

We define event $E'$ as $\u22c2t=1\u221eEt'$. We can see from proposition 1 that the probability that event $E'$ occurs is at least $1-\delta $. In the rest of the proof, we assume that this event holds. By lemma ^{15} and the assumption on $E'$, we see that the output $M^*$ is $\u025b$-optimal super arm. Next, we focus on bounding the sample complexity.

## Acknowledgments

We thank the anonymous reviewers for their comments and suggestions. Y.K. was supported by a Grant-in-Aid for JSPS Fellows (No. 18J23034) and JST CREST grant number JPMJCR1403, including AIP challenge program. A.M. was supported by a Grant-in-Aid for Research Activity Start-up (No. 17H07357) and a Grant-in-Aid for Early-Career Scientists (No. 19K20218). J.H. was supported by a Grant-in-Aid for Scientific Research on Innovative Areas (No. 16H00881). M.S. was supported by KAKENHI 17H00757.

## References

*24*(pp.

*28*(pp.

*28*(pp.

*Proceedings of the 39th ACM International Symposium on Symbolic and Algebraic Computation*

## Author notes

L.X. is now at Gatsby Computational Neuroscience Unit, University College London.