## Abstract

We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size-$k$ subset of arms (also known as a super arm) from given $n$ arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms $K$ is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is $O$(log $K$), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10$10$ super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.

## 1  Introduction

The stochastic multiarmed bandit (MAB) is a classical decision-making model, that characterizes the trade-off between exploration and exploitation in stochastic environments (Lai & Robbins, 1985). While the best-studied objective is to minimize the cumulative regret or maximize the cumulative reward (Bubeck & Cesa-Bianchi, 2012; Cesa-Bianchi & Lugosi, 2006), another popular objective is to identify the best arm with the maximum expected reward from given $n$ arms. This problem, called pure exploration or best arm identification in the MAB, has received much attention (Audibert & Bubeck, 2010; Chen & Li, 2015; Even-Dar, Mannor, & Mansour, 2002, 2006; Jamieson, Malloy, Nowak, & Bubeck, 2014; Kaufmann, Cappé, & Garivier, 2016).

An important variant of the MAB is the multiple-play MAB problem (MP-MAB), in which the agent pulls $k(≥1)$ different arms at each round. In many application domains, we need to make a decision to take multiple actions among a set of all possible choices. For example, in online advertisement auctions, companies want to choose multiple keywords to promote their products to consumers based on their search queries (Rusmevichientong & Williamson, 2006). From millions of available choices, a company aims to find the most effective set of keywords by observing the historical performance of the chosen keywords. This decision making is formulated as the MP-MAB, where each arm corresponds to each keyword. In addition, MP-MAB has further applications, such as channel selection in cognitive radio networks (Huang, Liu, & Ding, 2008), ranking web documents (Radlinski, Kleinberg, & Joachims, 2008), and crowdsourcing (Zhou, Chen, & Li, 2014). Owing to various applications, MP-MAB has received much attention, and several algorithms have been proposed for regret minimization (Agrawal, Hegde, & Teneketzis, 1990; Anantharam, Varaiya, & Walrand, 1987; Komiyama, Honda, & Nakagawa, 2015; Lagrée, Vernade, & Cappe, 2016). Adversarial case has been also studied in the literature (Cesa-Bianchi & Lugosi, 2012; Combes, Talebi Mazraeh Shahi, Proutiere, & Lelarge, 2015).

In this paper, we study the multiple-arm identification that corresponds to the pure exploration in the MP-MAB. In this problem, the goal is to find the size-$k$ subset (a super arm) with the maximum expected rewards. The problem is also called the top-$k$ selection or $k$-best arm identification and has been extensively studied recently (Bubeck, Wang, & Viswanathan, 2013; Cao, Li, Tao, & Li, 2015; Gabillon, Ghavamzadeh, & Lazaric, 2012; Gabillon, Ghavamzadeh, Lazaric, & Bubeck, 2011; Kalyanakrishnan & Stone, 2010; Kalyanakrishnan, Tewari, Auer, & Stone, 2012; Kaufmann & Kalyanakrishnan, 2013; Chaudhuri & Kalyanakrishnan, 2017, 2019; Zhou et al., 2014). This prior work has considered the semi-bandit setting, in which we can observe a reward of each single arm in the pulled super arm, or assumed that a single arm can be queried. However, in many application domains, it is costly to observe a reward of individual arms, or sometimes we cannot access feedback from individual arms. For example, in crowdsourcing, we often obtain a lot of labels given by crowdworkers, but it is costly to compile labels according to labelers. Furthermore, in software projects, an employer may have complicated tasks that need multiple workers, in which the employer can evaluate only the quality of a completed task rather than a single worker performance (Retelny et al., 2014; Tran-Thanh, Stein, Rogers, & Jennings, 2014). In such scenarios, we wish to extract expert workers who can perform the task with high quality from a sequential access to the quality of the task completed by multiple workers.

In this study, we tackle the multiple-arm identification with full-bandit feedback, where only a noisy observation of the total sum of a super arm is given at each pull rather than a reward of each pulled single arm. This setting is more challenging since estimators of expected rewards of single arms are no longer independent of each other. To solve this problem, one might use an algorithm for the noncombinatorial top-$k$ identification problem based on the idea such that the difference in mean reward between two single arms $i$ and $j$ can be estimated by pulling super arms ${i}∪A$ and ${j}∪A$ for any size-$(k-1)$ set $A⊆[n]$. However, such an approach fails to reduce the number of samples, as shown in section 6, since it cannot fully exploit the information from $k-1$ arms at each pull.

We can see our problem as an instance of the pure exploration in linear bandits, which has received increasing attention (Lattimore & Szepesvari, 2017; Soare, Lazaric, & Munos, 2014; Tao, Blanco, & Zhou, 2018; Xu, Honda, & Sugiyama, 2018). In linear bandits, each arm has its own feature $x∈Rn$, while in our problem, each super arm can be associated with a vector $x∈{0,1}n$. Most linear bandit algorithms have, however, the time complexity at least proportional to the number of arms. Therefore, a naive use of them is computationally infeasible since the number of super arms $K=nk$ is exponential. A modicum of research on linear bandits has addressed the time complexity (Jun, Bhargava, Nowak, & Willett, 2017; Tao et al., 2018). Jun et al. (2017) proposed efficient algorithms for regret minimization, which results in the sublinear time complexity $O(Kρ)$ for $ρ∈(0,1)$. Nevertheless, in our setting, they still have to spend $O(nρk)$ time, where $ρ∈(0,1)$ is a constant, which is exponential. Thus, to perform multiple-arm identification with full-bandit feedback in practice, the computational infeasibility needs to be overcome since fast decisions are required in real-world applications.

In this study, we design algorithms, which are efficient in terms of both the time complexity and the sample complexity. Our contributions are summarized as follows:

1. We propose a polynomial-time approximation algorithm (algorithm 1) for an NP-hard 0-1 quadratic programming problem arising in confidence ellipsoid maximization. In the design of the approximation algorithm, we utilize algorithms for a classical combinatorial optimization problem called the densest $k$-subgraph problem (D$k$S); (Feige, Peleg, & Kortsarz, 2001). Importantly, we provide a theoretical guarantee for the approximation ratio of our algorithm (theorem 2).

2. Based on our approximation algorithm, we propose bandit algorithms (algorithm 2) that runs in $O(logK)$ time (theorem 3) and provide an upper bound of the sample complexity (theorem 4) that is still worst-case optimal. This result means that our algorithm achieves an exponential speedup over linear bandit algorithms while keeping statistical efficiency. Moreover, we design two heuristic algorithms, which empirically perform well. We propose algorithm 3, which employs the first-order approximation of confidence ellipsoids, and algorithm 4, which is based on the lower-upper confidence-bound algorithm.

3. We conduct a series of experiments on both synthetic and real-world data sets. First, we run our proposed algorithms on synthetic data sets and verify that our algorithms give good approximation to an exhaustive search algorithm. Next, we evaluate our algorithms on large-scale crowdsourcing data sets with more than $1010$ super arms, demonstrating the superiority of our algorithms in terms of both time complexity and sample complexity.

Note that the multiple-arm identification problem is a special class of the combinatorial pure exploration, where super arms follow certain combinatorial constraints such as paths, matchings, or matroids (Cao & Krishnamurthy, 2017; Chen, Gupta, & Li, 2016; Chen, Gupta, Li, Qiao, & Wang, 2017; Chen, Lin, King, Lyu, & Chen, 2014; Gabillon, Lazaric, Ghavamzadeh, Ortner, & Bartlett, 2016; Huang, Ok, Li, & Chen, 2018; Perrault, Perchet, & Valko, 2019). We can also design a simple algorithm (algorithm 5 in appendix A) for the combinatorial pure exploration under general constraints with full-bandit feedback, which results in a looser but general sample complexity bound. All proofs in this paper are given in appendix F.

## 2  Preliminaries

### 2.1  Problem Definition

Let $[n]={1,2,⋯,n}$ for an integer $n$. For a vector $x∈Rn$ and a matrix $B∈Rn×n$, let $∥x∥B=x⊤Bx$. For a vector $θ∈Rn$ and a subset $S⊆[n]$, we define $θ(S)=∑e∈Sθ(e)$. Now we describe the problem formulation formally. Suppose that there are $n$ single arms associated with unknown reward distributions ${ϕ1,⋯,ϕn}$. The reward from $ϕe$ for each single arm $e∈[n]$ is expressed as $Xt(e)=θ(e)+εt(e)$, where $θ(e)$ is the expected reward and $εt(e)$ is the zero-mean noise bounded in $[-R,R]$ for some $R>0$. The agent chooses a size-$k$ subset from $n$ single arms at each round $t$ for an integer $k>0$. In the well-studied semi-bandit setting, the agent pulls a subset $Mt$ and then can observe $Xt(e)$ for each $e∈Mt$ independently sampled from the associated unknown distribution $ϕe$. However, in the full-bandit setting, she can observe the sum of rewards $rMt=θ(Mt)+∑e∈Mtεt(e)$ only at each pull, which means that estimators of expected rewards of single arms are no longer independent of each other.

We call a size-$k$ subset of single arms a super arm. We define a decision class$M$ as a finite set of super arms that satisfies the size constraint: $M={M⊆2[n]:|M|=k}$; thus, the size of the decision class is given by $K=nk$. Let $M*$ be the optimal super arm in the decision class $M$: $M*=argmaxM∈Mθ(M)$. In this letter, we focus on the $(ɛ,δ)$-PAC setting, where the goal is to design an algorithm to output the super arm $Out∈M$ that satisfies for $δ∈(0,1)$ and $ɛ>0$, $Pr[θ(M*)-θ(Out)≤ɛ]≥1-δ$. An algorithm is called $(ɛ,δ)$-PAC if it satisfies this condition. In the fixed confidence setting, the agent's performance is evaluated by her sample complexity: the number of rounds until the agent terminates.

### 2.2  Confidence Bound

In order to handle full-bandit feedback, we utilize approaches for best arm identification in linear bandits. In best arm identification problems, the agent sequentially estimates $θ$ from past observations and confirms that the estimation error is small or not. We introduce the necessary notation as follows. Let $Mt=(M1,M2,…,Mt)∈Mt$ be a sequence of super arms and $(rM1,…,rMt)∈Rt$ be the corresponding sequence of observed rewards. Let $χM∈{0,1}n$ denote the indicator vector of super arm $M∈M$; for each $e∈[n]$, $χM(e)=1$ if $e∈M$ and $χM(e)=0$ otherwise.

Given the sequence of super arm selections $xt=(χM1,…,χMt)$, an unbiased least-squares estimator for $θ∈Rn$ can be obtained by
$θ^t=Axt-1bxt∈Rn,$
(2.1)
where
$Axt=∑i=1tχMiχMi⊤∈Rn×nandbxt=∑i=1tχMirMi∈Rn.$
(2.2)
It suffices to consider the case where $Axt$ is invertible, since we shall exclude a redundant feature when any sampling strategy cannot make $Axt$ invertible. For $xt$ fixed beforehand, Soare et al. (2014) provided the following proposition on the confidence ellipsoid for ordinary least-square estimator $θ^t$.
Proposition 1
(Soare et al., 2014, proposition 1). Let $εt$ be a noise variable bounded as $εt∈[-σ,σ]$ for $σ>0$. Let $c=22σ$ and $c'=6/π2$ and fix $δ∈(0,1)$. Then, for any fixed sequence $xt$, with probability at least $1-δ$, the inequality
$|x⊤θ-x⊤θ^t|≤Ct∥x∥Axt-1$
(2.3)
holds for all $t∈{1,2,…}$ and $x∈Rn$, where $Ct=clog(c't2K/δ)$.

In our problem, the proposition holds for $σ=kR$. Two allocation strategies, named G-allocation and $XY$-allocation, are discussed in Soare et al. (2014). Approximating the optimal G-allocation can be done via convex optimization and an efficient rounding procedure, and $XY$-allocation can be computed in similar manner (see appendix D for details).

### 2.3  Computational Hardness

The agent continues sampling a super arm until a certain stopping condition is satisfied. In order to check the stopping condition, existing algorithms for best arm identification in linear bandits involve the following confidence ellipsoid maximization,
$CEM:max.∥χM∥Axt-1s.t.M∈M,$
(2.4)
where recall that $∥χM∥Axt-1=χM⊤Axt-1χM$. Existing algorithms in linear bandits implicitly assume that an optimal solution to CEM can be exhaustively searched (Soare et al., 2014; Xu et al., 2018). However, since the number of super arms $K$ is exponential in our setting, it is computationally intractable to exactly solve it. Therefore, we need its approximation or a totally different approach for solving the multiple-arm identification with full-bandit feedback.

## 3  Confidence Ellipsoid Maximization

In this section, we design an approximation algorithm for confidence ellipsoid maximization CEM. In the combinatorial optimization literature, an algorithm is called an $α$-approximation algorithm if it returns a solution that has an objective value greater than or equal to the optimal value times $α∈(0,1]$ for any instance. Let $W∈Rn×n$ be a symmetric matrix. CEM introduced in equation 2.4 can be naturally represented by the following 0-1 quadratic programming problem:
$QP:max.∑i=1n∑j=1nwijxixjs.t.∑i=1nxi=k,xi∈{0,1},∀i∈[n].$
(3.1)

Notice that QP can be seen as an instance of the uniform quadratic knapsack problem, which is known to be NP-hard (Taylor, 2016), and there are few results of polynomial-time approximation algorithms even for a special case (see appendix C for details).

In this study, by utilizing algorithms for a classical combinatorial optimization problem, called the densest $k$-subgraph problem (D$k$S), we design an approximation algorithm that admits theoretical performance guarantee for QP with positive-definite matrix $W$. The definition of the D$k$S is as follows. Let $G=(V,E,w)$ be an undirected graph with nonnegative edge weight $w=(we)e∈E$.

For a vertex set $S⊆V$, let $E(S)={{u,v}∈E:u,v∈S}$ be the subset of edges in the subgraph induced by $S$. We denote by $w(S)$ the sum of the edge weights in the subgraph induced by $S$: $w(S)=∑e∈E(S)we$. In the D$k$S, given $G=(V,E,w)$ and positive integer $k$, we are asked to find $S⊆V$ with $|S|=k$ that maximizes $w(S)$. Although the D$k$S is NP-hard, there are a variety of polynomial-time approximation algorithms (Asahiro, Iwama, Tamaki, & Tokuyama, 2000; Bhaskara, Charikar, Chlamtac, Feige, & Vijayaraghavan, 2010; Feige et al., 2001). The current best approximation result for the D$k$S has an approximation ratio of $Ω(1/|V|1/4+ε)$ for any $ε>0$ (Bhaskara et al., 2010). The direct reduction of QP to the D$k$S results in an instance that has arbitrary weights of edges. Existing algorithms cannot be used for such an instance since these algorithms need an assumption that the weights of all edges are nonnegative.

Now we present our algorithm for QP, which is detailed in algorithm 1. The algorithm operates in two steps. In the first step, it constructs an $n$-vertex complete graph $G˜=(V,E,w˜)$ from a given symmetric matrix $W∈Rn×n$. For each ${i,j}∈E$, the edge weight $w˜ij$ is set to $wij+wii+wjj$. Note that if $W$ is positive definite, $w˜ij≥0$ holds for every ${i,j}∈E$, which means that $G˜$ is an instance of the D$k$S (lemma 10 in appendix F). In the second step, the algorithm accesses the densest $k$-subgraph oracle (D$k$S-Oracle), which accepts $G˜$ as input and returns in polynomial time an approximate solution for the D$k$S. Note that we can use any polynomial-time approximation algorithm for the D$k$S as the D$k$S-Oracle. Let $αDkS$ be the approximation ratio of the algorithm employed by the D$k$S-Oracle. By sophisticated analysis on the approximation ratio of algorithm 1, we have the following theorem.

Theorem 1.

For QP with any positive-definite matrix $W∈Rn×n$, algorithm 1 with $αDkS$-approximation DkS-Oracle is a $1k-1λmin(W)λmax(W)αDkS$-approximation algorithm, where $λmin(W)$ and $λmax(W)$ represent the minimum and maximum eigenvalues of $W$, respectively.

## 4  Main Algorithm

Based on the approximation algorithm proposed in the previous section, we propose two algorithms for the multiple-arm identification with full-bandit feedback. Note that we assume that $k≥2$ since the multiple-arm identification with $k=1$ is the same as best arm identification problem of the MAB.

### 4.1  Proposed Algorithm Based on Static Allocation

First, we deal with static allocation strategies, which sequentially sample a super arm from a fixed sequence of super arms. In general, adaptive strategies will perform better than static ones, but due to the computational hardness, we focus on static ones to analyze the worst-case optimality in terms of the minimum gap $Δmin=argminM∈M∖{M*}θ(M*)-θ(M)$. In static algorithms, the agent pulls a super arm from a fixed set of super arms until a certain stopping condition is satisfied. Therefore, it is important to construct a stopping condition guaranteeing that the estimate $θ^t$ belongs to a set of parameters that admits the empirical best super arm $M^t*=argmaxM∈Mθ^t(M)$ as an optimal super arm $M*$ as quickly as possible.

Now we propose an algorithm named SAQM, which is detailed in algorithm 2. Let $P$ be a $K$-dimensional probability simplex. We define an allocation strategy $p$ as $p=(pM)M∈M∈P$, where $pM$ prescribes the proportions of pulls to super arm $M$, and let $supp(p)={M∈M:pM>0}$ be its support. Let $TM(t)$ be the number of times that $M$ is pulled before $(t+1)$th round. At each round $t$, SAQM samples a super arm $Mt=argminM∈supp(p)TM(t)/pM$ and updates statistics $Axt,bt$, and $θ^t$. Then the algorithm computes the empirical best super arm $M^t*$ and approximately solves CEM in equation 2.4 using algorithm 1 as a subroutine. Note that any $α$-approximation algorithm for QP is a $α$-approximation algorithm for CEM. SAQM employs the following stopping condition,
$θ^t(M^t*)-Ct∥χM^t*∥Axt-1≥maxM∈M∖{M^t*}θ^t(M)+1αtCtZt-ɛ,$
(4.1)
where $Zt$ denotes the objective value of an approximate solution $Mt'$ for CEM and $αt$ denotes the approximation ratio of our algorithm for CEM at round $t$. Note that we can specify the value of $αt$ using the guarantee in theorem 2, and this stopping condition allows the output to be $ɛ$-optimal with high probability. As the following theorem states, SAQM provides an exponential speedup over exhaustive search algorithms.
Theorem 2.

Let $poly(n)DkS$ be the computation time of the D$k$S-Oracle. Then at any round $t>0$, SAQM (algorithm 2) runs in $O(max{n2,poly(n)DkS})$ time.

Most existing approximation algorithms for the D$k$S have efficient computation time. For example, if we employ the algorithm by Feige et al. (2001) as the D$k$S-Oracle that runs in $O(nω)$ time in algorithm 1, the running time of SAQM becomes $O(nω)$, where the exponent $ω≤2.373$ is equal to that of the computation time of matrix multiplication (see Lee Gall, 2014). If we employ the algorithm by Asahiro et al. (2000) that runs in $O(n2)$, the running time of SAQM also becomes $O(n2)$.

Let $Λp=∑M∈MpMχMχM⊤$ be a design matrix. We define the problem complexity $Hɛ$ as
$Hɛ=ρ(p)(Δmin+ɛ)2,$
where $ρ(p)=maxM∈M∥χM∥Λp-12$, which also appeared in Soare et al. (2014). The next theorem shows that SAQM is $(ɛ,δ)$-PAC and gives a problem-dependent sample complexity bound.
Theorem 3.
Given any instance of the multiple-arm identification with full-bandit feedback, with probability at least $1-δ$, SAQM (algorithm 2) with $α$-approximation of CEM returns an $ɛ$-optimal set $M^*$, and the total number of samples $T$ is bounded as follows:
$τ≤83+1α2σ2HɛlogKδ+C(Hɛ,δ),$
where
$C(Hɛ,δ)=Oσ2Hɛlogσ2α2Hɛ+logKδ.$

It is worth mentioning that if we have an $α$-approximation algorithm for CEM with a more general decision class $M$ (such as paths, matchings, matroids), we have the same sample complexity bound in theorem 4 for the combinatorial pure exploration (CPE) with general constraints. For the top-$k$ identification setting, we have $α=Ωk-12n-18λmin(Λp)λmax(Λp)$ in theorem 4.

Soare et al. (2014) considered the oracle sample complexity of a linear best-arm identification problem. The oracle complexity, which is based on the optimal allocation strategy $p$ derived from the true parameter $θ$, is $O(Hɛlog(1/δ))$ if we ignore the terms that are not related to $Hɛ$ and $δ$. Soare et al. (2014) showed that the sample complexity with G-allocation strategy matches the oracle sample complexity up to constants in the worst case. The sample complexity of SAQM is also worst-case optimal in the sense that it matches $O(Hɛlog(1/δ))$, while SAQM runs in polynomial time.

Note that if we use proposition 2, which is specified in section 5, instead of proposition 1, we have a better sample complexity bound in terms of $k$ (or $n$). However, the sample complexity with proposition 2 becomes complicated since it will depend on regularization parameter $λ$. Therefore, we analyze the sample complexity bound based on proposition 1 to clarify how the sample complexity depends on $α$, $Hɛ$, and $δ$ rather than $k$ (or $n$).

Although the quantity $ρ(p)$ is unbounded in the worst case, it is upper-bounded by $n$, as pointed out in Soare et al. (2014), if we use G-allocation strategy as sampling strategy $p$. However, the exact G-allocation might be hard to compute if $n$ is large, since the number of variable can be exponential.

Remark 1.

Suppose that SAQM (algorithm 2) employs G-allocation strategy $pG$: $pG=argminp∈PmaxM⊆[n]χMΛp-1χM$. Then, from the theorem in Kiefer & Wolfowitz (1960, section 2), we have $ρ(pG)=n$.

We also note that under some mild conditions, we have an upper bound of the condition number of $Λp$.

Lemma 1.

Suppose that SAQM (algorithm 2) employs sampling strategy $p∈P$, in which the agent pulls any size-$k$ set including single arm $i$ with probability $Ω(1/n)$ for all $i∈[n]$: $mini∈[n]∑M∈supp(p):i∈Mp(M)=Ω(1/n)$. Then the condition number of $Λp$ satisfies $λmax(Λp)λmin(Λp)=O(nk).$

## 5  Heuristic Algorithms

In this section, we propose two algorithms that output the optimal super arm but do not have a sample complexity bound. The first algorithm employs the first-order approximation of confidence ellipsoids for checking the stopping condition efficiently. The second one is based on an adaptive sampling strategy. Empirically, we evaluated these algorithms and observed that they perform well in our experiments.

### 5.1  First-Order Approximation for Confidence Ellipsoid Maximization

In SAQM, we compute an upper confidence bound of the expected reward of each super arm. However, in order to reduce the number of required samples, we wish to directly construct a tight confidence bound for the gap of the reward between two super arms. For this reason, we propose another algorithm, SA-FOA. The procedure of SA-FOA is shown in algorithm 3. Given an allocation strategy $p$, this algorithm continues sampling until the stopping condition $ɛ2≥Zt'-θ^t(M^t*)$ is satisfied, where $Zt'$ denotes the objective value of an approximate solution of the following maximization problem:
$maxM∈M∖{M^t*}θ^t(M)+Ct∥χM-χM^t*∥Axt-1.$
(5.1)
The second term of equation 5.1 can be regarded as the confidence interval of the estimated gap $θ^t(M)-θ^t(M^t*)$. In order to simultaneously maximize the estimated reward $θ^t(M)$ and the matrix norm $∥χM-χM^t*∥Axt-1$, we employ a first-order approximation technique. For a fixed super arm $Mi$, we approximate $∥χM-χM^t*∥Axt-1$ using the following bound:
$∥χM-χM^t*∥Axt-1≤∥χM-χM^t*∥Axt-122∥χMi-χM^t*∥Axt-1+∥χMi-χM^t*∥Axt-12,$
(5.2)
which follows from $a+x≤a+x2a$ for any $a,x>0$. Using equation 5.2, the objective function in equation 5.1 can be bounded as follows:
$θ^t(M)+Ct∥χM-χM^t*∥Axt-1≤θ^t⊤χM+Ct∥χM-χM^t*∥Axt-122∥χMi-χM^t*∥Axt-1+∥χMi-χM^t*∥Axt-12$
(5.3)
$=θ^t⊤χM+Ct∥χM∥Axt-12-2(χM^t*Axt-1)⊤χM+∥χM^t*∥Axt-122∥χMi-χM^t*∥Axt-1+Ct∥χMi-χM^t*∥Axt-12.$
(5.4)
For any $y∈Rn$, let $Diag(y)$ be a diagonal matrix whose $i$th diagonal component is $y(i)$ for $i∈[n]$. Note that for a fixed round $t$ and $Mi$, the terms $∥χMi-χM^t*∥Axt-1$ and $∥χM^t*∥Axt-12$ do not depend on $χM$. Therefore, the above first-order approximation allows us to transform the original problem to QP, where the objective function is
$χM⊤γAxt-1-Diag(2γ(Axt-1χM^t*))+Diag(θ^t)χM,$
with a positive constant $γ=Ct2∥χMi-χM^t*∥Axt-1$. We can approximately solve it by algorithm 1 and choose the best approximate solution that maximizes the original objective among $ℓn$ super arms. Notice that SA-FOA is an $(ε,δ)$-PAC algorithm since we compute the upper bound of the objective function in equation 5.1 and thus it will not stop earlier. In our experiments, it works well, although we have no theoretical results on the sample complexity. We will also observe in the experiments that the approximation error of SA-FOA for equation 5.1 becomes smaller as the number of rounds increases.

In the previous section, using our approximation scheme of confidence ellipsoid maximization, we proposed algorithms based on static allocation strategies. However, in order to design a near-optimal sampling algorithm, we should focus on adaptive algorithms, which adaptively changes arm selection strategies based on the past observations at every round. In this section, we propose an adaptive algorithm making use of existing classical methods.

If $xt$ is adaptively determined based on past observations, we consider that the regularized least-square estimator is given by
$θ^t=(Axtλ)-1bxt,$
(5.5)
where $Axtλ$ is defined by
$Axtλ=λI+∑i=1tχMχM⊤,$
for regularization parameter $λ>0$ and the identity matrix $I$. In the remaining part of this section, we assume that the reward distribution $ϕe$ is an $R$-sub-gaussian distribution for all $e∈[n]$. For the regularized least-square estimator, we have another confidence bound, which is also valid for adaptive strategies. Abbasi-Yadkori, Pál, and Szepesvári (2011), for linear bandits, proposed another confidence bound that can be used even if $xt$ is adaptively defined. Plugging their confidence bound into our setting, we have the following proposition.
Proposition 2
(Adapted from Abbasi-Yadkori et al., 2011, theorem 3). Let $εt$ be an R-sub gaussian noise. If the $l2$-norm of parameter $θ$ is less than $S$, then statement
$|x⊤θ-x⊤θ^t|≤Ct∥x∥(Axtλ)-1$
(5.6)
holds for all $t∈{1,2,…}$ and $x∈Rn$ with probability at least $1-δ$, where
$Ct=R2klogdet(Axtλ)12λn2+λ12S.$
(5.7)
Moreover, if $∥x∥≤k$ holds for all $t>0$, then
$Ct≤R2knlog1+tk/λδ+λ12S.$
(5.8)

We propose an algorithm named CLUCB-QM that employs an adaptive strategy based on the combinatorial lower-upper confidence bound (CLUCB) algorithm (Chen et al., 2014). The CLUCB algorithm is originally designed for semi-bandit settings. We can modify the algorithm to solve full-bandit settings; whereas the original algorithm queries one single arm in one period, we can use it for the full-bandit settings by replacing the one single arm with two super arms that are different only by one single arm. However, with the original stopping condition used in Chen et al. (2014), the algorithm works poorly in terms of number of samples, as we will observe in our experiments. In the CLUCB-QM we propose, we use the stopping criterion as in SAQM that maintains the least-squares estimator and involves confidence ellipsoid maximization.

The entire procedure of CLUCB-QM is detailed in algorithm 4. This algorithm keeps up confidence radius $radt(e)$ for each single arm $e∈[n]$. The vector $θ˜t(e)$ penalizes single arms belonging to the current empirical best super arm $M^t*$ and encourages exploring singe-arms out of $M^t*$. CLUCB-QM chooses a single arm $et$ that has the largest confidence radius among the symmetric difference $M^t*$ and $M˜t=argmaxM∈Mθ˜t(M)$. The algorithm queries two super arms with one arm difference: CLUCB-QM pulls $Mt∈M$ such that $et∈Mt$ and $e'∉Mt$, and then pulls $Mt'=Mt∖{et}⋃{e'}$, where $e'$ is a fixed single arm. Since CLUCB-QM employs the stopping condition 4.1 as in SAQM, we can prove that the CLUCB-QM algorithm outputs $ɛ$-optimal super arm with probability at least $1-δ$. Although we do not have a sample complexity bound for CLUCB-QM, it performs better than SAQM in our experimental results.

## 6  Experiments

In this section, we evaluate the empirical performance of our algorithms: SAQM (algorithm 2), SA-FOA (algorithm 3), and CLUCB-QM (algorithm 4). We implement a baseline algorithm, ICB (algorithm 5 in appendix A), which works in polynomial time. ICB employs simplified confidence bounds obtained by diagonal approximation of confidence ellipsoids. Note that ICB can solve the combinatorial pure exploration problem with general constraints and results in another sample complexity (lemma 9 in appendix A).

To verify the effectiveness of our novel stopping rule that uses confidence ellipsoids, we implement CLUCB (algorithm 6 in appendix E.1); its stopping rule is different from CLUCB-QM, but its sampling rule is the same, as CLUCB-QM. CLUCB estimates the gap between each single arm and a fixed arm by pulling two super arms with one single arm difference. CLUCB is based on the lower-upper confidence-bound algorithm for solving general constrains proposed by Chen et al. (2014) as in CLUCB-QM, and it employs a stopping condition that does not require confidence ellipsoid maximization. Notice that CLUCB is $(ε,δ)$-PAC (see its sample complexity in appendix E.1).

We implement two baseline algorithms that invoke noncombinatorial top-$k$ identification algorithms: one is the elimination-based algorithm ME (algorithm 7 with ME-Subroutine algorithm 8 in appendix E.2); the other is the confidence-bound-based algorithm LUCB (algorithm 7 with LUCB-Subroutine algorithm 9 in appendix E.2) ME employs the median elimination algorithm by Kalyanakrishnan and Stone (2010) for the noncombinatorial setting as a subroutine with simple modification; we sample ${i}∪A$ when base arm $i∈[n]$ should be sampled for a fixed size-($k-1$) subset $A⊆[n]$. LUCB is a counterpart of ME that employs the lower-upper confidence-based algorithm proposed by Kalyanakrishnan et al. (2012). Note that LUCB and ME are also $(ε,δ)$-PAC (see appendix E for their sample complexity).

We compare our algorithms with two exponential time algorithms, SA-Ex and CLUCB-Ex, which reduce our problem to the pure exploration problem in the linear bandit (see appendix E.3 for details). We conduct the experiments on small synthetic data sets and large-scale real-world data sets.

All experiments were conducted on a Macbook with a 1.3 GHz Intel Core i5 and 8 GB memory. All codes were implemented by using Python. In all experiments, we employed the approximation algorithm called the greedy peeling (Asahiro et al., 2000) as the D$k$S-Oracle. Specifically, the greedy peeling algorithm iteratively removes a vertex with the minimum weighted degree in the currently remaining graph until we are left with the subset of vertices with size $k$. The algorithm runs in $O(n2)$.

### 6.1  Synthetic Data Sets

To see the dependence of the performance on the minimum gap $Δmin$, we generate synthetic instances as follows. We first set the expected rewards for the top-$k$ single arms uniformly at random from [0,1]. Let $θmin-k$ be the minimum expected reward in the top-$k$ single arms. We set the expected reward of the $(k+1)$th best single-arm to $θmin-k-Δmin$ for the predetermined parameter $Δmin∈[0,1]$. Then we generate the expected rewards of the rest of single arms by uniform samples from $[-1,θmin-k-Δmin]$ so that expected rewards of the best super arm are larger than those of the rest of super arms by at least $Δmin$. We set the additive noise distribution $N(0,1)$. In all instances, we set $δ=0.05$ and $ɛ=0.5$. For a regularization parameter in CLUCB-QM, we set $λ=1$ as in Xu et al. (2018). SAQM, SA-FOA and ICB employ G-allocation strategy. In SA-FOA, $ℓ$ is set to 2 for all experiments.

### 6.2  Approximation Error

First, we examine the approximation precision of our approximation algorithms. The results are reported in Figure 1. SAQM and SA-FOA employ some approximation mechanisms to test the stopping condition in polynomial time. Recall that SAQM approximately solves CEM in equation 2.4 to attain an objective value of $Zt$, and SA-FOA approximately solves the maximization problem in equation 5.1 to attain an objective value of $Zt'$. We set up the experiments with $n=10$ single arms and $k=5$. We run the experiments for the small gap ($Δmin=0.1$) and large gap ($Δmin=1.0$). We plot the approximation ratio and the additive approximation error of SAQM and SA-FOA in the first 100,000 rounds. From the results, we can see that their approximation ratios are almost always greater than 0.9, which are far better than the worst-case guarantee proved in theorem 2. In particular, the approximation ratio of SA-FOA in the small gap case is surprisingly good (around 0.95) and grows as the number of rounds increases. This result implies that there is only a slight increase of the sample complexity caused by the approximation, especially when the expected rewards of single arms are close to each other.

Figure 1:

Approximation precision for synthetic data sets with $(n,k)=(10,5)$. Each point corresponds to an average over 10 realizations.

Figure 1:

Approximation precision for synthetic data sets with $(n,k)=(10,5)$. Each point corresponds to an average over 10 realizations.

### 6.3  Running Time

Next, we conduct the experiments to compare the running time of algorithms. We set $n=10,12,…,24$ and $k=n/2$ on synthetic data sets. We report the computational time in one round in Figure 2. Since ME is an elimination-based algorithm, we report the computational time, which is overall time $t$ divided by the number of samples required by the algorithm. As can be seen, SA-Ex and CLUCB-Ex are prohibitive on instances with large number of super arms, while our algorithms can run fast even if $n$ becomes larger, which matches our theoretical analysis. The results indicate that polynomial-time algorithms are of crucial importance for practical use.

Figure 2:

Run time in each round for synthetic data sets. Each point is an average over 10 realizations.

Figure 2:

Run time in each round for synthetic data sets. Each point is an average over 10 realizations.

### 6.4  Number of Samples

Finally, we evaluate the number of samples required to identify the best super arm for varying $Δmin$. Based on the above observation, we set $α=0.9$. The result is shown in Figure 3. We observed that our algorithms always output the optimal super arm. The result indicates that the numbers of samples of our algorithms are comparable to those of SA-Ex and CLUCB-Ex. Notice that LUCB does not show a decreasing trend in Figure 3. The reason may be that it reduces the problem to two instances, where the gap between the best $k$th and $(k+1)$th arm can be no longer $Δmin$.

Figure 3:

Number of samples for synthetic data sets with $(n,k)=(10,5)$. Each point is an average over 10 realizations.

Figure 3:

Number of samples for synthetic data sets with $(n,k)=(10,5)$. Each point is an average over 10 realizations.

### 6.5  Performance on Real-World Crowdsourcing Data Sets

We use the crowdsourcing data sets compiled by Li, Baba, and Kashima (2017) whose basic information is shown in Table 1. The task is to identify the top-$k$ workers with the highest accuracy only from a sequential access to the accuracy of part of labels given by some workers. Notice that the number of super arms is more than $1010$ in all experiments. All data sets are hard instances as $Δmin$ is less than 0.05. We set $k=10$ and $ɛ=0.5$. Since SA-Ex and CLUCB-Ex are prohibitive, we compare the other algorithms. SAQM, SA-FOA, and ICB employ uniform allocation strategy.

Table 1:
Real-World Data Sets on Crowdsourcing.
Data SetNumber of TasksNumber of WorkersAverageBest$Δmin$
IT 25 36 0.54 0.84 0.04
Medicine 36 45 0.48 0.92 0.03
Chinese 24 50 0.37 0.79 0.04
Pokémon 20 55 0.28 1.00 0.05
English 30 63 0.26 0.70 0.03
Science 20 111 0.29 0.85 0.05
Data SetNumber of TasksNumber of WorkersAverageBest$Δmin$
IT 25 36 0.54 0.84 0.04
Medicine 36 45 0.48 0.92 0.03
Chinese 24 50 0.37 0.79 0.04
Pokémon 20 55 0.28 1.00 0.05
English 30 63 0.26 0.70 0.03
Science 20 111 0.29 0.85 0.05

Note: “Average” and “Best” give the average and the best accuracy rate among the workers, respectively.

The result is shown in Table 2, which indicates the applicability of our algorithms to the instances with a massive number of super arms. Moreover, all algorithms found the optimal subset of crowdworkers. In all data sets, SA-FOA outperformed the other algorithms. Recall that ICB uses the simplified confidence bound for the gap between two super arms. On the other hand, SA-FOA uses the approximation of the confidence ellipsoids for the gap between two super arms, which results in better performance than ICB. SAQM approximately computes the maximal confidence ellipsoid bound for the reward of one super arm rather than the gap between two super arms, which may result in worse performance than SA-FOA. CLUCB-QM, which employs the same sampling rule as CLUCB and the same stopping rule as SAQM, performed better than CLUCB and SAQM. This result may indicate that an adaptive sampling rule is more desirable than a static sampling rule, and using a confidence ellipsoid is more desirable than considering an individual confidence bound. ME, LUCB, and CLUCB discard the information from $k-1$ arms at each pull, which may cause the unfavorable results. LUCB worked better than CLUCB, since the original version of CLUCB was designed for very general combinatorial constraints, while LUCB was designed only for the top-$k$ setting. Notice that ME is phased adaptive while LUCB is fully adaptive, ME performed poorly in all instances, although ME is the counterpart of LUCB.

Table 2:
Number of Samples ($×102$) on Real-World Crowdsourcing Data Sets.
Data SetSAQMSA-FOACLUCB-QMCLUCBMELUCBICB
IT 62,985 1437 9896 405,313 111,603 91,442 43,773
Medicine 96,174 865 15,678 400,953 139,504 109,124 66,468
Chinese 88,209 1060 19,438 754,439 301,635 129,795 99,424
Pokémon 83,209 328 1994 151,748 331,799 89,674 19,705
English 121,890 1023 31,300 671,274 380,060 117,611 114,406
Science 276,325 1505 100,950 1,825,106 1,292,074 224,494 418,155
Data SetSAQMSA-FOACLUCB-QMCLUCBMELUCBICB
IT 62,985 1437 9896 405,313 111,603 91,442 43,773
Medicine 96,174 865 15,678 400,953 139,504 109,124 66,468
Chinese 88,209 1060 19,438 754,439 301,635 129,795 99,424
Pokémon 83,209 328 1994 151,748 331,799 89,674 19,705
English 121,890 1023 31,300 671,274 380,060 117,611 114,406
Science 276,325 1505 100,950 1,825,106 1,292,074 224,494 418,155

Note: Each value is an average over 10 realizations.

## 7  Conclusion

We studied the multiple-arm identification with full-bandit feedback, where we can observe only the sum of the rewards, not the reward of each single arm. Although we can regard our problem as a special case of pure exploration in linear bandits, an approach based on linear bandits is not computationally feasible since the number of super arms may be exponential. To overcome the computational challenges, we designed a novel approximation algorithm with theoretical guarantee for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we proposed $(ε,δ)$-PAC algorithms SAQM that run in $O(logK)$ time and provided an upper bound of the sample complexity, which is still worst-case optimal; the result indicates that our algorithm provided an exponential speedup over an exhaustive search algorithm while keeping the statistical efficiency. We also designed two heuristic algorithms that empirically perform well: SA-FOA using first-order approximation and CLUCB-QM based on lower-upper confidence-bound algorithm. Finally, we conducted experiments on synthetic and real-world data sets with more than $1010$ super arms, demonstrating the superiority of our algorithms in terms of both computation time and sample complexity. There are several directions for future research. It remains open to design adaptive algorithms with a problem-dependent optimal sample complexity. Another interesting question is to seek a lower bound of any $(ε,δ)$-PAC algorithm that works in polynomial time. Extension for combinatorial pure exploration with full-bandit feedback is another direction.

## Appendix A:  Simplified Confidence Bounds for the Combinatorial Pure Exploration

In this appendix, we see the fundamental observation of employing a simplified confidence bound to obtain a computationally efficient algorithm for the combinatorial pure exploration problem. We consider any decision class $M$ in which super arms satisfy any constraint where a linear maximization problem is polynomial-time solvable. The examples of decision class considered here are paths, matchings, or matroids (see appendix B for the definition of matroids). The purpose of this appendix is to give a polynomial-time algorithm for solving the combinatorial pure exploration with general constraints by using the simplified confidence bound and see the trade-off between statistical efficiency and computational efficiency. The $(ε,δ)$-PAC algorithm proposed in this section, named ICB, is also evaluated as a simple benchmark strategy in our experiments .

For a matrix $B∈Rn×n$, let $B(i,j)$ denote the $(i,j)$th entry of $B$. We construct a simplified confidence bound, named an independent confidence bound, which is obtained by diagonal approximation of confidence ellipsoids. We start with the following lemma, which shows that $θ$ lies in an independent confidence region centered at $θ^t$ with high probability.

Lemma 2.
Let $c'=6/π2$. Let $εt$ be a noise variable bounded as $εt∈[-σ,σ]$ for $σ>0$. Then for any fixed sequence $xt$, any $t∈{1,2,…}$, and $δ∈(0,1)$, with probability at least $1-δ$, the inequality
$|x⊤θ-x⊤θ^t|≤Ct∑i=1n|xi|Axt-1(i,i)$
(A.1)
holds for all $x∈{-1,0,1}n$, where
$Ct=σ2log(c't2n/δ).$
This lemma can be derived from proposition 1 and the triangle inequality. The right-hand side of equation A.1 only has linear terms of ${xi}i∈[n]$, whereas that of equation 3 in proposition 1 has the matrix norm $∥x∥Axt-1$, which results in a difficult instance. As long as we assume that linear maximization oracle is available, maximization of this value can be also done in polynomial time. For example, maximization of the right-hand side of equation A.1 under matroid constraints can be solved by using the simple greedy procedure (Karger, 1998) described in appendix B. Based on the independent confidence bounds, we propose ICB, which is detailed in algorithm 5. At each round $t$, ICB computes the empirical best super arm $M^t*$ and then solves the following maximization problem:
$P1:max.θ^t(M)+Ct∑i=1n|χM(i)-χM^t*(i)|Axt-1(i,i),s.tM∈M∖{M^t*}.$
The second term in the objective of $P1$ can be regarded as the confidence interval of the estimated gap $θ^t(M)-θ^t(M^t*)$. ICB continues sampling a super arm until the following stopping condition is satisfied:
$Zt*-θ^t(M^t*)<ɛ,$
(A.2)
where $Zt*$ represents the optimal value of $P1$. Note that $P1$ is solvable in polynomial time because it is an instance of linear maximization problems. As the following lemma states, ICB is an efficient algorithm in terms of the computation time.
Lemma 3.

Given any instance of combinatorial pure exploration with full-bandit feedback with decision class $M$, ICB (algorithm 5) at each round $t∈{1,2,…}$ runs in polynomial time.

For example, ICB runs in $O(max{n2,ng(n)})$ time for matroid constraints, where $g(n)$ is the computation time to check whether a given super arm is contained in the decision class. Note that $g(n)$ is polynomial in $n$ for any matroid constraints. For example, $g(n)=O(n)$ if we consider the case where each super arm corresponds to a spanning tree of a graph $G=(V,E)$ and a decision class corresponds to a set of spanning trees in a given graph $G$.

From the definition, we have $At=∑M∈MTM(t)χMχM⊤$, where $TM(t)$ denotes the number of times that $M$ is pulled before the round $t+1$. Let $Λp'=∑M∈MpMχMχM⊤$. We define $ρ'(p)$ as
$ρ'(p)=maxM,M'∈M∑i=1n|χM(i)-χM'(i)|Λp-1(i,i)2.$
(A.3)
Now, we give a problem-dependent sample complexity bound of ICB with allocation strategy $p$ as follows.
Lemma 4.
Given any instance of combinatorial pure exploration with decision class $M$ in full-bandit setting, with probability at least $1-δ$, ICB (algorithm 5) returns an $ɛ$-optimal super-arm $M^*$ and the total number of samples $T$ is bounded as follows:
$T=Ok2R2Hɛ'lognδkRHɛ'k2R2Hɛ'+lognδ,whereHε'=ρ'(p)(Δmin+ɛ)2.$

The proof is given in appendix F. Notice that in the MAB, this diagonal approximation is tight since $Axt$ becomes a diagonal matrix. However, for combinatorial settings where the size of super arms is $k≥2$, there is no guarantee that this approximation is tight; the approximation may degrade the sample complexity. Although the proposed algorithm here empirically performs well when the number of single arms is not large, as seen in Figure 3, it is still unclear that using the simplified confidence bound should be desired instead of ellipsoids, confidence bounds. This is the reason we focus on the approach with confidence ellipsoids.

## Appendix B:  Definition of Matroids

A matroid is a combinatorial structure that abstracts many notions of independence such as linearly independent vectors in a set of vectors, called the linear matroid, and spanning trees in a graph, called the graphical matroid (Whitney, 1935). Formally, a matroid is a pair $J=(E,I)$, where $E={1,2,…,n}$ is a finite set called a ground set, and $I⊆2E$ is a family of subsets of $E$ called independent sets that satisfies the following axioms:

1. $∅∈I$.

2. $X⊆Y∈I⟹X∈I$.

3. $∀X,Y∈I$ such that $|X|<|Y|$, $∃e∈Y∖X$ such that $X∪{e}∈I$.

A weighted matroid is a matroid that has a weight function $w:E→R$. For $F⊆E$; we define the weight of $F$ as $w(F)=∑e∈Fw(e)$.

Let us consider the following problem: given a weighted matroid $J=(E,I)$ with $w:E→R$, we are asked to find an independent set with the maximum weight, that is, $argmaxF∈Iw(F)$. This problem can be solved exactly by the following simple greedy algorithm (Karger, 1998). The algorithm initially sets $F$ to the empty set. Then the algorithm sorts the elements in $E$ with the decreasing order by weight, and for each element $e$ in this order, the algorithm adds $e$ to $F$ if $F∪{e}∈I$. Letting $g(n)$ be the computation time for checking whether $F$ is independent, we see that the running time of the above algorithm is $O(nlogn+ng(n))$.

## Appendix C:  Uniform Quadratic Knapsack Problem

Assume that we have $n$ items, each of which has weight 1. In addition, we are given an $n×n$ nonnegative integer matrix $W=(wij)$, where $wii$ is the profit achieved if item $i$ is selected and $wij+wji$ is a profit achieved if both items $i$ and $j$ are selected for $i. The uniform quadratic knapsack problem (UQKP) calls for selecting a subset of items whose overall weight does not exceed a given knapsack capacity $k$ so as to maximize the overall profit. The UQKP can be formulated as the following 0-1 integer quadratic programming:
$max.∑i=1n∑j=1nwijxixjs.t.∑i=1nxi≤k.$
The UQKP is an NP-hard problem. Indeed, the maximum clique problem, which is also NP-hard, can be reduced to it. Given a graph $G=(V,E)$, we set $wii=0$ for all $i$ and $wij=1$ for all ${i,j}∈E$. Solving this problem allows us to find a clique of size $k$ if and only if the optimal solution of the problem has value $k(k-1)$ (Taylor, 2016).

## Appendix D:  Allocation Strategies

In this section, we briefly introduce the possible allocation strategies and describe how to implement a continuous allocation $p$ into a discrete allocation $xt$ for any sample size $t$. We report the efficient rounding procedure introduced in Pukelsheim (2006). In the G-allocation strategy, we make the sequence of selection $xt$ to be $xtG=argminxt∈Rn×tmaxx∈X∥x∥Axt-1$ for $X⊆Rn$, which is an NP-hard optimization problem. Massive studies have proposed approximate solutions to solve it in the experimental design literature (Bouhtou, Gaubert, & Sagnol, 2010; Sagnol, 2013). We can optimize the continuous relaxation of the problem by the projected gradient algorithm, multiplicative algorithm, or interior point algorithm. From the obtained optimal allocation $p$, we wish to design a discrete allocation for fixed sample size $t$.

Given an allocation $p∈P$, recall that $supp(p)={j∈[K]:pj>0}$. Let $ti$ be the number of pulls for arm $i∈supp(p)$ and $s$ be the size of $supp(p)$. Then, letting the frequency$ti=⌈t-12spi⌉$ results in $∑i∈supp(p)ti$ samples. If $∑i∈supp(p)ti=t$, this allocation is a desired solution. Otherwise, we conduct the following procedure until $∑i∈supp(p)ti-n$ becomes 0: increase a frequency $tj$, which attains $tj/pj=mini∈supp(p)ti/pi$ to $tj+1$, or decrease some $tj$ with $(tj-1)/pj=maxi∈supp(p)(ti-1)/pi$ to $tj-1$. Then $(ti,…,ts)$ lies in the efficient design apportionment (see Pukelsheim, 2006). Note that since the relaxation problem has an exponential number of variables in our setting, we are restricted to the number of $supp(p)$ instead of dealing with all super arms.

## Appendix E:  Details of Baseline Algorithms

### E.1  Details of $CLUCB$

A baseline strategy CLUCB is given in algorithm 6. CLUCB was proposed for combinatorial pure exploration problem where we can query each single arm (Chen et al., 2014). With simple modification, we can use the original algorithm for the full-bandit setting. For a fixed single arm $e'$, CLUCB estimates the gap between each single arm and fixed $e'$ by comparing two super arm queries with one single arm difference. The vector $θ^t'$ represents the empirical gap between each arm and fixed $e'$. The stopping condition employed by CLUCB is
$θ˜t'(M˜t)-θ˜t'(M^t*)≤ɛ,$
as in the original algorithm (Chen et al., 2014). The variance becomes $O(k)$ times larger than the one in the setting where the single arm can be queried. Therefore, CLUCB is $(ɛ,δ)$-PAC algorithm with a sample complexity bound as follows.
Proposition 3
(Chen et al., 2014, theorem 2). With probability at least $1-δ$, CLUCB algorithm (algorithm 6) returns $ɛ$-optimal super arm $M^*$ and the number of samples $T$ is bounded as
$T≤OkR2∑e∈[n]min2Δe2,k2ɛ2logkR2δ∑e∈[n]min2Δe2,k2ε2,$
where for each base arm $e∈[n]$, the gap $Δe$ is defined as
$Δe=θ(M*)-maxM∈M:e∈M∑e∈Mθe(ife∉M*)θ(M*)-maxM∈M:e∉M∑e∈Mθe(ife∈M*).$
(E.1)

CLUCB algorithm is efficient in terms of computational time since the algorithm runs in polynomial time. However, as observed in our experiments, this naive baseline algorithm does not work well, especially for real-world data sets, since it cannot exploit the information from $k-1$ arms at each pull.

### E.2  Details of $ME$ and $LUCB$

The entire procedure of ME is detailed in algorithm 7 with ME subroutine algorithm 8. First, we choose any subset $A$ of $k-1$ arms and find the best $k$ subset $B*$ among $B=[n]∖A$ by the median elimination algorithm proposed by Kalyanakrishnan and Stone (2010). When $i∈B$ should be pulled, we pull the super arm ${i}∪A$ instead of $i$. By this procedure, we can get the $k$ arms $i∈B$ to maximize $θ({i}∪A)$, which are equal to the arms $i∈B$ to maximize $θ(i)$. Clearly, the best $k$ subset in $[n]$ can be obtained by finding the best $k$ subset in $A∪B*$. ME is $(ε,δ)$-PAC, and its sample complexity is $Onk2ε2logkδ$.

LUCB is a counterpart of ME, which is detailed in algorithm 7 with LUCB Subroutine algorithm 9. LUCB employs the lower-upper confidence algorithm proposed by Kalyanakrishnan et al. (2012) whose sample complexity is $O∑e∈[n]1max{Δe,ε}2log1max{Δe,ε}2/δ$, where $Δe$ is defined by equation E.1. Since this subroutine is $(ε,δ)$-PAC, we see that LUCB is also $(ε,δ)$-PAC. However, the sample complexity can be very large when the gap between the best $k$th and $(k+1)$th arm in the instance for its first subroutine is much smaller than $Δmin$.

ME and LUCB are computationally efficient since they run in polynomial time. However, they must invoke a subroutine two times, which will increase the number of samples.

### E.3  Details of Exponential Algorithms

The entire procedure of SA-Ex is detailed in algorithm 10. In this algorithm and SAQM, the difference is only the stopping condition. In SAQM, we approximately solve confidence ellipsoid maximization, whereas SA-Ex conducts an exhaustive search to obtain the exact solution. Thus, SA-Ex runs in exponential: $O(nk)$. The entire procedure of CLUCB-Ex is detailed in algorithm 11. This algorithm also reduces our problem to the pure exploration problem in the linear bandit and thus runs in exponential time. The stopping condition used in CLUCB-Ex and SA-Ex is the same. CLUCB-Ex adaptively pulls super arms based on CLUCB strategy as in CLUCB and CLUCB-QM.

## Appendix F:  Proofs

First, we introduce the notation. For $M,M'∈M$, let $Δ(M,M')$ be the value gap between two super arms: $Δ(M,M')=|θ(M)-θ(M')|$. Also, let $Δ^(M,M')$ be the empirical gap between two super arms: $Δ^(M,M')=|θ^t(M)-θ^t(M')|$.

### F.1  Proof of Lemma 6

Proof.
Let $(0≤)λ1≤λ2≤⋯≤λn$ be the eigenvalues of $Λp$. For $λmax(Λp)$, we have
$λmax(Λp)≤∑i=1nλi=tr(Λp)=∑M∈supp(p)p(M)tr(χMχM⊤)=∑M∈supp(p)p(M)k=k.$
Next, we consider a lower bound of $λmin(Λp)$. Since $Λp$ is a positive-definite matrix, $Λp-mini∈[n]Λp(i,i)I$ is positive semidefinite, where $I$ is an identity matrix. Therefore, we have
$λmin(Λp)=min∥x∥=1x⊤Λpx≥mini∈[n]Λp(i,i)×min∥x∥=1x⊤Ix=mini∈[n]Λp(i,i).$
Recall that $Λp(i,i)$ is the probability that single arm $i$ is pulled in the sampling strategy $p$. We have
$mini∈[n]Λp(i,i)=mini∈[n]∑M∈supp(p):i∈Mp(M)=Ω1n.$
Therefore, we have the lemma. $□$

### F.2  Proof of Lemma 9

Proof.
First we define random event $Et$ as follows:
$Et=∀M,M'∈M,|θ^t(M)-θ^t(M')|≤Ct∑i=1n|χM(i)-χMt'(i)|Axt-1(i,i).$
We notice that random event $Et$ implies that the event that the confidence intervals of all super arms $M∈M$ are valid at round $t$. From lemma 6, we see that the probability that event $E=∩t=1∞Et$ occurs is at least $1-δ$. In the event $E$, we see that the output $M^*$ is an $ɛ$-optimal super arm. In the rest of the proof, we assume that event $E$ holds. Next, we focus on bounding the sample complexity $T$. By recalling the stopping condition A.2, a sufficient condition for stopping is that for $M*$ and for $t>n$,
$ɛ>maxM∈M∖{M*}θ^t(M)+Ct∑i=1n|χM(i)-χM*(i)|Axt-1(i,i)-θ^t(M*).$
(F.1)
Let $M¯=argmaxM∈M∖{M*}θ^t(M)+Ct∑i=1n|χM(i)-χM*(i)|Axt-1(i,i)$. Eq-uation F.1 is satisfied if
$Δ^(M*,M¯)>Ctρ(p)t-ɛ.$
(F.2)
From lemma 6 with $x=χM*-χM¯$, with probability at least $1-δ$, we have
$Δ^(M*,M¯)≥Δ(M*,M¯)-Ct∑i=1n|χM*(i)-χM¯(i)|Axt-1(i,i)≥Δ(M*,M¯)-Ctρ(p)t.$
(F.3)
Combining equations F.2 and F.3, we see that a sufficient condition for stopping is given by $Δ(M*,M¯)≥Δmin≥2Ctρ(p)t-ɛ$. Therefore, we have $t≥4Ct2Hɛ$ as a sufficient condition to stop. Let $τ>n$ be the stopping time of the algorithm. From the above discussion, we see that $τ≤4Cτ2Hɛ$. Recalling that $Ct=σ2log(c't2n/δ)$, we have $τ≤8σ2log(c'τ2n/δ)Hɛ$. Let $τ'$ be a parameter that satisfies
$τ=8σ2log(c'τ'2n/δ)Hɛ.$
(F.4)
Then it is obvious that $τ'≤τ$ holds. For $N$ defined as $N=8σ2log(c'n/δ)Hɛ$, we have
$τ'≤τ=16σ2log(τ')Hɛ+N≤16σ2τ'Hɛ+N.$
Transforming this inequality, we obtain
$τ'≤8σ2Hɛ+64σ4Hɛ2+N≤264σ4Hɛ2+N.$
(F.5)
Let $L=264σ4Hɛ2+N$, which equals the right-hand side of equation F.5. We see that
$logL=OlogσHɛσ2Hɛ+lognδ.$
Then, using this upper bound of $τ'$ in equation F.4, we have
$τ≤16σ2Hɛlogc'nδ+C(Hɛ,δ),$
where
$C(Hɛ,δ)=Oσ2HɛlogσHɛσ2Hɛ+lognδ.$
Recalling that $σ=kR$, we obtain
$τ=Ok2R2HɛlognδkRHɛk2R2Hɛ+lognδ.$
$□$

### F.3  Proof of Theorem 2

We begin by showing the following three lemmas.

Lemma 5.

Let $W∈Rn×n$ be any positive-definite matrix. Then $G˜=(V,E,w˜)$, constructed by algorithm 1, is a nonnegative weighted graph.

Proof.

For any $(i,j)∈V2$, we have $wii≥0$ and $wjj≥0$ since $W$ is a positive-definite matrix. If $wij≥0$, it is obvious that $w˜ij=wij+wii+wjj≥0$. We consider the case $wij<0$. In the case, we have $wij+wii+wjj>2wij+wii+wjj≥0$, where the last inequality holds from the definition of positive-definite matrix $W$. Thus, we obtain the desired result. $□$

Lemma 6.

Let $W∈Rn×n$ be any positive-definite matrix and $W˜=(w˜ij)$ be the adjacency matrix of the complete graph constructed by algorithm 1. Then, for any $S⊆V$ such that $|S|≥2$, we have $w(S)≤w˜(S)$.

Proof.
We have
$w(S)=∑e∈E(S)we=∑{i,j}∈E(S):i≠jwij+∑i∈Swii≤∑{i,j}∈E(S):i≠jwij+(|S|-1)∑i∈Swii=w˜(S),$
where the last inequality holds since each diagonal component $wii$ is positive for all $i∈V$ from the definition of the positive-definite matrix. $□$
Lemma 7.

Let $W∈Rn×n$ be any positive-definite matrix and $W˜=(w˜ij)$ be the adjacency matrix of the complete graph constructed in algorithm 1. Then for any subset of vertices $S⊆V$, we have $w˜(S)w(S)≤(|S|-1)λmax(W)λmin(W)$, where $λmin(W)$ and $λmax(W)$ represent the minimum and maximum eigenvalues of $W$, respectively.

Proof.

We consider the following two cases: case i, $∑{i,j}∈E(S):i≠jwij≥0$, and case ii, $∑{i,j}∈E(S):i≠jwij<0$.

Case i. Since $W=(wij)1≤i,j≤n$ is a positive-definite matrix, we see that diagonal component $wii$ is positive for all $i∈V$. Thus, we have
$w˜(S)=∑(i,j)∈E(S):i≠jwij+(|S|-1)∑i∈Swii≤(|S|-1)∑(i,j)∈E(S):i≠jwij+∑i∈Swii=(|S|-1)w(S).$

Since $W$ is positive definite, we have $w(S)>0$. That gives us the desired result.

Case ii. In this case, we see that
$w˜(S)=∑(i,j)∈E(S):i≠jwij+(|S|-1)∑i∈Swii≤(|S|-1)∑i∈Swii.$
For any diagonal component $wii$, we have that $wii≤max1≤i,j≤nwij$. For the largest component $max1≤i,j≤nwij$, we have
$max1≤i,j≤nwij≤max1≤i,j≤n12ei⊤Wei+12ej⊤Wej≤λmax(W),$
where the first inequality is satisfied since $W$ is positive definite. Thus, we obtain
$w˜(S)<|S|(|S|-1)λmax(W).$
(F.6)
For the lower bound of $w(S)$, we have
$w(S)=χS⊤WχS=χS⊤WχS∥χS⊤χS∥22|S|>λmin(W)|S|.$
(F.7)
Combining equations F.6 and F.7, we obtain
$w˜(S)w(S)<|S|-1λmax(W)λmin(W),$
which completes the proof. $□$

We are now ready to prove theorem 2.

Proof.
For any round $t>n$, let $χSt$ be the approximate solution obtained by algorithm 1 and $Z$ be its objective value. Let $w˜:2[n]→R$ be the weight function defined by algorithm 1. We denote the optimal value of QP by OPT. Let us denote the optimal solution of the D$k$S for $G([n],E,w˜)$ by $S˜OPT$. Adjacency matrix $W$ is a symmetric positive-definite matrix; thus, lemmas 10, 12, and 13 hold for $W$. We have
$w˜(St)≥αDkSw˜(S˜OPT)(∵StisanαDkS-approximatesolutionforDkS(G˜).)≥αDkSw˜(SOPT)≥αDkSw(SOPT).(∵lemma5).$
(F.8)
Thus, we obtain
$Z=χSt⊤Axt-1χSt=w(St)≥1|S|-1λmin(W)λmax(W)w˜(St)(∵Lemma6)≥1|S|-1λmin(W)λmax(W)αDkSw(SOPT)(∵equationF.8)=1|S|-1λmin(W)λmax(W)αDkSOPT.$

Therefore, we obtain $Z≥1k-1λmin(W)λmax(W)αDkSOPT$. $□$

### F.4  Proof of Theorem 3

Proof.

Updating $Axt-1$ can be done in $O(n2)$ time, and computing the empirical best super arm can be done in $O(n)$ time. Moreover, confidence maximization CEM can be approximately solved in polynomial time, since quadratic maximization QP is solved in polynomial time as long as we employ polynomial time algorithm as the D$k$S-Oracle. Let $poly(n)DkS$ be the computation time of D$k$S-Oracle. Then we can guarantee that SAQM runs in $O(max{n2,poly(n)DkS})$ time. $□$

### F.5  Proof of Theorem 4

Before stating the proof of theorem 4, we give the technical lemmas.

For any $t>0$, let us define random event $Et'$ as
$∀M∈M,|θ(M)-θ^t(M)|≤Ct∥χM∥Axt-1.$
(F.9)
We note that random event $Et'$ characterizes the event that the confidence bounds of all super arm $M∈M$ are valid at round $t$. The next lemma indicates that if the confidence bounds are valid, then SAQM always outputs $ɛ$-optimal super arm $M^*$ when it stops.
Lemma 8.

Given any $t>n$, assume that $Et'$ occurs. Then if SAQM (algorithm 2) terminates at round $t$, we have $θ(M*)-θ(M^*)≤ɛ$.

Proof.
If $M^*=M*$, we have the desired result. Then we shall assume $M^*≠M*$. We have the following inequalities:
$θ(M^*)≥θ^t(M^t*)-Ct∥χM^*∥Axt-1≥maxM∈M∖{M^*}θ^(M)+1αtZt-ɛ≥maxM∈M∖{M^*}θ^t(M)+maxM∈MCt∥χM∥Axt-1-ɛ≥θ^t(M*)+Ct∥χM*∥Axt-1-ɛ≥θ(M*)-ɛ.$
The first inequality is from the event $E'$ and the second from the stopping condition. The third inequality holds as $Zt≥αtmaxM∈MCt∥χM∥Axt-1$. The last inequality is again from the event $E'$. $□$

We see that approximation ratio of CEM $ατ=Ωk-1/2n-1/8λmin(Λp)λmax(Λp)$ for sufficiently large $τ$ from theorem 2 and lemma 6 below if we use the best approximation algorithm for the D$k$S as the D$k$S-Oracle (Bhaskara et al., 2010).

Lemma 9.
For $t≫|supp(p)|$, the condition number of $Axt$ is given by
$λmax(Axt)λmin(Axt)≃λmax(Λp)λmin(Λp).$
Proof.
Recall that we assume an arm selection strategy to be
$Mt=argminM∈supp(p)TM(t-1)/pM,$
which makes $TM(t-1)$ proportional to the fixed arm selection ratio $pM$. It is easy to see that $pMt+1≥TM(t)≥pMt-|supp(p)|$ when such an arm selection strategy is employed. If $TM(t)≃pMt$, then
$λmax(Axt)λmin(Axt)≃λmax(tΛp)λmin(tΛp)=λmax(Λp)λmin(Λp),$
where $Λp=∑M∈supp(p)pMχMχM⊤$. $□$

We are now ready to prove theorem 4.

Proof.

We define event $E'$ as $⋂t=1∞Et'$. We can see from proposition 1 that the probability that event $E'$ occurs is at least $1-δ$. In the rest of the proof, we assume that this event holds. By lemma 15 and the assumption on $E'$, we see that the output $M^*$ is $ɛ$-optimal super arm. Next, we focus on bounding the sample complexity.

A sufficient condition for stopping is that for $M*$ and $t>n$,
$θ^(M*)-Ct∥χM*∥Axt-1≥maxM∈M∖{M*}θ^(M)+1αtZt-ɛ.$
(F.10)
From the definition of $Axt-1$, we have $Λp=Axtt$. Using $∥χM*∥Axt-1≤maxM∈M∥χM∥Axt-1$ and $Zt≤maxM∈M∥χM∥Axt-1$, a sufficient condition for equation F.10 is equivalent to
$Δ^(M*,M¯)≥1+1αtCtρ(p)t-ɛ,$
(F.11)
where $M¯=argmaxM∈MΔ^t(M*,M)$. On the other hand, we have
$∥χM*-χM¯∥Axt-1≤2maxM∈M∥χM∥Axt-1≤2ρ(p)t.$
Therefore, from proposition 1 with $x=χM*-χMi$, with at least probability $1-δ$, we have
$Δ^t(M*,M¯)≥Δ(M*,M¯)-Ct∥χM*-χM¯∥Axt-1≥Δ(M*,M¯)-2Ctρ(p)t.$
(F.12)
Combining equations F.11 and F.12 we see that a sufficient condition for stopping becomes the following inequality:
$Δmin-2Ctρ(p)t≥1+1αtCtρ(p)t-ɛ.$
Therefore, we have that a sufficient condition to stop is $t≥3+1αt2Ct2Hɛ,$ where $Hɛ=ρ(p)(Δmin+ɛ)2$. Let $τ>n$ be the stopping time of the algorithm. From the above discussion, we see that
$τ≤3+1ατ2Cτ2Hɛ.$
(F.13)
Recalling that $Cτ=clog(c'τ2K/δ)$, we have that
$τ≤3+1/ατ2Cτ2Hɛ=3+1/ατ2c2log(c'τ2K/δ)Hɛ.$
Let $τ'$ be a parameter that satisfies
$τ=3+1/ατ2c2log(c'τ'2K/δ)Hɛ.$
(F.14)
Then, it is obvious that $τ'≤τ$ holds. For $N$ defined as $N=3+1/ατ2c2log(c'K/δ)Hɛ$, we have
$τ'≤τ=3+1/ατ2c2log(τ')Hɛ+N≤3+1/ατ2c2τ'Hɛ+N.$
By solving this inequality with $c=22σ$, we obtain
$τ'≤43+1/ατ2σ2Hɛ+163+1/ατ4σ4Hɛ2+N≤2163+1/ατ4σ4Hɛ2+N.$
Let $L=2163+1/ατ4σ4Hɛ2+N$, which is equal to the right-hand side of the inequality. We see that $logL=Olog(σατ)2Hɛ+logKδ$. Then, using this upper bound of $τ'$ in equation F.14, we have
$τ≤83+1ατ2σ2Hɛlogc'Kδ+C(Hɛ,δ),$
where
$C(Hɛ,δ)=163+1ατ2σ2Hɛlog(L)=Oσ2Hɛlogσ2ατ2Hɛ+logKδ.$
$□$

## Acknowledgments

We thank the anonymous reviewers for their comments and suggestions. Y.K. was supported by a Grant-in-Aid for JSPS Fellows (No. 18J23034) and JST CREST grant number JPMJCR1403, including AIP challenge program. A.M. was supported by a Grant-in-Aid for Research Activity Start-up (No. 17H07357) and a Grant-in-Aid for Early-Career Scientists (No. 19K20218). J.H. was supported by a Grant-in-Aid for Scientific Research on Innovative Areas (No. 16H00881). M.S. was supported by KAKENHI 17H00757.

## References

,
Y.
,
Pál
,
D.
, &
Szepesvári
,
C.
(
2011
). Improved algorithms for linear stochastic bandits. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K.
Weinberger
(Eds.),
Advances in neural information processing systems
, 24 (pp.
2312
2320
).
Red Hook, NY
:
Curran
.
Agrawal
,
R.
,
Hegde
,
M.
, &
Teneketzis
,
D.
(
1990
).
Multi-armed bandit problems with multiple plays and switching cost
.
Stochastics and Stochastic Reports
,
29
,
437
459
.
Anantharam
,
V.
,
Varaiya
,
P.
, &
Walrand
,
J.
(
1987
).
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays—Part I: I.I.D. Rewards
.
IEEE Transactions on Automatic Control
,
32
,
968
976
.
Asahiro
,
Y.
,
Iwama
,
K.
,
Tamaki
,
H.
, &
Tokuyama
,
T.
(
2000
).
Greedily finding a dense subgraph
.
Journal of Algorithms
,
34
(
2
),
203
221
.
Audibert
,
J.-Y.
, &
Bubeck
,
S.
(
2010
).
Best arm identification in multi-armed bandits.
In
Proceedings of the 23rd Annual Conference on Learning Theory
(pp.
41
53
).
:
Omnipress
.
,
A.
,
Charikar
,
M.
,
Chlamtac
,
E.
,
Feige
,
U.
, &
Vijayaraghavan
,
A.
(
2010
).
Detecting high log-densities: An $O(n1/4)$ approximation for densest $k$-subgraph.
In
Proceedings of the 42nd ACM Symposium on Theory of Computing
(pp.
201
210
).
New York
:
ACM
.
Bouhtou
,
M.
,
Gaubert
,
S.
, &
Sagnol
,
G.
(
2010
).
Submodularity and randomized rounding techniques for optimal experimental design
.
Electronic Notes in Discrete Mathematics
,
36
,
679
686
.
Bubeck
,
S.
, &
Cesa-Bianchi
,
N.
(
2012
).
Regret analysis of stochastic and non-stochastic multi-armed bandit problems
.
Foundations and Trends in Machine Learning
,
5
,
1
122
.
Bubeck
,
S.
,
Wang
,
T.
, &
Viswanathan
,
N.
(
2013
).
Multiple identifications in multiarmed bandits.
In
Proceedings of the 30th International Conference on Machine Learning
(pp.
258
265
).
Cao
,
T.
, &
Krishnamurthy
,
A.
(
2017
).
Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization
.
arXiv:1711.08018
.
Cao
,
W.
,
Li
,
J.
,
Tao
,
Y.
, &
Li
,
Z.
(
2015
). On top-$k$ selection in multi-armed bandits and hidden bipartite graphs. In
C.
Cortes
,
N. A.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 28 (pp.
1036
1044
).
Red Hook, NY
:
Curran
.
Cesa-Bianchi
,
N.
, &
Lugosi
,
G.
(
2006
).
Prediction, learning, and games
.
Cambridge
:
Cambridge University Press
.
Cesa-Bianchi
,
N.
, &
Lugosi
,
G.
(
2012
).
Combinatorial bandits
.
Journal of Computer and System Sciences
,
78
(
5
),
1404
1422
.
Chaudhuri
,
A. R.
, &
Kalyanakrishnan
,
S.
(
2017
).
PAC identification of a bandit arm relative to a reward quantile.
In
Proceedings of the 31st AAAI Conference on Artificial Intelligence
(pp.
1977
1985
).
Palo Alto, CA
:
AAAI
.
Chaudhuri
,
A. R.
, &
Kalyanakrishnan
,
S.
(
2019
).
PAC identification of many good arms in stochastic multi-armed bandits.
In
Proceedings of the 36th International Conference on Machine Learning
(pp.
991
1000
).
Chen
,
L.
,
Gupta
,
A.
, &
Li
,
J.
(
2016
).
Pure exploration of multi-armed bandit under matroid constraints.
In
Proceedings of the 29th Annual Conference on Learning Theory
(pp.
647
669
).
Chen
,
L.
,
Gupta
,
A.
,
Li
,
J.
,
Qiao
,
M.
, &
Wang
,
R.
(
2017
).
Nearly optimal sampling algorithms for combinatorial pure exploration.
In
Proceedings of the 30th Annual Conference on Learning Theory
(pp.
482
534
).
Chen
,
L.
, &
Li
,
J.
(
2015
).
On the optimal sample complexity for best arm identification
.
arXiv:1511.03774
.
Chen
,
S.
,
Lin
,
T.
,
King
,
I.
,
Lyu
,
M. R.
, &
Chen
,
W.
(
2014
). Combinatorial pure exploration of multi-armed bandits. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
379
387
).
Red Hook, NY
:
Curran
.
Combes
,
R.
,
Talebi
Mazraeh
,
Shahi
,
M. S.
,
Proutiere
,
A.
, &
Lelarge
,
M.
(
2015
). Combinatorial bandits revisited. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 28 (pp.
2116
2124
).
Red Hook, NY
:
Curran
.
Even-Dar
,
E.
,
Mannor
,
S.
, &
Mansour
,
Y.
(
2002
).
PAC bounds for multi-armed bandit and Markov decision processes
. In
Proceedings of the 15th Annual Conference on Learning Theory
(pp.
255
270
).
Berlin
:
Springer
.
Even-Dar
,
E.
,
Mannor
,
S.
, &
Mansour
,
Y.
(
2006
).
Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems
.
Journal of Machine Learning Research
,
7
,
1079
1105
.
Feige
,
U.
,
Peleg
,
D.
, &
Kortsarz
,
G.
(
2001
).
The dense $k$-subgraph problem.
Algorithmica
,
29
(
3
).
410
421
.
Gabillon
,
V.
,
,
M.
, &
Lazaric
,
A.
(
2012
). Best arm identification: A unified approach to fixed budget and fixed confidence. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
3212
3220
).
Red Hook, NY
:
Curran
.
Gabillon
,
V.
,
,
M.
,
Lazaric
,
A.
, &
Bubeck
,
S.
(
2011
). Multi-bandit best arm identification. In
F.
Pereira
,
C. J. S.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
2222
2230
).
Red Hook, NY
:
Curran
.
Gabillon
,
V.
,
Lazaric
,
A.
,
,
M.
,
Ortner
,
R.
, &
Bartlett
,
P.
(
2016
).
Improved learning complexity in combinatorial pure exploration bandits.
In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
(pp.
1004
1012
).
San Francisco
:
Morgan Kaufmann
.
Huang
,
S.
,
Liu
,
X.
, &
Ding
,
Z.
(
2008
).
Opportunistic spectrum access in cognitive radio networks.
In
Proceedings of the 27th IEEE International Conference on Computer Communications
(pp.
1427
1435
).
Piscataway, NJ
:
IEEE
.
Huang
,
W.
,
Ok
,
J.
,
Li
,
L.
, &
Chen
,
W.
(
2018
).
Combinatorial pure exploration with continuous and separable reward functions and its applications.
In
Proceedings of the 27th International Joint Conference on Artificial Intelligence
(pp.
2291
2297
).
San Francisco
:
Morgan Kaufmann
.
Jamieson
,
K.
,
Malloy
,
M.
,
Nowak
,
R.
, &
Bubeck
,
S.
(
2014
).
lil'UCB: An optimal exploration algorithm for multi-armed bandits.
In
Proceedings of the 27th Annual Conference on Learning Theory
(pp.
423
439
).
Jun
,
K.
,
Bhargava
,
A.
,
Nowak
,
R.
, &
Willett
,
R.
(
2017
). Scalable generalized linear bandits: Online computation and hashing. In
L.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 30
(pp.
99
109
).
Red Hook, NY
:
Curran
.
Kalyanakrishnan
,
S.
, &
Stone
,
P.
(
2010
).
Efficient selection of multiple bandit arms: Theory and practice.
In
Proceedings of the 27th International Conference on Machine Learning
(pp.
511
518
).
:
Omnipress
.
Kalyanakrishnan
,
S.
,
Tewari
,
A.
,
Auer
,
P.
, &
Stone
,
P.
(
2012
).
PAC subset selection in stochastic multi-armed bandits.
In
Proceedings of the 29th International Conference on Machine Learning
(pp.
655
662
).
:
Omnipress
.
Karger
,
D. R.
(
1998
).
Random sampling and greedy sparsification for matroid optimization problems.
Mathematical Programming
,
82
(
1
),
41
81
.
Kaufmann
,
E.
,
Cappé
,
O.
, &
Garivier
,
A.
(
2016
).
On the complexity of best-arm identification in multi-armed bandit models
.
Journal of Machine Learning Research
,
17
,
1
42
.
Kaufmann
,
E.
, &
Kalyanakrishnan
,
S.
(
2013
).
Information complexity in bandit subset selection.
In
Proceedings of the 26th Annual Conference on Learning Theory
(pp.
228
251
).
Kiefer
,
J.
, &
Wolfowitz
,
J.
(
1960
).
The equivalence of two extremum problems
.
,
12
,
363
366
.
Komiyama
,
J.
,
Honda
,
J.
, &
Nakagawa
,
H.
(
2015
).
Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays.
In
Proceedings of the 32nd International Conference on Machine Learning
(pp.
1152
1161
).
Lagrée
,
P.
,
,
C.
, &
Cappe
,
O.
(
2016
). Multiple-play bandits in the position-based model. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
1597
1605
).
Red Hook, NY
:
Curran
.
Lai
,
T.
, &
Robbins
,
H.
(
1985
).
.
,
6
(
1
),
4
22
.
Lattimore
,
T.
, &
Szepesvari
,
C.
(
2017
).
The end of optimismfi An asymptotic analysis of finite-armed linear bandits.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
(pp.
728
737
).
Le Gall
,
F.
(
2014
).
Powers of tensors and fast matrix multiplication.
In
Proceedings of the 39th ACM International Symposium on Symbolic and Algebraic Computation
(pp.
296
303
).
New York
:
ACM
.
Li
,
J.
,
Baba
,
Y.
, &
Kashima
,
H.
(
2017
).
Hyper questions: Unsupervised targeting of a few experts in crowdsourcing.
In
Proceedings of the 26th ACM International Conference on Information and Knowledge Management
(pp.
1069
1078
)
New York
:
ACM
.
Perrault
,
P.
,
Perchet
,
V.
, &
Valko
,
M.
(
2019
).
Exploiting structure of uncertainty for efficient matroid semi-bandits.
In
Proceedings of the 36th International Conference on Machine Learning
(pp.
5123
5132
).
Pukelsheim
,
F.
(
2006
).
Optimal design of experiments
.
:
Society for Industrial and Applied Mathematics
.
,
F.
,
Kleinberg
,
R.
, &
Joachims
,
T.
(
2008
).
Learning diverse rankings with multi-armed bandits.
In
Proceedings of the 25th International Conference on Machine Learning
(pp.
784
791
)
New York
:
ACM
.
Retelny
,
D.
,
Robaszkiewicz
,
S.
,
To
,
A.
,
Lasecki
,
W. S.
,
Patel
,
J.
,
Rahmati
,
N.
, &
Bernstein
,
M. S.
(
2014
).
Expert crowdsourcing with fiash teams.
In
Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology
(pp.
75
85
).
New York
:
ACM
.
Rusmevichientong
,
P.
, &
Williamson
,
D. P.
(
2006
).
In
Proceedings of the 7th ACM Conference on Electronic Commerce
(pp.
260
269
).
New York
:
ACM
,
Sagnol
,
G.
(
2013
).
Approximation of a maximum-submodular-coverage problem involving spectral functions, with application to experimental designs
.
Discrete Ap- plied Mathematics
,
161
,
258
276
.
Soare
,
M.
,
Lazaric
,
A.
, &
Munos
,
R.
(
2014
). Best-arm identification in linear bandits. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
828
836
).
Red Hook, NY
:
Curran
.
Tao
,
C.
,
Blanco
,
S.
, &
Zhou
,
Y.
(
2018
).
Best arm identification in linear bandits with linear dimension dependency.
In
Proceedings of the 35th International Conference on Machine Learning
(pp.
4877
4886
).
Taylor
,
R.
(
2016
).
Approximation of the quadratic knapsack problem
.
Operations Research Letters
,
44
(
4
),
495
497
.
Tran-Thanh
,
L.
,
Stein
,
S.
,
Rogers
,
A.
, &
Jennings
,
N. R.
(
2014
).
Efficient crowdsourcing of unknown experts using bounded multi-armed bandits
.
Artificial Intelligence
,
214
,
89
111
.
Whitney
,
H.
(
1935
).
On the abstract properties of linear dependence
.
American Journal of Mathematics
,
57
(
3
),
509
533
.
Xu
,
L.
,
Honda
,
J.
, &
Sugiyama
,
M.
(
2018
).
A fully adaptive algorithm for pure exploration in linear bandits.
In
Proceedings of the 21st International Conference on Artificial Intelligence and Statistics
(pp.
843
851
).
Zhou
,
Y.
,
Chen
,
X.
, &
Li
,
J.
(
2014
).
Optimal PAC multiple arm identification with applications to crowdsourcing.
In
Proceedings of the 31st International Conference on Machine Learning
(pp.
217
225
).

## Author notes

L.X. is now at Gatsby Computational Neuroscience Unit, University College London.