Abstract

Linear submodular bandits has been proven to be effective in solving the diversification and feature-based exploration problem in information retrieval systems. Considering there is inevitably a budget constraint in many web-based applications, such as news article recommendations and online advertising, we study the problem of diversification under a budget constraint in a bandit setting. We first introduce a budget constraint to each exploration step of linear submodular bandits as a new problem, which we call per-round knapsack-constrained linear submodular bandits. We then define an -approximation unit-cost regret considering that the submodular function maximization is NP-hard. To solve this new problem, we propose two greedy algorithms based on a modified UCB rule. We prove these two algorithms with different regret bounds and computational complexities. Inspired by the lazy evaluation process in submodular function maximization, we also prove that a modified lazy evaluation process can be used to accelerate our algorithms without losing their theoretical guarantee. We conduct a number of experiments, and the experimental results confirm our theoretical analyses.

1  Introduction

The trade-off between exploration and exploitation is one of the challenges arising in reinforcement learning (Sutton & Barto, 1998; Shen, Tobia, Sommer, & Obermayer, 2014). The multiarmed bandit (MAB) problem is the simplest instance of the exploration-versus-exploitation dilemma (Auer, Cesa-Bianchi, & Fischer, 2002), which is a trade-off between exploring the environment to find a better action (exploration) and adopting the current best action as often as possible (exploitation). The classical MAB is formulated as a system of arms. At each time step, we choose an arm to pull and obtain a reward from an unknown distribution. The goal is to maximize the cumulative rewards by optimally balancing exploration and exploitation in finite time steps . The most popular measure of an algorithm’s success is regret, which is the cumulative loss of failing to pull the optimal arm. A variant of the classical MAB, which is usually called combinatorial multiarmed bandits (Gai, Krishnamachari, & Jain, 2012; Chen, Wang, & Yuan, 2013), allows multiple arms chosen at each time step. Many web-based applications can be modeled as combinatorial bandits—for example, the personalized recommendation of news articles (Li, Chu, Langford, & Schapire, 2010; Fang & Tao, 2014) in which multiple news articles are recommended to a user.

Diversification is a key problem in information retrieval systems, such as ranking of documents (Radlinski, Kleinberg, & Joachims, 2008), products recommendation (Ziegler, McNee, Konstan, & Lausen, 2005), and news article recommendation. Implications affecting user satisfaction have been observed in practice: recommendation requires the proposal of a diverse set of items (Ziegler et al., 2005). Submodularity is an intuitive notion of diminishing returns, which states that adding a new item to a larger set helps less than adding the same item to a smaller set. It has turned out that diversification can be well captured by a submodular function (Krause & Golovin, 2012), and linear submodular bandits (Yue & Guestrin, 2011) has thus been proposed to handle the diversification problem in a bandit setting.

There is nevertheless always a budget constraint in real-world scenarios, where limited resources are consumed during the process of actions. For example, in dynamic procurement (Badanidiyuru, Kleinberg, & Slivkins, 2013), the budget for buying items is limited. In clinical trials, experiments on alternative medical treatments are limited by the cost of materials. However, a budget constraint imposed on each time step, not the entire process, is more reasonable for other applications. For example, in online advertising (Chakrabarti, Kumar, Radlinski, & Upfal, 2009), the size of a web page is limited, while the ads are changing each time the user visits the web page. In news article recommendations, several articles are recommended to a user, and feedback is obtained each time, but users will have only a limited time to read those articles (e.g., if we recommend three short news articles, the user may read all of them, but if we recommend three long articles, the user might be not patient enough to read all of them). We thus formulate a per-round budget constraint as follows: for all , let denote the cost of pulling arm , where is the set of all arms. The total costs of arm pulling are limited by a budget . At each time step , we choose a subset of arms under a budget constraint (i.e., ), which is known as a knapsack constraint (Sviridenko, 2004).

In order to improve user satisfaction by considering the problem of diversification under a budget constraint, we introduce the per-round budget constraint to linear submodular bandits as a new problem, which we refer to as per-round knapsack-constrained linear submodular bandits. To solve this new problem, we construct the upper confidence bounds (UCB) under a budget constraint, which is called the unit-cost upper confidence bounds, to control the trade-off between exploration and exploitation. Inspired by other knapsack solutions, we try to obtain the maximum reward on each budget unit. Specifically, we greedily choose each arm, which has the maximum upper confidence bound on per-budget utility gain, to construct a subset of arms in our algorithms.

In this letter, we first briefly review related work. We then describe the new problem of per-round knapsack-constrained linear submodular bandits (or linear submodular bandits with a knapsack constraint; Yu, Fang, & Tao, 2016) and the definition of regret. After that, we propose two greedy algorithms based on a modified UCB rule and prove both algorithms with theoretical regret bounds. We also show that a modified lazy evaluation can be used to accelerate both algorithms without losing theoretical guarantee. Finally, we use news article recommendation as a case study, which requires us to recommend multiple news articles under a per-round budget constraint. We conduct a number of experiments and compare our two algorithms with the baselines for linear submodular bandits, such as LSBGreedy (Yue & Guestrin, 2011) and Epsilon-Greedy.

2  Related Work

Multiarmed bandits addresses the exploration-versus-exploitation dilemma in statistics and reinforcement learning (Lai and Robbins, 1985; Berry & Fristedt, 1985; Sutton & Barto, 1998; Shen et al., 2014). There are two types of multiarmed bandit problems: adversarial bandits and stochastic bandits. Adversarial bandits, in which an adversary controls the arms and tries to defeat the learning process, is initiated by Auer, Cesa-Bianchi, Freund, and Schapire (2002) and followed by Audibert and Bubeck (2009); Kakade, Kalai, and Ligett (2009); and Streeter, Golovin, & Krause (2009) in different settings. For stochastic bandits, in which the rewards of arms are sampled from an unknown distribution, a uniform sublinear regret bound has been provided by Auer, Cesa-Bianchi, and Fischer (2002). They propose several algorithms based on an -greedy rule or a UCB rule, which are then widely used in multiarmed bandit problems to control the trade-off between exploration and exploitation (Auer, Cesa-Bianchi, & Fischer, 2002; Auer, 2003). Our work lies in stochastic bandits.

A budget constraint on the entire process has been well studied in the classical multiarmed bandit problem. In the budget-limited multiarmed bandit problem (Tran-Thanh, Chapman, Munoz de Cote, Rogers, & Jennings, 2010; Tran-Thanh, Chapman, Rogers, & Jennings, 2012), the costs of pulling different arms are different and the total costs are limited by a fixed budget. A series of work has followed this problem with different assumptions on the cost of the arm (Tran-Thanh, Chapman, Munoz de Cote, Rogers, & Jennings, 2010; Devanur, Jain, Sivan, & Wilkens, 2011; Tran-Thanh, Chapman, Rogers, & Jennings, 2012; Ding, Qin, Zhang, & Liu, 2013; Badanidiyuru, Kleinberg, & Slivkins, 2013). More specifically, in the setting proposed by Tran-Thanh et al. (2010), the cost of each arm is fixed. This problem was first solved by a simple budgeted -first algorithm (Tran-Thanh et al., 2010) and subsequently by an improved algorithm called KUBE (Tran-Thanh, Chapman, Rogers, & Jennings, 2012). Another setting in which the cost of arm is a variable has been studied by Ding et al. (2013). The most important budget constraint is known as a knapsack constraint (Devanur et al., 2011; Badanidiyuru et al., 2013), in which the total costs are the sum of the pulled arms.

Combinatorial multiarmed bandits, which allows multiple arms to be pulled at each time step, is a variant of classical multiarmed bandit problem (Kalyanakrishnan & Stone, 2010; Cesa-Bianchi & Lugosi, 2012). Several specific instances of combinatorial multiarmed bandits have been studied by Caro and Gallien (2007) and Liu, Liu, and Zhao (2011) and also in different applications, such as resource allocation (Lin & Gen, 2008), cognitive radio networks (Gai, Krishnamachari, & Jain, 2010), adaptive shortest-path routing (Liu & Zhao, 2012), and pure exploration bandits (Chen, Lin, King, Lyu, & Chen 2014). A general combinatorial bandit framework with an approximation oracle has been studied by Gai, Krishnamachari, and Jain (2012) in linear rewards setting and extended by Chen et al. (2013) to nonlinear rewards setting. Recently, combinatorial multiarmed bandits with a matroid constraint, a notion of independence in combinatorial optimization, has been proposed by Kveton, Wen, Ashkan, and Eydgahi (2014). Another interesting study considers a multiple-choice knapsack bandit (Tran-Thanh, Xia, Qin, & Jennings, 2015), in which the arm is chosen from several subsets of arms. However, all previous work has focused on the constraint on the entire process and has not considered a per-round budget constraint.

In information retrieval systems, combinatorial multiarmed bandits has been used to identify user interests (Li et al., 2010), especially when multiple items need to be recommended simultaneously, such as news articles (Kohli, Salek, & Stoddard, 2013) and online advertising (Chakrabarti et al., 2009). In these multiple items retrieval settings, the notion of diversity has been addressed in many studies, such as diverse rankings of documents (Radlinski et al., 2008), diverse search results (Rafiei, Bharat, & Shukla, 2010; Welch, Cho, & Olston, 2011; Angel & Koudas, 2011), and topic diversification in recommendation systems (Ziegler, McNee, Konstan, & Lausen, 2005; Clarke, Kolla, Cormack, Vechtomova, Ashkan, Büttcher, & MacKinnon, 2008; Küçüktunç, Saule, Kaya, & Çatalyürek, 2013). The problem of diversification in information retrieval systems is known as a NP-hard combinatorial optimization problem (Carterette, 2011). Submodularity is a notion of diminishing returns (Golovin & Krause, 2010; Krause & Golovin, 2012), which means that adding a new item to a smaller set will achieve more utility gain than adding the same item to a larger set. It has been proven that submodularity is helpful in approximately solving the diversification problem, and submodularity-based algorithms have been used in many real-world applications (Lin & Bilmes, 2011, 2012; Das, Dasgupta, & Kumar, 2012). A greedy algorithm solves the monotone submodular function maximization problem with a -approximation guarantee (Nemhauser, Wolsey, & Fisher, 1978). Algorithms for submodular maximization problem with different constraints, such as a knapsack constraint (Sviridenko, 2004; Leskovec et al., 2007) or a matroid constraint (El-Arini, Veda, Shahaf, & Guestrin, 2009), also have been proposed.

Recently, linear submodular bandits has been proposed by Yue and Guestrin (2011) as a typical combinatorial bandit model to handle diversification in information retrieval systems. Linear submodular bandits introduces submodularity to linear bandits, a well-studied bandit model (Dani, Hayes, & Kakade, 2008; Filippi, Cappe, Garivier, & Szepesvári, 2010; Abbasi-Yadkori, Pál, and Szepesvári, 2011; Chu, Li, Reyzin, & Schapire, 2011). A budget constraint nevertheless exists in real-world applications; for example, in a news article recommendation, a user has a limited time to read all recommended news articles (El-Arini et al., 2009); In online advertising (Chakrabarti et al., 2009), the size of a web page for ads is limited. However, budget constraint has not been considered in linear submodular bandits, while the cardinality constraint can be seen only as a special case of budget constraint. In order to solve the diversification problem under budget constraint, we thus introduce a per-round knapsack constraint to linear submodular bandits. Unlike previous knapsack-constrained bandit problems (Devanur et al., 2011; Badanidiyuru et al., 2013), in which the knapsack constraint works on the full time sequence, our per-round knapsack constraint is imposed on each time step separately.

3  Problem Definition

We formulate the per-round knapsack-constrained linear submodular bandits as follows. Let denote a set of arms and be a set of costs for arms. At each time step , we greedily choose each arm in under a budget constraint and obtain rewards , a random variable with the martingale assumption (Abbasi-Yadkori et al., 2011). The expected rewards of are measured by a monotone submodular utility function , where is a parameter vector.

Definition 1
(submodularity). Let be a nonempty finite set and be a collection of all subsets of . Let be a submodular function, that is,
formula
3.1

where .

Definition 2
(monotonicity). Let be a nonempty finite set and a collection of all subsets of . Let be a monotone function, that is,
formula
3.2
Linear submodular bandits is a feature-based bandit model known as contextual bandits (Li et al., 2010), in which we can acquire a context (side information) before making a decision. In a bandit setting, the parameter vector is unknown to us; we use to represent the real value of and assume that , where is a positive constant. The utility function of linear submodular bandits is a linear combination of submodular functions,
formula
3.3
where and is a submodular function with (). The function can be constructed by a probabilistic coverage model a in news article recommendation (El-Arini et al., 2009; Yue & Guestrin, 2011). For each time step we choose a subset of arms under the per-round knapsack constraint, and the goal is to obtain the maximum cumulative rewards,
formula
3.4

4  -Regret

Regret, which indicates the loss of not always pulling the optimal arm, has been widely used in bandit problems as a measure of an algorithm’s success. Considering that submodular function maximization is NP-hard, we can find only approximated solutions, , in polynomial time (Sviridenko, 2004; Leskovec et al., 2007). As a result, we can guarantee only an -approximation solution for per-round knapsack-constrained linear submodular bandits, even if we know the parameter . In a bandit setting, we use , which is evaluated according to the previous time steps’ feedback, as an estimate of to help us make decisions.

For per-round knapsack-constrained linear submodular bandits, we therefore define an -approximation unit-cost regret (for simplicity, called -Regret) as follows. Let denote the optimal subset of arms,
formula
4.1
Let denote the subset of arms we chose at time step . The -Regret is then defined as
formula
4.2

5  Algorithms

In this section, we introduce the evaluation of as well as a modified UCB rule. We then propose two greedy algorithms based on the modified UCB rule to solve the problem of per-round knapsack-constrained linear submodular bandits. Both of these algorithms can be seen as extensions of submodular greedy algorithms (Sviridenko, 2004; Leskovec et al., 2007) to bandit settings. There is a regret versus computational complexity trade-off between our two algorithms.

5.1  Evaluation of

Let 1 denote a subset of arms chosen at time step . For simplicity, we define for all . Then the feature vector of the arm is
formula
5.1
where
formula
5.2
and . At each time step , we greedily choose each arm to construct and then acquire all rewards where denotes the rewards of . The expected rewards of are
formula
5.3
The -regularized least-squares estimation has been widely used in linear bandits to estimate parameters (Abbasi-Yadkori et al., 2011; Filippi et al., 2010; Dani et al., 2008). Considering that the utility function is a linear function, is derived by -regularized least-squares estimate with regularization parameter ,
formula
5.4
where indicates the features of all chosen arms through the last time steps and forms all corresponding rewards. The row of matrix is the arm feature,
formula
5.5

5.2  Modified UCB Rule

The UCB rule can be used elegantly for the trade-off between exploration and exploitation in a multiarmed bandit problem (Auer, 2003), especially in linear bandit problems (Yue & Guestrin, 2011; Filippi et al., 2010). Inspired by this, we constructed a modified UCB rule under a budget constraint. Let and . From the results of the linear bandits (Abbasi-Yadkori et al., 2011) we have, with probability ,
formula
5.6
where
formula
5.7
formula
5.8
and is a positive constant. The confidence interval is
formula
5.9
Considering the budget constraint, we denote the unit-cost confidence interval by the confidence interval on each budget unit:
formula
5.10
Following optimism in the face of the uncertainty principle, we construct a unit-cost upper confidence bound as
formula
5.11
We use this modified UCB rule to deal with the trade-off between exploration and exploitation in the problem of per-round knapsack-constrained linear submodular bandits.

5.3  Algorithm 1: MCSGreedy

In our first algorithm, we greedily choose each arm with the maximum unit-cost upper confidence bounds:
formula
5.12
We also need a partial enumeration () to achieve the -approximation guarantee, which is essential for a submodular function maximization problem with a knapsack constraint (Sviridenko, 2004). As a result, at each time step , we first choose the initial best set of arms with cardinality equal to three through a partial enumeration:
formula
5.13
We then greedily choose each arm according to equation 5.12 until the budget is exhausted. Considering the partial enumeration process, this algorithm is called the modified cost-sensitive UCB-based greedy algorithm (MCSGreedy; see algorithm 1). If , we can always find the estimated optimal set through an enumeration process
formula
formula

5.4  Algorithm II : CGreedy

Algorithm 1 has a partial enumeration, which is is time-consuming: times evaluation of the utility function in each time step ( is the number of arms). We therefore propose another greedy algorithm (see algorithm 2), which is able to provide -approximation guarantee without a partial enumeration.

In algorithm 2, we first choose two subsets of arms at each time step: is greedily selected according to the UCB rule:
formula
5.14
is greedily selected by the modified UCB rule, equation 5.12. After that, we choose the best subset from and as our final choice. The second algorithm is called competitive UCB-Based greedy Algorithm (CGreedy; see algorithm 2).

6  Theoretical Analysis

In this section, we use -Regret for the analysis of per-round knapsack-constrained linear submodular bandits. We prove different -Regret bounds and computational complexities for algorithms 1 and 2. Also, a modified lazy evaluation is proven to accelerate our two algorithms without losing the theoretical guarantee (see the details in the appendix).

6.1  -Regret Bounds

-Regret is the difference between the reward derived from our algorithm and the -approximation of the optimal reward. In algorithm 1, we have , which means that algorithm 1 is at least a -approximation of the optimal solution. We prove the regret bound for algorithm 1 in theorem 3.

Theorem 1.
For and ), with probability at least , the -Regret of algorithm 1 is bounded by
formula
where

The proof is in the appendix.

For , we prove the regret bound for algorithm 2 in theorem 4:

Theorem 2.
For and ), with probability at least , the -Regret of algorithm 2 is bounded by
formula
where

The proof is in the appendix.

We simplify the regret bounds by ignoring the constants and some trivial terms as follows. Let and denote the subset of arms chosen at time step . We have
formula
6.1
We then have
formula
6.2
Finally, the regret bounds for algorithms 1 and 2 can be simplified as
formula
6.3
which means that the cumulative loss of rewards is increased sublinearly with . That is, the average loss decreases at a rate of .

6.2  Regret versus Computational Complexity

Our two algorithms have sublinear -Regret bounds with different approximation rates: for algorithm 1 and for algorithm 2. Algorithm 1 enjoys a better regret bound. For computational complexity, algorithm 1 runs in time because of a partial enumeration procedure in each time step, where is the number of arms. Algorithm 2 runs in time and is thus computationally more efficient than algorithm 1. Overall, algorithm 1 enjoys a better regret bound, while algorithm 2 is computationally more efficient.

Comparing with LSBGreedy, our two algorithms solve the problem in a more general setting (i.e., knapsack-constrained setting). LSBGreedy can be seen as a special case of our algorithms in an equal-cost setting, and we prove comparable regret bounds for our algorithms. A detailed comparison on approximation rate, constraint, and complexity between our algorithms and baselines is shown in Table 1.

Table 1:
Comparison of MCSGreedy, CGreedy, LSBGreedy, and Epsilon-Greedy.
AlgorithmApproximation RateConstraintComplexity
MCSGreedy    
CSGreedy    
LSBGreedy    
Epsilon-Greedy   
AlgorithmApproximation RateConstraintComplexity
MCSGreedy    
CSGreedy    
LSBGreedy    
Epsilon-Greedy   

7  Experiment

In this section, we empirically evaluate our algorithms by using news article recommendation (Li et al., 2010) as a case study. We first formulate news article recommendation into the linear submodular bandits. We then introduce a knapsack constraint to the news article recommendation where the costs of arms indicate the length of the news article (or the reading time). Finally, we compare our algorithms with several baselines on two data sets.

7.1  News Article Recommendation

Let denote a set of news articles, and each news article is represented by a feature vector which is the information coverage on different topics. For a subset of arms , the information coverage on different topics is a feature vector where
formula
7.1
and for all .

7.2  Competing Methods and Data Sets

We compare our two algorithms with the following baselines:

  • The LSBGreedy algorithm, which is proposed by Yue and Guestrin (2011) to solve the problem of linear submodular bandits

  • The Epsilon-Greedy algorithm (in the appendix), which randomly chooses the arm with the maximum unit-cost reward according to the -greedy rule (Auer, Cesa-Bianchi, & Fischer, 2002).

All experiments are performed on two data sets:

  • The simulation data set, which follows the previous work of linear submodular bandits (Yue & Guestrin, 2011).

  • The 20 Newsgroups data set, a popular data set for text analysis in machine learning2

7.3  Results on the Simulation Data Set

In a simulation experiment, we randomly generate a -dimensional vector to represent a news article, , where indicates the information coverage on the topic . For each news article, we assume that it has only a limited number of main topics and noisy topics . The number of main topics is , and the number of noise topics is .

We use a randomly sampled to indicate a user’s interest level on each topic. We assume that a user will like some topics very much and dislike other topics . We also assume that the user has limited time to read all news articles (El-Arini et al., 2009). We first demonstrate the results in Figure 1, in which the cost of the arm is sampled from a uniform distribution.

Figure 1:

Results comparing MCSGreedy, CGreedy, LSBGreedy, and Epsilon-Greedy on the Simulation data set. The cost of the arm is sampled from a uniform distribution.

Figure 1:

Results comparing MCSGreedy, CGreedy, LSBGreedy, and Epsilon-Greedy on the Simulation data set. The cost of the arm is sampled from a uniform distribution.

In Figure 1a, we demonstrate what the learned look like compared with and find that achieved by MCSGreedy has the smallest differences with . In Figure 1b, we show that MCSGreedy and CGreedy always obtain more average rewards. We compare the average rewards under different budgets in Figure 1c, in which our algorithms work well with the budget constraint. We compare the average rewards under different cost intervals, that is, the maximum differences between the costs of two arms, in Figure 1d. It is clear that MCSGreedy and CGreedy outperform LSBGreedy and Epsilon-Greedy, especially when the cost interval is large (it is assumed that a large cost interval is representative of a complex setting).

In other settings, the cost of an arm sampled from a gaussian distribution is more reasonable; for example, most news articles are of medium length; extremely long (or short) news articles are rare. We demonstrate almost the same results in Figure 2 when the cost is sampled from a gaussian distribution.

Figure 2:

Results comparing MCSGreedy, CGreedy, LSBGreedy and Epsilon-Greedy on the simulation data set. The cost of the arm is sampled from a gaussian distribution.

Figure 2:

Results comparing MCSGreedy, CGreedy, LSBGreedy and Epsilon-Greedy on the simulation data set. The cost of the arm is sampled from a gaussian distribution.

7.4  Results on the 20 Newsgroups Data Set

The 20 Newsgroups data set contains approximately 20,000 news article documents, and each group corresponds to a specific topic (see Table 2). In order to perform experiments on this data set, we first train a softmax classifier (e.g., random forest; Breiman, 2001) and then perform text classification. We use the classification score to indicate the information coverage of each topic. The cost of the arm is the length of the news document (i.e., number of words), and the budget is denoted by , where is the mean length of news article documents.

Table 2:
Description of Groups for the 20 Newsgroups Data Set.
IndexGroup DescriptionIndexGroup Description
alt.atheism 11 rec.sport.hockey 
comp.graphics 12 sci.crypt 
comp.os.ms-windows.misc 13 sci.electronics 
comp.sys.ibm.pc.hardware 14 sci.med 
comp.sys.mac.hardware 15 sci.space 
comp.windows.x 16 soc.religion.christian 
misc.forsale 17 talk.politics.guns 
rec.autos 18 talk.politics.mideast 
rec.motorcycles 19 talk.politics.misc 
10 rec.sport.baseball 20 talk.religion.misc 
IndexGroup DescriptionIndexGroup Description
alt.atheism 11 rec.sport.hockey 
comp.graphics 12 sci.crypt 
comp.os.ms-windows.misc 13 sci.electronics 
comp.sys.ibm.pc.hardware 14 sci.med 
comp.sys.mac.hardware 15 sci.space 
comp.windows.x 16 soc.religion.christian 
misc.forsale 17 talk.politics.guns 
rec.autos 18 talk.politics.mideast 
rec.motorcycles 19 talk.politics.misc 
10 rec.sport.baseball 20 talk.religion.misc 

We choose the top five topics for each document according to the classification score. We use a randomly sampled to indicate a user’s interest level on each topic. The experimental results are shown in Figure 3.

Figure 3:

Results comparing MCSGreedy, CGreedy, LSBGreedy and Epsilon-Greedy on the 20 Newsgroups data set. The cost of the arm is defined as the length of a news document (i.e., number of words).

Figure 3:

Results comparing MCSGreedy, CGreedy, LSBGreedy and Epsilon-Greedy on the 20 Newsgroups data set. The cost of the arm is defined as the length of a news document (i.e., number of words).

In Figure 3a, we find that MCSGreedy and CGreedy learn more quickly than LSBGreedy and Epsilon-Greedy. In Figure 3b, we demonstrate that MCSGreedy and CGreedy obtain more average rewards than LSBGreedy and Epsilon-Greedy. In Figure 3d, we show the distribution of the length of documents, which indicates that most of documents are of medium length. We compare the average rewards under different budgets in Figure 3c, in which MCSGreedy and CGreedy always obtain more average rewards.

8  Conclusion and Future Work

In this letter we introduce a new problem: per-round knapsack-constrained linear submodular bandits. To solve this problem, we define a modified UCB rule to control the trade-off between exploration and exploitation. We propose two algorithms with different regret bounds and computational complexities. To analyze our algorithms, we define an -Regret and prove that both of our algorithms have sublinear regret bounds. The experimental results on two data sets demonstrate that our algorithms outperform the baselines for per-round knapsack-constrained linear submodular bandits.

Considering that there are many different constraints in real-world applications, the linear submodular bandits with a more complex constraint, such as multiple knapsack constraints or a matroid constraint, will be the subject of future study.

Appendix:  Extended Algorithms and Proofs

A.1  Epsilon-Greedy Algorithm

The -greedy rule is a simple and well-known policy to control the trade-off between exploration and exploitation in a classical multiarmed bandit problem (Auer, Cesa-Bianchi, & Fischer, 2002). Considering a budget constraint, we choose the arm that has the maximum estimated unit-cost rewards, with probability . Otherwise, we choose another arm randomly with probability . We describe the Epsilon-Greedy algorithm in algorithm 3, which is used as a baseline in our experiments.

formula

The Epsilon-Greedy performance of the algorithm in experiments is even better than that of LSBGreedy. However, it is not easy to prove a regret bound for Epsilon-Greedy in the setting of per-round knapsack-constrained linear submodular bandits.

A.2  Proofs of Theorems 3 and 4

Let denote a set of arms and denote a subset of arms chosen one by one in top-down fashion (Yue & Guestrin, 2011). Let denote a set of costs for arms in . At each time step , we choose a subset of arms under a budget constraint . Let denote the unit-cost rewards of for budget unit , where . The expected value of is
formula
A.1
We then have
formula
A.2
and
formula
A.3
Let denote a subset of arms chosen at time step , denote the estimated optimal subset of arms according to , and denote the optimal subset arms:
formula
A.4
formula
A.5
The feature vector of the subset is then defined as
formula
A.6
Also, let and denote the feature vectors of and :
formula
A.7
formula
A.8
Let denote the feature of at time step :
formula
A.9
We use as an estimate of and is derived by the -regularized least-squares estimate with a regularization parameter . From the results of linear bandits (Abbasi-Yadkori et al., 2011), we have:
Lemma 1
(theorem 8 in Abbasi-Yadkori et al., 2011). Let and . Then for any , with probability at least ,
formula
A.10
where
formula
A.11
formula
A.12
and is a positive constant.
Lemma 2.
For each , we have
formula
Proof.
From theorem 3 in Sviridenko (2004), we have
formula
A.13
According to the definition of and , we have
formula
A.14
Then
formula
A.15
That is,
formula
A.16
Lemma 3.
and are achieved by equations 5.12 and 5.14 without considering the confidence interval. Then we have
formula
Proof.
From theorem 3 in Leskovec et al. (2007), we have
formula
A.17
Then from the definition of and , we have
formula
A.18
As a result,
formula
A.19
That is,
formula
A.20
Proof of Theorem 3. Let , where . From the definition of the -Regret and lemma 6, we have
formula
A.21
From lemma 5, we know that, with probability ,
formula
A.22
and
formula
A.23
where
formula
Then, with probability at least , we have
formula
A.24
Lemma 4.
If , where , we then have
formula
Proof.
From lemma 9 in Yue and Guestrin (2011), we have
formula
A.25
and
formula
A.26
We then have
formula
A.27
According to the Cauchy-Schwarz inequality, we have
formula
A.28
Following lemma 8, with probability at least , we can bound the -Regret as follows:
formula
A.29
From theorem 8 in Abbasi-Yadkori et al. (2011), we have
formula
A.30
We replace with , and then with probability , we have
formula
A.31
Proof of Theorem 4. From lemmas 6 and 7, we have
formula
A.32
From algorithm 2, we know that
formula
A.33
Then, with probability at least , we can bound the -Regret as follows:
formula
A.34
We replace with , and then with probability at least , we have
formula
A.35
where

A.3  Modified Lazy Evaluation

We can directly use lazy evaluation to speed up the Epsilon-Greedy algorithm (see the details in algorithm 3). The idea of using lazy evaluation can lead to orders-of-magnitude performance speed-ups for the submodular function maximization algorithms (Leskovec et al., 2007). However, for UCB-based algorithms, the objective function is a nonsubmodular function due to the confidence interval term . Although lazy evaluation still works in the LSBGreedy algorithm, as mentioned in Yue and Guestrin (2011), it no longer keeps the theoretical guarantee. As a result, we propose a modified lazy evaluation procedure, that speeds up our algorithms without losing the theoretical guarantee.

In CGreedy and MCSGreedy, we need to evaluate the value of all possible arm and we choose the arm
formula
A.36
In the lazy evaluation setting, we do not need to evaluate the value of all possible arms by using a priority queue. Without loss of generality, we introduce the modified lazy evaluation to the CSGreedy algorithm (see algorithm 4), while both MCSGreedy and CGreedy can be seen as different variants of CSGreedy. We can then easily extend it to both CGreedy and MCSGreedy.
formula
The modified lazy evaluation is described as follows. Let denote the priority queue according to the upper bound of the utility function:
formula
A.37
Let keep the upper bound of for all :
formula
A.38
We first initialize by
formula
A.39
After that, the priority queue keeps property A.37. We then choose an arm at each time step through the following steps:
  1. Evaluate the unit-cost confidence intervals for all . Then let .

  2. Evaluate for all . Then choose .

  3. Update in for all . Then adjust according to .

  4. Update the one by one until (each time we update ; then we adjust the immediately).

We use Lazy-Greedy-Choose() to refer to this four-step procedure and apply it in CSGreedy to construct the Lazy CSGreedy (see algorithm 5).

formula

A.4  Correctness of Modified Lazy Evaluation

Next, we show the correctness of modified lazy evaluation. In lemmas 9 and 10, we prove that the modified lazy evaluation speeds up the CSGeedy algorithm without losing the theoretical guarantee. We first prove that the arm chosen by equation A.36 is exactly chosen by the four-step procedure Lazy-Greedy-Choose() (in lemma 9). We then prove that the priority queue keeps property A.37 after performing Lazy-Greedy-Choose() (in lemma 10).

Lemma 5.
Let = Lazy-Greedy-Choose(). Then we have
formula
Proof.
For the initial state of , we have
formula
A.40
After step 1 in Lazy-Greedy-Choose(), we also have
formula
A.41
After step 2, we have
formula
A.42
formula
A.43
where is the upper bound of . As a result,
formula
A.44
For chosen by step 2, we have
formula
A.45
Finally, from equations A.44 and A.45, we have
formula
A.46
That is, chosen by Lazy-Greedy-Choose() is same with the arm chosen in equation A.36.
Lemma 6.

At the beginning of each loop (after removing from ), the priority queue also keeps property A.37.

Proof.
After step 3 in Lazy-Greedy-Choose(), we also have:
formula
A.47
In step 4, we assume that it stops at . We then have
formula
A.48
As a result,
formula
A.49
Finally, if we remove from the priority queue, from equation A.49, we know that priority queue also keeps property A.37.

Acknowledgments

This research is supported by Australian Research Council Projects: FT-130101457, DP-140102164 and LE-140100061. A preliminary version of this work was published in AAAI 2016.

References

Abbasi-Yadkori
,
Y.
,
Pál
,
D.
, &
Szepesvári
,
C.
(
2011
).
Online least squares estimation with self-normalized processes: An application to bandit problems
.
arXiv:1102.2670
Angel
,
A.
, &
Koudas
,
N.
(
2011
).
Efficient diversity-aware search
. In
Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data
(pp.
781
792
),
New York
:
ACM
.
Audibert
,
J.-Y.
, &
Bubeck
,
S.
(
2009
).
Minimax policies for adversarial and stochastic bandits
. In
Proceedings of the Conference on Learning Theory
(pp.
217
226
).
Madison, WI
:
Omnipress
.
Auer
,
P.
(
2003
).
Using confidence bounds for exploitation-exploration trade-offs
.
Journal of Machine Learning Research
,
3
,
397
422
.
Auer
,
P.
,
Cesa-Bianchi
,
N.
, &
Fischer
,
P.
(
2002
).
Finite-time analysis of the multiarmed bandit problem
.
Machine Learning
,
47
(
2–3
),
235
256
.
Auer
,
P.
,
Cesa-Bianchi
,
N.
,
Freund
,
Y.
, &
Schapire
,
R. E.
(
2002
).
The nonstochastic multiarmed bandit problem
.
SIAM Journal on Computing
,
32
(
1
),
48
77
.
Badanidiyuru
,
A.
,
Kleinberg
,
R.
, &
Slivkins
,
A.
(
2013
).
Bandits with knapsacks
. In
Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science
(pp.
207
216
).
Piscataway, NJ
:
IEEE
.
Berry
,
D. A.
, &
Fristedt
,
B.
(
1985
).
Bandit problems: Sequential allocation of experiments
.
New York
:
Springer
.
Breiman
,
L.
(
2001
).
Random forests
.
Machine Learning
,
45
(
1
),
5
32
.
Caro
,
F.
, &
Gallien
,
J.
(
2007
).
Dynamic assortment with demand learning for seasonal consumer goods
.
Management Science
,
53
(
2
),
276
292
.
Carterette
,
B.
(
2011
).
An analysis of NP-completeness in novelty and diversity ranking
.
Information Retrieval
,
14
(
1
),
89
106
.
Cesa-Bianchi
,
N.
, &
Lugosi
,
G.
(
2012
).
Combinatorial bandits
.
Journal of Computer and System Sciences
,
78
(
5
),
1404
1422
.
Chakrabarti
,
D.
,
Kumar
,
R.
,
Radlinski
,
F.
, &
Upfal
,
E.
(
2009
).
Mortal multi-armed bandits
. In
D.
Koller
,
D.
Schvurmans
,
L.
Bottou
, &
Y.
Beng
(Eds.),
Advances in neural information processing systems
,
20
(pp.
273
280
).
Cambridge, MA
:
MIT Press
.
Chen
,
S.
,
Lin
,
T.
,
King
,
I.
,
Lyu
,
M. R.
, &
Chen
,
W.
(
2014
).
Combinatorial pure exploration of multi-armed bandits
. In
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
379
387
).
Red Hook, NY
:
Curran.
Chen
,
W.
,
Wang
,
Y.
, &
Yuan
,
Y.
(
2013
).
Combinatorial multi-armed bandit: General framework and applications
. In
Proceedings of the 30th International Conference on Machine Learning
(pp.
151
159
).
Madison, WI
:
Omnipress
.
Chu
,
W.
,
Li
,
L.
,
Reyzin
,
L.
, &
Schapire
,
R. E.
(
2011
).
Contextual bandits with linear payoff functions
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
208
214
).
Cambridge, MA
:
MIT Press
.
Clarke
,
C. L.
,
Kolla
,
M.
,
Cormack
,
G. V.
,
Vechtomova
,
O.
,
Ashkan
,
A.
,
Büttcher
,
S.
, &
MacKinnon
,
I.
(
2008
).
Novelty and diversity in information retrieval evaluation
. In
Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in Information Retrieval
(pp.
659
666
).
New York
:
ACM
.
Dani
,
V.
,
Hayes
,
T. P.
, &
Kakade
,
S. M.
(
2008
).
Stochastic linear optimization under bandit feedback
. In
Proceedings of the Conference on Learing Theory
(pp.
355
366
).
Madison, WI
:
Omnipress
.
Das
,
A.
,
Dasgupta
,
A.
, &
Kumar
,
R.
(
2012
).
Selecting diverse features via spectral regularization
. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottov
, &
K. L.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
1583
1591
).
Red Hook, NY
:
Curran
.
Devanur
,
N. R.
,
Jain
,
K.
,
Sivan
,
B.
, &
Wilkens
,
C. A.
(
2011
).
Near optimal online algorithms and fast approximation algorithms for resource allocation problems
. In
Proceedings of the 12th ACM conference on Electronic Commerce
(pp.
29
38
).
New York
:
ACM
.
Ding
,
W.
,
Qin
,
T.
,
Zhang
,
X.-D.
, &
Liu
,
T.-Y.
(
2013
).
Multi-armed bandit with budget constraint and variable costs
. In
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence
.
AAAI Press
.
El-Arini
,
K.
,
Veda
,
G.
,
Shahaf
,
D.
, &
Guestrin
,
C.
(
2009
).
Turning down the noise in the blogosphere
. In
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
289
298
).
New York
:
ACM
.
Fang
,
M.
, &
Tao
,
D.
(
2014
).
Networked bandits with disjoint linear payoffs
. In
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1106
1115
).
New York
:
ACM
.
Filippi
,
S.
,
Cappe
,
O.
,
Garivier
,
A.
, &
Szepesvári
,
C.
(
2010
).
Parametric bandits: The generalized linear case
. In
J. D.
Lafferty
,
C. K. I.
Williams
,
J.
Shawe-Taylor
,
R. S.
Zemell
, &
A.
Culotte
(Eds.),
Advances in neural information processing systems
,
22
(pp.
586
594
).
Red Wook, NY
:
Curran
.
Gai
,
Y.
,
Krishnamachari
,
B.
, &
Jain
,
R.
(
2010
).
Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation
. In
Proceedings of the 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum
(pp.
1
9
).
Piscataway, NJ
:
IEEE
.
Gai
,
Y.
,
Krishnamachari
,
B.
, &
Jain
,
R.
(
2012
).
Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations
.
IEEE/ACM Transactions on Networking
,
20
(
5
),
1466
1478
.
Golovin
,
D.
, &
Krause
,
A.
(
2010
).
Adaptive submodularity: A new approach to active learning and stochastic optimization
. In
Proceedings of the Conference on Learning
(pp.
333
345
).
Madison, WI
:
Omnipress
.
Kakade
,
S. M.
,
Kalai
,
A. T.
, &
Ligett
,
K.
(
2009
).
Playing games with approximation algorithms
.
SIAM Journal on Computing
,
39
(
3
),
1088
1106
.
Kalyanakrishnan
,
S.
, &
Stone
,
P.
(
2010
).
Efficient selection of multiple bandit arms: Theory and practice
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
511
518
).
Cambridge, MA
:
MIT Press
.
Kohli
,
P.
,
Salek
,
M.
, &
Stoddard
,
G.
(
2013
).
A fast bandit algorithm for recommendation to users with heterogeneous tastes
. In
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence
.
AAAI Press
.
Krause
,
A.
, &
Golovin
,
D.
(
2012
).
Submodular function maximization
. In
L.
Bordeaux
,
V.
Hommad
, &
P.
Kohli
(Eds.),
Tractability: Practical approaches to hard problems
(
3:19
).
Cambridge
:
Cambridge University, Press
.
Küçüktunç
,
O.
,
Saule
,
E.
,
Kaya
,
K.
, &
Çatalyürek
,
Ü. V.
(
2013
).
Diversified recommendation on graphs: Pitfalls, measures, and algorithms
. In
Proceedings of the 22nd International Conference on World Wide Web
(pp.
715
726
).
International World Wide Web Conferences Steering Committee
.
Kveton
,
B.
,
Wen
,
Z.
,
Ashkan
,
A.
, &
Eydgahi
,
H.
(
2014
).
Matroid bandits: Practical large-scale combinatorial bandits
. In
Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence
.
AAAI Press
.
Lai
,
T. L.
, &
Robbins
,
H.
(
1985
).
Asymptotically efficient adaptive allocation rules
.
Advances in Applied Mathematics
,
6
(
1
),
4
22
.
Leskovec
,
J.
,
Krause
,
A.
,
Guestrin
,
C.
,
Faloutsos
,
C.
,
VanBriesen
,
J.
, &
Glance
,
N.
(
2007
).
Cost-effective outbreak detection in networks
. In
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
420
429
).
New York
:
ACM
.
Li
,
L.
,
Chu
,
W.
,
Langford
,
J.
, &
Schapire
,
R. E.
(
2010
).
A contextual-bandit approach to personalized news article recommendation
. In
Proceedings of the 19th International Conference on World Wide Web
(pp.
661
670
).
New York
:
ACM
.
Lin
,
C.-M.
, &
Gen
,
M.
(
2008
).
Multi-criteria human resource allocation for solving multistage combinatorial optimization problems using multiobjective hybrid genetic algorithm
.
Expert Systems with Applications
,
34
(
4
),
2480
2490
.
Lin
,
H.
, &
Bilmes
,
J.
(
2011
).
A class of submodular functions for document summarization
. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1
(pp.
510
520
).
Stroudsburg, PA
:
Association for Computational Linguistics
.
Lin
,
H.
, &
Bilmes
,
J. A.
(
2012
).
Learning mixtures of submodular shells with application to document summarization
.
arXiv:1210.4871
Liu
,
H.
,
Liu
,
K.
, &
Zhao
,
Q.
(
2011
).
Logarithmic weak regret of non-Bayesian restless multi-armed bandit
. In
Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
1968
1971
).
Piscataway, NJ
:
IEEE
.
Liu
,
K.
, &
Zhao
,
Q.
(
2012
).
Adaptive shortest-path routing under unknown and stochastically varying link states
. In
Proceedings of the 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks
(pp.
232
237
).
Piscataway, NJ
:
IEEE
.
Nemhauser
,
G. L.
,
Wolsey
,
L. A.
, &
Fisher
,
M. L.
(
1978
).
An analysis of approximations for maximizing submodular set functions
.
Mathematical Programming
,
14
(
1
),
265
294
.
Radlinski
,
F.
,
Kleinberg
,
R.
, &
Joachims
,
T.
(
2008
).
Learning diverse rankings with multi-armed bandits
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
784
791
).
New York
:
ACM
.
Rafiei
,
D.
,
Bharat
,
K.
, &
Shukla
,
A.
(
2010
).
Diversifying web search results
. In
Proceedings of the 19th International Conference on the World Wide Web
(pp.
781
790
).
New York
:
ACM
.
Shen
,
Y.
,
Tobia
,
M. J.
,
Sommer
,
T.
, &
Obermayer
,
K.
(
2014
).
Risk-sensitive reinforcement learning
.
Neural Computation
,
26
(
7
),
1298
1328
.
Streeter
,
M.
,
Golovin
,
D.
, &
Krause
,
A.
(
2009
).
Online learning of assignments
. In
Y.
Bengio
,
D.
Schuurmans
,
J. D.
Lafferly
,
C. K. I.
Williams
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
21
(pp.
1794
1802
).
Red Hook, NY
:
Curran
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning: An introduction
(
vol. 1
).
Cambridge, MA
:
MIT Press
.
Sviridenko
,
M.
(
2004
).
A note on maximizing a submodular set function subject to a knapsack constraint
.
Operations Research Letters
,
32
(
1
),
41
43
.
Tran-Thanh
,
L.
,
Chapman
,
A.
,
Munoz de Cote
,
E.
,
Rogers
,
A.
, &
Jennings
,
N. R.
(
2010
).
Epsilon-first policies for budget-limited multi-armed bandits
. In
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence
.
AAAI Press
.
Tran-Thanh
,
L.
,
Chapman
,
A.
,
Rogers
,
A.
, &
Jennings
,
N. R.
(
2012
).
Knapsack based optimal policies for budget-limited multi-armed bandits
.
arXiv:1204.1909
Tran-Thanh
,
L.
,
Xia
,
Y.
,
Qin
,
T.
, &
Jennings
,
N. R.
(
2015
).
Efficient algorithms with performance guarantees for the stochastic multiple-choice knapsack problem
. In
Proceedings of the 24th International Joint Conference on Artificial Intelligence
.
AAAI Press
.
Welch
,
M. J.
,
Cho
,
J.
, &
Olston
,
C.
(
2011
).
Search result diversity for informational queries
. In
Proceedings of the 20th International Conference on World Wide Web
(pp.
237
246
).
New York
:
ACM
.
Yu
,
B.
,
Fang
,
M.
, &
Tao
,
D.
(
2016
).
Linear submodular bandits with a knapsack constraint
. In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
.
AAAI Press
.
Yue
,
Y.
, &
Guestrin
,
C.
(
2011
).
Linear submodular bandits and their application to diversified retrieval
. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
23
(pp.
2483
2491
).
Ziegler
,
C.-N.
,
McNee
,
S. M.
,
Konstan
,
J. A.
, &
Lausen
,
G.
(
2005
).
Improving recommendation lists through topic diversification
. In
Proceedings of the 14th International Conference on World Wide Web
(pp.
22
32
).
New York
:
ACM
.

Notes

1

might be different for different .