## Abstract

Linear submodular bandits has been proven to be effective in solving the diversification and feature-based exploration problem in information retrieval systems. Considering there is inevitably a budget constraint in many web-based applications, such as news article recommendations and online advertising, we study the problem of diversification under a budget constraint in a bandit setting. We first introduce a budget constraint to each exploration step of linear submodular bandits as a new problem, which we call per-round knapsack-constrained linear submodular bandits. We then define an -approximation unit-cost regret considering that the submodular function maximization is NP-hard. To solve this new problem, we propose two greedy algorithms based on a modified UCB rule. We prove these two algorithms with different regret bounds and computational complexities. Inspired by the lazy evaluation process in submodular function maximization, we also prove that a modified lazy evaluation process can be used to accelerate our algorithms without losing their theoretical guarantee. We conduct a number of experiments, and the experimental results confirm our theoretical analyses.

## 1 Introduction

The trade-off between exploration and exploitation is one of the challenges arising in reinforcement learning (Sutton & Barto, 1998; Shen, Tobia, Sommer, & Obermayer, 2014). The multiarmed bandit (MAB) problem is the simplest instance of the exploration-versus-exploitation dilemma (Auer, Cesa-Bianchi, & Fischer, 2002), which is a trade-off between exploring the environment to find a better action (exploration) and adopting the current best action as often as possible (exploitation). The classical MAB is formulated as a system of arms. At each time step, we choose an arm to pull and obtain a reward from an unknown distribution. The goal is to maximize the cumulative rewards by optimally balancing exploration and exploitation in finite time steps . The most popular measure of an algorithm’s success is regret, which is the cumulative loss of failing to pull the optimal arm. A variant of the classical MAB, which is usually called combinatorial multiarmed bandits (Gai, Krishnamachari, & Jain, 2012; Chen, Wang, & Yuan, 2013), allows multiple arms chosen at each time step. Many web-based applications can be modeled as combinatorial bandits—for example, the personalized recommendation of news articles (Li, Chu, Langford, & Schapire, 2010; Fang & Tao, 2014) in which multiple news articles are recommended to a user.

Diversification is a key problem in information retrieval systems, such as ranking of documents (Radlinski, Kleinberg, & Joachims, 2008), products recommendation (Ziegler, McNee, Konstan, & Lausen, 2005), and news article recommendation. Implications affecting user satisfaction have been observed in practice: recommendation requires the proposal of a diverse set of items (Ziegler et al., 2005). Submodularity is an intuitive notion of diminishing returns, which states that adding a new item to a larger set helps less than adding the same item to a smaller set. It has turned out that diversification can be well captured by a submodular function (Krause & Golovin, 2012), and linear submodular bandits (Yue & Guestrin, 2011) has thus been proposed to handle the diversification problem in a bandit setting.

There is nevertheless always a budget constraint in real-world scenarios, where limited resources are consumed during the process of actions. For example, in dynamic procurement (Badanidiyuru, Kleinberg, & Slivkins, 2013), the budget for buying items is limited. In clinical trials, experiments on alternative medical treatments are limited by the cost of materials. However, a budget constraint imposed on each time step, not the entire process, is more reasonable for other applications. For example, in online advertising (Chakrabarti, Kumar, Radlinski, & Upfal, 2009), the size of a web page is limited, while the ads are changing each time the user visits the web page. In news article recommendations, several articles are recommended to a user, and feedback is obtained each time, but users will have only a limited time to read those articles (e.g., if we recommend three short news articles, the user may read all of them, but if we recommend three long articles, the user might be not patient enough to read all of them). We thus formulate a per-round budget constraint as follows: for all , let denote the cost of pulling arm , where is the set of all arms. The total costs of arm pulling are limited by a budget . At each time step , we choose a subset of arms under a budget constraint (i.e., ), which is known as a knapsack constraint (Sviridenko, 2004).

In order to improve user satisfaction by considering the problem of diversification under a budget constraint, we introduce the per-round budget constraint to linear submodular bandits as a new problem, which we refer to as per-round knapsack-constrained linear submodular bandits. To solve this new problem, we construct the upper confidence bounds (UCB) under a budget constraint, which is called the unit-cost upper confidence bounds, to control the trade-off between exploration and exploitation. Inspired by other knapsack solutions, we try to obtain the maximum reward on each budget unit. Specifically, we greedily choose each arm, which has the maximum upper confidence bound on per-budget utility gain, to construct a subset of arms in our algorithms.

In this letter, we first briefly review related work. We then describe the new problem of per-round knapsack-constrained linear submodular bandits (or linear submodular bandits with a knapsack constraint; Yu, Fang, & Tao, 2016) and the definition of regret. After that, we propose two greedy algorithms based on a modified UCB rule and prove both algorithms with theoretical regret bounds. We also show that a modified lazy evaluation can be used to accelerate both algorithms without losing theoretical guarantee. Finally, we use news article recommendation as a case study, which requires us to recommend multiple news articles under a per-round budget constraint. We conduct a number of experiments and compare our two algorithms with the baselines for linear submodular bandits, such as LSBGreedy (Yue & Guestrin, 2011) and Epsilon-Greedy.

## 2 Related Work

Multiarmed bandits addresses the exploration-versus-exploitation dilemma in statistics and reinforcement learning (Lai and Robbins, 1985; Berry & Fristedt, 1985; Sutton & Barto, 1998; Shen et al., 2014). There are two types of multiarmed bandit problems: adversarial bandits and stochastic bandits. Adversarial bandits, in which an adversary controls the arms and tries to defeat the learning process, is initiated by Auer, Cesa-Bianchi, Freund, and Schapire (2002) and followed by Audibert and Bubeck (2009); Kakade, Kalai, and Ligett (2009); and Streeter, Golovin, & Krause (2009) in different settings. For stochastic bandits, in which the rewards of arms are sampled from an unknown distribution, a uniform sublinear regret bound has been provided by Auer, Cesa-Bianchi, and Fischer (2002). They propose several algorithms based on an -greedy rule or a UCB rule, which are then widely used in multiarmed bandit problems to control the trade-off between exploration and exploitation (Auer, Cesa-Bianchi, & Fischer, 2002; Auer, 2003). Our work lies in stochastic bandits.

A budget constraint on the entire process has been well studied in the classical multiarmed bandit problem. In the budget-limited multiarmed bandit problem (Tran-Thanh, Chapman, Munoz de Cote, Rogers, & Jennings, 2010; Tran-Thanh, Chapman, Rogers, & Jennings, 2012), the costs of pulling different arms are different and the total costs are limited by a fixed budget. A series of work has followed this problem with different assumptions on the cost of the arm (Tran-Thanh, Chapman, Munoz de Cote, Rogers, & Jennings, 2010; Devanur, Jain, Sivan, & Wilkens, 2011; Tran-Thanh, Chapman, Rogers, & Jennings, 2012; Ding, Qin, Zhang, & Liu, 2013; Badanidiyuru, Kleinberg, & Slivkins, 2013). More specifically, in the setting proposed by Tran-Thanh et al. (2010), the cost of each arm is fixed. This problem was first solved by a simple budgeted -first algorithm (Tran-Thanh et al., 2010) and subsequently by an improved algorithm called KUBE (Tran-Thanh, Chapman, Rogers, & Jennings, 2012). Another setting in which the cost of arm is a variable has been studied by Ding et al. (2013). The most important budget constraint is known as a knapsack constraint (Devanur et al., 2011; Badanidiyuru et al., 2013), in which the total costs are the sum of the pulled arms.

Combinatorial multiarmed bandits, which allows multiple arms to be pulled at each time step, is a variant of classical multiarmed bandit problem (Kalyanakrishnan & Stone, 2010; Cesa-Bianchi & Lugosi, 2012). Several specific instances of combinatorial multiarmed bandits have been studied by Caro and Gallien (2007) and Liu, Liu, and Zhao (2011) and also in different applications, such as resource allocation (Lin & Gen, 2008), cognitive radio networks (Gai, Krishnamachari, & Jain, 2010), adaptive shortest-path routing (Liu & Zhao, 2012), and pure exploration bandits (Chen, Lin, King, Lyu, & Chen 2014). A general combinatorial bandit framework with an approximation oracle has been studied by Gai, Krishnamachari, and Jain (2012) in linear rewards setting and extended by Chen et al. (2013) to nonlinear rewards setting. Recently, combinatorial multiarmed bandits with a matroid constraint, a notion of independence in combinatorial optimization, has been proposed by Kveton, Wen, Ashkan, and Eydgahi (2014). Another interesting study considers a multiple-choice knapsack bandit (Tran-Thanh, Xia, Qin, & Jennings, 2015), in which the arm is chosen from several subsets of arms. However, all previous work has focused on the constraint on the entire process and has not considered a per-round budget constraint.

In information retrieval systems, combinatorial multiarmed bandits has been used to identify user interests (Li et al., 2010), especially when multiple items need to be recommended simultaneously, such as news articles (Kohli, Salek, & Stoddard, 2013) and online advertising (Chakrabarti et al., 2009). In these multiple items retrieval settings, the notion of diversity has been addressed in many studies, such as diverse rankings of documents (Radlinski et al., 2008), diverse search results (Rafiei, Bharat, & Shukla, 2010; Welch, Cho, & Olston, 2011; Angel & Koudas, 2011), and topic diversification in recommendation systems (Ziegler, McNee, Konstan, & Lausen, 2005; Clarke, Kolla, Cormack, Vechtomova, Ashkan, Büttcher, & MacKinnon, 2008; Küçüktunç, Saule, Kaya, & Çatalyürek, 2013). The problem of diversification in information retrieval systems is known as a NP-hard combinatorial optimization problem (Carterette, 2011). Submodularity is a notion of diminishing returns (Golovin & Krause, 2010; Krause & Golovin, 2012), which means that adding a new item to a smaller set will achieve more utility gain than adding the same item to a larger set. It has been proven that submodularity is helpful in approximately solving the diversification problem, and submodularity-based algorithms have been used in many real-world applications (Lin & Bilmes, 2011, 2012; Das, Dasgupta, & Kumar, 2012). A greedy algorithm solves the monotone submodular function maximization problem with a -approximation guarantee (Nemhauser, Wolsey, & Fisher, 1978). Algorithms for submodular maximization problem with different constraints, such as a knapsack constraint (Sviridenko, 2004; Leskovec et al., 2007) or a matroid constraint (El-Arini, Veda, Shahaf, & Guestrin, 2009), also have been proposed.

Recently, linear submodular bandits has been proposed by Yue and Guestrin (2011) as a typical combinatorial bandit model to handle diversification in information retrieval systems. Linear submodular bandits introduces submodularity to linear bandits, a well-studied bandit model (Dani, Hayes, & Kakade, 2008; Filippi, Cappe, Garivier, & Szepesvári, 2010; Abbasi-Yadkori, Pál, and Szepesvári, 2011; Chu, Li, Reyzin, & Schapire, 2011). A budget constraint nevertheless exists in real-world applications; for example, in a news article recommendation, a user has a limited time to read all recommended news articles (El-Arini et al., 2009); In online advertising (Chakrabarti et al., 2009), the size of a web page for ads is limited. However, budget constraint has not been considered in linear submodular bandits, while the cardinality constraint can be seen only as a special case of budget constraint. In order to solve the diversification problem under budget constraint, we thus introduce a per-round knapsack constraint to linear submodular bandits. Unlike previous knapsack-constrained bandit problems (Devanur et al., 2011; Badanidiyuru et al., 2013), in which the knapsack constraint works on the full time sequence, our per-round knapsack constraint is imposed on each time step separately.

## 3 Problem Definition

We formulate the per-round knapsack-constrained linear submodular bandits as follows. Let denote a set of arms and be a set of costs for arms. At each time step , we greedily choose each arm in under a budget constraint and obtain rewards , a random variable with the martingale assumption (Abbasi-Yadkori et al., 2011). The expected rewards of are measured by a monotone submodular utility function , where is a parameter vector.

## 4 -Regret

Regret, which indicates the loss of not always pulling the optimal arm, has been widely used in bandit problems as a measure of an algorithm’s success. Considering that submodular function maximization is NP-hard, we can find only approximated solutions, , in polynomial time (Sviridenko, 2004; Leskovec et al., 2007). As a result, we can guarantee only an -approximation solution for per-round knapsack-constrained linear submodular bandits, even if we know the parameter . In a bandit setting, we use , which is evaluated according to the previous time steps’ feedback, as an estimate of to help us make decisions.

## 5 Algorithms

In this section, we introduce the evaluation of as well as a modified UCB rule. We then propose two greedy algorithms based on the modified UCB rule to solve the problem of per-round knapsack-constrained linear submodular bandits. Both of these algorithms can be seen as extensions of submodular greedy algorithms (Sviridenko, 2004; Leskovec et al., 2007) to bandit settings. There is a regret versus computational complexity trade-off between our two algorithms.

### 5.1 Evaluation of

^{1}denote a subset of arms chosen at time step . For simplicity, we define for all . Then the feature vector of the arm is where and . At each time step , we greedily choose each arm to construct and then acquire all rewards where denotes the rewards of . The expected rewards of are

### 5.2 Modified UCB Rule

### 5.3 Algorithm 1: MCSGreedy

### 5.4 Algorithm II : CGreedy

Algorithm 1 has a partial enumeration, which is is time-consuming: times evaluation of the utility function in each time step ( is the number of arms). We therefore propose another greedy algorithm (see algorithm 2), which is able to provide -approximation guarantee without a partial enumeration.

## 6 Theoretical Analysis

In this section, we use -Regret for the analysis of per-round knapsack-constrained linear submodular bandits. We prove different -Regret bounds and computational complexities for algorithms 1 and 2. Also, a modified lazy evaluation is proven to accelerate our two algorithms without losing the theoretical guarantee (see the details in the appendix).

### 6.1 -Regret Bounds

-Regret is the difference between the reward derived from our algorithm and the -approximation of the optimal reward. In algorithm 1, we have , which means that algorithm 1 is at least a -approximation of the optimal solution. We prove the regret bound for algorithm 1 in theorem ^{3}.

The proof is in the appendix.

For , we prove the regret bound for algorithm 2 in theorem ^{4}:

The proof is in the appendix.

### 6.2 Regret versus Computational Complexity

Our two algorithms have sublinear -Regret bounds with different approximation rates: for algorithm 1 and for algorithm 2. Algorithm 1 enjoys a better regret bound. For computational complexity, algorithm 1 runs in time because of a partial enumeration procedure in each time step, where is the number of arms. Algorithm 2 runs in time and is thus computationally more efficient than algorithm 1. Overall, algorithm 1 enjoys a better regret bound, while algorithm 2 is computationally more efficient.

Comparing with LSBGreedy, our two algorithms solve the problem in a more general setting (i.e., knapsack-constrained setting). LSBGreedy can be seen as a special case of our algorithms in an equal-cost setting, and we prove comparable regret bounds for our algorithms. A detailed comparison on approximation rate, constraint, and complexity between our algorithms and baselines is shown in Table 1.

## 7 Experiment

In this section, we empirically evaluate our algorithms by using news article recommendation (Li et al., 2010) as a case study. We first formulate news article recommendation into the linear submodular bandits. We then introduce a knapsack constraint to the news article recommendation where the costs of arms indicate the length of the news article (or the reading time). Finally, we compare our algorithms with several baselines on two data sets.

### 7.1 News Article Recommendation

### 7.2 Competing Methods and Data Sets

We compare our two algorithms with the following baselines:

- •
The LSBGreedy algorithm, which is proposed by Yue and Guestrin (2011) to solve the problem of linear submodular bandits

- •
The Epsilon-Greedy algorithm (in the appendix), which randomly chooses the arm with the maximum unit-cost reward according to the -greedy rule (Auer, Cesa-Bianchi, & Fischer, 2002).

All experiments are performed on two data sets:

- •
The simulation data set, which follows the previous work of linear submodular bandits (Yue & Guestrin, 2011).

- •
The 20 Newsgroups data set, a popular data set for text analysis in machine learning

^{2}

### 7.3 Results on the Simulation Data Set

In a simulation experiment, we randomly generate a -dimensional vector to represent a news article, , where indicates the information coverage on the topic . For each news article, we assume that it has only a limited number of main topics and noisy topics . The number of main topics is , and the number of noise topics is .

We use a randomly sampled to indicate a user’s interest level on each topic. We assume that a user will like some topics very much and dislike other topics . We also assume that the user has limited time to read all news articles (El-Arini et al., 2009). We first demonstrate the results in Figure 1, in which the cost of the arm is sampled from a uniform distribution.

In Figure 1a, we demonstrate what the learned look like compared with and find that achieved by MCSGreedy has the smallest differences with . In Figure 1b, we show that MCSGreedy and CGreedy always obtain more average rewards. We compare the average rewards under different budgets in Figure 1c, in which our algorithms work well with the budget constraint. We compare the average rewards under different cost intervals, that is, the maximum differences between the costs of two arms, in Figure 1d. It is clear that MCSGreedy and CGreedy outperform LSBGreedy and Epsilon-Greedy, especially when the cost interval is large (it is assumed that a large cost interval is representative of a complex setting).

In other settings, the cost of an arm sampled from a gaussian distribution is more reasonable; for example, most news articles are of medium length; extremely long (or short) news articles are rare. We demonstrate almost the same results in Figure 2 when the cost is sampled from a gaussian distribution.

### 7.4 Results on the 20 Newsgroups Data Set

The 20 Newsgroups data set contains approximately 20,000 news article documents, and each group corresponds to a specific topic (see Table 2). In order to perform experiments on this data set, we first train a softmax classifier (e.g., random forest; Breiman, 2001) and then perform text classification. We use the classification score to indicate the information coverage of each topic. The cost of the arm is the length of the news document (i.e., number of words), and the budget is denoted by , where is the mean length of news article documents.

Index . | Group Description . | Index . | Group Description . |
---|---|---|---|

1 | alt.atheism | 11 | rec.sport.hockey |

2 | comp.graphics | 12 | sci.crypt |

3 | comp.os.ms-windows.misc | 13 | sci.electronics |

4 | comp.sys.ibm.pc.hardware | 14 | sci.med |

5 | comp.sys.mac.hardware | 15 | sci.space |

6 | comp.windows.x | 16 | soc.religion.christian |

7 | misc.forsale | 17 | talk.politics.guns |

8 | rec.autos | 18 | talk.politics.mideast |

9 | rec.motorcycles | 19 | talk.politics.misc |

10 | rec.sport.baseball | 20 | talk.religion.misc |

Index . | Group Description . | Index . | Group Description . |
---|---|---|---|

1 | alt.atheism | 11 | rec.sport.hockey |

2 | comp.graphics | 12 | sci.crypt |

3 | comp.os.ms-windows.misc | 13 | sci.electronics |

4 | comp.sys.ibm.pc.hardware | 14 | sci.med |

5 | comp.sys.mac.hardware | 15 | sci.space |

6 | comp.windows.x | 16 | soc.religion.christian |

7 | misc.forsale | 17 | talk.politics.guns |

8 | rec.autos | 18 | talk.politics.mideast |

9 | rec.motorcycles | 19 | talk.politics.misc |

10 | rec.sport.baseball | 20 | talk.religion.misc |

We choose the top five topics for each document according to the classification score. We use a randomly sampled to indicate a user’s interest level on each topic. The experimental results are shown in Figure 3.

In Figure 3a, we find that MCSGreedy and CGreedy learn more quickly than LSBGreedy and Epsilon-Greedy. In Figure 3b, we demonstrate that MCSGreedy and CGreedy obtain more average rewards than LSBGreedy and Epsilon-Greedy. In Figure 3d, we show the distribution of the length of documents, which indicates that most of documents are of medium length. We compare the average rewards under different budgets in Figure 3c, in which MCSGreedy and CGreedy always obtain more average rewards.

## 8 Conclusion and Future Work

In this letter we introduce a new problem: per-round knapsack-constrained linear submodular bandits. To solve this problem, we define a modified UCB rule to control the trade-off between exploration and exploitation. We propose two algorithms with different regret bounds and computational complexities. To analyze our algorithms, we define an -Regret and prove that both of our algorithms have sublinear regret bounds. The experimental results on two data sets demonstrate that our algorithms outperform the baselines for per-round knapsack-constrained linear submodular bandits.

Considering that there are many different constraints in real-world applications, the linear submodular bandits with a more complex constraint, such as multiple knapsack constraints or a matroid constraint, will be the subject of future study.

## Appendix: Extended Algorithms and Proofs

### A.1 Epsilon-Greedy Algorithm

The -greedy rule is a simple and well-known policy to control the trade-off between exploration and exploitation in a classical multiarmed bandit problem (Auer, Cesa-Bianchi, & Fischer, 2002). Considering a budget constraint, we choose the arm that has the maximum estimated unit-cost rewards, with probability . Otherwise, we choose another arm randomly with probability . We describe the Epsilon-Greedy algorithm in algorithm 3, which is used as a baseline in our experiments.

The Epsilon-Greedy performance of the algorithm in experiments is even better than that of LSBGreedy. However, it is not easy to prove a regret bound for Epsilon-Greedy in the setting of per-round knapsack-constrained linear submodular bandits.

### A.2 Proofs of Theorems ^{3} and ^{4}

^{3}in Sviridenko (2004), we have According to the definition of and , we have Then That is,

^{9}in Yue and Guestrin (2011), we have and We then have According to the Cauchy-Schwarz inequality, we have

^{8}, with probability at least , we can bound the -Regret as follows: From theorem 8 in Abbasi-Yadkori et al. (2011), we have We replace with , and then with probability , we have

### A.3 Modified Lazy Evaluation

We can directly use lazy evaluation to speed up the Epsilon-Greedy algorithm (see the details in algorithm 3). The idea of using lazy evaluation can lead to orders-of-magnitude performance speed-ups for the submodular function maximization algorithms (Leskovec et al., 2007). However, for UCB-based algorithms, the objective function is a nonsubmodular function due to the confidence interval term . Although lazy evaluation still works in the LSBGreedy algorithm, as mentioned in Yue and Guestrin (2011), it no longer keeps the theoretical guarantee. As a result, we propose a modified lazy evaluation procedure, that speeds up our algorithms without losing the theoretical guarantee.

Evaluate the unit-cost confidence intervals for all . Then let .

Evaluate for all . Then choose .

Update in for all . Then adjust according to .

Update the one by one until (each time we update ; then we adjust the immediately).

We use Lazy-Greedy-Choose() to refer to this four-step procedure and apply it in CSGreedy to construct the Lazy CSGreedy (see algorithm 5).

### A.4 Correctness of Modified Lazy Evaluation

Next, we show the correctness of modified lazy evaluation. In lemmas ^{9} and ^{10}, we prove that the modified lazy evaluation speeds up the CSGeedy algorithm without losing the theoretical guarantee. We first prove that the arm chosen by equation A.36 is exactly chosen by the four-step procedure Lazy-Greedy-Choose() (in lemma ^{9}). We then prove that the priority queue keeps property A.37 after performing Lazy-Greedy-Choose() (in lemma ^{10}).

At the beginning of each loop (after removing from ), the priority queue also keeps property A.37.

## Acknowledgments

This research is supported by Australian Research Council Projects: FT-130101457, DP-140102164 and LE-140100061. A preliminary version of this work was published in AAAI 2016.

## References

## Notes

^{1}

might be different for different .