## Abstract

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret for *T* steps under a certain concavity assumption and under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.

## 1 Introduction

The Markov decision process (MDP) is a popular framework of reinforcement learning for sequential decision making (Sutton & Barto, 1998), where an agent takes an action depending on the current state, moves to the next state, receives a reward based on the last transition, and this process is repeated *T* times. The goal is to find an optimal decision-making policy (i.e., a conditional probability density of action given state) that maximizes the expected sum of rewards over *T* steps.

In the standard MDP formulation, the reward function is fixed over iterations. However, this assumption is often violated in reality. In this letter, we consider an online MDP scenario where the reward function is allowed to change over time. Such an online MDP problem is an extension of both online decision making and reinforcement learning (Yu, Mannor, & Shimkin, 2009):

- •
In an online decision-making problem, the agent needs to make a decision at each time step without knowledge of the future environment (Kalai & Vempala, 2005). A certain cost function will be observed only after the decision is made at each time step, and the goal is to minimize the regret against the best single decision. There is no assumption on the dynamics in the online decision making problem, and thus the decision can switch from one to another abruptly.

- •
In reinforcement learning, the dynamics are assumed to be Markovian. The reward function and transition dynamics are fixed but unknown to the agent, and thus the estimated reward function and transition function will converge to the true ones if sufficient samples are observed. The goal is to find the optimal policy that maximizes the cumulative reward without full information about the environment.

The goal of the online MDP problem is to find the best time-dependent policy that minimizes the regret, the difference from the best fixed policy. We expect the regret to be , by which the difference from the best fixed policy vanishes as *T* goes to infinity.

The MDP expert algorithm (MDP-E), which chooses the current best action at each state, was shown to achieve regret (Even-Dar, Kakade, & Mansour, 2004, 2009), where denotes the cardinality of the action space. Although this bound does not explicitly depend on the cardinality of the state space, the algorithm itself needs an expert algorithm for each state, and thus large state space may not be handled in practice. Another algorithm, called the lazy follow-the-perturbed-leader (lazy-FPL), divides the time steps into short periods, and policies are updated only at the end of each period using the average reward function (Yu et al., 2009). This lazy-FPL algorithm was shown to have regret for . The online MDP algorithm, called the online relative entropy policy search, is considered in Zimin and Neu (2013), which was shown to have regret for state space with *L*-layered structure. However, the regret bounds of these algorithms explicitly depend on and , and the algorithms cannot be directly implemented for problems with continuous state and action spaces. The online algorithm for Markov decision processes (Abbasi-Yadkori, Bartlett, Kanade, Seldin, & Szepesvari, 2013) was shown to have regret with changing transition probability distributions, where is the cardinality of the policy set. Although sublinear bounds still hold for continuous policy spaces, the algorithm cannot be used with infinite policy candidates directly. The online MDP problem is formulated as an online linear optimization problem in Dick, György, and Szepesvári (2014). By introducing the stationary occupation measures, the mirror descent with approximate projections was shown to have regret . However, the algorithm assumes that both the state and action spaces are finite. Yu et al. (2009), Abbasi-Yadkori et al. (2013), and Neu, György, and Szepesvári (2012) considered even more challenging online MDP problems under unknown or changing transition dynamics.

In practice, full information of the reward function may be hard to acquire, but only the value of the reward function for the current state and action is available. Such a setup, called the bandit feedback scenario, has attracted a great deal of attention recently. An extension of the lazy-FPL method to the bandit feedback scenario, called the exploratory-FPL algorithm (Yu et al., 2009), was shown to have regret . Neu, György, Szepesvári, and Antos (2010) proposed a method based on MDP-E that uses an unbiased estimator of the reward function and showed that its regret is . Neu, György, Szepesvári, and Antos (2014) further improved the regret bound to . However, this algorithm cannot be used in continuous state and action problems.

In this letter, we propose a simple online policy gradient (OPG) algorithm that can be implemented in a straightforward manner for problems with continuous state and action spaces, which could be seen as an extension of Dick et al. (2014).^{1} Under the assumption that the expected average reward function is concave, we prove that the regret of our OPG algorithm with respect to a compact and convex parametric policies set is , which is independent of the cardinality of the state and action spaces but is dependent on the diameter *F* and dimension *N* of the parameter space. Furthermore, regret is also proved under a strong concavity assumption on the expected average reward function. We also extend the proposed algorithm to a bandit feedback scenario and theoretically prove that the regret bound of the proposed algorithm is with the concavity assumption. We numerically illustrate the superior behavior of the proposed OPG algorithm in continuous problems over MDP-E with different discretization schemes.

The remainder of this letter is organized as follows. In section 2, we give a formal definition of the online MDP problem. Our proposed algorithm is given in section 3, and regret analyses in full information and the bandit scenario are given in sections 4 and 5, with proofs presented in the appendix.

## 2 Online Markov Decision Process

In this section, we formulate the problem of online MDP learning. An online MDP is specified by:

- •
State space , which could be either continuous or discrete.

- •
Action space , which contains all possible actions .

*A*could be either continuous or discrete. - •
Transition density , which represents the conditional probability density of next state given current state and action to be taken. We assume that the transition density is fully available to the agent.

- •
Reward function sequence , which is a pre fixed real-valued function sequence and will not change no matter what action is taken.

*t*, where is a convex and compact parameter set. Thus, algorithm gives a sequence of policies:

*T*time steps of algorithm , which can be denoted as In the above definition, denotes the expectation over the joint state-action distribution given the algorithm has been followed at each time step. The state-action distribution induced by and the transition density at time step

*t*can be expressed as where the state distribution induced by at time step

*t*is defined as However, maximizing the objective defined in equation 2.1 is not possible, since we cannot observe all

*T*reward functions during the process of an online decision-making problem. Here, we instead design algorithm that minimizes the regret against the baseline, which is the best parametric offline policy defined by In this definition of the regret, we suppose that there exists such that policy maximizes the expected cumulative rewards: The best offline parameter is given by where denotes the expectation over the state-action distribution given that the policy has been followed at each time step.

We assume here that all candidate policies are parameterized by the parameter , which is different from related works with finite states and actions (Even-Dar et al., 2004, 2009; Neu, György, Szepesvári, et al., 2010; Yu et al., 2009; Dick et al., 2014). For continuous problems, it is a common choice to use a parametric policy (e.g., the gaussian policy), which was demonstrated to work well (Sutton & Barto, 1998; Peters & Schaal, 2006). For this reason, the best offline policy defined in equation 2.2 is a suitable baseline given that the best policy with respect to the class of all Markovian policies is not a suitable baseline for continuous problems. If the regret is bounded by a sublinear function with respect to *T*, the algorithm is shown to be asymptotically as powerful as the best offline policy.

## 3 Online Policy Gradient Algorithm

In this section, we introduce an online policy gradient algorithm for solving the online MDP problem.

### 3.1 Algorithm

Unlike previous work (Even-Dar et al., 2004, 2009; Neu, György, Szepesvári, et al., 2010), we do not use the expert algorithm in our method because it is not suitable for handling continuous state and action problems. Instead, we consider a gradient-based algorithm that updates the parameter of policy along the gradient direction of the expected average reward function at each time step *t*.

*t*: where the expectation is taken over the stationary state-action distribution of policy .

Then our online policy gradient (OPG) algorithm is given as follows:

- •
Initialize policy parameter .

- •
for to

Observe current state .

Take action according to current policy .

Observe reward

*r*from the environment._{t}Move to next state .

*L*, and is the average reward obtained by trajectory . With theoretical guarantee, the REINFORCE algorithm has been shown to converge to the true gradient as and

*L*tend to infinity. In the following analysis, we ignore the approximation error since it could be arbitrarily small by collecting a large enough number of samples.

When the reward function does not changed over time, the OPG algorithm is reduced to the ordinary policy gradient algorithm (Williams, 1992), an efficient and natural algorithm for continuous state and action MDPs. The OPG algorithm can also be regarded as an extension of the online gradient descent algorithm (Zinkevich, 2003), which maximizes , not . As we showed in the definition of , the stationary state distribution of policy is used, which is different from the state distribution used in . As we will prove in section 4, the regret bound of the OPG algorithm is under a certain concavity assumption and under a strong concavity assumption on the expected average reward function. Unlike previous work (Even-Dar et al., 2004, 2009; Yu et al., 2009; Neu, György, Szepesvári, et al., 2010), these bounds do not depend on the cardinality of state and action spaces since a parameterized policy space is considered. Therefore, the OPG algorithm would be suitable for handling continuous state and action online MDPs.

### 3.2 Bandit Feedback

*r*in the OPG algorithm with a random reward function given by where can be calculated recursively using the following equation: Note that the above reward function is an unbiased estimator of for all (Yu et al., 2009): In the previous equation, denotes the expectation over the joint state-action distribution by the policies picked by algorithm at time step

_{t}*t*, where . By the definition , the estimated expected average reward function satisfies where The gradient of with respect to the parameter can be obtained by passing the derivative through the integral as As the previous equation shows, we replaced the gradient of the expected average reward function in equation 3.2 with its unbiased estimator .

As will be proved in section 5, the regret bound of the OPG method with bandit feedback is still , although the bound is looser than that in the full-feedback case. If it is not possible to calculate the state distribution directly, its estimate obtained by reinforcement learning may be employed in practice (Ng, Parr, & Koller, 1999).

## 4 Regret Analysis with Full Feedback

In this section, we provide a regret bound for the OPG algorithm in the full-feedback case.

### 4.1 Assumptions

First, we introduce the assumptions required in the proofs. Some assumptions have already been used in related works for discrete state and action MDPs, and we extend them to continuous state and action MDPs.

^{2}with , where . Note that we do not specify any policy model in the analysis, and therefore the following theoretical analysis is valid for other stochastic policy models as long as the assumptions are satisfied.

Assumption ^{4} means that the expected average reward function is concave, which is currently our sufficient condition to guarantee the -regret bound for the OPG algorithm. This assumption can be relaxed to locally concave expected average reward functions, where all the results still hold locally. More specifically the standard policy gradient algorithm (Sutton & Barto, 1998; Peters & Schaal, 2006) has been shown to converge to a local optimal solution, and we use a local optimal policy as the baseline in the definition of the regret instead of the global optimal solution.

### 4.2 Regret Bound with Concavity

We have the following theorem.

Note that the constant *C*_{1} depends on the specific policy model involved, which is claimed in assumption ^{2}.

^{5}, we decompose the regret in the same way as previous work has (Even-Dar et al., 2004, 2009; Neu, György, & Szepesvári, 2010; Neu, György, Szepesvári, et al., 2010): In the OPG method, is used for optimization, and the sum of the expected average reward functions is calculated based on the stationary state distribution of the policy parameterized by . However, the sum of the expected rewards is calculated by , the state distribution at time step

*t*following policy . A similar argument can be obtained for and . These differences affect the first and third terms of the decomposed regret equation 4.3.

Below, we bound each of the three terms in lemmas ^{6}, ^{7}, and ^{8}, which are proved in appendixes A, B, and C, respectively.

The first term has already been analyzed for discrete state and action online MDPs in Even-Dar et al. (2004, 2009), Neu et al. (2014), and Dick et al. (2014), and we extended it to continuous state and action spaces in lemma ^{6}.

Lemma ^{7} is obtained by using the result of Zinkevich (2003).

Lemma ^{8} is similar to lemma 5.2 in Even-Dar et al. (2009), but our bound does not depend on the cardinality of state and action spaces.

Combining lemmas ^{6} to ^{8}, we can immediately obtain theorem ^{5}.

### 4.3 Regret Analysis under Strong Concavity

Next we derive a sharper regret bound for the OPG algorithm under a strong concavity assumption.

In theorem ^{9}, , where *C*_{1} depends on the specific policy model. We again consider the same decomposition as equation 4.3, and the first term of the regret bound is exactly the same as lemma ^{6}.

The second term is bounded by the following proposition given the strong concavity assumption, equation 4.4, and step size :

The proof of proposition ^{10} is given in appendix D, which follows the same line as Hazan, Agarwal, and Kale (2007).

From the proof of lemma ^{8}, the bound of the third term with the strong concavity assumption, equation 4.4, is given by proposition ^{11}:

The result of proposition ^{11} is obtained by following the same line as the proof of lemma ^{8} with a different step size. Combining lemma ^{6} and propositions ^{10} and ^{11}, we can obtain theorem ^{9}.

## 5 Regret Analysis with Bandit Feedback

In this section, we prove a regret bound for the OPG algorithm in the bandit-feedback case.

### 5.1 Regret Bound with Concavity in the Bandit Scenario

Then we have the following theorem:

Theorem ^{12} can be proved by extending the proof of theorem ^{5}1 as follows.

The same regret decomposition as equation 4.3 is still possible in the bandit-feedback setting. The first term can be bounded in the same way as the full-information case; lemma ^{6} still holds. However, the bounds for the second and third terms, originally given in lemmas ^{7} and ^{8}, should be modified as follows:

The bound of the second part is still , but it is looser than the bound in the full-information scenario, which is caused by the estimated gradient of the expected average reward function.

Proofs of lemmas ^{13} and ^{14} are given in appendix G. From these lemmas, we can immediately obtain theorem ^{12}.

## 6 Experiments

In this section, we illustrate the behavior of the OPG algorithm through experiments.

### 6.1 Target Tracking

*t*. The mechanism for moving the target is set as the uniform distribution over the interval .

^{2}Then for all , the expected average reward functions are given by where . This implies that is concave with respect to the parameter , and thus satisfies assumption

^{3}for all .

^{3}

*c*:

Figure 2 shows the average rewards and average regrets for full-information and bandit-feedback cases, which substantiate the theoretical results.^{4}

### 6.2 Linear-Quadratic Regulator

The linear-quadratic regulator (LQR) is a simple system, where the transition dynamics is linear and the reward function is quadratic. This system is instructive because we can compute the best offline parameter and the gradient directly (Peters & Schaal, 2006). Here, an online LQR system is simulated to illustrate the parameter update trajectory of the OPG algorithm.

We use the gaussian policy with mean parameter and standard deviation parameter and in full-information and bandit-feedback experiments, respectively. The best offline parameter is given by , and the initial parameter for the OPG algorithm is drawn uniformly at random.

*P*is the positive-definite solution of the modified Ricatti equation . Then the second-order derivative of is given by Given that

_{t}*P*is the positive-definite solution, which yields , we can obtain . This means that the expected average reward function of the target LQR system is always concave with respect to the policy parameter.

In Figure 3a, a parameter update trajectory of OPG with full information in the online LQR problem is plotted by the solid line, and the best offline parameter is denoted by the dashed line. This shows that the OPG solution quickly approaches the best offline parameter.

Next, we also include the gaussian standard deviation in the policy parameter: . When takes a value less than during gradient update iterations, we project it back to . A parameter update trajectory is plotted in Figure 3b, showing again that the OPG solution smoothly approaches the best offline parameter.

In Figure 4a, the solid line shows the trajectory of the OPG algorithm with bandit feedback in the online LQR system simulation. The result validates that the OPG solution converges to the best offline parameter with a slightly slower speed compared with the full-information result.

The parameter trajectory is shown in Figure 4b, when the standard deviation is included in the parameter. The OPG solution still approaches the best offline mean parameter as we expect.

## 7 Conclusion

In this letter, we proposed an online policy gradient method for continuous state and action online MDPs and showed that the regret of the proposed method is under a certain concavity assumption on the expected average reward function. A notable fact is that the regret bound does not depend on the cardinality of state and action spaces, which makes the proposed algorithm suitable in handling continuous states and actions. We further extended our method to the bandit-feedback scenario and showed that the regret of the extended method is still . Furthermore, we also established the regret bound under a strong concavity assumption for the full information setup. Through experiments, we illustrated that directly handling continuous state and action spaces by the proposed method is more advantageous than discretizing them and applying an existing method.

Our future work will extend the current theoretical analysis to nonconcave expected average reward functions, where gradient-based algorithms suffer from the local optimal problem. A difficulty in this situation it that the regret bound with bandit feedback becomes trivial when the lower bounds of policy and state distributions are too small. Thus, improving our current result in the bandit feedback scenario is an important future work. Another important challenge is to develop an effective method to estimate the stationary state distribution, which is required in our algorithm.

### Appendix A: Proof of Lemma ^{6}

The following proposition holds, which can be obtained by recursively using assumption ^{1}:

### Appendix B: Proof of Lemma ^{7}

The following proposition is a continuous extension of lemma 6.3 in Even-Dar et al. (2009):

Then we have the following proposition, which is proved in appendix E:

From proposition ^{17}, we have the following proposition:

From proposition ^{18}, we have the following proposition:

^{19}, the result of online convex optimization (Zinkevich, 2003) is applicable to the current setup. More specifically we have which concludes the proof.

### Appendix C: Proof of Lemma ^{8}

From propositions 44 and ^{20}, we have the following proposition:

Then the following proposition holds, which is proved in appendix F following the same line as lemma 5.1 in Even-Dar et al. (2009):

Although the original bound given in Even-Dar et al. (2004, 2009) depends on the cardinality of the action space, that is not the case in the current setup.

### Appendix D: Proof of Proposition ^{10}

^{10}can be obtained from Hazan et al. (2007), that is, by the Taylor approximation, the expected average reward function can be decomposed as where is some point between and . The last inequality comes from the strong concavity assumption, equation 4.4. Given the parameter updating rule, summing up all

*T*terms of equation D.1, and setting yield

### Appendix E: Proof of Proposition ^{17}

^{2}and

^{3}imply and also equation E.1 can be written as The second equality comes from the definition of the stationary state distribution, and the third inequality can be obtained from the triangle inequality. The last inequality follows from assumption

^{1}and proposition

^{16}. Thus, we have which concludes the proof.

### Appendix F: Proof of Proposition ^{22}

^{1}and Recursively using equation F.1, we have which concludes the proof.

### Appendix G: Proofs of Lemmas ^{13} and ^{14}

As we show in section 5, an unbiased estimator of reward function is used for updating the parameter ; we also show that the corresponding estimated gradient is unbiased, which can be bounded by the following lemma, which is proved in appendix H.

^{13}by using the result of lemma

^{23}. Similarly, using lemma

^{23}in the proof of lemma

^{8}, we obtain lemma

^{14}.

### Appendix H: Proof of Lemma ^{23}

^{17}, the bound for the gradient of the stationary distribution is given by Similarly, from assumption

^{2}, the bound for the gradient of policy is given by Then we have

### Appendix I: Concavity Analysis for Target Tracking

## Acknowledgments

Y.M. was supported by the MEXT scholarship and the JST CREST program. T.Z. was supported by NSFC 61502339 and SRF for ROCS, SEM. K.H. was supported by MEXT KAKENHI 25330261 and 24106010. M.S. was supported by KAKENHI 23120004.

## References

## Notes

^{1}

Our OPG algorithm can also be seen as an extension of the online gradient descent algorithm (Zinkevich, 2003) to online MDP problems.

^{2}

Note that the parameter space is not closed in this experiment. When takes a value less than −1.99 or more than −0.01 during gradient update iterations, we project it back to −1.99 or −0.01, respectively.

^{3}

The analysis of concavity is presented in appendix I.

^{4}

The state and action spaces are bounded to −2,2 in the bandit-feedback experiment.

^{5}

The reward function is not bounded, which violates assumption ^{3}. However, it is interesting to illustrate that the parameter updated by the OPG algorithm still converges to the best offline parameter.