## Abstract

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret for T steps under a certain concavity assumption and under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.

## 1  Introduction

The Markov decision process (MDP) is a popular framework of reinforcement learning for sequential decision making (Sutton & Barto, 1998), where an agent takes an action depending on the current state, moves to the next state, receives a reward based on the last transition, and this process is repeated T times. The goal is to find an optimal decision-making policy (i.e., a conditional probability density of action given state) that maximizes the expected sum of rewards over T steps.

In the standard MDP formulation, the reward function is fixed over iterations. However, this assumption is often violated in reality. In this letter, we consider an online MDP scenario where the reward function is allowed to change over time. Such an online MDP problem is an extension of both online decision making and reinforcement learning (Yu, Mannor, & Shimkin, 2009):

• In an online decision-making problem, the agent needs to make a decision at each time step without knowledge of the future environment (Kalai & Vempala, 2005). A certain cost function will be observed only after the decision is made at each time step, and the goal is to minimize the regret against the best single decision. There is no assumption on the dynamics in the online decision making problem, and thus the decision can switch from one to another abruptly.

• In reinforcement learning, the dynamics are assumed to be Markovian. The reward function and transition dynamics are fixed but unknown to the agent, and thus the estimated reward function and transition function will converge to the true ones if sufficient samples are observed. The goal is to find the optimal policy that maximizes the cumulative reward without full information about the environment.

The goal of the online MDP problem is to find the best time-dependent policy that minimizes the regret, the difference from the best fixed policy. We expect the regret to be , by which the difference from the best fixed policy vanishes as T goes to infinity.

The MDP expert algorithm (MDP-E), which chooses the current best action at each state, was shown to achieve regret (Even-Dar, Kakade, & Mansour, 2004, 2009), where denotes the cardinality of the action space. Although this bound does not explicitly depend on the cardinality of the state space, the algorithm itself needs an expert algorithm for each state, and thus large state space may not be handled in practice. Another algorithm, called the lazy follow-the-perturbed-leader (lazy-FPL), divides the time steps into short periods, and policies are updated only at the end of each period using the average reward function (Yu et al., 2009). This lazy-FPL algorithm was shown to have regret for . The online MDP algorithm, called the online relative entropy policy search, is considered in Zimin and Neu (2013), which was shown to have regret for state space with L-layered structure. However, the regret bounds of these algorithms explicitly depend on and , and the algorithms cannot be directly implemented for problems with continuous state and action spaces. The online algorithm for Markov decision processes (Abbasi-Yadkori, Bartlett, Kanade, Seldin, & Szepesvari, 2013) was shown to have regret with changing transition probability distributions, where is the cardinality of the policy set. Although sublinear bounds still hold for continuous policy spaces, the algorithm cannot be used with infinite policy candidates directly. The online MDP problem is formulated as an online linear optimization problem in Dick, György, and Szepesvári (2014). By introducing the stationary occupation measures, the mirror descent with approximate projections was shown to have regret . However, the algorithm assumes that both the state and action spaces are finite. Yu et al. (2009), Abbasi-Yadkori et al. (2013), and Neu, György, and Szepesvári (2012) considered even more challenging online MDP problems under unknown or changing transition dynamics.

In practice, full information of the reward function may be hard to acquire, but only the value of the reward function for the current state and action is available. Such a setup, called the bandit feedback scenario, has attracted a great deal of attention recently. An extension of the lazy-FPL method to the bandit feedback scenario, called the exploratory-FPL algorithm (Yu et al., 2009), was shown to have regret . Neu, György, Szepesvári, and Antos (2010) proposed a method based on MDP-E that uses an unbiased estimator of the reward function and showed that its regret is . Neu, György, Szepesvári, and Antos (2014) further improved the regret bound to . However, this algorithm cannot be used in continuous state and action problems.

In this letter, we propose a simple online policy gradient (OPG) algorithm that can be implemented in a straightforward manner for problems with continuous state and action spaces, which could be seen as an extension of Dick et al. (2014).1 Under the assumption that the expected average reward function is concave, we prove that the regret of our OPG algorithm with respect to a compact and convex parametric policies set is , which is independent of the cardinality of the state and action spaces but is dependent on the diameter F and dimension N of the parameter space. Furthermore, regret is also proved under a strong concavity assumption on the expected average reward function. We also extend the proposed algorithm to a bandit feedback scenario and theoretically prove that the regret bound of the proposed algorithm is with the concavity assumption. We numerically illustrate the superior behavior of the proposed OPG algorithm in continuous problems over MDP-E with different discretization schemes.

The remainder of this letter is organized as follows. In section 2, we give a formal definition of the online MDP problem. Our proposed algorithm is given in section 3, and regret analyses in full information and the bandit scenario are given in sections 4 and 5, with proofs presented in the appendix.

## 2  Online Markov Decision Process

In this section, we formulate the problem of online MDP learning. An online MDP is specified by:

• State space , which could be either continuous or discrete.

• Action space , which contains all possible actions . A could be either continuous or discrete.

• Transition density , which represents the conditional probability density of next state given current state and action to be taken. We assume that the transition density is fully available to the agent.

• Reward function sequence , which is a pre fixed real-valued function sequence and will not change no matter what action is taken.

An online MDP algorithm produces a stochastic time-dependent policy, a conditional probability density of action to be taken given current state at each time step. In this letter, we suppose that the online MDP algorithm outputs parameter of stochastic policy at each time step t, where is a convex and compact parameter set. Thus, algorithm gives a sequence of policies:
Ideally the objective is to maximize the expected cumulative reward over T time steps of algorithm , which can be denoted as
2.1
In the above definition, denotes the expectation over the joint state-action distribution given the algorithm has been followed at each time step. The state-action distribution induced by and the transition density at time step t can be expressed as
where the state distribution induced by at time step t is defined as
However, maximizing the objective defined in equation 2.1 is not possible, since we cannot observe all T reward functions during the process of an online decision-making problem. Here, we instead design algorithm that minimizes the regret against the baseline, which is the best parametric offline policy defined by
In this definition of the regret, we suppose that there exists such that policy maximizes the expected cumulative rewards:
The best offline parameter is given by
2.2
where denotes the expectation over the state-action distribution given that the policy has been followed at each time step.

We assume here that all candidate policies are parameterized by the parameter , which is different from related works with finite states and actions (Even-Dar et al., 2004, 2009; Neu, György, Szepesvári, et al., 2010; Yu et al., 2009; Dick et al., 2014). For continuous problems, it is a common choice to use a parametric policy (e.g., the gaussian policy), which was demonstrated to work well (Sutton & Barto, 1998; Peters & Schaal, 2006). For this reason, the best offline policy defined in equation 2.2 is a suitable baseline given that the best policy with respect to the class of all Markovian policies is not a suitable baseline for continuous problems. If the regret is bounded by a sublinear function with respect to T, the algorithm is shown to be asymptotically as powerful as the best offline policy.

## 3  Online Policy Gradient Algorithm

In this section, we introduce an online policy gradient algorithm for solving the online MDP problem.

### 3.1  Algorithm

Unlike previous work (Even-Dar et al., 2004, 2009; Neu, György, Szepesvári, et al., 2010), we do not use the expert algorithm in our method because it is not suitable for handling continuous state and action problems. Instead, we consider a gradient-based algorithm that updates the parameter of policy along the gradient direction of the expected average reward function at each time step t.

More specifically, we assume that all the MDPs are ergodic whose state transitions are induced by the transition density and the parametric policy . Then every policy has a unique stationary state distribution :
Note that the stationary state distribution satisfies
Let be the expected average reward function of policy at time step t:
3.1
where the expectation is taken over the stationary state-action distribution of policy .

Then our online policy gradient (OPG) algorithm is given as follows:

• Initialize policy parameter .

• for to

1. Observe current state .

2. Take action according to current policy .

3. Observe reward rt from the environment.

4. Move to next state .

5. Update the policy parameter as
where is the projection function on parameter space, denotes the Euclidean norm. is the step size, and is the gradient of :

In equation 3.3, the facts and are used. Here we assume that and are differentiable with respect to the policy parameter . If it is time-consuming to obtain the exact stationary state distribution, gradients estimated by a reinforcement learning algorithm may be used instead in practice. Since the transition and reward functions are known to the agent, it is straightforward to estimate the gradient efficiently by using a reinforcement learning technique (e.g., REINFORCE and policy gradients with parameter-based exploration; Sutton & Barto, 1998; Williams, 1992; Sehnke et al., 2010). Furthermore, some reinforcement learning techniques provided a convergence guarantee for the gradient estimation. Especially in the REINFORCE algorithm, the gradient is approximated by the empirical average value after sufficient trajectories are collected as
where is a rollout sample denoted as , is the set of collected trajectories with length L, and is the average reward obtained by trajectory . With theoretical guarantee, the REINFORCE algorithm has been shown to converge to the true gradient as and L tend to infinity. In the following analysis, we ignore the approximation error since it could be arbitrarily small by collecting a large enough number of samples.

When the reward function does not changed over time, the OPG algorithm is reduced to the ordinary policy gradient algorithm (Williams, 1992), an efficient and natural algorithm for continuous state and action MDPs. The OPG algorithm can also be regarded as an extension of the online gradient descent algorithm (Zinkevich, 2003), which maximizes , not . As we showed in the definition of , the stationary state distribution of policy is used, which is different from the state distribution used in . As we will prove in section 4, the regret bound of the OPG algorithm is under a certain concavity assumption and under a strong concavity assumption on the expected average reward function. Unlike previous work (Even-Dar et al., 2004, 2009; Yu et al., 2009; Neu, György, Szepesvári, et al., 2010), these bounds do not depend on the cardinality of state and action spaces since a parameterized policy space is considered. Therefore, the OPG algorithm would be suitable for handling continuous state and action online MDPs.

### 3.2  Bandit Feedback

Here we extend the OPG algorithm to the bandit feedback scenario, where the entire reward function is not available; only the value of the reward function for the current state and action is observed:
Due to lack of the entire reward function, we replace reward function rt in the OPG algorithm with a random reward function given by
3.4
where can be calculated recursively using the following equation:
Note that the above reward function is an unbiased estimator of for all (Yu et al., 2009):
In the previous equation, denotes the expectation over the joint state-action distribution by the policies picked by algorithm at time step t, where . By the definition , the estimated expected average reward function satisfies
where
The gradient of with respect to the parameter can be obtained by passing the derivative through the integral as
As the previous equation shows, we replaced the gradient of the expected average reward function in equation 3.2 with its unbiased estimator .

As will be proved in section 5, the regret bound of the OPG method with bandit feedback is still , although the bound is looser than that in the full-feedback case. If it is not possible to calculate the state distribution directly, its estimate obtained by reinforcement learning may be employed in practice (Ng, Parr, & Koller, 1999).

## 4  Regret Analysis with Full Feedback

In this section, we provide a regret bound for the OPG algorithm in the full-feedback case.

### 4.1  Assumptions

First, we introduce the assumptions required in the proofs. Some assumptions have already been used in related works for discrete state and action MDPs, and we extend them to continuous state and action MDPs.

Assumption 1.
There exists a positive number , such that for two arbitrary distributions d and over S and for every policy parameter ,
where
is called the mixing time (Even-Dar et al., 2004, 2009).
Assumption 2.
There exists a positive constant C1 depending on the specific policy model such that for two arbitrary policy parameters and and for every ,
where denotes the L1 norm.
The gaussian policy is a common choice in continuous state and action MDPs. Below, we consider the gaussian policy with mean and standard deviation , where is the policy parameter and is the basis function. The KL divergence between these two policies is given by
By Pinsker’s inequality, the following inequality holds:
4.1
This implies that the gaussian policy model satisfies assumption 2 with , where . Note that we do not specify any policy model in the analysis, and therefore the following theoretical analysis is valid for other stochastic policy models as long as the assumptions are satisfied.
Assumption 3.
All the reward functions in online MDPs are bounded. For simplicity, we assume that the reward functions satisfy
Assumption 4.
For all , the second derivative of the expected average reward function satisfies
4.2
where and is the parameter set, which is convex and compact.

Assumption 4 means that the expected average reward function is concave, which is currently our sufficient condition to guarantee the -regret bound for the OPG algorithm. This assumption can be relaxed to locally concave expected average reward functions, where all the results still hold locally. More specifically the standard policy gradient algorithm (Sutton & Barto, 1998; Peters & Schaal, 2006) has been shown to converge to a local optimal solution, and we use a local optimal policy as the baseline in the definition of the regret instead of the global optimal solution.

### 4.2  Regret Bound with Concavity

We have the following theorem.

Theorem 1.
The regret against the best offline policy of the OPG algorithm is bounded as
where F is the diameter of and .

Note that the constant C1 depends on the specific policy model involved, which is claimed in assumption 2.

To prove theorem 5, we decompose the regret in the same way as previous work has (Even-Dar et al., 2004, 2009; Neu, György, & Szepesvári, 2010; Neu, György, Szepesvári, et al., 2010):
4.3
In the OPG method, is used for optimization, and the sum of the expected average reward functions is calculated based on the stationary state distribution of the policy parameterized by . However, the sum of the expected rewards is calculated by , the state distribution at time step t following policy . A similar argument can be obtained for and . These differences affect the first and third terms of the decomposed regret equation 4.3.

Below, we bound each of the three terms in lemmas 6, 7, and 8, which are proved in appendixes  A,  B, and  C, respectively.

Lemma 1.
The difference between the return and the expected average reward function of the best offline policy parameter satisfies

The first term has already been analyzed for discrete state and action online MDPs in Even-Dar et al. (2004, 2009), Neu et al. (2014), and Dick et al. (2014), and we extended it to continuous state and action spaces in lemma 6.

Lemma 2.
The expected average reward function satisfies

Lemma 7 is obtained by using the result of Zinkevich (2003).

Lemma 3.
The difference between the return and the expected average reward function of given by the OPG algorithm satisfies

Lemma 8 is similar to lemma 5.2 in Even-Dar et al. (2009), but our bound does not depend on the cardinality of state and action spaces.

Combining lemmas 6 to 8, we can immediately obtain theorem 5.

### 4.3  Regret Analysis under Strong Concavity

Next we derive a sharper regret bound for the OPG algorithm under a strong concavity assumption.

Theorem 5 shows the theoretical guarantee of the OPG algorithm with the concave assumption. If the expected reward function is strongly concave,
4.4
where H is a positive constant and IN is the identity matrix, we have following theorem:
Theorem 2.
The regret against the best offline policy of the OPG algorithm is bounded as
with step size .

In theorem 9, , where C1 depends on the specific policy model. We again consider the same decomposition as equation 4.3, and the first term of the regret bound is exactly the same as lemma 6.

The second term is bounded by the following proposition given the strong concavity assumption, equation 4.4, and step size :

Proposition 1.

The proof of proposition 10 is given in appendix  D, which follows the same line as Hazan, Agarwal, and Kale (2007).

From the proof of lemma 8, the bound of the third term with the strong concavity assumption, equation 4.4, is given by proposition 11:

Proposition 2.
4.5

The result of proposition 11 is obtained by following the same line as the proof of lemma 8 with a different step size. Combining lemma 6 and propositions 10 and 11, we can obtain theorem 9.

## 5  Regret Analysis with Bandit Feedback

In this section, we prove a regret bound for the OPG algorithm in the bandit-feedback case.

### 5.1  Regret Bound with Concavity in the Bandit Scenario

Suppose that there exist and such that the policy and the state distribution satisfy
Note that the above assumptions yield the state and action spaces to be compact, where the gaussian policy cannot be used directly.

Then we have the following theorem:

Theorem 3.
The regret of the OPG algorithm with bandit feedback is
where , , and C1 depends on the specific policy model as assumption 2.

Theorem 12 can be proved by extending the proof of theorem 51 as follows.

The same regret decomposition as equation 4.3 is still possible in the bandit-feedback setting. The first term can be bounded in the same way as the full-information case; lemma 6 still holds. However, the bounds for the second and third terms, originally given in lemmas 7 and 8, should be modified as follows:

Lemma 4.
The expected average reward function given by the online policy gradient algorithm with bandit feedback satisfies

The bound of the second part is still , but it is looser than the bound in the full-information scenario, which is caused by the estimated gradient of the expected average reward function.

Lemma 5.
The third term of the regret of the online policy gradient algorithm with bandit feedback is bounded as

Proofs of lemmas 13 and 14 are given in appendix  G. From these lemmas, we can immediately obtain theorem 12.

## 6  Experiments

In this section, we illustrate the behavior of the OPG algorithm through experiments.

### 6.1  Target Tracking

The task is to let an agent track an abruptly moving target located in one-dimensional real space . The action space is also one-dimensional real space , and we can change the position of the agent as . The reward function is given by evaluating the distance between the agent and target as
6.1
where denotes the position of the target at time step t. The mechanism for moving the target is set as the uniform distribution over the interval .
We use the gaussian policy with mean parameter and standard deviation parameter in this experiment. From the standard argument (Peters & Schaal, 2006), the stationary state distribution is the gaussian distribution with zero mean parameter and standard deviation parameter .2 Then for all , the expected average reward functions are given by
where . This implies that is concave with respect to the parameter , and thus satisfies assumption 3 for all .3
As a baseline method for comparison, we consider the MDP-E algorithm (Even-Dar et al., 2004, 2009), where the exponential weighted average algorithm is used as the best expert. Since MDP-E can handle only discrete states and actions, we discretize the state and action spaces. More specifically, the state space is discretized as
and the action space is discretized as
We consider the following five setups for c:
In the experiment, the state distribution and the gradient are estimated by the policy gradient estimator REINFORCE introduced in Peters and Schaal (2006). independent experiments are run with time steps, and the average return is used for evaluating the performance:
The results are plotted in Figure 1, showing that the OPG algorithm works better than the MDP-E algorithm with the best discretization resolution. This illustrates the advantage of directly handling continuous state and action spaces without discretization. The MDP-E algorithm performs poorly when the discretization resolution is too small. The regret caused by the MDP-E algorithm increases as the cardinalities of state and action spaces increase. On the other hand, the performance of the MDP-E algorithm is limited when the discretization resolution is too large. Moreover, it is difficult to design the best discretization method without knowledge of the target movement.
Figure 1:

Average and standard deviation of returns of the OPG algorithm and the MDP-E algorithm with different discretization resolution c.

Figure 1:

Average and standard deviation of returns of the OPG algorithm and the MDP-E algorithm with different discretization resolution c.

Figure 2 shows the average rewards and average regrets for full-information and bandit-feedback cases, which substantiate the theoretical results.4

Figure 2:

Average rewards and average regrets of the OPG algorithm with full information and bandit feedback

Figure 2:

Average rewards and average regrets of the OPG algorithm with full information and bandit feedback

### 6.2  Linear-Quadratic Regulator

The linear-quadratic regulator (LQR) is a simple system, where the transition dynamics is linear and the reward function is quadratic. This system is instructive because we can compute the best offline parameter and the gradient directly (Peters & Schaal, 2006). Here, an online LQR system is simulated to illustrate the parameter update trajectory of the OPG algorithm.

Let state and action spaces be one-dimensional real space: . The transitions are deterministically performed as
The reward function is defined as
where and are chosen from uniformly at time step .5 Thus, the reward function is changing abruptly.

We use the gaussian policy with mean parameter and standard deviation parameter and in full-information and bandit-feedback experiments, respectively. The best offline parameter is given by , and the initial parameter for the OPG algorithm is drawn uniformly at random.

From the standard argument (Peters & Schaal, 2006), the expected average reward function of the above LQR system is given by
where Pt is the positive-definite solution of the modified Ricatti equation . Then the second-order derivative of is given by
Given that P is the positive-definite solution, which yields , we can obtain . This means that the expected average reward function of the target LQR system is always concave with respect to the policy parameter.

In Figure 3a, a parameter update trajectory of OPG with full information in the online LQR problem is plotted by the solid line, and the best offline parameter is denoted by the dashed line. This shows that the OPG solution quickly approaches the best offline parameter.

Figure 3:

Trajectory of the OPG solution with full information and the best offline parameter.

Figure 3:

Trajectory of the OPG solution with full information and the best offline parameter.

Next, we also include the gaussian standard deviation in the policy parameter: . When takes a value less than during gradient update iterations, we project it back to . A parameter update trajectory is plotted in Figure 3b, showing again that the OPG solution smoothly approaches the best offline parameter.

In Figure 4a, the solid line shows the trajectory of the OPG algorithm with bandit feedback in the online LQR system simulation. The result validates that the OPG solution converges to the best offline parameter with a slightly slower speed compared with the full-information result.

Figure 4:

Trajectory of the OPG solution with bandit feedback and the best offline parameter.

Figure 4:

Trajectory of the OPG solution with bandit feedback and the best offline parameter.

The parameter trajectory is shown in Figure 4b, when the standard deviation is included in the parameter. The OPG solution still approaches the best offline mean parameter as we expect.

## 7  Conclusion

In this letter, we proposed an online policy gradient method for continuous state and action online MDPs and showed that the regret of the proposed method is under a certain concavity assumption on the expected average reward function. A notable fact is that the regret bound does not depend on the cardinality of state and action spaces, which makes the proposed algorithm suitable in handling continuous states and actions. We further extended our method to the bandit-feedback scenario and showed that the regret of the extended method is still . Furthermore, we also established the regret bound under a strong concavity assumption for the full information setup. Through experiments, we illustrated that directly handling continuous state and action spaces by the proposed method is more advantageous than discretizing them and applying an existing method.

Our future work will extend the current theoretical analysis to nonconcave expected average reward functions, where gradient-based algorithms suffer from the local optimal problem. A difficulty in this situation it that the regret bound with bandit feedback becomes trivial when the lower bounds of policy and state distributions are too small. Thus, improving our current result in the bandit feedback scenario is an important future work. Another important challenge is to develop an effective method to estimate the stationary state distribution, which is required in our algorithm.

### Appendix A:  Proof of Lemma 6

The following proposition holds, which can be obtained by recursively using assumption 1:

The first part of the regret bound in theorem 5 is caused by the difference between the state distribution at time t and the stationary state distribution following the best offline policy parameter ,
where the second inequality can be obtained by assumption 1.

### Appendix B:  Proof of Lemma 7

The following proposition is a continuous extension of lemma 6.3 in Even-Dar et al. (2009):

Then we have the following proposition, which is proved in appendix  E:

From proposition 17, we have the following proposition:

From proposition 18, we have the following proposition:

From proposition 19, the result of online convex optimization (Zinkevich, 2003) is applicable to the current setup. More specifically we have
which concludes the proof.

### Appendix C:  Proof of Lemma 8

The following proposition holds, which can be obtained from assumption 2 and

From propositions 44 and 20, we have the following proposition:

Then the following proposition holds, which is proved in appendix  F following the same line as lemma 5.1 in Even-Dar et al. (2009):

Although the original bound given in Even-Dar et al. (2004, 2009) depends on the cardinality of the action space, that is not the case in the current setup.

Then the third term of the decomposed regret, equation 4.3, is expressed as
which concludes the proof.

### Appendix D:  Proof of Proposition 10

The proof of proposition 10 can be obtained from Hazan et al. (2007), that is, by the Taylor approximation, the expected average reward function can be decomposed as
D.1
where is some point between and . The last inequality comes from the strong concavity assumption, equation 4.4. Given the parameter updating rule,
summing up all T terms of equation D.1, and setting yield

### Appendix E:  Proof of Proposition 17

For two different parameters and , we have
E.1
The first equation comes from equation 3.1, and the second inequality is obtained from the triangle inequality. Since assumptions 2 and 3 imply
and also
equation E.1 can be written as
The second equality comes from the definition of the stationary state distribution, and the third inequality can be obtained from the triangle inequality. The last inequality follows from assumption 1 and proposition 16. Thus, we have
which concludes the proof.

### Appendix F:  Proof of Proposition 22

This proof is following the same line as lemma 5.1 in Even-Dar et al. (2009):
F.1
The first equation comes from the definition of the stationary state distribution, and the second inequality can be obtained by the triangle inequality. The third inequality holds from assumption 1 and
Recursively using equation F.1, we have
which concludes the proof.

### Appendix G:  Proofs of Lemmas 13 and 14

As we show in section 5, an unbiased estimator of reward function is used for updating the parameter ; we also show that the corresponding estimated gradient is unbiased, which can be bounded by the following lemma, which is proved in appendix  H.

Following the same line with the proof of lemma 3.1 in Flaxman, Kalai, and McMahan (2005), we first define the auxiliary functions for all as
where . It is observed that
and the unbiased estimation satisfies
where the above equation follows from the fact that , and . Thus, we can obtain
which concludes the proof of lemma 13 by using the result of lemma 23. Similarly, using lemma 23 in the proof of lemma 8, we obtain lemma 14.

### Appendix H:  Proof of Lemma 23

The estimated gradient is expressed as
Consider the stationary distribution as a function of parameter for all , Then, from proposition 17, the bound for the gradient of the stationary distribution is given by
Similarly, from assumption 2, the bound for the gradient of policy is given by
Then we have

### Appendix I:  Concavity Analysis for Target Tracking

The reward function in the target tracking experiment is defined as
Then for all , the expected average reward function is given by
where and . For verifying the concavity of , we obtain the derivative of with respect to by plugging in as
We observed that is monotonically nonincreasing as shown in Figure 5. Thus, the defined expected average reward functions are concave with respect to the parameter .
Figure 5:

The derivative of with respect to .

Figure 5:

The derivative of with respect to .

## Acknowledgments

Y.M. was supported by the MEXT scholarship and the JST CREST program. T.Z. was supported by NSFC 61502339 and SRF for ROCS, SEM. K.H. was supported by MEXT KAKENHI 25330261 and 24106010. M.S. was supported by KAKENHI 23120004.

## References

,
Y.
,
Bartlett
,
P.
,
,
V.
,
Seldin
,
Y.
, &
Szepesvari
,
C.
(
2013
).
Online learning in Markov decision processes with adversarially chosen transition probability distributions
. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems, 26
(pp.
2508
2516
).
Red Hook, NY
:
Curran
.
Dick
,
T.
,
György
,
A.
, &
Szepesvári
,
C.
(
2014
).
Online learning in Markov decision processes with changing cost sequences
. In
Proceedings of the 31st International Conference on Machine Learning
(pp.
512
520
).
JMLR
.
Even-Dar
,
E.
,
,
S. M.
, &
Mansour
,
Y.
(
2004
).
Experts in a Markov decision process
. In
L. K.
Saul
,
Y.
Weiss
, &
L.
Bottou
(Eds.),
Advances in neural information processing system
,
17
(pp.
401
408
).
Cambridge, MA
:
MIT Press
.
Even-Dar
,
E.
,
,
S. M.
, &
Mansour
,
Y.
(
2009
).
Online Markov decision processes
.
Mathematics of Operations Research
,
34
(
3
),
726
736
.
Flaxman
,
A.
,
Kalai
,
A.
, &
McMahan
,
B.
(
2005
).
Online convex optimization in the bandit setting: Gradient descent without a gradient
. In
Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms
(pp.
385
394
).
:
SIAM
.
Hazan
,
E.
,
Agarwal
,
A.
, &
Kale
,
S.
(
2007
).
Logarithmic regret algorithms for online convex optimization
.
Machine Learning
,
69
(
2–3
),
169
192
.
Kalai
,
A.
, &
Vempala
,
S.
(
2005
).
Efficient algorithms for online decision problems
.
Journal of Computer and System Sciences
,
71
(
3
),
291
307
.
Ma
,
Y.
,
Zhao
,
T.
,
Hatano
,
K.
, &
Sugiyama
,
M.
(
2014
).
An online policy gradient algorithm for Markov decision processes with continuous states and actions
. In
Proceedings of the Machine Learning and Knowledge Discovery in Databases—European Conference
(pp.
354
369
).
New York
:
Springer-Verlag
.
Neu
,
G.
,
György
,
A.
, &
Szepesvári
,
C.
(
2010
).
The online loop-free stochastic shortest-path problem
. In
Proceedings of the 23rd Conference on Learning Theory
(pp.
231
243
).
Neu
,
G.
,
György
,
A.
, &
Szepesvári
,
C.
(
2012
).
The adversarial stochastic shortest path problem with unknown transition probabilities
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
805
813
).
JMLR
.
Neu
,
G.
,
György
,
A.
,
Szepesvári
,
C.
, &
Antos
,
A.
(
2010
).
Online Markov decision processes under bandit feedback
. In
J. D.
Lafferty
,
C.K.I.
Williams
,
J.
Shawe-Taylor
,
R. S.
Zemel
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
23
(pp.
1804
1812
).
Red Hook, NY
:
Curran
.
Neu
,
G.
,
György
,
A.
,
Szepesvári
,
C.
, &
Antos
,
A.
(
2014
).
Online Markov decision processes under bandit feedback
.
IEEE Transactions on Automatic Control
,
59
(
3
),
676
691
.
Ng
,
A. Y.
,
Parr
,
R.
, &
Koller
,
D.
(
1999
).
Policy search via density estimation
. In
S. A.
Solla
,
T. K.
Leen
, &
K.-R.
Müller
(Eds.),
Advances in neural information processing systems, 12
(pp.
1022
1028
).
Cambridge, MA
:
MIT Press
.
Peters
,
J.
, &
Schaal
,
S.
(
2006
).
Policy gradient methods for robotics
. In
2006 IEEE/RSJ International Conference on Intelligent Robots and Systems
(pp.
2219
2225
).
Piscataway, NJ
:
IEEE
.
Sehnke
,
F.
,
Osendorfer
,
C.
,
Rückstiess
T.
,
Graves
A.
,
Peters
J.
, &
Schmidhuber
,
J.
(
2010
).
.
Neural Networks
,
23
(
4
),
551
559
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning: An introduction
.
Cambridge, MA
:
MIT Press
.
Williams
,
R. J.
(
1992
).
Simple statistical gradient-following algorithms for connectionist reinforcement learning
.
Machine Learning
,
8
(
3–4
),
229
256
.
Yu
,
J. Y.
,
Mannor
,
S.
, &
Shimkin
,
N.
(
2009
).
Markov decision processes with arbitrary reward processes
.
Mathematics of Operations Research
,
34
(
3
),
737
757
.
Zimin
,
A.
, &
Neu
,
G.
(
2013
).
Online learning in episodic Markovian decision processes by relative entropy policy search
. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
1583
1591
).
Red Hook, NY
:
Curran
.
Zinkevich
,
M.
(
2003
).
Online convex programming and generalized infinitesimal gra- dient ascent
. In
T.
Fawcett
&
N.
Mishra
(Eds.),
Proceedings of the 20th International Conference on Machine Learning ICML
(pp.
928
936
).
Cambridge, MA
:
AAAI Press
.

## Notes

1

Our OPG algorithm can also be seen as an extension of the online gradient descent algorithm (Zinkevich, 2003) to online MDP problems.

2

Note that the parameter space is not closed in this experiment. When takes a value less than −1.99 or more than −0.01 during gradient update iterations, we project it back to −1.99 or −0.01, respectively.

3

The analysis of concavity is presented in appendix  I.

4

The state and action spaces are bounded to −2,2 in the bandit-feedback experiment.

5

The reward function is not bounded, which violates assumption 3. However, it is interesting to illustrate that the parameter updated by the OPG algorithm still converges to the best offline parameter.