We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret for T steps under a certain concavity assumption and under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.
The Markov decision process (MDP) is a popular framework of reinforcement learning for sequential decision making (Sutton & Barto, 1998), where an agent takes an action depending on the current state, moves to the next state, receives a reward based on the last transition, and this process is repeated T times. The goal is to find an optimal decision-making policy (i.e., a conditional probability density of action given state) that maximizes the expected sum of rewards over T steps.
In the standard MDP formulation, the reward function is fixed over iterations. However, this assumption is often violated in reality. In this letter, we consider an online MDP scenario where the reward function is allowed to change over time. Such an online MDP problem is an extension of both online decision making and reinforcement learning (Yu, Mannor, & Shimkin, 2009):
In an online decision-making problem, the agent needs to make a decision at each time step without knowledge of the future environment (Kalai & Vempala, 2005). A certain cost function will be observed only after the decision is made at each time step, and the goal is to minimize the regret against the best single decision. There is no assumption on the dynamics in the online decision making problem, and thus the decision can switch from one to another abruptly.
In reinforcement learning, the dynamics are assumed to be Markovian. The reward function and transition dynamics are fixed but unknown to the agent, and thus the estimated reward function and transition function will converge to the true ones if sufficient samples are observed. The goal is to find the optimal policy that maximizes the cumulative reward without full information about the environment.
The goal of the online MDP problem is to find the best time-dependent policy that minimizes the regret, the difference from the best fixed policy. We expect the regret to be , by which the difference from the best fixed policy vanishes as T goes to infinity.
The MDP expert algorithm (MDP-E), which chooses the current best action at each state, was shown to achieve regret (Even-Dar, Kakade, & Mansour, 2004, 2009), where denotes the cardinality of the action space. Although this bound does not explicitly depend on the cardinality of the state space, the algorithm itself needs an expert algorithm for each state, and thus large state space may not be handled in practice. Another algorithm, called the lazy follow-the-perturbed-leader (lazy-FPL), divides the time steps into short periods, and policies are updated only at the end of each period using the average reward function (Yu et al., 2009). This lazy-FPL algorithm was shown to have regret for . The online MDP algorithm, called the online relative entropy policy search, is considered in Zimin and Neu (2013), which was shown to have regret for state space with L-layered structure. However, the regret bounds of these algorithms explicitly depend on and , and the algorithms cannot be directly implemented for problems with continuous state and action spaces. The online algorithm for Markov decision processes (Abbasi-Yadkori, Bartlett, Kanade, Seldin, & Szepesvari, 2013) was shown to have regret with changing transition probability distributions, where is the cardinality of the policy set. Although sublinear bounds still hold for continuous policy spaces, the algorithm cannot be used with infinite policy candidates directly. The online MDP problem is formulated as an online linear optimization problem in Dick, György, and Szepesvári (2014). By introducing the stationary occupation measures, the mirror descent with approximate projections was shown to have regret . However, the algorithm assumes that both the state and action spaces are finite. Yu et al. (2009), Abbasi-Yadkori et al. (2013), and Neu, György, and Szepesvári (2012) considered even more challenging online MDP problems under unknown or changing transition dynamics.
In practice, full information of the reward function may be hard to acquire, but only the value of the reward function for the current state and action is available. Such a setup, called the bandit feedback scenario, has attracted a great deal of attention recently. An extension of the lazy-FPL method to the bandit feedback scenario, called the exploratory-FPL algorithm (Yu et al., 2009), was shown to have regret . Neu, György, Szepesvári, and Antos (2010) proposed a method based on MDP-E that uses an unbiased estimator of the reward function and showed that its regret is . Neu, György, Szepesvári, and Antos (2014) further improved the regret bound to . However, this algorithm cannot be used in continuous state and action problems.
In this letter, we propose a simple online policy gradient (OPG) algorithm that can be implemented in a straightforward manner for problems with continuous state and action spaces, which could be seen as an extension of Dick et al. (2014).1 Under the assumption that the expected average reward function is concave, we prove that the regret of our OPG algorithm with respect to a compact and convex parametric policies set is , which is independent of the cardinality of the state and action spaces but is dependent on the diameter F and dimension N of the parameter space. Furthermore, regret is also proved under a strong concavity assumption on the expected average reward function. We also extend the proposed algorithm to a bandit feedback scenario and theoretically prove that the regret bound of the proposed algorithm is with the concavity assumption. We numerically illustrate the superior behavior of the proposed OPG algorithm in continuous problems over MDP-E with different discretization schemes.
The remainder of this letter is organized as follows. In section 2, we give a formal definition of the online MDP problem. Our proposed algorithm is given in section 3, and regret analyses in full information and the bandit scenario are given in sections 4 and 5, with proofs presented in the appendix.
2 Online Markov Decision Process
In this section, we formulate the problem of online MDP learning. An online MDP is specified by:
State space , which could be either continuous or discrete.
Action space , which contains all possible actions . A could be either continuous or discrete.
Transition density , which represents the conditional probability density of next state given current state and action to be taken. We assume that the transition density is fully available to the agent.
Reward function sequence , which is a pre fixed real-valued function sequence and will not change no matter what action is taken.
We assume here that all candidate policies are parameterized by the parameter , which is different from related works with finite states and actions (Even-Dar et al., 2004, 2009; Neu, György, Szepesvári, et al., 2010; Yu et al., 2009; Dick et al., 2014). For continuous problems, it is a common choice to use a parametric policy (e.g., the gaussian policy), which was demonstrated to work well (Sutton & Barto, 1998; Peters & Schaal, 2006). For this reason, the best offline policy defined in equation 2.2 is a suitable baseline given that the best policy with respect to the class of all Markovian policies is not a suitable baseline for continuous problems. If the regret is bounded by a sublinear function with respect to T, the algorithm is shown to be asymptotically as powerful as the best offline policy.
3 Online Policy Gradient Algorithm
In this section, we introduce an online policy gradient algorithm for solving the online MDP problem.
Unlike previous work (Even-Dar et al., 2004, 2009; Neu, György, Szepesvári, et al., 2010), we do not use the expert algorithm in our method because it is not suitable for handling continuous state and action problems. Instead, we consider a gradient-based algorithm that updates the parameter of policy along the gradient direction of the expected average reward function at each time step t.
Then our online policy gradient (OPG) algorithm is given as follows:
Initialize policy parameter .
Observe current state .
Take action according to current policy .
Observe reward rt from the environment.
Move to next state .
When the reward function does not changed over time, the OPG algorithm is reduced to the ordinary policy gradient algorithm (Williams, 1992), an efficient and natural algorithm for continuous state and action MDPs. The OPG algorithm can also be regarded as an extension of the online gradient descent algorithm (Zinkevich, 2003), which maximizes , not . As we showed in the definition of , the stationary state distribution of policy is used, which is different from the state distribution used in . As we will prove in section 4, the regret bound of the OPG algorithm is under a certain concavity assumption and under a strong concavity assumption on the expected average reward function. Unlike previous work (Even-Dar et al., 2004, 2009; Yu et al., 2009; Neu, György, Szepesvári, et al., 2010), these bounds do not depend on the cardinality of state and action spaces since a parameterized policy space is considered. Therefore, the OPG algorithm would be suitable for handling continuous state and action online MDPs.
3.2 Bandit Feedback
As will be proved in section 5, the regret bound of the OPG method with bandit feedback is still , although the bound is looser than that in the full-feedback case. If it is not possible to calculate the state distribution directly, its estimate obtained by reinforcement learning may be employed in practice (Ng, Parr, & Koller, 1999).
4 Regret Analysis with Full Feedback
In this section, we provide a regret bound for the OPG algorithm in the full-feedback case.
First, we introduce the assumptions required in the proofs. Some assumptions have already been used in related works for discrete state and action MDPs, and we extend them to continuous state and action MDPs.
Assumption 4 means that the expected average reward function is concave, which is currently our sufficient condition to guarantee the -regret bound for the OPG algorithm. This assumption can be relaxed to locally concave expected average reward functions, where all the results still hold locally. More specifically the standard policy gradient algorithm (Sutton & Barto, 1998; Peters & Schaal, 2006) has been shown to converge to a local optimal solution, and we use a local optimal policy as the baseline in the definition of the regret instead of the global optimal solution.
4.2 Regret Bound with Concavity
We have the following theorem.
Note that the constant C1 depends on the specific policy model involved, which is claimed in assumption 2.
The first term has already been analyzed for discrete state and action online MDPs in Even-Dar et al. (2004, 2009), Neu et al. (2014), and Dick et al. (2014), and we extended it to continuous state and action spaces in lemma 6.
4.3 Regret Analysis under Strong Concavity
Next we derive a sharper regret bound for the OPG algorithm under a strong concavity assumption.
The second term is bounded by the following proposition given the strong concavity assumption, equation 4.4, and step size :
5 Regret Analysis with Bandit Feedback
In this section, we prove a regret bound for the OPG algorithm in the bandit-feedback case.
5.1 Regret Bound with Concavity in the Bandit Scenario
Then we have the following theorem:
The same regret decomposition as equation 4.3 is still possible in the bandit-feedback setting. The first term can be bounded in the same way as the full-information case; lemma 6 still holds. However, the bounds for the second and third terms, originally given in lemmas 7 and 8, should be modified as follows:
The bound of the second part is still , but it is looser than the bound in the full-information scenario, which is caused by the estimated gradient of the expected average reward function.
In this section, we illustrate the behavior of the OPG algorithm through experiments.
6.1 Target Tracking
6.2 Linear-Quadratic Regulator
The linear-quadratic regulator (LQR) is a simple system, where the transition dynamics is linear and the reward function is quadratic. This system is instructive because we can compute the best offline parameter and the gradient directly (Peters & Schaal, 2006). Here, an online LQR system is simulated to illustrate the parameter update trajectory of the OPG algorithm.
We use the gaussian policy with mean parameter and standard deviation parameter and in full-information and bandit-feedback experiments, respectively. The best offline parameter is given by , and the initial parameter for the OPG algorithm is drawn uniformly at random.
In Figure 3a, a parameter update trajectory of OPG with full information in the online LQR problem is plotted by the solid line, and the best offline parameter is denoted by the dashed line. This shows that the OPG solution quickly approaches the best offline parameter.
Next, we also include the gaussian standard deviation in the policy parameter: . When takes a value less than during gradient update iterations, we project it back to . A parameter update trajectory is plotted in Figure 3b, showing again that the OPG solution smoothly approaches the best offline parameter.
In Figure 4a, the solid line shows the trajectory of the OPG algorithm with bandit feedback in the online LQR system simulation. The result validates that the OPG solution converges to the best offline parameter with a slightly slower speed compared with the full-information result.
The parameter trajectory is shown in Figure 4b, when the standard deviation is included in the parameter. The OPG solution still approaches the best offline mean parameter as we expect.
In this letter, we proposed an online policy gradient method for continuous state and action online MDPs and showed that the regret of the proposed method is under a certain concavity assumption on the expected average reward function. A notable fact is that the regret bound does not depend on the cardinality of state and action spaces, which makes the proposed algorithm suitable in handling continuous states and actions. We further extended our method to the bandit-feedback scenario and showed that the regret of the extended method is still . Furthermore, we also established the regret bound under a strong concavity assumption for the full information setup. Through experiments, we illustrated that directly handling continuous state and action spaces by the proposed method is more advantageous than discretizing them and applying an existing method.
Our future work will extend the current theoretical analysis to nonconcave expected average reward functions, where gradient-based algorithms suffer from the local optimal problem. A difficulty in this situation it that the regret bound with bandit feedback becomes trivial when the lower bounds of policy and state distributions are too small. Thus, improving our current result in the bandit feedback scenario is an important future work. Another important challenge is to develop an effective method to estimate the stationary state distribution, which is required in our algorithm.
Appendix A: Proof of Lemma 6
The following proposition holds, which can be obtained by recursively using assumption 1:
Appendix B: Proof of Lemma 7
The following proposition is a continuous extension of lemma 6.3 in Even-Dar et al. (2009):
Then we have the following proposition, which is proved in appendix E:
From proposition 17, we have the following proposition:
From proposition 18, we have the following proposition:
Appendix C: Proof of Lemma 8
From propositions 44 and 20, we have the following proposition:
Appendix D: Proof of Proposition 10
Appendix E: Proof of Proposition 17
Appendix F: Proof of Proposition 22
As we show in section 5, an unbiased estimator of reward function is used for updating the parameter ; we also show that the corresponding estimated gradient is unbiased, which can be bounded by the following lemma, which is proved in appendix H.
Appendix H: Proof of Lemma 23
Appendix I: Concavity Analysis for Target Tracking
Y.M. was supported by the MEXT scholarship and the JST CREST program. T.Z. was supported by NSFC 61502339 and SRF for ROCS, SEM. K.H. was supported by MEXT KAKENHI 25330261 and 24106010. M.S. was supported by KAKENHI 23120004.
Our OPG algorithm can also be seen as an extension of the online gradient descent algorithm (Zinkevich, 2003) to online MDP problems.
Note that the parameter space is not closed in this experiment. When takes a value less than −1.99 or more than −0.01 during gradient update iterations, we project it back to −1.99 or −0.01, respectively.
The analysis of concavity is presented in appendix I.
The state and action spaces are bounded to −2,2 in the bandit-feedback experiment.
The reward function is not bounded, which violates assumption 3. However, it is interesting to illustrate that the parameter updated by the OPG algorithm still converges to the best offline parameter.