## Abstract

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate γ for the value functions close to 1, these algorithms do not permit γ to be set exactly at γ = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting γ = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.

## 1.  Introduction

Policy gradient reinforcement learning (PGRL) is a popular family of algorithms in reinforcement learning (RL). It improves a policy parameter to maximize the average reward (also called the expected return) by using the average reward gradients with respect to the policy parameter, which are called the policy gradients (PGs) (Gullapalli, 1990; Williams, 1992; Kimura & Kobayashi, 1998; Baird & Moore, 1999; Sutton, McAllester, Singh, & Mansour, 2000; Baxter & Bartlett, 2001; Konda & Tsitsiklis, 2003; Peters & Schaal, 2006). However, most conventional PG algorithms for infinite-horizon problems neglect (or do not explicitly make use of) the term associated with the derivative of the stationary (state) distribution in the PGs, with the exception of Ng, Parr, and Koller (2000), since to date there is not an efficient algorithm to estimate this derivative. This derivative is an indicator of how sensitive the stationary distribution is to changes in the policy parameter. While the biases introduced by this omission can be reduced by using a forgetting (or discounting)1 rate “γ” for the value functions close to 1, that tends to increase the variance of the PG estimates, and for γ = 1, the variance can become infinite, which violates the conditions of these algorithms. This trade-off makes it difficult to find an appropriate γ in practice. Furthermore, while the solution to discounted reinforcement learning is well defined if the optimal control solution can be perfectly represented by the policy, this is no longer true in the case where function approximation is employed. For approximations of the policies, the solution will always be determined by the start-state distributions and thus is in general an ill-defined problem. Average reward RL, on the other hand, is a well-posed problem, as it depends only on the stationary distribution.

Here, we propose a new PG framework for estimating the derivative of the logarithmic stationary state distribution (log stationary distribution derivative, or LSD) as an alternative and useful form of the derivative of the stationary distribution for estimating the PG. 2 It is our main result and contribution of this article that a method for estimating the LSD is derived through backward Markov chain formulation and a temporal difference learning method. Then the learning agent estimates the LSD instead of estimating the value functions in this PG framework. In addition, the use of LSD estimation will open other possibilities for RL. In particular, it allows implementing state-of-the-art natural gradient learning for RL (Morimura, Uchibe, Yoshimoto, & Doya, 2008, in press), which was reported to be especially effective in the randomly synthesized large-scale MDPs. The Fisher information matrix including the LSD can be used as the Riemannian metric, which defines the transformation of an ordinary gradient to a natural gradient.

This article is an extended version of an earlier technical report (Morimura, Uchibe, Yoshimoto, & Doya, 2007), with new results, and is organized as follows. In section 2, we review the conventional PGRL methods and describe a motivation to estimate LSD. In section 3, we propose an LSD(λ) algorithm for the estimation of LSD by a east quares temporal difference method based on the backward Markov chain formulation. In section 4, the LSD(λ)-PG algorithm is instantly derived as a novel PG algorithm utilizing LSD(λ). We also propose a baseline function for LSD(λ)-PG, which decreases the variance of the PG estimate. To verify the performances of the proposed algorithms, numerical results for simple Markov decision processes (MDPs) are shown in section 5. In section 6, we review existing (stationary) state distribution derivative estimating and average reward PG methods. In section 7, we give a summary and discussion. We also suggest other significant possibilities brought by the realization of the LSD estimation.

## 2.  Policy Gradient Reinforcement Learning

We briefly review conventional PGRL methods and present the motivation to estimate LSD. A discrete-time MDP with finite sets of states and actions is defined by a state transition probability and a (bounded) reward function r+1 = r(s, a, s+1), where s+1 is the state followed by the action a at the state s and r+1 is the observed immediate reward at s+1 (Bertsekas, 1995; Sutton & Barto, 1998). The state s+k and the action a+k denote a state and an action after k time steps from the state s and the action a, respectively, and backward for −k. The decision-making rule follows a stochastic policy , parameterized by . We assume the policy is always differentiable with respect to . We also posit the following assumption:

Assumption 1.
The Markov chain is ergodic (irreducible and aperiodic) for all policy parameters . Then there exists a unique stationary state distribution , which is independent of the initial state and satisfies the recursion
2.1
where .

The goal of PGRL is to find a policy parameter θ* that maximizes the time average of immediate rewards called the average reward or expected return:
where denotes the expectation over the Markov chain . Under assumption 1, the average reward is independent of the initial state s and can be shown to be equal to (Bertsekas, 1995),
2.2
where does not depend on the policy parameter. The policy gradient RL algorithms update the policy parameters in the direction of the gradient of the average reward with respect to so that
where ⊤ denotes the transpose. This derivative is often referred to as the policy gradient (PG). Using equation 2.2, the PG is directly determined to be
2.3
However, since the derivation of the gradient of the log stationary state distribution is nontrivial, the conventional PG algorithms (Baxter & Bartlett, 2001; Kimura & Kobayashi, 1998) utilize an alternative representation of the PG as (we give this derivation in appendix A)
2.4
where
are the cumulative rewards defined with forgetting rate γ ∈ [0, 1), known as the state-action and state value functions, respectively (Sutton & Barto, 1998). Since the contribution of the second term in equation 2.4 shrinks as γ approaches 1 (Baxter & Bartlett, 2001), the conventional algorithms omitted the second term and approximated the PG as a biased estimate by taking γ ≈ 1: the PG estimate was composed of only the first term in equation 2.4. Although the bias introduced by this omission shrinks as γ approaches 1, the variance of the estimate becomes larger (Baxter & Bartlett, 2001; Baxter, Bartlett, & Weaver, 2001). In addition, these algorithms prohibit γ from being set exactly at 1, though the bias disappears in the limit of γ → 1.

In this article, we propose an estimation approach for the log stationary distribution derivative (LSD), . The realization of the LSD estimation instantly makes an alternative PG approach feasible, which uses equation 2.3 for computing the PG estimate with the LSD estimate. An important feature is that this approach does not need to learn value function Qπγ or Vπγ, and therefore the resulting algorithms are free from the bias-variance trade-off in the choice of the forgetting rate γ.

## 3.  Estimation of the Log Stationary Distribution Derivative

In this section, we propose an LSD estimation algorithm based on least squares, LSD(λ). For this purpose, we formulate the backwardness of the ergodic Markov chain and show that LSD can be estimated on the basis of the temporal difference learning (Sutton, 1988; Bradtke & Barto, 1996; Boyan, 2002).

### 3.1.  Properties of Forward and Backward Markov Chains.

According to Bayes's theorem, a backward probability q of a previous state-action pair (s−1, a−1) leading to the current state s is given by
The posterior q(s−1, a−1s) depends on the prior distribution . When the prior distribution follows a stationary distribution under a fixed policy π, that is, , the posterior is called the stationary backward probability denoted by and satisfies
3.1
If a Markov chain follows , we call it the backward Markov chain associated with M(θ) following . Both Markov chains— and —are closely related, as described in the following two propositions:

Proposition 1.
Let a Markov chain characterized by a transition probability , which is irreducible and ergodic. Then the backward Markov chain characterized by the backward (stationary) transition probability with respect to is also ergodic and has the same unique stationary distribution as M(θ):
3.2
where and are the stationary distributions of and , respectively.

Proof.
By multiplying both sides of equation 3.1 by and summing over all possible , we obtain a “detailed balance equations” (MacKay, 2003):
3.3
Then
holds by summing both sides of equation 3.3 over all possible , indicating that (i) has the same stationary distribution as . By assumption 1, (i) directly proves that (ii) is irreducible. Equation 3.3 is reformulated by the matrix notation: both transition probabilities and are assembled into and , respectively,3 and the stationary distribution into dθ4
We can easily see that the diagonal components of are equal to those of for any natural number n. This implies that (iii) B(θ) has the same aperiodic property as M(θ). Proposition 1, equation 3.2, is directly proven by (i), (ii), and (iii) (Schinazi, 1999).

Proposition 2.
Let the distribution of sK follow , and let f(sk, ak) be an arbitrary function of a state-action pair at time k. Then
3.4
where and are the expectations over the backward and forward Markov chains, and , respectively, and . Equation 3.4 holds even at the limit K → ∞.

Proof.
By utilizing the Markov property and substituting equation 3.1, we have the following relationship:
This directly implies the proposition in the case of finite K. Since the following equations are derived from proposition 1, the proposition in the limit case K → ∞ is also instantly proven:

Propositions 1 and 2 are significant as they indicate that the samples from the forward Markov chain can be used directly for estimating the statistics of the backward Markov chain .

### 3.2.  Temporal Difference Learning for LSD from the Backward to Forward Markov Chains.

Using equation 3.1 the LSD, , can be decomposed into
3.5
There exist and in equation 3.5, so its recursion can be rewritten as
3.6
Equation 3.6 implies that the LSD of a state s is the infinite-horizon cumulation of the log policy distribution derivative (LPD), , along the backward Markov chain B(θ) from state s. From equations 3.5 and 3.6, LSD could be estimated using an approach similar to temporal difference (TD) learning (Sutton, 1988) for the following backward TD-error δ on the backward Markov chain B(θ) rather than :
where the first two terms of the right-hand side describe the one-step actual observation of the policy eligibility and the one-step ahead LSD on , respectively, while the last term is the current LSD. Interestingly, while the well-known TD error for the value-function estimation uses the reward r(s, a, s+1) on (Sutton & Barto, 1998), this TD error for the LSD estimation uses on .

While is a random variable, holds for all states , which comes from equation 3.5. This is a motivation for minimizing the mean squares of the backward TD error, , for the estimation of LSDs,5 where is composed of the LSD estimate rather than (exact) LSD . Here, denotes the Euclidean norm .

With an eligibility decay rate λ ∈ [0, 1] and a back-trace time step , equation 3.6 is generalized, where denotes the set of natural numbers:
Accordingly, the backward TD is extended into the backward TD(λ), ,
where the unbiased property, , is still retained. The minimization of at λ = 1 and the limit K → ∞ is regarded as the Widrow-Hoff supervised learning procedure. Even if λ and K are not set in the above values, the minimization of in a large λ ∈ [0, 1) and K would be less sensitive to a non-Markovian effect as in the case of the conventional TD(λ) learning for the value functions (Peng & Williams, 1996).
In order to minimize as the estimation of the LSD, we need to gather many samples drawn from the backward Markov chain . However, the actual samples are drawn from a forward Markov chain . Fortunately, by using propositions 1 and 2, we can derive the following exchangeable property:
3.7
In particular, the actual samples can be reused to minimize , provided . In real problems, however, the initial state is rarely distributed according to the stationary distribution . To interpolate the gap between theoretical assumption and realistic applicability, we would need to adopt either of the following two strategies: (i) K is not set at such a large integer if λ ≈ 1, or (ii) λ is not set at 1 if Kt, where t is the current time step of the actual forward Markov chain .

### 3.3.  LSD Estimation Algorithm: Least Squares on Backward TD(⁠⁠) with Constraint.

In the previous sections, we introduced the theory for estimating LSD by minimizing of the mean squares of on , . However, LSD also has the following constraint derived from :
3.8
In this section, we propose an LSD estimation algorithm, LSD(λ), based on the least-squares TD approach (Young, 1984; Bradtke & Barto, 1996; Boyan, 2002; Lagoudakis & Parr, 2003), which simultaneously attempts to decrease the mean squares and satisfy the constraint. We consider the situation where the LSD estimate is represented by a linear vector function approximator
3.9
where is a basis function and is an adjustable parameter matrix, and we assume that the optimal parameter satisfies . If the estimator cannot represent the LSD exactly, LSD(λ) would behave as suggested by Sutton (1988) and Peng and Williams (1996), which means the estimation error for the LSD would get smaller as the value of λ ∈ [0, 1) approaches 1. This will be confirmed in our numerical experiments in section 5.
For simplicity, we focus on only the ith element θi of the policy parameter θ, denoting , , and as the ith element of . Accordingly, the objective function to be minimized is given by
3.10
where the second term of the right side comes from the constraint of equation 3.8.6 Thus, the derivative is
3.11
where
and
Although the conventional least-squares method aims to find the parameter satisfying as the true parameter , it induces estimation bias if a correlation exists between the error and its derivative concerning the first term of the right-hand side in equation 3.10. That is, if
holds because in the second term of the left-hand side in equation 3.10 is always a zero vector due to equation 3.8 and . Since this correlation exists in general RL tasks, we apply the instrumental variable method to eliminate the bias (Young, 1984; Bradtke & Barto, 1996). This requires that be replaced by the instrumental variable ι(s), which has a correlation with but not . This condition is obviously satisfied when as well as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2002). Instead of equation 3.11 we aim to find the parameter making the equation
3.12
be equal to zero in order to compute the true parameter , so that .
For the remainder of this article, we denote the current state at time step t by st to clarify the time course on the actual Markov chain . In the proposed LSD estimation algorithm, LSD(λ), the back-trace time step K is set equal to the time step t, of the current state st, while the eligibility decay rate λ can be set in [0, 1), that is,
where and . The expectations in equation 3.12 are estimated without bias by Bradtke and Barto (1996) and Boyan (2002):
where and and
Therefore, by substituting these estimators into equation 3.12, the estimate at time step T is computed as
The LSD(λ) for the matrix parameter rather than is shown in algorithm 1, where the notation ≔ denotes the right-to-left substitution:7

It is intriguing that LSD(λ) has a relationship to a model-based method, as noted by Boyan (2002) and Lagoudakis and Parr (2003) in the references for LSTD(λ) and LSTDQ(λ), but LSD(λ) is concerned with the “backward” model instead of the forward model . This is due to the fact that the sufficient statistics A in LSD(λ) can be regarded as a compressed backward model, since A is equivalent to one of the sufficient statistics to estimate the backward state transition probability when λ = 0 and the feature vector corresponding to ; ; and so on. We give a detailed explanation of this in appendix B.

## 4.  Policy Gradient Algorithms with the LSD Estimate

We propose a PG algorithm as a straightforward application with the LSD estimates in section 4.1. In section 4.2, we introduce baseline functions to reduce the variance of the PG as estimated by our PG algorithm.

### 4.1.  Policy Update with the LSD Estimate.

Now let us define the PGRL algorithm based on the LSD estimate. The realization of the estimation for by LSD(λ) directly leads to the following estimate for the PG (see equation 2.3), due to its independence from the forgetting factor γ for the value functions:
4.1
4.2
where rt+1 is the immediate reward defined by the reward function r(st, at, st+1). The policy parameter can then be updated through the stochastic gradient method with an appropriate step size α (Bertsekas and Tsitsiklis, 1996):8
LSD(λ)-PG without a baseline function is shown in algorithm 2 as one of the simplest realizations of PG algorithm that uses LSD(λ). In algorithm 2, the forgetting rate parameter β ∈ [0, 1) is introduced to discard the past estimates given by old values of θ:

Another important topic for function approximation is the choice of the basis function of the approximator, particularly in continuous state problems. For the PG algorithm, the objective of the LSD estimate is to provide one term of the PG estimate, such as , but not to provide a precise estimate of the LSD , where . Therefore, the following proposition would be useful:

Proposition 3.
Let the basis function of the LSD estimator be
and then the function estimator, , can represent the second term of the PG, , where the adjustable parameter is a d-dimensional vector:
where minimizes the mean error, .

Proof.
The proposition follows directly from

### 4.2.  Baseline Function for Variance Reduction of Policy Gradient Estimates with LSD.

As the variance of the PG estimates using the LSD, equation 4.2, might be large, we consider variance reduction using a baseline function for immediate reward r. The following proposition provides the kind of functions that can be used as the baseline function for PG estimation using the LSD:9

Proposition 4.
With the following function of the state s and the following state s+1 on ,
4.3
where c and g(s) are an arbitrary constant and an arbitrary bounded function of the state, respectively. The derivative of the average reward with respect to the policy parameter (see equation 2.3), , is then transformed to
4.4

Proof.

See appendix C.

Proposition 4 implies that any ρ(s, s+1) defined in equation 4.3 can be used as the baseline function of the immediate reward r+1r(s, a, s+1) for computing the PG, as in equation 4.4. Therefore, the PG can be estimated with the baseline function ρ(st, st+1) with a large time step T,
4.5
In view of the form of the baseline function in equation 4.3, we use the following linear function as a representation of the baseline function,
where and are its coefficient parameter and feature vector function of the state.
When we consider the trace of the covariance matrix of the PG estimates as the variance of and use the results of Greensmith, Bartlett, and Baxter (2004), an upper bound for the variance is derived as
where h(a) is a monotonically increasing function of its argument a. Accordingly, since the optimal coefficient parameter for the optimal (linear) baseline function satisfies
the optimal coefficient parameter is computed as10
4.6
There is also an alternative decent baseline function , which minimizes the residual sum of squares about the expectation of the reward function r(s, a, s+1) given s and then works to reduce the variance of the PG estimates. If the rank of is equal to the number of states , this decent baseline function satisfies
4.7
and has the following statistical interpretation: this function is a solution of Poisson's equation
thus, υd and are equal to the average reward and the (undiscounted) state value function, respectively (Konda & Tsitsiklis, 2003). The parameter of the decent baseline function can be computed as11
4.8
where (Ueno, Kawanabe, Mori, Maeda, & Ishii, 2008.

By equations 4.6 and 4.8, both the coefficient parameters and for the optimal b*(s, s+1) and decent b(s, s+1) baseline functions can be estimated by least squares and LSTD(λ), respectively, though the estimation for b* requires LSD estimates. The LSD(λ)-PG algorithms with both baseline functions are shown in algorithms 3 and 4:

## 5.  Numerical Experiments

We verify the performance of our proposed algorithms in stochastic “torus” MDPs in section 5.1. The proposed algorithms and other existing PG algorithms are also applied to a pendulum balancing problem as a continuous state-action problem in section 5.2.

### 5.1.  Torus -State MDP.

We tested the performance of our proposed algorithms in a stochastic one-dimensional torus grid-world with a finite set of grids and a set of two possible actions . This is a typical -state MDP task where the state transition probabilities p are given by
otherwise, p = 0, where s = 0 and (s = 1 and ) are the identical states and qs ∈ [0, 1] is a task-dependent constant. In this experiment, a stochastic policy is a so-called Boltzmann or Gibbs policy represented by a sigmoidal function,
Here, all of the elements of the state-feature vectors were independently drawn from the gaussian distribution for each episode (simulation run). This was to assess how the parameterization of the stochastic policy affected the performance of our algorithms. The state-feature vectors were also used as the basis function for the LSD estimate (see equation 3.9).

#### 5.1.1.  Performance of LSD(λ) Algorithm.

First, we verified how precisely the LSD(λ) algorithm estimated without regard to the setting of qs and the policy parameter . Each element of and the task-dependent constant qs were initialized according to and , respectively, where is the uniform distribution over the interval of [a, b]. These values were fixed during each episode.

Figure 1A shows a typical time course of the LSD estimates for -state MDP, where nine different gray scales indicate all of the different elements of the LSD, respectively. The solid lines denote the values estimated by LSD(0), and the dotted lines denote the analytical solution of the LSD. This result shows that the proposed algorithm, LSD(λ), can estimate the LSD . Besides this result, we have confirmed that the estimates by LSD(0) always converged to the analytical solution for as in Figure 1A.

Figure 1:

Performances of LSD(λ) for the estimation of the LSD . (A) A typical time course of LSD estimates in a three-state MDP. (B, C) The relative errors averaged over independent 200 episodes in seven-state MDPs for various λ (B) with a proper basis function and (C) with an improper basis function .

Figure 1:

Performances of LSD(λ) for the estimation of the LSD . (A) A typical time course of LSD estimates in a three-state MDP. (B, C) The relative errors averaged over independent 200 episodes in seven-state MDPs for various λ (B) with a proper basis function and (C) with an improper basis function .

Second, we investigated the effect of the eligibility decay rate λ using seven-state MDPs. In order to evaluate the average performance over various settings, we employed a relative error criterion that is defined by , where is the optimal parameter defined in proposition 3. Figures 1B and 1C show the time courses of the relative error averages over independent 200 episodes for λ = 0, 0.3, 0.9, and 1. The only difference between these two figures was the number of elements of the feature vectors . The feature vectors used in Figure B were appropriate and sufficient to distinguish all of the different states, while the feature vectors used in Figure 1C were inappropriate and insufficient. These results were consistent with the theoretical prospects. In particular, we could set λ arbitrarily in [0, 1) if the basis function was appropriate (see Figure 1B); otherwise, we would need to set λ close but not equal to 1 (see Figure 1C).

#### 5.1.2.  Comparison to Other PG Methods.

We compared the LSD(λ=0)-PG algorithm with the other PG algorithms for three-state MDPs, concerned with the estimation of PG and the optimization of the policy parameter . The policy and the state transition probability were set as and for every i ∈ {1, 2, 3}, respectively.

Figure 2 shows the reward setting in the MDP. There are two types of rewards: r = (±)2/Z(c) and r = (±)c/Z(c), where the variable c was initialized using the uniform distribution over [0.95, 1) for each episode (simulation run) and the function Z(c) was the normalizing constant to ensure .12 Note that the reward c defines the infimum value of γ to find the optimal policy: . Therefore, the setting of γ is important but difficult in this task. From the performance baselines of the existing PG methods, we adopted two algorithms: GPOMDP (Baxter & Bartlett, 2001) and Konda's actor-critic (Konda & Tsitsiklis, 2003). In these algorithms, the state values were estimated by LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2002; Yu & Bertsekas, 2006) and these estimates were used as the baseline functions to reduce the variance of PG estimates, while the baseline functions were not used in the original algorithms.

Figure 2:

Reward setting of three-state MDPs used in our comparative studies. The value of c is selected from the uniform distribution for each episode. Z(c) is a normalizing function to ensure .

Figure 2:

Reward setting of three-state MDPs used in our comparative studies. The value of c is selected from the uniform distribution for each episode. Z(c) is a normalizing function to ensure .

Figure 3 shows the results for the estimation of PG from equation 4.1. The forgetting rate for the statistics and the eligibility decay rate were set as β = 1 and λ = 0 for all of the algorithms. Figures 3A and 3B represent the mean and the standard deviation of the angles between the estimates and the exact PG, respectively. These results show that LSD-PG, when estimating the optimal baseline function b*(s, s+1), termed LSD-PG:b*(s, s+1), worked best to estimate the PG. LSD-PG with b(s, s+1) or b*(s, s+1) drastically improved the PG estimation performance for LSD-PG without a baseline function, LSD-PG:None, which was even worse than GPOMPD:V(s), though LSD-PG:None worked better than GPOMDP:None. Thus, we confirmed that these baseline functions for LSD-PG are important.

Figure 3:

Comparison with various PG algorithms for the estimation of the PG over independent 2500 episodes: (A, B) The mean and the standard deviation of the angles between the estimates and the exact PG, respectively.

Figure 3:

Comparison with various PG algorithms for the estimation of the PG over independent 2500 episodes: (A, B) The mean and the standard deviation of the angles between the estimates and the exact PG, respectively.

Finally, we examined the optimization of the policy parameter , that is, the average reward, with these PG methods. In this experiment, the forgetting rate and the eligibility decay rate were set as β = 0.99 and λ = 0. In order to avoid the effect from poor estimations of the functions for the PG estimate, there was a prelearning period of 50 time steps, where the learning rate α was set to zero. It means that the policy remained unchanged in the first 50 time steps. Figure 4 shows the means and the standard deviations of the average rewards at an earlier stage (500 time step) and a later stage (104 time step) over 1000 independent simulations of various learning rates α in order to give comparisons among the PG algorithms for the optimization of the policy parameter. It was confirmed that LSD-PG:b*(s, s+1) worked best, except for the high learning rate, where the learning speed of b*(s, s+1) could not properly follow the changes of the policy rather than that of b(s, s+1). Figure 5 shows the time courses of the average reward, where we chose appropriate learning rates for the PG algorithms by drawing on the previous results; α = 0.16 in LSD-PG:b(s, s+1), α = 0.08 in LSD-PG:b*(s, s+1), α = 0.08 in Actor-Critic:V(s), and α = 0.007 in GPOMDP:V(s). This result also indicates that our LSD-PG algorithm with the optimal baseline function b*(s, s+1) outperformed the other PG algorithms in the sense of realizing both the highest average and the lowest standard deviation of the average rewards.

Figure 4:

Comparisons with various PG algorithms for the means of the average rewards in the three-state torus MDPs with independent 1000 episodes about various learning rates. (A, B) The mean and the standard deviation of the average rewards at 500 time steps, respectively. (C, D) At 104 time step.

Figure 4:

Comparisons with various PG algorithms for the means of the average rewards in the three-state torus MDPs with independent 1000 episodes about various learning rates. (A, B) The mean and the standard deviation of the average rewards at 500 time steps, respectively. (C, D) At 104 time step.

Figure 5:

Comparison with various PG algorithms for the optimization of the policy parameters with the appropriate learning rate in the three-state torus MDPs over independent 1000 episodes. (A, B) Time courses of the mean and the standard deviation of the average rewards, respectively.

Figure 5:

Comparison with various PG algorithms for the optimization of the policy parameters with the appropriate learning rate in the three-state torus MDPs over independent 1000 episodes. (A, B) Time courses of the mean and the standard deviation of the average rewards, respectively.

### 5.2.  Continuous State-Action Problem.

The LSD(λ)-PG and the other existing PG algorithms were also applied to a continuous state-action problem. This task is to balance an inverted pendulum.

#### 5.2.1.  Interpretation of Continuous State Problem.

Although this task obviously violates assumption 1, on which most policy gradient algorithms (including LSD(λ)-PG) have been based, it is valuable to acquire insights into the feasibility of PG algorithms. The reason is due to the following interpretations: a continuous problem to (i) a numerous-state MDP problem with some structures, or (ii) a partially observable MDP (POMDP) problem with belief states (Aberdeen, 2003). While the interpretation of (i) is straightforward, (ii) comes from the fact that when the policy (or the baseline function) in a continuous state problem is represented by a linear function with finite basis functions that output bounded activation values, then the activations biased as nonnegative values and normalized can be regarded as the belief states of a finite-state POMDP (Aberdeen, 2003).

#### 5.2.2.  Pendulum Balancing Problem.

A pendulum balancing problem is a well known benchmark in continuous RL problems (Peters et al., 2005; Morimura, Uchibe, & Doya, 2005). The state consists of the angle and the angular speed of the pendulum, which are limited to the ranges [ − π/6, π/6] and [ − π/2, π/2], respectively, as shown in Figure 6. Its dynamics is given by
where a is the torque as an action selected by a learning agent. The physical parameters are set as m = l = 1, g = 9.8, and μ = 0.01. The reward function is
The state st is initialized with the uniform distributions as and , at the beginning of each episode or when the previous state st−1 deviates from the permissible ranges, that is, |xt−1|>π/6 or . The state is also initialized with the probability 0.01 at each time step in order to explore the state space more efficiently.
Figure 6:

Pendulum balancing problem near the top ranges: x ∈ [ − π/6, π/6] and .

Figure 6:

Pendulum balancing problem near the top ranges: x ∈ [ − π/6, π/6] and .

Since this problem has a potentially infinite number of states, we use a normalized radial basis function (nRBF) model (Doya, 2000) for the following policy and the linear baseline function:
where is the output of the nRBF model with 3 × 3 RBFs as follows. The centers of the RBFs are x ∈ {−π/9, 0, π/9} and . The standard deviations of the RBFs are π/9 and π/3 for x and , respectively. The policy parameter was set to at the beginning of each episode.

We used our LSD(λ)-PGs with optimal baseline function b*(s, s+1) and decent baseline function b(s, s+1), (Konda's) actor-critic:V(s), and GPOMDP:V(s), as in the previous experiment (see section 5.1.2). We set the meta-parameters of those algorithms appropriately: α = 0.2 and β = 0.9995 for the LSD-PGs and actor-critic, and α = 0.04, γ = 0.99, and β = 0.9995 for the GPOMDP. Also, λ for the LSD-PGs with b*(s, s+1) and b(s, s+1), the actor-critic, and the GPOMDP were set to 0.7, 0.5, 0.5, and 0. There was a pre learning period of 103 time steps to avoid any effects from poor PG estimates, where α was set to zero.

Figure 7A shows the time courses of the mean and Figure 7B the standard deviations for the average rewards. We confirmed that the performance of LSD-PG:b(s, s+1) was better than that of LSD-PG:b*(s, s+1). Because its order in the previous MDP problem (see Figure 5) was reversed, it must be caused by the difficulty in learning the parameter of the “optimal” baseline function b*(s, s+1) in equation 4.6 in contrast to of the “decent” one b(s, s+1) in equations 4.8 with the nRBF model. This difference of the difficulties indicates that b*(s, s+1) could be more complex than b(s, s+1) and b*(s, s+1) would need more representation capability for the baseline function approximator. From the comparison of these forms in equations 4.6 and 4.8, these observations would come from the existence of the weights , that is, the weights would often make the estimation of b*(s, s+1) with a function approximation difficult. We also confirmed from Figure 7 that the performance of the LSD-PG:b(s, s+1) was much better than the GPOMDP:V(s) and slightly better than the actor-critic:V(s) algorithm.

Figure 7:

Comparisons with various PG algorithms for the optimization of the policy parameters in the pendulum balancing problem over independent 500 episodes. (A, B) Time courses of the mean and the standard deviation of the average awards, respectively.

Figure 7:

Comparisons with various PG algorithms for the optimization of the policy parameters in the pendulum balancing problem over independent 500 episodes. (A, B) Time courses of the mean and the standard deviation of the average awards, respectively.

## 6.  Related Work

There are two alternative methods that estimate the derivative of the (stationary) state distribution and have already been proposed in Glynn (1991) or Rubinstein (1991) and Ng et al. (2000). However, these are different from our approach and have the following problems. The method in Glynn or Rubinstein from operations research, is called the likelihood ratio gradient or the score function. This method can be problematic as to how to design the recurrence state, since the applicability is limited to regenerative processes (see Baxter & Bartlett, 2001, in detail). The method proposed in Ng et al. is not a direct estimation of the derivative of the state distribution but is done by estimating of the state distribution with density propagation. Accordingly, both methods require knowledge of which state the agent is in, while our method needs only to observe the feature vector of the state.

Meanwhile, there is an average reward PG algorithm (Tsitsiklis & Van Roy, 1999; Konda & Tsitsiklis, 2003; Sutton et al., 2000) that eliminates the use of the forgetting rate by introducing a differential cost function as a solution of Poisson's equation (also known as the average reward Bellman equation in RL). However, since to date this is a unique PG framework proposed for maximizing the average reward,13 more studies of the average reward optimization would be needed and they would have to be significant. At least one possible advantage of the proposed framework over the existing average reward PG method is that a closed-form solution of an optimal baseline function for minimizing the variance bound of PG estimate can be computed by least-squares approaches, while such a solution in conventional PG frameworks has not been explored and would be intractable (Greensmith et al., 2004).

## 7.  Conclusion

Our propositions show that the actual forward and virtual backward Markov chains are closely related and have common properties. Utilizing these properties, we proposed LSD(λ) as an estimation algorithm for the log stationary distribution derivative (LSD) and LSD(λ)-PG as a PG algorithm utilizing the LSD estimate. The experimental results also demonstrated that LSD(λ) worked for λ ∈ [0, 1) and LSD(λ)-PG could learn regardless of the task's requirements for the value of γ to optimize the average reward. At the same time, it has been suggested that there is theoretically no significant difference in performances between the average-reward-based PG methods and the alternative PG methods with forgetting rate γ for the value functions close to one (Tsitsiklis & Van Roy, 2002). This might be true for the case of our proposed PG, LSD-PG, which would mean that LSD-PG might not drastically improve the performances of the PG methods with γ as does the algorithm proposed by Kimura and Kobayashi (1998) when γ was set appropriately. However, it is noted that LSD-PG is free of the setting of γ. In contrast, the learning performances of Konda's actor-critic and the LSD-PG approaches do not seem to be the significantly different, as confirmed in our numerical experiments. This would seem to be because the LSD-PG is just a dual approach compared to Konda's actor-critic approach. In dynamic programming (DP), Wang, Bowling, & Schuurmans (2007) formalize the dual approaches, in which probability distributions are maintained instead of the value functions (primal approaches). In contrast, in PGRL, the LSD-PG approach maintains the LSD instead of the value functions to estimate the policy gradient. Considering the relationships between DP and PGRL, Konda's actor-critic and our LSD-PG approaches correspond to primal and dual approaches in PGRL, respectively14. Since Wang, Lizotte, Bowling, & Schuurmans (2008) also show the advantages of the dual DP approaches in that the dual updates will not diverge even with function approximations because of the constraints caused by the probability distributions, the LSD-PG approaches may share these advantages due to the constraint of equation 3.8. More theoretical and experimental work is necessary to understand the effectiveness further.

Meanwhile, in our numerical experiments, the LSD-PG approaches greatly surpassed the performances of GPOMDP. This may be because LSD-PG and GPOMDP are fundamentally different approaches, since GPOMDP can be regarded as an approach based on a fully model-free estimation of the PG, while LSD-PG can be regarded as a model-based estimation approach under certain conditions (see appendix B for the conditions). Although the latent difficulty of the PG estimation problem is unchanged, the LSD-PG utilizes model information or prior knowledge of the model. Also, since the policy is usually a parametric model and the LSD approximator would be defined as a model similar to the policy model, the utilization of the policy model in LSD-PG may be useful in many cases. Accordingly, it is important for future work to discuss and define a necessary and sufficient basis function for the (linear) LSD approximator based on the parameterization of the policy.

On the other hand, the use of LSD estimation will open up new possibilities for the natural policy gradient learning (Kakade, 2001; Peters, Vijayakumar, & Schaal, 2005; Peters & Schaal, 2008; Morimura, Uchibe, Yoshimoto, & Doya, 2008). It enables us to compute a valid Riemannian metric matrix for the natural gradient, which is effective especially in the large-scale MDPs,
where ι ∈ [0, 1] interpolates the natural policy gradient (ι = 0) and the natural state-action gradient (ι = 1) (Morimura, Uchibe, Yoshimoto et al., 2008).

In addition, the use of LSD estimation might offer novel methods for addressing the trade-offs between exploration and exploitation. This is because LSD gives statistical information about how much of a change of the state stationary distribution is caused by the perturbation of each element of each policy parameter, while the stationary distribution with low entropy would make the exploration difficult.

## Appendix A:  Derivation of Equation 2.4

Using the relation between the forgetting (or discounted) state-value function Vπγ(s) and the average reward (Singh, Jaakkola, & Jordan, 1994)
the derivative of average reward with respect to is given by
A.1
where ∇αAB implies (∇αA)B. The second term is modified as follows:
A.2
A.3
Equation A.2 is given by the property of stationary distribution, equation 2.1. Substituting equation A.3 in equation A.1, we can derive equation 2.4:
A.4

## Appendix B:  as a Model-Based Learning

Here, we show that LSD(λ) can be regarded as a model-based estimation of the LSD under certain conditions.15 First, we consider a model-based LSD estimation. With the following matrix notations,
the recursive equation of the LSD, equation 3.5 is rewritten as
where and . Although this equation is transformed to
B.1
the rank of is equal to , and thus its inverse does not exist, because the backward transition probability matrix has a unique stationary distribution from proposition 1, , that is, . However, is always invertible because of . By utilizing equation 3.8 as the matrix notation, , equation B.1 is transformed to
B.2
Therefore, in a model-based approach for the LSD estimation, the model parameters , , and will be estimated as sufficient statistics , , and , respectively. Then the LSD estimate is computed as
B.3
Under the conditions that the feature vectors are , , …, and and the eligibility decay rate λ is equal to 0, the statistics , , and in the LSD(λ = 0) (see algorithm 1) can be the sufficient statistics for these model parameters as
where T is the time-step in Algorithm 1. Then using equation B.3, the LSD estimate in this LSD algorithm, , is computed as
Therefore, the LSD(λ = 0) algorithm with these table-lookup features of the states is equivalent to the model-based LSD estimation. It is noted that while the LSTD(λ) by Boyan (2002) is regarded as a forward model-based approach for the value function estimation, our LSD(λ) can be regarded as a backward model-based approach for the LSD function estimation.

## Appendix C:  Proof of Proposition 4

If the following equation holds,
C.1
then the transformation to equation 4.4 is obviously true. Because of equation 4.3 and
we know that
C.2
Since a time average is equivalent to a state-action space average in an ergodic Markov chain by assumption 1, equation C.2 is transformed to
C.3
C.4
where equation 3.4 (proposition 2) and equation 3.5 are used for the transformations to equations C.3 and C.4. Therefore, equation C.1 holds.

## Notes

1

Note that the parameter γ has two different meanings: discounting and forgetting. γ is sometimes interpreted as a discounting rate to define the objective function. On the other hand, the role of γ can be regarded as the forgetting rate to enforce a horizon change for the approach of Baxter and Bartlett (2001), where the objective function is the average reward. That is, while the discounting rate is seen as part of the problem, the forgetting rate is part of algorithm. Since we focus on the average reward as the infinite-horizon problem, we use the term forgetting rate for γ in this article.

2

While the log stationary distribution derivative with respect to the policy parameter is sometimes referred to as the likelihood ratio gradient or score function, we call it the LSD in this article.

3

The bold has no relationship with the state-action value function Qπ(s, a).

4

The function diag(a) for a vector denotes the diagonal matrix of A, so .

5

Actually, the classical least-squares approach to would make the LSD estimate biased, because has the different time step LSDs, and . Instead, is minimized for an unbiased LSD estimation, where is a instrumental variable (Young, 1984; Bradtke & Barto, 1996). Such a detailed discussion is given in section 3.3. Before that, we use to enhance readability.

6

While LSD(λ) weighs the two objectives equally, we can instantly extend it to the problem minimizing subject to the constraint of equation 3.8 with the Lagrange multiplier method.

7

Incidentally, although there is calculation of an inverse matrix in the algorithms, a pseudo-inverse matrix may be used instead of direct calculation of the inverse matrix so as to secure stability in numeric calculation.

8

Alternatively, θ can also be updated through the bath gradient method: .

9

Though a baseline might be a constant from a traditional perspective, we call the function defined in equation 4.3 a baseline function for equation 2.3 because it does not add any bias to .

10

The optimal baseline function is computed directly with as . Note that there is a similarity to the optimal baseline in Peters and Schaal (2006).

11

is given by solving the estimating function .

12

The normalizing constant Z(c) was computed analytically with the unnormalized reward function as .

13

Although R-learning also maximizes the average reward, it is based on the value function not the PG algorithm (Sutton & Barto, 1998).

14

However, this is unclear, because the conversion operation on PGRL from the primal to the dual (or from the dual to the primal) is not obvious, while it should be automatous.

15

Model based indicates that target values are estimated through the estimation of the model parameters for these target values.

## References

Aberdeen
,
D.
(
2003
).
Policy-gradient algorithms for partially observable Markov decision processes.
.
Unpublished doctoral dissertation, Australian National University
.
Baird
,
L.
, &
Moore
,
A.
(
1999
).
Gradient descent for general reinforcement learning
. In
M. S. Kearns, S. Solla, & D. Cohn
(Eds.),
Advances in neural information processing systems
,
11
.
Cambridge, MA
:
MIT Press
.
Baxter
,
J.
, &
Bartlett
,
P.
(
2001
).
.
Journal of Artificial Intelligence Research
,
15
,
319
350
.
Baxter
,
J.
,
Bartlett
,
P.
, &
Weaver
,
L.
(
2001
).
.
Journal of Artificial Intelligence Research
,
15
,
351
381
.
Bertsekas
,
D. P.
(
1995
).
Dynamic programming and optimal control
.
Belmont, MA
:
Athena Scientific
.
Bertsekas
,
D. P.
, &
Tsitsiklis
,
J. N.
(
1996
).
Neuro-dynamic programming
.
Belmont, MA
:
Athena Scientific
.
Boyan
,
J. A.
(
2002
).
Technical update: Least-squares temporal difference Learning
.
Machine Learning
,
49
,
233
246
.
,
S. J.
, &
Barto
,
A. G.
(
1996
).
Linear least-squares algorithms for temporal difference learning
.
Machine Learning
,
22
,
33
57
.
Doya
,
K.
(
2000
).
Reinforcement learning in continuous time and space
.
Neural Computation
,
12
,
219
245
.
Glynn
,
P. W.
(
1991
).
Likelihood ratio gradient estimation for stochastic systems
.
Communications of the ACM
,
33
(
10
),
75
84
.
Greensmith
,
E.
,
Bartlett
,
P.
, &
Baxter
,
J.
(
2004
).
Variance reduction techniques for gradient estimates in reinforcement learning
.
Journal of Machine Learning Research
,
5
,
1471
1530
.
Gullapalli
,
V.
(
1990
).
A stochastic reinforcement learning algorithm for learning real-valued functions
.
Neural Networks
,
3
,
671
692
.
,
S.
(
2001
).
Optimizing average reward using discounted rewards
. In
D. Helmbold & B. Williamson
(Eds.),
Annual Conference on Computational Learning Theory
,
14
.
Cambridge, MA
:
MIT Press
.
Kimura
,
H.
, &
Kobayashi
,
S.
(
1998
).
An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function
.
International Conference on Machine Learning
(pp.
278
286
).
San Franciso
:
Morgan Kaufmann
.
Konda
,
V. S.
, &
Tsitsiklis
,
J. N.
(
2003
).
On actor-critic algorithms
.
SIAM Journal on Control and Optimization
,
42
,
1143
1166
.
Lagoudakis
,
M. G.
, &
Parr
,
R.
(
2003
).
Least-squares policy iteration
.
Journal of Machine Learning Research
,
4
,
1107
1149
.
MacKay
,
D.
(
2003
).
Information theory, inference, and learning algorithms
.
Cambridge
:
Cambridge University Press
.
Morimura
,
T.
,
Uchibe
,
E.
, &
Doya
,
K.
(
2005
).
Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces
. In
International Symposium on Information Geometry and Its Applications
(pp.
256
263
).
Morimura
,
T.
,
Uchibe
,
E.
, &
Doya
,
K.
(
2008
).
Natural actor-critic with baseline adjustment for variance reduction
.
Artificial Life and Robotics
,
13
,
275
279
.
Morimura
,
T.
,
Uchibe
,
E.
,
Yoshimoto
,
J.
, &
Doya
,
K.
(
2007
).
Reinforcement learning with log stationary distribution gradient (Tech. Rep.)
.
Nara
:
Nara Institute of Science and Technology
.
Morimura
,
T.
,
Uchibe
,
E.
,
Yoshimoto
,
J.
, &
Doya
,
K.
(
2008
).
A new natural policy gradient by stationary distribution metric
. In
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
.
Berlin
:
Springer-Verlag
.
Morimura
,
T.
,
Uchibe
,
E.
,
Yoshimoto
,
J.
, &
Doya
,
K.
(in press).
A generalized natural actor-critic algorithm
. In
Advances in neural information processing systems
.
Cambridge, MA
:
MIT Press
.
Ng
,
A. Y.
,
Parr
,
R.
, &
Koller
,
D.
(
2000
).
Policy search via density estimation
. In
S. Solla, T. K. Leen, & K.-R. Müller
(Eds.),
Advances in neural information processing systems
,
Cambridge, MA
:
MIT Press
.
Peng
,
J.
, &
Williams
,
R. J.
(
1996
).
Incremental multi-step Q-learning
.
Machine Learning
,
22
,
283
290
.
Peters
,
J.
, &
Schaal
,
S.
(
2006
).
. In
Proceedings of the IEEE International Conference on Intelligent Robots and Systems
.
Piscataway, NJ
:
IEEE Press
.
Peters
,
J.
, &
Schaal
,
S.
(
2008
).
Natural actor-critic
.
Neurocomputing
,
71
,
1180
1190
.
Peters
,
J.
,
Vijayakumar
,
S.
, &
Schaal
,
S.
(
2005
).
Natural actor-critic
. In
European Conference on Machine Learning
.
Berlin
:
Springer-Verlag
.
Rubinstein
,
R. Y.
(
1991
).
How to optimize discrete-event system from a single sample path by the score function method
.
Annals of Operations Research
,
27
,
175
212
.
Schinazi
,
R. B.
(
1999
).
Classical and spatial stochastic processes
.
Boston
:
Birkhauser
.
Singh
,
S. P.
,
Jaakkola
,
T.
, &
Jordan
,
M. I.
(
1994
).
Learning without state-estimation in partially observable Markovian decision processes
. In
International Conference on Machine Learning
(pp.
284
292
).
San Francisco
:
Morgan Kaufmann
.
Sutton
,
R. S.
(
1988
).
Learning to predict by the methods of temporal differences
.
Machine Learning
,
3
,
9
44
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning
.
Cambridge, MA
:
MIT Press
.
Sutton
,
R. S.
,
McAllester
,
D.
,
Singh
,
S.
, &
Mansour
,
Y.
(
2000
).
Policy gradient methods for reinforcement learning with function approximation
. In
S. A. Solla, T. K. Leen, & K.-R. Müller
(Eds.),
Advances in neural information processing systems
,
12
.
Cambridge, MA
:
MIT Press
.
Tsitsiklis
,
J. N.
, &
Van Roy
,
B.
(
1999
).
Average cost temporal-difference learning
.
Automatica
,
35
,
1799
1808
.
Tsitsiklis
,
J. N.
, &
Van Roy
,
B.
(
2002
).
On average versus discounted reward temporal-difference learning
.
Machine Learning
,
49
,
179
191
.
Ueno
,
T.
,
Kawanabe
,
M.
,
Mori
,
T.
,
Maeda
,
S.
, &
Ishii
,
S.
(
2008
).
A semiparametric statistical approach to model-free policy evaluation
.
International Conference on Machine Learning
(pp.
857
864
).
New York
:
ACM
.
Wang
,
T.
,
Bowling
,
M.
, &
Schuurmans
,
D.
(
2007
).
Dual representations for dynamic programming and reinforcement learning
. In 2
Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning
(pp.
44
51
).
Piscataway, NJ
:
IEEE
.
Wang
,
T.
,
Lizotte
,
D.
,
Bowling
,
M.
, &
Schuurmans
,
D.
(
2008
).
Stable dual dynamic programming
. In
J. C. Platt, D. Koller, Y. Singer, & S. Roweis
(Eds.),
Advances in neural information processing systems
,
20
.
Cambridge, MA
:
MIT Press
.
Williams
,
R. J.
(
1992
).
Simple statistical gradient-following algorithms for connectionist reinforcement learning
.
Machine Learning
,
8
,
229
256
.
Young
,
P.
(
1984
).
Recursive esimation and time-series analysis
.
Berlin
:
Springer-Verlag
.
Yu
,
H.
, &
Bertsekas
,
D. P.
(
2006
).
Convergence results for some temporal difference methods based on least squares
(
Tech. Rep. LIDS 2697
).
Cambridge, MA
:
MIT Press
.