A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.

1  Introduction

In a traditional reinforcement learning setting, an agent interacts with an environment in a sequence of episodes, observing states and acting according to a policy that ideally maximizes expected cumulative reward. If an agent is required to pursue different goals across episodes, its goal-conditional policy may be represented by a probability distribution over actions for every combination of state and goal. This distinction between states and goals is particularly useful when the probability of a state transition given an action is independent of the goal pursued by the agent.

Learning such goal-conditional behavior has received significant attention in machine learning and robotics, especially because a goal-conditional policy may generalize desirable behavior to goals that were never encountered by the agent (Schmidhuber & Huber, 1990; Da Silva, Konidaris, & Barto, 2012; Kupcsik, Deisenroth, Peters, & Neumann, 2013; Deisenroth, Englert, Peters, & Fox, 2014; Schaul, Horgan, Gregor, & Silver, 2015; Zhu et al., 2017; Kober, Wilhelm, Oztop, & Peters, 2012; Ghosh, Singh, Rajeswaran, Kumar, & Levine, 2018; Mankowitz et al., 2018; Pathak et al., 2018; Schmidhuber, 2019). Consequently, developing goal-based curricula to facilitate learning has also attracted considerable interest (Fabisch & Metzen, 2014; Florensa, Held, Wulfmeier, Zhang, & Abbeel, 2017; Sukhbaatar et al., 2018; Srivastava, Steunebrink, & Schmidhuber, 2013; Schmidhuber, 2013). In hierarchical reinforcement learning, goal-conditional policies may enable agents to plan using subgoals, which abstracts the details involved in lower-level decisions (Oh, Singh, Lee, & Kohli, 2017; Vezhnevets et al., 2017; Kulkarni, Narasimhan, Saeedi, & Tenenbaum, 2016; Levy, Platt, & Saenko, 2019).

In a typical sparse-reward environment, an agent receives a nonzero reward only upon reaching a goal state. Besides being natural, this task formulation avoids the potentially difficult problem of reward shaping, which often biases the learning process toward suboptimal behavior (Ng, Harada, & Russell, 1999). Unfortunately, sparse-reward environments remain particularly challenging for traditional reinforcement learning algorithms (Andrychowicz et al., 2017; Florensa et al., 2017). For example, consider an agent tasked with traveling between cities. In a sparse-reward formulation, if reaching a desired destination by chance is unlikely, a learning agent will rarely obtain reward signals. At the same time, it seems natural to expect that an agent will learn how to reach the cities it visited regardless of its desired destinations.

In this context, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended is called hindsight. This capacity was recently introduced by Andrychowicz et al. (2017) to off-policy reinforcement learning algorithms that rely on experience replay (Lin, 1992). In earlier work, Karkus, Kupcsik, Hsu, and Lee (2016) introduced hindsight to policy search based on Bayesian optimization (Metzen, Fabisch, & Hansen, 2015). This work was recently extended by Pinsler, Karkus, Kupcsik, Hsu, and Lee (2019).

In this letter, we demonstrate how hindsight can be introduced to policy gradient methods (Williams, 1986, 1992; Sutton, McAllester, Singh, & Mansour, 1999), generalizing this idea to a successful class of reinforcement learning algorithms (Peters & Schaal, 2008; Duan, Chen, Houthooft, Schulman, & Abbeel, 2016).

In contrast to previous work on hindsight, our approach relies on importance sampling (Bishop, 2013). In reinforcement learning, importance sampling has been traditionally employed in order to efficiently reuse information obtained by earlier policies during learning (Precup, Sutton, & Singh, 2000; Peshkin & Shelton, 2002; Jie & Abbeel, 2010; Thomas, Theocharous, & Ghavamzadeh, 2015; Munos, Stepleton, Harutyunyan, & Bellemare, 2016). In comparison, our approach attempts to efficiently learn about different goals using information obtained by the current policy for a specific goal. This approach leads to multiple formulations of a hindsight policy gradient that relate to well-known policy gradient results.

In comparison to conventional (goal-conditional) policy gradient estimators, our proposed estimators lead to remarkable sample efficiency on a diverse selection of sparse-reward environments.

2  Preliminaries

We denote random variables by uppercase letters and assignments to these variables by corresponding lowercase letters. We let Val(X) denote the set of valid assignments to a random variable X. We also omit the subscript that typically relates a probability function to random variables when there is no risk of ambiguity. For instance, we may use p(x) to denote pX(x) and p(y) to denote pY(y).

Consider an agent that interacts with its environment in a sequence of episodes, each of which lasts exactly T time steps. The agent receives a goal gVal(G) at the beginning of each episode. At every time step t, the agent observes a state stVal(St), receives a reward r(st,g)R, and chooses an action atVal(At). For simplicity of notation, suppose that Val(G),Val(St), and Val(At) are finite for every t.

In our setting, a goal-conditional policy defines a probability distribution over actions for every combination of state and goal. The same policy is used to make decisions at every time step.

Let τ=s1,a1,s2,a2,,sT-1,aT-1,sT denote a trajectory. We assume that the probability p(τg,θ) of trajectory τ given goal g and a policy parameterized by θVal(Θ) is given by
p(τg,θ)=p(s1)t=1T-1p(atst,g,θ)p(st+1st,at).
(2.1)

In contrast to a Markov decision process, this formulation allows the probability of a state transition given an action to change across time steps within an episode. More important, it implicitly states that the probability of a state transition given an action is independent of the goal pursued by the agent, which we denote by St+1GSt,At. For every τ,g, and θ, we also assume that p(τg,θ) is nonzero and differentiable with respect to θ.

Assuming that GΘ, the expected return η(θ) of a policy parameterized by θ is given by
η(θ)=Et=1Tr(St,G)θ=gp(g)τp(τg,θ)t=1Tr(st,g).
(2.2)

The action-value function is given by Qtθ(s,a,g)=Et'=t+1Tr(St',g)St=s,At=a,g,θ, the value function by Vtθ(s,g)=EQtθ(s,At,g)St=s,g,θ, and the advantage function by Atθ(s,a,g)=Qtθ(s,a,g)-Vtθ(s,g).

3  Goal-Conditional Policy Gradients

This section presents results for goal-conditional policies that are analogous to well-known results for conventional policies (Peters & Schaal, 2008). They establish the foundation for the results presented in the next section. Additional proofs are included in appendix A for completeness.

The objective of policy gradient methods is finding policy parameters that achieve maximum expected return. When combined with Monte Carlo techniques (Bishop, 2013), the following result allows pursuing this objective using gradient-based optimization.

Theorem 1
(Intermediary Goal-Conditional Policy Gradient). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)t=1Tr(st,g).
(3.1)
Proof.
The partial derivative η(θ)/θj of the expected return η(θ) with respect to θj is given by
θjη(θ)=gp(g)τθjp(τg,θ)t=1Tr(st,g).
(3.2)
The likelihood-ratio trick allows rewriting the previous equation as
θjη(θ)=gp(g)τp(τg,θ)θjlogp(τg,θ)t=1Tr(st,g).
(3.3)
Note that
logp(τg,θ)=logp(s1)+t=1T-1logp(atst,g,θ)+t=1T-1logp(st+1st,at).
(3.4)
Therefore,
θjη(θ)=gp(g)τp(τg,θ)t=1T-1θjlogp(atst,g,θ)t=1Tr(st,g).
(3.5)

More conveniently, the following result can be obtained by noting that an action is independent of any previous state given the current state, the goal, and the policy parameters (see section A.2).

Theorem 2
(Goal-Conditional Policy Gradient). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)t'=t+1Tr(st',g).
(3.6)

In order to reduce the variance of the gradient estimator, the following result allows employing a so-called baseline (see section A.4).

Theorem 3
(Goal-Conditional Policy Gradient, Baseline Formulation). For every t,θ, and associated real-valued (baseline) function btθ, the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)×t'=t+1Tr(st',g)-btθ(st,g).
(3.7)

Section A.7 presents the constant baselines that minimize the (elementwise) variance of the corresponding estimator. However, such baselines are usually impractical to compute (or estimate), and the variance of the estimator may be reduced further by a baseline function that depends on state and goal. Although generally suboptimal, it is typical to let the baseline function btθ approximate the value function Vtθ (Greensmith, Bartlett, & Baxter, 2004).

The action-value function is related to the goal-conditional policy gradient by the following result (see section A.5).

Lemma 1
(Goal-Conditional Policy Gradient, Action-Value Formulation). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)Qtθ(st,at,g).
(3.8)

Finally, actor-critic methods may rely on the following result for goal-conditional policies (see section A.6).

Theorem 4
(Goal-Conditional Policy Gradient, Advantage Formulation). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)Atθ(st,at,g).
(3.9)

4  Hindsight Policy Gradients

This section presents the novel ideas that introduce hindsight to policy gradient methods. Additional proofs can be found in appendix B.

Importance sampling is a traditional technique used to obtain estimates related to a random variable Xp using samples from an arbitrary positive distribution q. This technique relies on the following equalities:
Ep(X)f(X)=xp(x)f(x)=xq(x)q(x)p(x)f(x)=Eq(X)p(X)q(X)f(X).
(4.1)

Suppose that the reward r(s,g) is known for every combination of state s and goal g, as in previous work on hindsight (Andrychowicz et al., 2017; Karkus et al., 2016; Pinsler et al., 2019). In that case, it is possible to evaluate a trajectory obtained while trying to achieve an original goal g' for an alternative goal g. This information can be exploited using a central result based on importance sampling.

Theorem 5
(Every-decision Hindsight Policy Gradient). For an arbitrary (original) goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×t'=t+1Tk=1T-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(4.2)
Proof.
Starting from theorem 2, importance sampling allows rewriting the partial derivative η(θ)/θj as
θjη(θ)=gp(g)τp(τg',θ)p(τg',θ)p(τg,θ)t=1T-1θjlogp(atst,g,θ)×t'=t+1Tr(st',g).
(4.3)
Using equation 2.1,
θjη(θ)=gp(g)τp(τg',θ)k=1T-1p(aksk,g,θ)p(aksk,g',θ)×t=1T-1θjlogp(atst,g,θ)t'=t+1Tr(st',g).
(4.4)

In the formulation presented above, every reward is multiplied by the ratio between the likelihood of the corresponding trajectory under an alternative goal and the likelihood under the original goal (see equation 2.1). Intuitively, every reward should instead be multiplied by a likelihood ratio that only considers the corresponding trajectory up to the previous action. This intuition underlies the following important result, named after an analogous result for action-value functions by Precup et al. (2000).

Theorem 6
(Per-decision Hindsight Policy Gradient). For an arbitrary (original) goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(4.5)
Proof.
Starting from equation 4.4 the partial derivative η(θ)/θj can be rewritten as
θjη(θ)=gp(g)t=1T-1t'=t+1Tτp(τg',θ)k=1T-1p(aksk,g,θ)p(aksk,g',θ)×θjlogp(atst,g,θ)r(st',g).
(4.6)
If we split every trajectory into states and actions before and after t', then η(θ)/θj is given by
gp(g)t=1T-1t'=t+1Ts1:t'-1a1:t'-1p(s1:t'-1,a1:t'-1g',θ)k=1t'-1p(aksk,g,θ)p(aksk,g',θ)×θjlogp(atst,g,θ)z,
(4.7)
where z is defined by
z=st':Tat':T-1p(st':T,at':T-1s1:t'-1,a1:t'-1,g',θ)k=t'T-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(4.8)
Using lemma 25 (see section D.2) and canceling terms,
z=st':Tat':T-1p(st'st'-1,at'-1)k=t'T-1p(aksk,g,θ)p(sk+1sk,ak)r(st',g).
(4.9)
Using lemma 25 once again,
z=st':Tat':T-1p(st':T,at':T-1s1:t'-1,a1:t'-1,g,θ)r(st',g).
(4.10)
Using the fact that St'GS1:t'-1,A1:t'-1,Θ,
z=st'r(st',g)p(st's1:t'-1,a1:t'-1,g,θ)=st'r(st',g)p(st's1:t'-1,a1:t'-1,g',θ).
(4.11)
Substituting z into expression 4.7 and returning to an expectation over trajectories,
θjη(θ)=τp(τg',θ)gp(g)t=1T-1θjlogp(atst,g,θ)×t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(4.12)

The following lemma allows introducing baselines to hindsight policy gradients (see section B.4).

Lemma 2.
For every g', t,θ, and associated real-valued (baseline) function btθ,
τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×k=1tp(aksk,g,θ)p(aksk,g',θ)btθ(st,g)=0.
(4.13)

Section B.7 presents the constant baselines that minimize the (elementwise) variance of the corresponding gradient estimator. By analogy with the conventional practice, we suggest letting the baseline function btθ approximate the value function Vtθ instead.

The action-value function is related to the hindsight policy gradient by the following result (see section B.5).

Lemma 3
(Hindsight Policy Gradient, Action-Value Formulation). For an arbitrary goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×k=1tp(aksk,g,θ)p(aksk,g',θ)Qtθ(st,at,g).
(4.14)

Importantly, the choice of likelihood ratio in lemma 8 is far from unique. However, besides leading to straightforward estimation, it also underlies the advantage formulation presented below.

Theorem 7
(Hindsight Policy Gradient, Advantage Formulation). For an arbitrary (original) goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×k=1tp(aksk,g,θ)p(aksk,g',θ)Atθ(st,at,g).
(4.15)

Fortunately, the following result allows approximating the advantage under a goal using a state transition collected while pursuing another goal (see section D.4).

Theorem 8.
For every t and θ, the advantage function Atθ is given by
Atθ(s,a,g)=Er(St+1,g)+Vt+1θ(St+1,g)-Vtθ(s,g)St=s,At=a.
(4.16)

5  Hindsight Gradient Estimators

This section details gradient estimation based on the results presented in the previous section. The corresponding proofs can be found in appendix C.

Consider a data set (batch) D={(τ(i),g(i))}i=1N where each trajectory τ(i) is obtained using a policy parameterized by θ in an attempt to achieve a goal g(i) chosen by the environment.

The following result points to a straightforward estimator based on theorem 7 (see section C.1).

Theorem 9.
The per-decision hindsight policy gradient estimator, given by
1Ni=1Ngp(g)t=1T-1logp(At(i)St(i),G(i)=g,θ)×t'=t+1Tk=1t'-1p(Ak(i)Sk(i),G(i)=g,θ)p(Ak(i)Sk(i),G(i),θ)r(St'(i),g),
(5.1)
is a consistent and unbiased estimator of the gradient η(θ) of the expected return.

In preliminary experiments, we found that this estimator leads to unstable learning progress, which is probably due to its potential high variance. The following result, inspired by weighted importance sampling (Bishop, 2013), represents our attempt to trade variance for bias (see section C.2).

Theorem 10.
The weighted per-decision hindsight policy gradient estimator, given by
i=1Ngp(g)t=1T-1logp(At(i)St(i),G(i)=g,θ)×t'=t+1Tk=1t'-1p(Ak(i)Sk(i),G(i)=g,θ)p(Ak(i)Sk(i),G(i),θ)r(St'(i),g)j=1Nk=1t'-1p(Ak(j)Sk(j),G(j)=g,θ)p(Ak(j)Sk(j),G(j),θ),
(5.2)
is a consistent estimator of the gradient η(θ) of the expected return.

In simple terms, the likelihood ratio for every combination of trajectory, (alternative) goal, and time step is normalized across trajectories by this estimator. In section C.3, we present a result that enables the corresponding consistency-preserving weighted baseline.

Consider a set G(i)={gVal(G)existsatsuchthatr(st(i),g)0} composed of so-called active goals during the ith episode. The feasibility of the proposed estimators relies on the fact that only active goals correspond to nonzero terms inside the expectation over goals in expressions 5.1 and 5.2. In many natural sparse-reward environments, active goals will correspond directly to states visited during episodes (e.g., the cities visited while trying to reach other cities), which enables computing said expectation exactly when the goal distribution is known.

The proposed estimators have remarkable properties that differentiate them from previous (weighted) importance sampling estimators for off-policy learning. For instance, although a trajectory is often more likely under the original goal than under an alternative goal, in policies with strong optimal substructure, a high probability of a trajectory between the state a and the goal (state) c that goes through the state b may naturally allow for a high probability of the corresponding (sub)trajectory between the state a and the goal (state) b. In other cases, the (unnormalized) likelihood ratios may become very small for some (alternative) goals after a few time steps across all trajectories. After normalization, in the worst case, this may even lead to equivalent ratios for such goals for a given time step across all trajectories. In any case, it is important to note that only likelihood ratios associated with active goals for a given episode will affect the gradient estimate. Additionally, an original goal will always have (unnormalized) likelihood ratios equal to one for the corresponding episode.

Under mild additional assumptions, the proposed estimators also allow using a data set containing goals chosen arbitrarily (instead of goals drawn from the goal distribution). Although this feature is not required by our experiments, we believe that it may be useful to circumvent catastrophic forgetting during curriculum learning (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017).

6  Experiments

This section reports results of an empirical comparison between goal-conditional policy gradient estimators and hindsight policy gradient estimators.1 Because there are no well-established sparse-reward environments intended to test agents under multiple goals, our experiments focus on our own selection of environments, which is described in section 6.1. Section 6.2 details the implementation of estimators, policies, and baselines. Section 6.3 documents our experimental protocol. Section 6.4 analyzes the results of the corresponding experiments. Unabridged results are presented in section 6.5. Section 6.6 provides a supplementary empirical study of likelihood ratios, and section 6.7 contains an empirical comparison with hindsight experience replay.

6.1  Environments

The environments presented in this section are diverse in terms of stochasticity, state-space dimensionality and size; relationship between goals and states; and number of actions. In every one of these environments, the agent receives the remaining number of time steps plus one as a reward for reaching the goal state, which also ends the episode. In every other situation, the agent receives no reward.

6.1.1  Bit Flipping Environment

The agent starts every episode in the same state (0, represented by k bits), and its goal is to reach a randomly chosen state. The actions allow the agent to toggle (flip) each bit individually. The maximum number of time steps is k+1. Despite its apparent simplicity, this environment is an ideal test bed for reinforcement learning algorithms intended to deal with sparse rewards, since obtaining a reward by chance is unlikely even for a relatively small k. Andrychowicz et al. (2017) employed a similar environment to evaluate their hindsight approach.

6.1.2  Grid World Environments

The agent starts every episode in a (possibly random) position on an 11×11 grid, and its goal is to reach a randomly chosen (noninitial) position. Some of the positions on the grid may contain impassable obstacles (walls). The actions allow the agent to move in the four cardinal directions. Moving toward walls causes the agent to remain in its current position. A state or goal is represented by a pair of integers between 0 and 10. The maximum number of time steps is 32. In the empty room environment, the agent starts every episode in the upper left corner of the grid, and there are no walls. In the four rooms environment (Sutton, Precup, & Singh, 1999), the agent starts every episode in one of the four corners of the grid (see Figure 1). There are walls that partition the grid into four rooms, such that each room provides access to two other rooms through single openings (doors). With probability 0.2, the action chosen by the agent is ignored and replaced by a random action.

6.1.3  Ms. Pac-man Environment

In this variant of the homonymous game for ATARI 2600 (see Figure 2), the agent starts every episode close to the center of the map, and its goal is to reach a randomly chosen (noninitial) position on a 14×19 grid defined on the game screen. The actions allow the agent to move in the four cardinal directions for 13 game ticks. A state is represented by the result of preprocessing a sequence of game screens (images) as described in section 6.2. A goal is represented by a pair of integers. The maximum number of time steps is 28, although an episode will also end if the agent is captured by an enemy. In comparison to the grid world environments considered in the previous section, this environment is additionally challenging due to its high-dimensional states and the presence of enemies.
Figure 1:

Four rooms.

Figure 1:

Four rooms.

Figure 2:

Ms. Pac-man.

Figure 2:

Ms. Pac-man.

6.1.4  FetchPush Environment

This is a variant of the environment recently proposed by Plappert et al. (2018) to assess goal-conditional policy learning algorithms in a challenging task of practical interest (see Figure 3). In a simulation, a robotic arm with seven degrees of freedom is required to push a randomly placed object (block) toward a randomly chosen position. The arm starts every episode in the same configuration. In contrast to the original environment, the actions in our variant allow increasing the desired velocity of the gripper along each of two orthogonal directions by ±0.1 or ±1, leading to a total of eight actions. A state is represented by a 28-dimensional real vector that contains the following information: positions of the gripper and block; rotational and positional velocities of the gripper and block; relative position of the block with respect to the gripper; state of the gripper; and current desired velocity of the gripper along each direction. A goal is represented by three coordinates. The maximum number of time steps is 50.
Figure 3:

FetchPush.

Figure 3:

FetchPush.

6.2  Implementation

Importantly, the weighted per-decision hindsight policy gradient estimator used in our experiments (HPG) does not precisely correspond to expression 5.2. First, the original estimator requires a constant number of time steps T, which would often require the agent to act beyond the end of an episode in the environments that we consider. Second, although it is feasible to compute expression 5.2 exactly when the goal distribution is known (as explained in section 5), we sometimes subsample the sets of active goals per episode. Furthermore, when including a baseline that approximates the value function, we again consider only active goals, which by itself generally results in an inconsistent estimator (HPG+B). As will become evident in the following sections, these compromised estimators still lead to remarkable sample efficiency.

In every experiment, a policy is represented by a feedforward neural network with a softmax output layer. The input to such a policy is a pair composed of state and goal. A baseline function is represented by a feedforward neural network with a single (linear) output neuron. The input to such a baseline function is a triple composed of state, goal, and time step. The baseline function is trained to approximate the value function using the mean squared (one-step) temporal difference error (Sutton & Barto, 1998). Parameters are updated using Adam (Kingma & Ba, 2014). The networks are given by the following.

6.2.1  Bit Flipping Environments and Grid World Environments

Both policy and baseline networks have two hidden layers, each with 256 hyperbolic tangent units. Every weight is initially drawn from a gaussian distribution with mean 0 and standard deviation 0.01 (and redrawn if far from the mean by two standard deviations), and every bias is initially zero.

6.2.2  Ms. Pac-Man Environment

The policy network is represented by a convolutional neural network. The network architecture is given by a convolutional layer with 32 filters (8×8, stride 4); convolutional layer with 64 filters (4×4, stride 2); convolutional layer with 64 filters (3×3, stride 1); and three fully connected layers, each with 256 units. Every unit uses a hyperbolic tangent activation function. Every weight is initially set using variance scaling (Glorot & Bengio, 2010), and every bias is initially zero. These design decisions are similar to the ones made by Mnih et al. (2015).

A sequence of images obtained from the Arcade Learning Environment (Bellemare, Naddaf, Veness, & Bowling, 2013) is preprocessed as follows. Individually for each color channel, an elementwise maximum operation is employed between two consecutive images to reduce rendering artifacts. Such a 210×160×3 preprocessed image is converted to grayscale, cropped, and rescaled into an 84×84 image xt. A sequence of images xt-12,xt-8,xt-4,xt obtained in this way is stacked into an 84×84×4 image, which is an input to the policy network (recall that each action is repeated for 13 game ticks). The goal information is concatenated with the flattened output of the last convolutional layer.

6.2.3  FetchPush Environment

The policy network has three hidden layers, each with 256 hyperbolic tangent units. Every weight is initially set using variance scaling (Glorot & Bengio, 2010), and every bias is initially zero.

6.3  Evaluation

We assess sample efficiency through learning curves and average performance scores, obtained as follows. After collecting a number of batches (composed of trajectories and goals), each of which enables one step of gradient ascent, an agent undergoes evaluation. During evaluation, the agent interacts with the environment for a number of episodes, selecting actions with maximum probability according to its policy. A learning curve shows the average return obtained during each evaluation step, averaged across multiple runs (independent learning procedures). The curves presented in this text also include a 95% bootstrapped confidence interval. The average performance is given by the average return across evaluation steps, averaged across runs. During both training and evaluation, goals are drawn uniformly at random. Note that there is no held-out set of goals for evaluation, since we are interested in evaluating sample efficiency instead of generalization.

For every combination of environment and batch size, grid search is used to select hyperparameters for each estimator according to average performance scores (after the corresponding standard deviation across runs is subtracted, as suggested by Duan et al., 2016). Definitive results are obtained by using the best hyperparameters found for each estimator in additional runs. In most cases, we present definitive results for small (2) and medium (16) batch sizes.

Tables 1, 2, and 3 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and definitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. Note that it is very difficult to establish this setting before hyperparameter search. The number of batches between evaluations is adapted so that there are 100 evaluation steps in total.

Table 1:

Experimental Settings for the Bit Flipping Environments.

Bit Flipping (8 bits)Bit Flipping (16 bits)
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 5000 1400 15000 1000 
Batches between evaluations (definitive) 50 14 150 10 
Runs (search) 10 10 10 10 
Training batches (search) 4000 1400 4000 1000 
Batches between evaluations (search) 40 14 40 10 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Bit Flipping (8 bits)Bit Flipping (16 bits)
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 5000 1400 15000 1000 
Batches between evaluations (definitive) 50 14 150 10 
Runs (search) 10 10 10 10 
Training batches (search) 4000 1400 4000 1000 
Batches between evaluations (search) 40 14 40 10 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Table 2:

Experimental Settings for the Grid World Environments.

Empty RoomFour Rooms
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 2200 200 10,000 1700 
Batches between evaluations (definitive) 22 100 17 
Runs (search) 10 10 10 10 
Training batches (search) 2500 800 10,000 3500 
Batches between evaluations (search) 25 100 35 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Empty RoomFour Rooms
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 2200 200 10,000 1700 
Batches between evaluations (definitive) 22 100 17 
Runs (search) 10 10 10 10 
Training batches (search) 2500 800 10,000 3500 
Batches between evaluations (search) 25 100 35 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Table 3:

Experimental Settings for the Ms. Pac-Man and Fetchpush Environ-ments.

Ms. Pac-ManFetchPush
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 10 10 10 10 
Training batches (definitive) 40,000 12,500 40,000 12,500 
Batches between evaluations (definitive) 400 125 400 125 
Runs (search) 
Training batches (search) 40,000 12,000 40,000 15,000 
Batches between evaluations (search) 800 120 800 300 
Policy learning rates R2 R2 R2 R2 
Episodes per evaluation 240 240 512 512 
Maximum active goals per episode   
Ms. Pac-ManFetchPush
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 10 10 10 10 
Training batches (definitive) 40,000 12,500 40,000 12,500 
Batches between evaluations (definitive) 400 125 400 125 
Runs (search) 
Training batches (search) 40,000 12,000 40,000 15,000 
Batches between evaluations (search) 800 120 800 300 
Policy learning rates R2 R2 R2 R2 
Episodes per evaluation 240 240 512 512 
Maximum active goals per episode   

Other settings include the sets of policy and baseline learning rates under consideration for hyperparameter search, and the number of active goals subsampled per episode. In Tables 1, 2, and 3, R1={α×10-kα{1,5}andk{2,3,4,5}} and R2={β×10-5β{1,2.5,5,7.5,10}}.

As already mentioned, the definitive runs use the best combination of hyperparameters (learning rates) found for each estimator. Every setting was carefully chosen during preliminary experiments to ensure that the best result for each estimator is representative. In particular, the best-performing learning rates rarely lie on the extrema of the corresponding search range. In the single case where the best-performing learning rate found by hyperparameter search for a goal-conditional policy gradient estimator was such an extreme value (FetchPush, for a small batch size), evaluating one additional learning rate lead to decreased average performance.

6.4  Analysis

This section summarizes the unabridged results presented in section 6.5 (Table 4, Figures 427).

6.4.1  Bit Flipping Environments

Figure 4 presents the learning curves for k=8. Goal-conditional policy gradient estimators with and without an approximate value function baseline (GCPG+B and GCPG, respectively) obtain excellent policies and lead to comparable sample efficiency. HPG+B obtains excellent policies more than 400 batches earlier than these estimators, but its policies degrade upon additional training. Additional experiments strongly suggest that the main cause of this issue is the fact that the value function baseline is still very poorly fit by the time that the policy exhibits desirable behavior. In comparison, HPG obtains excellent policies as early as HPG+B, but its policies remain remarkably stable upon additional training.

The learning curves for k=16 are presented in Figure 5. Clearly, both GCPG and GCPG+B are unable to obtain policies that perform better than chance, which is explained by the fact that they rarely incorporate reward signals during training. Confirming the importance of hindsight, HPG leads to stable and sample efficient learning. Although HPG+B also obtains excellent policies, they deteriorate upon additional training.

Similar results can be observed for a small batch size (see section 6.5.3). The average performance results documented in section 6.5.1 confirm that HPG leads to remarkable sample efficiency. Importantly, sections 6.5.4 and 6.5.5 present hyperparameter sensitivity plots suggesting that HPG is less sensitive to hyperparameter settings than the other estimators. A hyperparameter sensitivity plot displays the average performance achieved by each hyperparameter setting (sorted from best to worst along the horizontal axis). Section 6.5.5 also documents an ablation study where the likelihood ratios are removed from HPG, which notably promotes increased hyperparameter sensitivity. This study confirms the usefulness of the correction prescribed by importance sampling.

6.4.2  Grid World Environments

Figure 6 shows the learning curves for the empty room environment. Clearly, every estimator obtains excellent policies, although HPG and HPG+B improve sample efficiency by at least 200 batches. The learning curves for the four-rooms environment are presented in Figure 7. In this surprisingly challenging environment, every estimator obtains unsatisfactory policies. However, it is still clear that HPG and HPG+B improve sample efficiency. In contrast to the experiments presented in the previous section, HPG+B does not give rise to instability, which we attribute to easier value function estimation. Similar results can be observed for a small batch size (see section 6.5.3). HPG achieves the best average performance in every grid world experiment except for a single case, where the best average performance is achieved by HPG+B (see section 6.5.1). The hyperparameter sensitivity plots presented in sections 6.5.4 and 6.5.5 once again suggest that HPG is less sensitive to hyperparameter choices and that ignoring likelihood ratios promotes increased sensitivity (at least in the four-rooms environment).

6.4.3  Ms. Pac-Man Environment

Figure 8 presents the learning curves for a medium batch size. Approximate value function baselines are excluded from this experiment due to the significant cost of systematic hyperparameter search. Although HPG obtains better policies during early training, GCPG obtains better final policies. However, for such a medium batch size, only 3 active goals per episode (out of potentially 28) are subsampled for HPG. Although this harsh subsampling brings computational efficiency, it also appears to handicap the estimator. This hypothesis is supported by the fact that HPG outperforms GCPG for a small batch size, when all active goals are used (see sections 6.5.1 and 6.5.3). Policies obtained using each estimator are illustrated by videos included on the project website.

6.4.4  FetchPush Environment

Figure 9 presents the learning curves for a medium batch size. HPG obtains good policies after a reasonable number of batches, in sharp contrast to GCPG. For such a medium batch size, only 3 active goals per episode (out of potentially 50) are subsampled for HPG, showing that subsampling is a viable alternative to reduce the computational cost of hindsight. Similar results are observed for a small batch size, when all active goals are used (see sections 6.5.1 and 6.5.3). Policies obtained using each estimator are illustrated by videos included on the project website.

6.5  Results

6.5.1  Average Performance Results

Table 4:

Definitive Average Performance Results.

Bit Flipping (8 bits)Bit Flipping (16 bits)
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
HPG 4.60±0.06 4.72±0.02 7.11±0.12 7.39±0.24 
GCPG 1.81±0.61 3.44±0.30 0.00±0.00 0.00±0.00 
HPG+B 3.40±0.46 4.04±0.10 5.35±0.40 6.09±0.29 
GCPG+B 0.64±0.58 3.31±0.58 0.00±0.00 0.00±0.00[[ 
 [[Empty Room Four Rooms 
 Batch Size 2 Batch Size 16 Batch Size 2 Batch Size 16[[ 
[[HPG 20.22±0.37 16.83±0.84 7.38±0.16 8.75±0.12 
GCPG 12.54±1.01 10.96±1.24 4.64±0.57 6.12±0.54 
HPG+B 19.90±0.29 17.12±0.44 7.28±1.28 8.08±0.18 
GCPG+B 12.69±1.16 10.68±1.36 4.26±0.55 6.61±0.49[[ 
 [[Ms. Pac-Man FetchPush 
 Batch Size 2 Batch Size 16 Batch Size 2 Batch Size 16[[ 
[[HPG 6.58±1.96 6.80±0.64 6.10±0.34 13.15±0.40 
GCPG 5.29±1.67 6.92±0.58 3.48±0.15 4.42±0.28 
Bit Flipping (8 bits)Bit Flipping (16 bits)
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
HPG 4.60±0.06 4.72±0.02 7.11±0.12 7.39±0.24 
GCPG 1.81±0.61 3.44±0.30 0.00±0.00 0.00±0.00 
HPG+B 3.40±0.46 4.04±0.10 5.35±0.40 6.09±0.29 
GCPG+B 0.64±0.58 3.31±0.58 0.00±0.00 0.00±0.00[[ 
 [[Empty Room Four Rooms 
 Batch Size 2 Batch Size 16 Batch Size 2 Batch Size 16[[ 
[[HPG 20.22±0.37 16.83±0.84 7.38±0.16 8.75±0.12 
GCPG 12.54±1.01 10.96±1.24 4.64±0.57 6.12±0.54 
HPG+B 19.90±0.29 17.12±0.44 7.28±1.28 8.08±0.18 
GCPG+B 12.69±1.16 10.68±1.36 4.26±0.55 6.61±0.49[[ 
 [[Ms. Pac-Man FetchPush 
 Batch Size 2 Batch Size 16 Batch Size 2 Batch Size 16[[ 
[[HPG 6.58±1.96 6.80±0.64 6.10±0.34 13.15±0.40 
GCPG 5.29±1.67 6.92±0.58 3.48±0.15 4.42±0.28 

6.5.2  Learning Curves (Batch Size 16)

Figure 4:

Bit flipping (k=8).

Figure 4:

Bit flipping (k=8).

Figure 5:

Bit flipping (k=16).

Figure 5:

Bit flipping (k=16).

Figure 6:

Empty room.

Figure 6:

Empty room.

Figure 7:

Four rooms.

Figure 7:

Four rooms.

Figure 8:

Ms. Pac-man.

Figure 8:

Ms. Pac-man.

Figure 9:

FetchPush.

Figure 9:

FetchPush.

6.5.3  Learning Curves (Batch Size 2)

Figure 10:

Bit flipping (k=8).

Figure 10:

Bit flipping (k=8).

Figure 11:

Bit flipping (k=16).

Figure 11:

Bit flipping (k=16).

Figure 12:

Empty room.

Figure 12:

Empty room.

Figure 13:

Four rooms.

Figure 13:

Four rooms.

Figure 14:

Ms. Pac-man.

Figure 14:

Ms. Pac-man.

Figure 15:

FetchPush.

Figure 15:

FetchPush.

6.5.4  Hyperparameter Sensitivity Plots (Batch Size 16)

Figure 16:

Bit flipping (k=8).

Figure 16:

Bit flipping (k=8).

Figure 17:

Bit flipping (k=16).

Figure 17:

Bit flipping (k=16).

Figure 18:

Empty room.

Figure 18:

Empty room.

Figure 19:

Four rooms.

Figure 19:

Four rooms.

Figure 20:

Ms. Pac-man.

Figure 20:

Ms. Pac-man.

Figure 21:

FetchPush.

Figure 21:

FetchPush.

6.5.5  Hyperparameter Sensitivity Plots (Batch Size 2)

Figure 22:

Bit flipping (k=8).

Figure 22:

Bit flipping (k=8).

Figure 23:

Bit flipping (k=16).

Figure 23:

Bit flipping (k=16).

Figure 24:

Empty room.

Figure 24:

Empty room.

Figure 25:

Four rooms.

Figure 25:

Four rooms.

Figure 26:

Ms. Pac-man.

Figure 26:

Ms. Pac-man.

Figure 27:

FetchPush.

Figure 27:

FetchPush.

Figure 28:

Bit flipping (k=8, batch size 16).

Figure 28:

Bit flipping (k=8, batch size 16).

Figure 29:

Bit flipping (k=16, batch size 16).

Figure 29:

Bit flipping (k=16, batch size 16).

Figure 30:

Empty room (batch size 16).

Figure 30:

Empty room (batch size 16).

6.6  Likelihood Ratio Study

This section presents a study of the active (normalized) likelihood ratios computed by agents during training. A likelihood ratio is considered active if and only if it multiplies a nonzero reward (see expression 5.2). Note that only these likelihood ratios affect gradient estimates based on HPG.

This study is conveyed through plots that encode the distribution of active likelihood ratios computed during training, individually for each time step within an episode. Each plot corresponds to an agent that employs HPG and obtains the highest definitive average performance for a given environment (see Figures 28 to 33). Note that the length of the largest bar for a given time step is fixed to aid visualization.

Figure 31:

Four rooms (batch size 16).

Figure 31:

Four rooms (batch size 16).

Figure 32:

Ms. Pac-Man (batch size 16).

Figure 32:

Ms. Pac-Man (batch size 16).

Figure 33:

FetchPush (batch size 16).

Figure 33:

FetchPush (batch size 16).

The most important insight provided by these plots is that likelihood ratios behave very differently across environments, even for equivalent time steps (e.g., compare bit flipping environments to grid world environments). In contrast, after the first time step, the behavior of likelihood ratios changes slowly across time steps within the same environment. In any case, alternative goals have a significant effect on gradient estimates, which agrees with the results presented in the previous sections.

6.7  Hindsight Experience Replay Study

This section documents an empirical comparison between goal-conditional policy gradients (GCPG), hindsight policy gradients (HPG), deep Q-networks (Mnih et al., 2015, DQN), and a combination of DQN and hindsight experience replay (Andrychowicz et al., 2017, DQN+HER).

6.7.1  Experience Replay

Our implementations of both DQN and DQN+HER are based on OpenAI Baselines (Dhariwal et al., 2017) and use mostly the same hyperparameters that Andrychowicz et al. (2017) used in their experiments on environments with discrete action spaces, all of which resemble our bit flipping environments. The only notable differences in our implementations are the lack of both Polyak averaging and temporal difference target clipping.

Concretely, a cycle begins when an agent collects a number of episodes (16) by following an ε-greedy policy derived from its deep Q-network (ε=0.2). The corresponding transitions are included in a replay buffer, which contains at most 106 transitions. In the case of DQN+HER, hindsight transitions derived from a final strategy are also included in this replay buffer (Andrychowicz et al., 2017, sec. 4.5). When the cycle for a total of 40 different batches is completed, a batch composed of 128 transitions chosen at random from the replay buffer is used to define a loss function and allow one step of gradient-based minimization. The targets required to define these loss functions are computed using a copy of the deep Q-network from the start of the corresponding cycle. Parameters are updated using Adam (Kingma & Ba, 2014). A discount factor of γ=0.98 is used and seems necessary to improve the stability of both DQN and DQN+HER.

6.7.2  Network Architectures

In every experiment, the deep Q-network is implemented by a feedforward neural network with a linear output neuron corresponding to each action. The input to such a network is a triple composed of state, goal, and time step. The network architectures are the same as those described in section 6.2, except that every weight is initially set using variance scaling (Glorot & Bengio, 2010), and all hidden layers use rectified linear units (Nair & Hinton, 2010). For the Ms. Pac-Man environment, the time step information is concatenated with the flattened output of the last convolutional layer (together with the goal information). In comparison to the architecture employed by Andrychowicz et al. (2017) for environments with discrete action spaces, our architectures have one or two additional hidden layers (besides the convolutional architecture used for Ms. Pac-Man).

6.7.3  Experimental Protocol

The experimental protocol employed in our comparison is very similar to the one described in section 6.3. Each agent is evaluated periodically, after a number of cycles that depends on the environment. During this evaluation, the agent collects a number of episodes by following a greedy policy derived from its deep Q-network.

For each environment, grid search is used to select the learning rates for both DQN and DQN+HER according to average performance scores (after the corresponding standard deviation across runs is subtracted, as described in section 6.3). The candidate sets of learning rates are the following: bit flipping and grid world environments: {α×10-kα{1,5}andk{2,3,4,5}}; FetchPush: {10-2,5×10-3,10-3,5×10-4,10-4}; and Ms. Pac-Man: {10-3,5×10-4,10-4,5×10-5,10-5}. These sets were carefully chosen such that the best-performing learning rates do not lie on their extrema.

Definitive results for a given environment are obtained by using the best hyperparameters found for each method in additional runs. These definitive results are directly comparable to our previous results for GCPG and HPG (batch size 16), since every method will have interacted with the environment for the same number of episodes before each evaluation step. For each environment, the number of runs, the number of training batches (cycles), the number of batches (cycles) between evaluations, and the number of episodes per evaluation step are the same as those listed in Tables 1, 2, and 3.

6.7.4  Analysis

The definitive results for the different environments are represented by learning curves (see Figures 34 to 39). In the bit flipping environment for k=8 (see Figure 34), HPG and DQN+HER lead to equivalent sample efficiency, while GCPG lags far behind and DQN is completely unable to learn. In the bit flipping environment for k=16 (see Figure 35), HPG surpasses DQN+HER in sample efficiency by a small margin, while both GCPG and DQN are completely unable to learn. In the empty room environment (see Figure 36), HPG is arguably the most sample-efficient method, although DQN+HER is more stable upon obtaining a good policy. GCPG eventually obtains a good policy, whereas DQN exhibits instability. In the four-rooms environment (see Figure 37), DQN+HER outperforms all other methods by a large margin. Although DQN takes much longer to obtain good policies, it would likely surpass both HPG and GCPG given additional training cycles. In the Ms. Pac-man environment (see Figure 38), DQN+HER once again outperforms all other methods, which achieve equivalent sample efficiency (although DQN appears unstable by the end of training). In the FetchPush environment (see Figure 39), HPG dramatically outperforms all other methods. Both DQN+HER and DQN are completely unable to learn, while GCPG appears to start learning by the end of the training process. Note that active goals are harshly subsampled to increase the computational efficiency of HPG for both Ms. Pac-Man and FetchPush (see sections 6.3 and 6.4).
Figure 34:

Bit flipping (k=8).

Figure 34:

Bit flipping (k=8).

Figure 35:

Bit flipping (k=16).

Figure 35:

Bit flipping (k=16).

Figure 36:

Empty room.

Figure 36:

Empty room.

Figure 37:

Four rooms.

Figure 37:

Four rooms.

Figure 38:

Ms. Pac-Man.

Figure 38:

Ms. Pac-Man.

Figure 39:

FetchPush.

Figure 39:

FetchPush.

6.7.5  Discussion

Our results suggest that the decision between applying HPG or DQN+HER in a particular sparse-reward environment requires experimentation. In contrast, the decision to apply hindsight was always successful.343536373839

Note that we have not employed heuristics that are known to sometimes increase the performance of policy gradient methods (such as entropy bonuses, reward scaling, learning rate annealing, and simple statistical baselines) to avoid introducing confounding factors. We believe that such heuristics would allow both GCPG and HPG to achieve good results in both the four-rooms environment and Ms. Pac-Man. Furthermore, whereas hindsight experience replay is directly applicable to state-of-the-art techniques, our work can probably benefit from being extended to state-of-the-art policy gradient approaches, which we intend to explore in future work. Similarly, we believe that additional heuristics and careful hyperparameter settings would allow DQN+HER to achieve good results in the FetchPush environment. This is evidenced by the fact that Andrychowicz et al. (2017) achieve good results using the deep deterministic policy gradient (Lillicrap et al., 2016, DDPG) in a similar environment (with a continuous action space and a different reward function). The empirical comparisons between either GCPG and HPG or DQN and DQN+HER are comparatively more conclusive, since the similarities between the methods minimize confounding factors.

Regardless of these empirical results, policy gradient approaches constitute one of the most important classes of model-free reinforcement learning methods, which by itself warrants studying how they can benefit from hindsight. Our approach is also complementary to previous work, since it is entirely possible to combine a critic trained by hindsight experience replay with an actor that employs hindsight policy gradients. Although hindsight experience replay does not require a correction analogous to importance sampling, indiscriminately adding hindsight transitions to the replay buffer is problematic, which has mostly been tackled by heuristics (Andrychowicz et al., 2017, sec. 4.5). In contrast, our approach seems to benefit from incorporating all available information about goals at every update, which also avoids the need for a replay buffer.

7  Conclusion

We introduced techniques that enable learning goal-conditional policies using hindsight. In this context, hindsight refers to the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended. Prior to our work, hindsight has been limited to off-policy reinforcement learning algorithms that rely on experience replay (Andrychowicz et al., 2017) and policy search based on Bayesian optimization (Karkus et al., 2016; Pinsler et al., 2019).

In addition to the fundamental hindsight policy gradient, our technical results include its baseline and advantage formulations. These results are based on a self-contained goal-conditional policy framework that is also introduced in this text. Besides the straightforward estimator built on the per decision hindsight policy gradient, we also presented a consistent estimator inspired by weighted importance sampling, together with the corresponding baseline formulation. A variant of this estimator leads to remarkable comparative sample efficiency on a diverse selection of sparse-reward environments, especially in cases where direct reward signals are extremely difficult to obtain. This crucial feature allows natural task formulations that require just trivial reward shaping.

The main drawback of hindsight policy gradient estimators appears to be their computational cost, which is directly related to the number of active goals in a batch. This issue may be mitigated by subsampling active goals, which generally leads to inconsistent estimators. Fortunately, our experiments suggest that this is a viable alternative. Note that the success of hindsight experience replay also depends on an active goal subsampling heuristic (Andrychowicz et al., 2017, sec. 4.5). The inconsistent hindsight policy gradient estimator with a value function baseline employed in our experiments sometimes leads to unstable learning, which is likely related to the difficulty of fitting such a value function without hindsight. This hypothesis is consistent with the fact that such instability is observed only in the most extreme examples of sparse-reward environments. Although our preliminary experiments in using hindsight to fit a value function baseline have been successful, this may be accomplished in several ways and requires a careful study of its own. Further experiments are also required to evaluate hindsight on dense-reward environments.

There are many possibilities for future work besides integrating hindsight policy gradients into systems that rely on goal-conditional policies: deriving additional estimators; implementing and evaluating hindsight (advantage) actor-critic methods; assessing whether hindsight policy gradients can successfully circumvent catastrophic forgetting during curriculum learning of goal-conditional policies; approximating the reward function to reduce required supervision; analyzing the variance of the proposed estimators; studying the impact of active goal subsampling; and evaluating every technique on continuous action spaces.

Appendix A: Goal-Conditional Policy Gradients

This appendix contains proofs related to the results presented in section 3: theorems 1 to 5. Section A.7 presents optimal constant baselines for goal-conditional policies. The remaining subsections contain auxiliary results.

A.1  Theorem 1

Theorem 1
(Intermediary Goal-Conditional Policy Gradient). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)t=1Tr(st,g).
(3.1)
Proof.
The partial derivative η(θ)/θj of the expected return η(θ) with respect to θj is given by
θjη(θ)=gp(g)τθjp(τg,θ)t=1Tr(st,g).
(A.1)
The likelihood-ratio trick allows rewriting the previous equation as
θjη(θ)=gp(g)τp(τg,θ)θjlogp(τg,θ)t=1Tr(st,g).
(A.2)
Note that
logp(τg,θ)=logp(s1)+t=1T-1logp(atst,g,θ)+t=1T-1logp(st+1st,at).
(A.3)
Therefore,
θjη(θ)=gp(g)τp(τg,θ)t=1T-1θjlogp(atst,g,θ)t=1Tr(st,g).
(A.4)

A.2  Theorem 2

Theorem 2
(Goal-Conditional Policy Gradient). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)t'=t+1Tr(st',g).
(3.6)
Proof.
Starting from equation A.4, the partial derivative η(θ)/θj of η(θ) with respect to θj is given by
θjη(θ)=gp(gθ)τp(τg,θ)t=1Tr(st,g)t'=1T-1θjlogp(at'st',g,θ).
(A.5)
The previous equation can be rewritten as
θjη(θ)=t=1Tt'=1T-1Er(St,G)θjlogp(At'St',G,θ)θ.
(A.6)
Let c denote an expectation inside equation A.6 for t't. In that case, At'StSt',G,Θ, and so
c=stst'gat'p(at'st',g,θ)p(st,st',gθ)r(st,g)θjlogp(at'st',g,θ).
(A.7)
Reversing the likelihood-ratio trick,
c=stst'gp(st,st',gθ)r(st,g)θjat'p(at'st',g,θ)=0.
(A.8)
Therefore, the terms where t't can be dismissed from equation A.6, leading to
θjη(θ)=Et=1Tr(St,G)t'=1t-1θjlogp(At'St',G,θ)θ.
(A.9)
The previous equation can be conveniently rewritten as
θjη(θ)=Et=1T-1θjlogp(AtSt,G,θ)t'=t+1Tr(St',G)θ.
(A.10)

A.3  Lemma 4

Lemma 4.
For every j,t,θ, and associated real-valued (baseline) function btθ,
t=1T-1Eθjlogp(AtSt,G,θ)btθ(St,G)θ=0.
(A.11)
Proof.
Letting c denote an expectation inside equation A.11,
c=stgatp(atst,g,θ)p(st,gθ)θjlogp(atst,g,θ)btθ(st,g).
(A.12)
Reversing the likelihood-ratio trick,
c=stgp(st,gθ)btθ(st,g)θjatp(atst,g,θ)=0.
(A.13)

A.4  Theorem 2

Theorem 3
(Goal-Conditional Policy Gradient, Baseline Formulation). For every t,θ, and associated real-valued (baseline) function btθ, the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)×t'=t+1Tr(st',g)-btθ(st,g).
(3.7)
Proof.

The result is obtained by subtracting equation A.11 from A.10. Importantly, for every combination of θ and t, it would also be possible to have a distinct baseline function for each parameter in θ.

A.5  Lemma 1

Lemma 1
(Goal-Conditional Policy Gradient, Action-Value Formulation). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)Qtθ(st,at,g).
(3.8)
Proof.
Starting from equation A.10 and rearranging terms,
θjη(θ)=t=1T-1gstatp(st,at,gθ)θjlogp(atst,g,θ)×st+1:Tp(st+1:Tst,at,g,θ)t'=t+1Tr(st',g).
(A.14)
By the definition of action-value function,
θjη(θ)=Et=1T-1θjlogp(AtSt,G,θ)Qtθ(St,At,G)θ.
(A.15)

A.6  Theorem 7

Theorem 4
(Goal-Conditional Policy Gradient, Advantage Formulation). The gradient η(θ) of the expected return with respect to θ is given by
η(θ)=gp(g)τp(τg,θ)t=1T-1logp(atst,g,θ)Atθ(st,at,g).
(3.9)
Proof.

The result is obtained by choosing btθ=Vtθ and subtracting equation A.11 from equation A.15.

A.7  Theorem 11

For arbitrary j and θ, consider the following definitions of f and h:
f(τ,g)=t=1T-1θjlogp(atst,g,θ)t'=t+1Tr(st',g),
(A.16)
h(τ,g)=t=1T-1θjlogp(atst,g,θ).
(A.17)
For every bjR, using theorem 2 and the fact that Eh(T,G)θ=0 by lemma 16,
θjη(θ)=Ef(T,G)θ=Ef(T,G)-bjh(T,G)θ.
(A.18)
Theorem 11.
Assuming Varh(T,G)θ>0, the (optimal constant baseline) bj that minimizes Varf(T,G)-bjh(T,G)θ is given by
bj=Ef(T,G)h(T,G)θEh(T,G)2θ.
(A.19)
Proof.

The result is an application of lemma 35.

Appendix B: Hindsight Policy Gradients

This appendix contains proofs related to the results presented in section 4: theorems 6, 7, 24, and 10 (in, respectively, sections B.1, B.2, B.4, and B.6) and lemma 8 (in section B.3). Section B.7 presents optimal constant baselines for hindsight policy gradients. Section B.5 contains an auxiliary result.

B.1  Theorem 13

The following theorem relies on importance sampling, a traditional technique used to obtain estimates related to a random variable Xp using samples from an arbitrary positive distribution q. This technique relies on the following equalities:
Ep(X)f(X)=xp(x)f(x)=xq(x)q(x)p(x)f(x)=Eq(X)p(X)q(X)f(X).
(B.1)
Theorem 5
(Every-decision Hindsight Policy Gradient). For an arbitrary (original) goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×t'=t+1Tk=1T-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(4.2)
Proof.
Starting from theorem 2, importance sampling allows rewriting the partial derivative η(θ)/θj as
θjη(θ)=gp(g)τp(τg',θ)p(τg',θ)p(τg,θ)t=1T-1θjlogp(atst,g,θ)×t'=t+1Tr(st',g).
(B.2)
Using equation 2.1,
θjη(θ)=gp(g)τp(τg',θ)k=1T-1p(aksk,g,θ)p(aksk,g',θ)×t=1T-1θjlogp(atst,g,θ)t'=t+1Tr(st',g).
(B.3)

B.2  Theorem 6

Theorem 6
(Per-decision Hindsight Policy Gradient). For an arbitrary (original) goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(4.5)
Proof.
Starting from equation B.3 the partial derivative η(θ)/θj can be rewritten as
θjη(θ)=gp(g)t=1T-1t'=t+1Tτp(τg',θ)k=1T-1p(aksk,g,θ)p(aksk,g',θ)×θjlogp(atst,g,θ)r(st',g).
(B.4)
If we split every trajectory into states and actions before and after t', then η(θ)/θj is given by
gp(g)t=1T-1t'=t+1Ts1:t'-1a1:t'-1p(s1:t'-1,a1:t'-1g',θ)k=1t'-1p(aksk,g,θ)p(aksk,g',θ)×θjlogp(atst,g,θ)z,
(B.5)
where z is defined by
z=st':Tat':T-1p(st':T,at':T-1s1:t'-1,a1:t'-1,g',θ)k=t'T-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(B.6)
Using lemma 25 and canceling terms,
z=st':Tat':T-1p(st'st'-1,at'-1)k=t'T-1p(aksk,g,θ)p(sk+1sk,ak)r(st',g).
(B.7)
Using lemma 6 once again,
z=st':Tat':T-1p(st':T,at':T-1s1:t'-1,a1:t'-1,g,θ)r(st',g).
(B.8)
Using the fact that St'GS1:t'-1,A1:t'-1,Θ,
z=st'r(st',g)p(st's1:t'-1,a1:t'-1,g,θ)=st'r(st',g)p(st's1:t'-1,a1:t'-1,g',θ).
(B.9)
Substituting z into expression B.5 and returning to an expectation over trajectories,
θjη(θ)=τp(τg',θ)gp(g)t=1T-1θjlogp(atst,g,θ)×t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(B.10)

B.3  Lemma 2

Lemma 2.
For every g', t,θ, and associated real-valued (baseline) function btθ,
τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)k=1tp(aksk,g,θ)p(aksk,g',θ)btθ(st,g)=0.
(4.13)
Proof.
Let c denote the jth element of the vector on the left-hand side of equation 4.13, such that
c=gp(g)t=1T-1Eθjlogp(AtSt,g,θ)k=1tp(AkSk,g,θ)p(AkSk,g',θ)btθ(St,g)g',θ.
(B.11)
Using lemma 31 and writing the expectations explicitly,
c=gp(g)t=1T-1s1:ta1:tp(s1:t,a1:tg',θ)θjlogp(atst,g,θ)×p(s1:t,a1:tg,θ)p(s1:t,a1:tg',θ)btθ(st,g).
(B.12)
Canceling terms, using lemma 31 once again and reversing the likelihood-ratio trick,
c=gp(g)t=1T-1s1:ta1:tθjp(atst,g,θ)×p(s1)k=1t-1p(aksk,g,θ)p(sk+1sk,ak)btθ(st,g).
(B.13)
Pushing constants outside the summation over actions at time step t,
c=gp(g)t=1T-1s1:ta1:t-1p(s1)k=1t-1p(aksk,g,θ)p(sk+1sk,ak)×btθ(st,g)θjatp(atst,g,θ)=0.
(B.14)

B.4  Theorem 12

Theorem 12
(Hindsight Policy Gradient, Baseline Formulation). For every g', t,θ, and associated real-valued (baseline) function btθ, the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)z,
(B.15)
where
z=t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g)-k=1tp(aksk,g,θ)p(aksk,g',θ)btθ(st,g).
(B.16)
Proof.

The result is obtained by subtracting equation 4.13 from equation 4.5. Importantly, for every combination of θ and t, it would also be possible to have a distinct baseline function for each parameter in θ.

B.5  Lemma 6

Lemma 6
(Hindsight Policy Gradient, Action-Value Formulation). For an arbitrary goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×k=1tp(aksk,g,θ)p(aksk,g',θ)Qtθ(st,at,g).
(4.14)
Proof.
Starting from equation A.15, the partial derivative η(θ)/θj can be written as
θjη(θ)=t=1T-1gp(g)s1:ta1:tp(s1:t,a1:tg,θ)θjlogp(atst,g,θ)Qtθ(st,at,g).
(B.17)
Using importance sampling, for an arbitrary goal g',
θjη(θ)=gp(g)t=1T-1s1:ta1:tp(s1:t,a1:tg',θ)p(s1:t,a1:tg,θ)p(s1:t,a1:tg',θ)θj×logp(atst,g,θ)Qtθ(st,at,g).
(B.18)
Using lemma 31 and rewriting the previous equation using expectations,
θjη(θ)=gp(g)Et=1T-1θjlogp(AtSt,g,θ)k=1tp(AkSk,g,θ)p(AkSk,g',θ)×t=1T-1Qtθ(St,At,g)g',θ.
(B.19)

B.6  Theorem 14

Theorem 7
(Hindsight Policy Gradient, Advantage Formulation). For an arbitrary (original) goal g', the gradient η(θ) of the expected return with respect to θ is given by
η(θ)=τp(τg',θ)gp(g)t=1T-1logp(atst,g,θ)×k=1tp(aksk,g,θ)p(aksk,g',θ)Atθ(st,at,g).
(4.15)
Proof.

The result is obtained by choosing btθ=Vtθ and subtracting equation B.11 from B.19.

B.7  Theorem 13

For arbitrary g',j, and θ, consider the following definitions of f and h:
f(τ)=gp(g)t=1T-1θjlogp(atst,g,θ)t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g),
(B.20)
h(τ)=gp(g)t=1T-1θjlogp(atst,g,θ)k=1tp(aksk,g,θ)p(aksk,g',θ).
(B.21)
For every bjR, using theorem 7 and the fact that Eh(T)g',θ=0 by lemma 16,
θjη(θ)=Ef(T)g',θ=Ef(T)-bjh(T)g',θ.
(B.22)
Theorem 13.
Assuming Varh(T)g',θ>0, the (optimal constant baseline) bj that minimizes Varf(T)-bjh(T)g',θ is given by
bj=Ef(T)h(T)g',θEh(T)2g',θ.
(B.23)
Proof.

The result is an application of lemma 35.

Appendix C: Hindsight Gradient Estimators

This appendix contains proofs related to the estimators presented in section 5: theorems 12 and 13. Section C.3 presents a result that enables a consistency-preserving weighted baseline.

In this appendix, we consider a data set D={(τ(i),g(i))}i=1N where each trajectory τ(i) is obtained using a policy parameterized by θ in an attempt to achieve a goal g(i) chosen by the environment. Because D is an independent and identically distributed (i.i.d.) data set given Θ,
p(Dθ)=p(τ(1:N),g(1:N)θ)=i=1Np(τ(i),g(i)θ)=i=1Np(g(i))p(τ(i)g(i),θ).
(C.1)

C.1  Theorem 9

Theorem 9.
The per-decision hindsight policy gradient estimator, given by
1Ni=1Ngp(g)t=1T-1logp(At(i)St(i),G(i)=g,θ)×t'=t+1Tk=1t'-1p(Ak(i)Sk(i),G(i)=g,θ)p(Ak(i)Sk(i),G(i),θ)r(St'(i),g),
(5.1)
is a consistent and unbiased estimator of the gradient η(θ) of the expected return.
Proof.
Let Ij(N) denote the jth element of the estimator, which can be written as
Ij(N)=1Ni=1NI(T(i),G(i),θ)j,
(C.2)
where
I(τ,g',θ)j=gp(g)t=1T-1θjlogp(atst,g,θ)t'=t+1Tk=1t'-1p(aksk,g,θ)p(aksk,g',θ)r(st',g).
(C.3)
Using theorem 7, the expected value EIj(N)θ is given by
EIj(N)θ=1Ni=1Ng(i)p(g(i))EI(T(i),g(i),θ)jg(i),θ=1Ni=1Ng(i)p(g(i))θjη(θ)=θjη(θ).
(C.4)

Therefore, Ij(N) is an unbiased estimator of η(θ)/θj.

Conditionally on Θ, the random variable Ij(N) is an average of i.i.d. random variables with expected value η(θ)/θj (see equation C.4). By the strong law of large numbers (Sen & Singer, 1994, theorem 2.3.13),
Ij(N)a.s.θjη(θ).
(C.5)

Therefore, Ij(N) is a consistent estimator of η(θ)/θj.

C.2  Theorem 10

Theorem 10.
The weighted per-decision hindsight policy gradient estimator, given by
i=1Ngp(g)t=1T-1logp(At(i)St(i),G(i)=g,θ)×t'=t+1Tk=1t'-1p(Ak(i)Sk(i),G(i)=g,θ)p(Ak(i)Sk(i),G(i),θ)r(St'(i),g)j=1Nk=1t'-1p(Ak(j)Sk(j),G(j)=g,θ)p(Ak(j)Sk(j),G(j),θ),
(5.2)
is a consistent estimator of the gradient η(θ) of the expected return.
Proof.

Let Wj(N) denote the jth element of the estimator, which can be written as
Wj(N)=gp(g)t=1T-1t'=t+1TX(g,t,t')j(N)Y(g,t,t')j(N),
(C.6)
where
X(g,t,t')j(N)=1Ni=1NX(T(i),G(i),g,t,t',θ)j,
(C.7)
Y(g,t,t')j(N)=1Ni=1NY(T(i),G(i),g,t,t',θ)j,
(C.8)
X(τ,g',g,t,t',θ)j=k=1t'-1p(aksk,g,θ)p(aksk,g',θ)θjlogp(atst,g,θ)r(st',g),
(C.9)
Y(τ,g',g,t,t',θ)j=k=1t'-1p(aksk,g,θ)p(aksk,g',θ).
(C.10)
Consider the expected value EXi=EX(T(i),G(i),g,t,t',θ)jθ, which is given by
EXi=g(i)p(g(i))Ek=1t'-1p(AkSk,g,θ)p(AkSk,G=g(i),θ)×k=1t'-1θjlogp(AtSt,g,θ)r(St',g)G=g(i),θ.
(C.11)
Using the fact that t'>t, lemma 31, and canceling terms, EXi can be written as
g(i)p(g(i))s1:t'a1:t'-1p(st's1:t'-1,a1:t'-1,G=g(i),θ)p(s1:t'-1,a1:t'-1g,θ)×θjlogp(atst,g,θ)r(st',g).
(C.12)
Because St'GS1:t'-1,A1:t'-1,Θ,
EXi=Eθjlogp(AtSt,g,θ)r(St',g)g,θ.
(C.13)

Conditionally on Θ, the variable X(g,t,t')j(N) is an average of i.i.d. random variables with expected value EXi. By the strong law of large numbers (Sen & Singer, 1994, theorem 2.3.13), X(g,t,t')j(N)a.s.EXi.

Using lemma 31, the expected value EYi=EY(T(i),G(i),g,t,t',θ)jθ is given by
EYi=g(i)p(g(i))Ep(S1:t'-1(i),A1:t'-1(i)G(i)=g,θ)p(S1:t'-1(i),A1:t'-1(i)g(i),θ)g(i),θ=1.
(C.14)

Conditionally on Θ, the variable Y(g,t,t')j(N) is an average of i.i.d. random variables with expected value 1. By the strong law of large numbers, Y(g,t,t')j(N)a.s.1.

Because both X(g,t,t')j(N) and Y(g,t,t')j(N) converge almost surely to real numbers (Thomas, 2015, chap. 3, property 2),
X(g,t,t')j(N)Y(g,t,t')j(N)a.s.Eθjlogp(AtSt,g,θ)r(St',g)g,θ.
(C.15)
By theorem 2 and the fact that Wj(N) is a linear combination of terms X(g,t,t')j(N)/Y(g,t,t')j(N),
Wj(N)a.s.gp(g)t=1T-1t'=t+1TEθjlogp(AtSt,g,θ)r(St',g)g,θ=θjη(θ).
(C.16)

C.3  Theorem 14

Theorem 14.
The weighted baseline estimator, given by
i=1Ngp(g)t=1T-1logp(At(i)St(i),G(i)=g,θ)k=1tp(Ak(i)Sk(i),G(i)=g,θ)p(Ak(i)Sk(i),G(i),θ)btθ(St(i),g)j=1Nk=1tp(Ak(j)Sk(j),G(j)=g,θ)p(Ak(j)Sk(j),G(j),θ),
(C.17)
converges almost surely to zero.
Proof.
Let Bj(N) denote the jth element of the estimator, which can be written as
Bj(N)=gp(g)t=1T-1X(g,t)j(N)Y(g,t)j(N),
(C.18)
where
X(g,t)j(N)=1Ni=1NX(T(i),G(i),g,t,θ)j,
(C.19)
Y(g,t)j(N)=1Ni=1NY(T(i),G(i),g,t,θ)j,
(C.20)
X(τ,g',g,t,θ)j=k=1tp(aksk,g,θ)p(aksk,g',θ)θjlogp(atst,g,θ)btθ(st,g),
(C.21)
Y(τ,g',g,t,θ)j=k=1tp(aksk,g,θ)p(aksk,g',θ).
(C.22)
Using equations B.11 and B.14, the expected value EXi=E[X(T(i),G(i),g,t,θ)jθ] is given by
EXi=g(i)p(g(i))EX(T(i),g(i),g,t,θ)jg(i),θ=0.
(C.23)

Conditional on Θ, the variable X(g,t)j(N) is an average of i.i.d. random variables with expected value zero. By the strong law of large numbers (Sen & Singer, 1994, theorem 2.3.13), X(g,t)j(N)a.s.0.

The fact that Y(g,t)j(N)a.s.1 is already established in the proof of theorem 13. Because both X(g,t)j(N) and Y(g,t)j(N) converge almost surely to real numbers (Thomas, 2015, chap. 3, property 2),
X(g,t)j(N)Y(g,t)j(N)a.s.0.
(C.24)

Because Bj(N) is a linear combination of terms X(g,t)j(N)/Y(g,t)j(N), Bj(N)a.s.0.

Clearly, if E(N) is a consistent estimator of a some quantity given θ, then so is E(N)-Bj(N), which allows using this result in combination with theorem 13.

Appendix D: Fundamental Results

This appendix presents results required by previous sections: lemmas 31 and 25 (in, respectively, sections D.1 and D.2), theorem 7 (in section D.4), and lemma 35 (in section D.5). Appendix D.3 contains an auxiliary result.

D.1  Lemma 5

Lemma 5.
For every τ,g,θ, and 1tT-1,
p(s1:t,a1:tg,θ)=p(s1)p(atst,g,θ)k=1t-1p(aksk,g,θ)p(sk+1sk,ak).
(D.1)
Proof.
In order to employ backward induction, consider the case t=T-1. By marginalization,
p(s1:T-1,a1:T-1g,θ)=sTp(τg,θ)=sTp(s1)k=1T-1p(aksk,g,θ)p(sk+1sk,ak)
(D.2)
=p(s1)p(aT-1sT-1,g,θ)k=1T-2p(aksk,g,θ)p(sk+1sk,ak),
(D.3)
which completes the proof of the base case.
Assuming the inductive hypothesis is true for a given 2tT-1 and considering the case that t-1,
p(s1:t-1,a1:t-1g,θ)=statp(s1)p(atst,g,θ)k=1t-1p(aksk,g,θ)p(sk+1sk,ak)
(D.4)
=p(s1)p(at-1st-1,g,θ)k=1t-2p(aksk,g,θ)p(sk+1sk,ak).
(D.5)

D.2  Lemma 6

Lemma 6.
For every τ,g,θ, and 1tT,
p(st:T,at:T-1s1:t-1,a1:t-1,g,θ)=p(stst-1,at-1)k=tT-1p(aksk,g,θ)p(sk+1sk,ak).
(D.6)
Proof.
The case t=1 can be inspected easily. Consider 2tT. By definition,
p(st:T,at:T-1s1:t-1,a1:t-1,g,θ)=p(s1:T,a1:T-1g,θ)p(s1:t-1,a1:t-1g,θ).
(D.7)
Using lemma 16,
p(st:T,at:T-1s1:t-1,a1:t-1,g,θ)=p(s1)k=1T-1p(aksk,g,θ)p(sk+1sk,ak)p(s1)p(at-1st-1,g,θ)k=1t-2p(aksk,g,θ)p(sk+1sk,ak)
(D.8)
=k=t-1T-1p(aksk,g,θ)p(sk+1sk,ak)p(at-1st-1,g,θ).
(D.9)

D.3  Lemma 16

Lemma 7.
For every t and θ, the action-value function Qtθ is given by
Qtθ(s,a,g)=Er(St+1,g)+Vt+1θ(St+1,g)St=s,At=a.
(D.10)
Proof.
From the definition of action-value function and using the fact that St+1G,ΘSt,At,
Qtθ(s,a,g)=Er(St+1,g)St=s,At=a+Et'=t+2Tr(St',g)St=s,At=a,g,θ.
(D.11)
Let z denote the second term on the right-hand side of the previous equation, which can also be written as
z=st+1at+1st+2:Tp(st+1,at+1,st+2:TSt=s,At=a,g,θ)t'=t+2Tr(st',g).
(D.12)
Consider the following three independence properties:
St+1G,ΘSt,At,
(D.13)
At+1St,AtSt+1,G,Θ,
(D.14)
St+2:TSt,AtSt+1,At+1,G,Θ.
(D.15)
Together, these properties can be used to demonstrate that
z=st+1p(st+1St=s,At=a)at+1p(at+1st+1,g,θ)×st+2:Tp(st+2:Tst+1,at+1,g,θ)t'=t+2Tr(st',g).
(D.16)

From the definition of value function, z=E[Vt+1θ(St+1,g)St=s,At=a].

D.4  Theorem 8

Theorem 8.
For every t and θ, the advantage function Atθ is given by
Atθ(s,a,g)=Er(St+1,g)+Vt+1θ(St+1,g)-Vtθ(s,g)St=s,At=a.
(4.6)
Proof.

The result follows from the definition of advantage function and lemma 16.

D.5  Lemma 8

Consider a discrete random variable X and real-valued functions f and h. Suppose also that Eh(X)=0 and Varh(X)>0. Clearly, for every bR, we have Ef(X)-bh(X)=Ef(X).

Lemma 8.
The constant bR that minimizes Varf(X)-bh(X) is given by
b=Ef(X)h(X)Eh(X)2.
(D.17)
Proof.
Let v=Varf(X)-bh(X). Using our assumptions and the definition of variance,
v=E(f(X)-bh(X))2-Ef(X)-bh(X)2=E(f(X)-bh(X))2-Ef(X)2
(D.18)
=Ef(X)2-2bEf(X)h(X)+b2Eh(X)2-Ef(X)2.
(D.19)

The first and second derivatives of v with respect to b are given by dv/db=-2Ef(X)h(X)+2bEh(X)2 and d2v/db2=2Eh(X)2. Our assumptions guarantee that Eh(X)2>0. Therefore, by Fermat's theorem, if b is a local minimum, then dv/db=0, leading to the desired equality. By the second derivative test, b must be a local minimum.

Note

1

An open-source implementation of these estimators is available on http://paulorauber.com/hpg.

Acknowledgments

We thank Sjoerd van Steenkiste, Klaus Greff, Imanol Schlag, and the anonymous reviewers of previous versions of this work for their valuable feedback. This research was supported by the Swiss National Science Foundation (grant 200021_165675/1), the European Research Council (Advanced Grant 742870), and CAPES (Filipe Mutz, PDSE, 88881.133206/2016-01). We are grateful to Nvidia Corporation for donating a DGX-1 machine and to IBM for donating a Minsky machine.

References

Andrychowicz
,
M.
,
Wolski
,
F.
,
Ray
,
A.
,
Schneider
,
J.
,
Fong
,
R.
,
Welinder
,
P.
, …
Zaremba
,
W.
(
2017
). Hindsight experience replay. In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 30 (pp.
5048
5058
).
Red Hook, NY
:
Curran
.
Bellemare
,
M. G.
,
Naddaf
,
Y.
,
Veness
,
J.
, &
Bowling
,
M.
(
2013
).
The arcade learning environment: An evaluation platform for general agents
.
Journal of Artificial Intelligence Research
,
47
,
253
279
.
Bishop
,
C. M.
(
2013
).
Pattern recognition and machine learning
.
Berlin
:
Springer
.
Da Silva
,
B. C.
,
Konidaris
,
G.
, &
Barto
,
A. G.
(
2012
).
Learning parameterized skills.
In
Proceedings of International Conference of Machine Learning
.
Red Hook, NY
:
Curran
.
Deisenroth
,
M. P.
,
Englert
,
P.
,
Peters
,
J.
, &
Fox
,
D.
(
2014
).
Multi-task policy search for robotics.
In
Proceedings of the IEEE International Conference on Robotics and Automation, 2014
(pp.
3876
3881
).
Piscataway, NJ
:
IEEE
.
Dhariwal
,
P.
,
Hesse
,
C.
,
Klimov
,
O.
,
Nichol
,
A.
,
Plappert
,
M.
,
Radford
,
A.
, …
Zhokhov
,
P.
(
2017
).
Openai baselines
. https://github.com/openai/baselines.
Duan
,
Y.
,
Chen
,
X.
,
Houthooft
,
R.
,
Schulman
,
J.
, &
Abbeel
,
P.
(
2016
).
Benchmarking deep reinforcement learning for continuous control.
In
Proceedings of International Conference on Machine Learning
(pp.
1329
1338
).
Red Hook, NY
:
Curran
.
Fabisch
,
A.
, &
Metzen
,
J. H.
(
2014
).
Active contextual policy search
.
Journal of Machine Learning Research
,
15
(
1
),
3371
3399
.
Florensa
,
C.
,
Held
,
D.
,
Wulfmeier
,
M.
,
Zhang
,
M.
, &
Abbeel
,
P.
(
2017
).
Reverse curriculum generation for reinforcement learning.
In
Proceedings of the First Annual Conference on Robot Learning
(pp.
482
495
).
Ghosh
,
D.
,
Singh
,
A.
,
Rajeswaran
,
A.
,
Kumar
,
V.
, &
Levine
,
S.
(
2018
).
Divide-and-conquer reinforcement learning.
In
International Conference on Learning Representations
. https://openreview.net/forum?id=rJwelMbR.
Glorot
,
X.
. &
Bengio
,
Y.
(
2010
).
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
249
256
).
Palo Alto, CA
:
AAAI
.
Greensmith
,
E.
,
Bartlett
,
P. L.
, &
Baxter
,
J.
(
2004
).
Variance reduction techniques for gradient estimates in reinforcement learning
.
Journal of Machine Learning Research
,
5
,
1471
1530
.
Jie
,
T.
&
Abbeel
,
P.
(
2010
). On a connection between importance sampling and the likelihood ratio policy gradient. In
J.
Lafferty
,
C.
Williams
,
J.
Shawe-Taylor
,
R.
Zemel
,
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
23
(pp.
1000
1008
).
Red Hook, NY
:
Curran
.
Karkus
,
P.
,
Kupcsik
,
A.
,
Hsu
,
D.
, &
Lee
,
W. S.
(
2016
).
Factored contextual policy search with Bayesian optimization
. arXiv:1612.01746.
Kingma
,
D. P.
, &
Ba
,
J.
(
2014
).
Adam: A method for stochastic optimization.
In
Proceedings of the Third International Conference on Learning Representations
. https://openreview.net/forum?id=rJwelMbR.
Kirkpatrick
,
J.
,
Pascanu
,
R.
,
Rabinowitz
,
N.
,
Veness
,
J.
,
Desjardins
,
G.
,
Rusu
,
A. A.
, …
Hadsell
,
R.
(
2017
).
Overcoming catastrophic forgetting in neural networks
.
Proceedings of the National Academy of Sciences
,
114
(
13
),
3521
3526
.
Kober
,
J.
,
Wilhelm
,
A.
,
Oztop
,
E.
, &
Peters
,
J.
(
2012
).
Reinforcement learning to adjust parameterized motor primitives to new situations
.
Autonomous Robots
,
33
(
4
),
361
379
.
Kulkarni
,
T. D.
,
Narasimhan
,
K.
,
Saeedi
,
A.
, &
Tenenbaum
,
J.
(
2016
).
Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation.
In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
, (Eds.),
Advances in neural information processing systems
,
29
(pp.
3675
3683
).
Red Hook, NY
:
Curran
.
Kupcsik
,
A. G.
,
Deisenroth
,
M. P.
,
Peters
,
J.
, &
Neumann
,
G.
(
2013
).
Data-efficient generalization of robot skills with contextual policy search.
In
Proceedings of the 27th AAAI Conference on Artificial Intelligence
(pp.
1401
1407
).
Palo Alto, CA
:
AAAI Press
.
Levy
,
A.
,
Platt
,
R.
, &
Saenko
,
K.
(
2019
).
Hierarchical reinforcement learning with hindsight.
In
Proceedings of the International Conference on Learning Representations
.
Lillicrap
,
T. P.
,
Hunt
,
J. J.
,
Pritzel
,
A.
,
Heess
,
N.
,
Erez
,
T.
,
Tassa
,
Y.
,
Silver
,
D.
, &
Wierstra
,
D.
(
2016
).
Continuous control with deep reinforcement learning.
In
Proceedings of the International Conference on Learning Representations
.
Lin
,
L.
(
1992
).
Self-improving reactive agents based on reinforcement learning, planning and teaching.
Machine Learning
,
8
(
3/4
),
69
97
.
Mankowitz
,
D. J.
,
Žídek
,
A.
,
Barreto
,
A.
,
Horgan
,
D.
,
Hessel
,
M.
,
Quan
,
J.
, …
Schaul
,
T.
(
2018
).
Unicorn: Continual learning with a universal, off-policy agent
. arXiv:1802.08294.
McCloskey
,
M.
, &
Cohen
,
N. J.
(
1989
).
Catastrophic interference in connectionist networks: The sequential learning problem
.
Psychology of Learning and Motivation: Advances in Research and Theory
,
24
(
C
),
109
165
.
Metzen
,
J. H.
,
Fabisch
,
A.
, &
Hansen
,
J.
(
2015
).
Bayesian optimization for contextual policy search.
In
Proceedings of the Second Machine Learning in Planning and Control of Robot Motion Workshop.
Mnih
,
V.
,
Kavukcuoglu
,
K.
,
Silver
,
D.
,
Rusu
,
A. A.
,
Veness
,
J.
,
Bellemare
,
M. G.
, …
Hassabis
,
D.
(
2015
).
Human-level control through deep reinforcement learning
.
Nature
,
518
(
7540
), 529.
Munos
,
R.
,
Stepleton
,
T.
,
Harutyunyan
,
A.
, &
Bellemare
,
M.
(
2016
).
Safe and efficient off-policy reinforcement learning.
In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
1054
1062
).
Red Hook, NY
:
Curran
Nair
,
V.
, &
Hinton
,
G. E.
(
2010
).
Rectified linear units improve restricted Boltzmann machines.
In
Proceedings of the International Conference on Machine Learning
.
New York
:
ACM
.
Ng
,
A. Y.
,
Harada
,
D.
, &
Russell
,
S.
(
1999
).
Policy invariance under reward transformations: Theory and application to reward shaping.
In
Proceedings of the International Conference on Machine Learning
,
99
(pp.
278
287
).
New York
:
ACM
.
Oh
,
J.
,
Singh
,
S.
,
Lee
,
H.
, &
Kohli
,
P.
(
2017
).
Zero-shot task generalization with multi-task deep reinforcement learning.
In
Proceedings of the 34th International Conference on Machine Learning
.
New York
:
ACM
.
Pathak
,
D.
,
Mahmoudieh
,
P.
,
Luo
,
M.
,
Agrawal
,
P.
,
Chen
,
D.
,
Shentu
,
F.
, …
Darrell
,
T.
(
2018
).
Zero-shot visual imitation.
In
International Conference on Learning Representations
.
Peshkin
,
L.
, &
Shelton
,
C. R.
(
2002
).
Learning from scarce experience.
In
Proceedings of the Nineteenth International Conference on Machine Learning
(pp.
498
505
).
New York
:
ACM
.
Peters
,
J.
, &
Schaal
,
S.
(
2008
).
Reinforcement learning of motor skills with policy gradients
.
Neural Networks
,
21
(
4
),
682
697
.
Pinsler
,
R.
,
Karkus
,
P.
,
Kupcsik
,
A.
,
Hsu
,
D.
, &
Lee
,
W. S.
(
2019
).
Factored contextual policy search with Bayesian optimization.
In
Proceedings of the 2019 International Conference on Robotics and Automation
(pp.
7242
7248
).
Piscataway, NJ
:
IEEE
.
Plappert
,
M.
,
Andrychowicz
,
M.
,
Ray
,
A.
,
McGrew
,
B.
,
Baker
,
B.
,
Powell
,
G.
, …
Zaremba
,
W.
(
2018
).
Multi-goal reinforcement learning: Challenging robotics environments and request for research
. arXiv:1802.09464.
Precup
,
D.
,
Sutton
,
R. S.
, &
Singh
,
S. P.
(
2000
).
Eligibility traces for off-policy policy evaluation.
In
International Conference on Machine Learning
(pp.
759
766
).
Schaul
,
T.
,
Horgan
,
D.
,
Gregor
,
K.
, &
Silver
,
D.
(
2015
).
Universal value function approximators.
In
Proceedings of the International Conference on Machine Learning
(pp.
1312
1320
).
New York
:
ACM
.
Schmidhuber
,
J.
(
2013
).
PowerPlay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem.
Frontiers in Psychology
, June.
Schmidhuber
,
J.
(
2019
).
Reinforcement learning upside down: Don't predict rewards—just map them to actions
. arXiv:1912.02875.
Schmidhuber
,
J.
, &
Huber
,
R.
(
1990
).
Learning to generate focus trajectories for attentive vision
.
Bonn
:
Institut für Informatik
.
Sen
,
P.
, &
Singer
,
J.
(
1994
).
Large sample methods in statistics: An introduction with applications
.
London
:
Chapman & Hall/CRC
.
Srivastava
,
R. K.
,
Steunebrink
,
B. R.
, &
Schmidhuber
,
J.
(
2013
).
First experiments with PowerPlay
.
Neural Networks
,
41
,
130
136
.
Sukhbaatar
,
S.
,
Lin
,
Z.
,
Kostrikov
,
I.
,
Synnaeve
,
G.
,
Szlam
,
A.
, &
Fergus
,
R.
(
2018
).
Intrinsic motivation and automatic curricula via asymmetric self-play.
In
Proceedings of the International Conference on Learning Representations
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning: An introduction
.
Cambridge, MA
:
Bradford Book
.
Sutton
,
R. S.
,
McAllester
,
D. A.
,
Singh
,
S. P.
, &
Mansour
,
Y.
(
1999
).
Policy gradient methods for reinforcement learning with function approximation.
In
S.
Solla
,
T.
Leen
, &
K.
Müller
(Eds),
Advances in neural information processing systems
,
12
(pp.
1057
1063
).
Cambridge, MA
:
MIT Press
.
Sutton
,
R. S.
,
Precup
,
D.
, &
Singh
,
S.
(
1999
).
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
.
Artificial Intelligence
,
112
(
1–2
),
181
211
.
Thomas
,
P.
(
2015
).
Safe reinforcement learning
. PhD diss., University of Massachusetts.
Thomas
,
P.
,
Theocharous
,
G.
, &
Ghavamzadeh
,
M.
(
2015
).
High confidence policy improvement.
In
Proceedings of the International Conference on Machine Learning
(pp.
2380
2388
).
New York
:
ACM
.
Vezhnevets
,
A. S.
,
Osindero
,
S.
,
Schaul
,
T.
,
Heess
,
N.
,
Jaderberg
,
M.
,
Silver
,
D.
, &
Kavukcuoglu
,
K.
(
2017
).
FeUdal networks for hierarchical reinforcement learning.
In
Proceedings of the 34th International Conference on Machine Learning
(pp.
3540
3549
).
New York
:
ACM
.
Williams
,
R. J.
(
1986
).
Reinforcement-learning in connectionist networks: A mathematical analysis
(Technical Report 8605).
San Diego
:
Institute for Cognitive Science, University of California, San Diego
.
Williams
,
R. J.
(
1992
).
Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine Learning
,
8
(
3–4
),
229
256
.
Zhu
,
Y.
,
Mottaghi
,
R.
,
Kolve
,
E.
,
Lim
,
J. J.
,
Gupta
,
A.
,
Fei-Fei
,
L.
, &
Farhadi
,
A.
(
2017
).
Target-driven visual navigation in indoor scenes using deep reinforcement learning.
In
Proceedings of the IEEE International Conference on Robotics and Automation
(pp.
3357
3364
).
Piscataway, NJ
:
IEEE
.