A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.

## 1 Introduction

In a traditional reinforcement learning setting, an agent interacts with an environment in a sequence of episodes, observing states and acting according to a policy that ideally maximizes expected cumulative reward. If an agent is required to pursue different goals across episodes, its goal-conditional policy may be represented by a probability distribution over actions for every combination of state and goal. This distinction between states and goals is particularly useful when the probability of a state transition given an action is independent of the goal pursued by the agent.

Learning such goal-conditional behavior has received significant attention in machine learning and robotics, especially because a goal-conditional policy may generalize desirable behavior to goals that were never encountered by the agent (Schmidhuber & Huber, 1990; Da Silva, Konidaris, & Barto, 2012; Kupcsik, Deisenroth, Peters, & Neumann, 2013; Deisenroth, Englert, Peters, & Fox, 2014; Schaul, Horgan, Gregor, & Silver, 2015; Zhu et al., 2017; Kober, Wilhelm, Oztop, & Peters, 2012; Ghosh, Singh, Rajeswaran, Kumar, & Levine, 2018; Mankowitz et al., 2018; Pathak et al., 2018; Schmidhuber, 2019). Consequently, developing goal-based curricula to facilitate learning has also attracted considerable interest (Fabisch & Metzen, 2014; Florensa, Held, Wulfmeier, Zhang, & Abbeel, 2017; Sukhbaatar et al., 2018; Srivastava, Steunebrink, & Schmidhuber, 2013; Schmidhuber, 2013). In hierarchical reinforcement learning, goal-conditional policies may enable agents to plan using subgoals, which abstracts the details involved in lower-level decisions (Oh, Singh, Lee, & Kohli, 2017; Vezhnevets et al., 2017; Kulkarni, Narasimhan, Saeedi, & Tenenbaum, 2016; Levy, Platt, & Saenko, 2019).

In a typical sparse-reward environment, an agent receives a nonzero reward only upon reaching a goal state. Besides being natural, this task formulation avoids the potentially difficult problem of reward shaping, which often biases the learning process toward suboptimal behavior (Ng, Harada, & Russell, 1999). Unfortunately, sparse-reward environments remain particularly challenging for traditional reinforcement learning algorithms (Andrychowicz et al., 2017; Florensa et al., 2017). For example, consider an agent tasked with traveling between cities. In a sparse-reward formulation, if reaching a desired destination by chance is unlikely, a learning agent will rarely obtain reward signals. At the same time, it seems natural to expect that an agent will learn how to reach the cities it visited regardless of its desired destinations.

In this context, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended is called *hindsight*. This capacity was recently introduced by Andrychowicz et al. (2017) to off-policy reinforcement learning algorithms that rely on experience replay (Lin, 1992). In earlier work, Karkus, Kupcsik, Hsu, and Lee (2016) introduced hindsight to policy search based on Bayesian optimization (Metzen, Fabisch, & Hansen, 2015). This work was recently extended by Pinsler, Karkus, Kupcsik, Hsu, and Lee (2019).

In this letter, we demonstrate how hindsight can be introduced to policy gradient methods (Williams, 1986, 1992; Sutton, McAllester, Singh, & Mansour, 1999), generalizing this idea to a successful class of reinforcement learning algorithms (Peters & Schaal, 2008; Duan, Chen, Houthooft, Schulman, & Abbeel, 2016).

In contrast to previous work on hindsight, our approach relies on importance sampling (Bishop, 2013). In reinforcement learning, importance sampling has been traditionally employed in order to efficiently reuse information obtained by earlier policies during learning (Precup, Sutton, & Singh, 2000; Peshkin & Shelton, 2002; Jie & Abbeel, 2010; Thomas, Theocharous, & Ghavamzadeh, 2015; Munos, Stepleton, Harutyunyan, & Bellemare, 2016). In comparison, our approach attempts to efficiently learn about different goals using information obtained by the current policy for a specific goal. This approach leads to multiple formulations of a hindsight policy gradient that relate to well-known policy gradient results.

In comparison to conventional (goal-conditional) policy gradient estimators, our proposed estimators lead to remarkable sample efficiency on a diverse selection of sparse-reward environments.

## 2 Preliminaries

We denote random variables by uppercase letters and assignments to these variables by corresponding lowercase letters. We let $Val(X)$ denote the set of valid assignments to a random variable $X$. We also omit the subscript that typically relates a probability function to random variables when there is no risk of ambiguity. For instance, we may use $p(x)$ to denote $pX(x)$ and $p(y)$ to denote $pY(y)$.

Consider an agent that interacts with its environment in a sequence of episodes, each of which lasts exactly $T$ time steps. The agent receives a goal $g\u2208Val(G)$ at the beginning of each episode. At every time step $t$, the agent observes a state $st\u2208Val(St)$, receives a reward $r(st,g)\u2208R$, and chooses an action $at\u2208Val(At)$. For simplicity of notation, suppose that $Val(G),Val(St)$, and $Val(At)$ are finite for every $t$.

In our setting, a goal-conditional policy defines a probability distribution over actions for every combination of state and goal. The same policy is used to make decisions at every time step.

In contrast to a Markov decision process, this formulation allows the probability of a state transition given an action to change across time steps within an episode. More important, it implicitly states that the probability of a state transition given an action is independent of the goal pursued by the agent, which we denote by $St+1\u22a5\u22a5G\u2223St,At$. For every $\tau ,g,$ and $\theta $, we also assume that $p(\tau \u2223g,\theta )$ is nonzero and differentiable with respect to $\theta $.

The action-value function is given by $Qt\theta (s,a,g)=E\u2211t'=t+1Tr(St',g)\u2223St=s,At=a,g,\theta $, the value function by $Vt\theta (s,g)=EQt\theta (s,At,g)\u2223St=s,g,\theta $, and the advantage function by $At\theta (s,a,g)=Qt\theta (s,a,g)-Vt\theta (s,g)$.

## 3 Goal-Conditional Policy Gradients

This section presents results for goal-conditional policies that are analogous to well-known results for conventional policies (Peters & Schaal, 2008). They establish the foundation for the results presented in the next section. Additional proofs are included in appendix A for completeness.

The objective of policy gradient methods is finding policy parameters that achieve maximum expected return. When combined with Monte Carlo techniques (Bishop, 2013), the following result allows pursuing this objective using gradient-based optimization.

**Theorem 1**

**Proof.**

More conveniently, the following result can be obtained by noting that an action is independent of any previous state given the current state, the goal, and the policy parameters (see section A.2).

**Theorem 2**

In order to reduce the variance of the gradient estimator, the following result allows employing a so-called baseline (see section A.4).

**Theorem 3**

Section A.7 presents the constant baselines that minimize the (elementwise) variance of the corresponding estimator. However, such baselines are usually impractical to compute (or estimate), and the variance of the estimator may be reduced further by a baseline function that depends on state and goal. Although generally suboptimal, it is typical to let the baseline function $bt\theta $ approximate the value function $Vt\theta $ (Greensmith, Bartlett, & Baxter, 2004).

The action-value function is related to the goal-conditional policy gradient by the following result (see section A.5).

**Lemma 1**

Finally, actor-critic methods may rely on the following result for goal-conditional policies (see section A.6).

**Theorem 4**

## 4 Hindsight Policy Gradients

This section presents the novel ideas that introduce hindsight to policy gradient methods. Additional proofs can be found in appendix B.

Suppose that the reward $r(s,g)$ is known for every combination of state $s$ and goal $g$, as in previous work on hindsight (Andrychowicz et al., 2017; Karkus et al., 2016; Pinsler et al., 2019). In that case, it is possible to evaluate a trajectory obtained while trying to achieve an original goal $g'$ for an alternative goal $g$. This information can be exploited using a central result based on importance sampling.

**Theorem 5**

**Proof.**

^{2}, importance sampling allows rewriting the partial derivative $\u2202\eta (\theta )/\u2202\theta j$ as

In the formulation presented above, every reward is multiplied by the ratio between the likelihood of the corresponding trajectory under an alternative goal and the likelihood under the original goal (see equation 2.1). Intuitively, every reward should instead be multiplied by a likelihood ratio that only considers the corresponding trajectory up to the previous action. This intuition underlies the following important result, named after an analogous result for action-value functions by Precup et al. (2000).

**Theorem 6**

**Proof.**

^{25}(see section D.2) and canceling terms,

^{25}once again,

The following lemma allows introducing baselines to hindsight policy gradients (see section B.4).

**Lemma 2.**

Section B.7 presents the constant baselines that minimize the (elementwise) variance of the corresponding gradient estimator. By analogy with the conventional practice, we suggest letting the baseline function $bt\theta $ approximate the value function $Vt\theta $ instead.

The action-value function is related to the hindsight policy gradient by the following result (see section B.5).

**Lemma 3**

Importantly, the choice of likelihood ratio in lemma ^{8} is far from unique. However, besides leading to straightforward estimation, it also underlies the advantage formulation presented below.

**Theorem 7**

Fortunately, the following result allows approximating the advantage under a goal using a state transition collected while pursuing another goal (see section D.4).

**Theorem 8.**

## 5 Hindsight Gradient Estimators

This section details gradient estimation based on the results presented in the previous section. The corresponding proofs can be found in appendix C.

Consider a data set (batch) $D={(\tau (i),g(i))}i=1N$ where each trajectory $\tau (i)$ is obtained using a policy parameterized by $\theta $ in an attempt to achieve a goal $g(i)$ chosen by the environment.

The following result points to a straightforward estimator based on theorem ^{7} (see section C.1).

**Theorem 9.**

In preliminary experiments, we found that this estimator leads to unstable learning progress, which is probably due to its potential high variance. The following result, inspired by weighted importance sampling (Bishop, 2013), represents our attempt to trade variance for bias (see section C.2).

**Theorem 10.**

In simple terms, the likelihood ratio for every combination of trajectory, (alternative) goal, and time step is normalized across trajectories by this estimator. In section C.3, we present a result that enables the corresponding consistency-preserving weighted baseline.

Consider a set $G(i)={g\u2208Val(G)\u2223existsatsuchthatr(st(i),g)\u22600}$ composed of so-called active goals during the $i$th episode. The feasibility of the proposed estimators relies on the fact that only active goals correspond to nonzero terms inside the expectation over goals in expressions 5.1 and 5.2. In many natural sparse-reward environments, active goals will correspond directly to states visited during episodes (e.g., the cities visited while trying to reach other cities), which enables computing said expectation exactly when the goal distribution is known.

The proposed estimators have remarkable properties that differentiate them from previous (weighted) importance sampling estimators for off-policy learning. For instance, although a trajectory is often more likely under the original goal than under an alternative goal, in policies with strong optimal substructure, a high probability of a trajectory between the state $a$ and the goal (state) $c$ that goes through the state $b$ may naturally allow for a high probability of the corresponding (sub)trajectory between the state $a$ and the goal (state) $b$. In other cases, the (unnormalized) likelihood ratios may become very small for some (alternative) goals after a few time steps across all trajectories. After normalization, in the worst case, this may even lead to equivalent ratios for such goals for a given time step across all trajectories. In any case, it is important to note that only likelihood ratios associated with active goals for a given episode will affect the gradient estimate. Additionally, an original goal will always have (unnormalized) likelihood ratios equal to one for the corresponding episode.

Under mild additional assumptions, the proposed estimators also allow using a data set containing goals chosen arbitrarily (instead of goals drawn from the goal distribution). Although this feature is not required by our experiments, we believe that it may be useful to circumvent catastrophic forgetting during curriculum learning (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017).

## 6 Experiments

This section reports results of an empirical comparison between goal-conditional policy gradient estimators and hindsight policy gradient estimators.^{1} Because there are no well-established sparse-reward environments intended to test agents under multiple goals, our experiments focus on our own selection of environments, which is described in section 6.1. Section 6.2 details the implementation of estimators, policies, and baselines. Section 6.3 documents our experimental protocol. Section 6.4 analyzes the results of the corresponding experiments. Unabridged results are presented in section 6.5. Section 6.6 provides a supplementary empirical study of likelihood ratios, and section 6.7 contains an empirical comparison with hindsight experience replay.

### 6.1 Environments

The environments presented in this section are diverse in terms of stochasticity, state-space dimensionality and size; relationship between goals and states; and number of actions. In every one of these environments, the agent receives the remaining number of time steps plus one as a reward for reaching the goal state, which also ends the episode. In every other situation, the agent receives no reward.

#### 6.1.1 Bit Flipping Environment

The agent starts every episode in the same state ($0$, represented by $k$ bits), and its goal is to reach a randomly chosen state. The actions allow the agent to toggle (flip) each bit individually. The maximum number of time steps is $k+1$. Despite its apparent simplicity, this environment is an ideal test bed for reinforcement learning algorithms intended to deal with sparse rewards, since obtaining a reward by chance is unlikely even for a relatively small $k$. Andrychowicz et al. (2017) employed a similar environment to evaluate their hindsight approach.

#### 6.1.2 Grid World Environments

The agent starts every episode in a (possibly random) position on an $11\xd711$ grid, and its goal is to reach a randomly chosen (noninitial) position. Some of the positions on the grid may contain impassable obstacles (walls). The actions allow the agent to move in the four cardinal directions. Moving toward walls causes the agent to remain in its current position. A state or goal is represented by a pair of integers between 0 and 10. The maximum number of time steps is 32. In the empty room environment, the agent starts every episode in the upper left corner of the grid, and there are no walls. In the four rooms environment (Sutton, Precup, & Singh, 1999), the agent starts every episode in one of the four corners of the grid (see Figure 1). There are walls that partition the grid into four rooms, such that each room provides access to two other rooms through single openings (doors). With probability 0.2, the action chosen by the agent is ignored and replaced by a random action.

#### 6.1.3 Ms. Pac-man Environment

#### 6.1.4 FetchPush Environment

### 6.2 Implementation

Importantly, the weighted per-decision hindsight policy gradient estimator used in our experiments (*HPG*) does not precisely correspond to expression 5.2. First, the original estimator requires a constant number of time steps $T$, which would often require the agent to act beyond the end of an episode in the environments that we consider. Second, although it is feasible to compute expression 5.2 exactly when the goal distribution is known (as explained in section 5), we sometimes subsample the sets of active goals per episode. Furthermore, when including a baseline that approximates the value function, we again consider only active goals, which by itself generally results in an inconsistent estimator (*HPG+B*). As will become evident in the following sections, these compromised estimators still lead to remarkable sample efficiency.

In every experiment, a policy is represented by a feedforward neural network with a softmax output layer. The input to such a policy is a pair composed of state and goal. A baseline function is represented by a feedforward neural network with a single (linear) output neuron. The input to such a baseline function is a triple composed of state, goal, and time step. The baseline function is trained to approximate the value function using the mean squared (one-step) temporal difference error (Sutton & Barto, 1998). Parameters are updated using Adam (Kingma & Ba, 2014). The networks are given by the following.

#### 6.2.1 Bit Flipping Environments and Grid World Environments

Both policy and baseline networks have two hidden layers, each with 256 hyperbolic tangent units. Every weight is initially drawn from a gaussian distribution with mean 0 and standard deviation 0.01 (and redrawn if far from the mean by two standard deviations), and every bias is initially zero.

#### 6.2.2 Ms. Pac-Man Environment

The policy network is represented by a convolutional neural network. The network architecture is given by a convolutional layer with 32 filters ($8\xd78$, stride 4); convolutional layer with 64 filters ($4\xd74$, stride 2); convolutional layer with 64 filters ($3\xd73$, stride 1); and three fully connected layers, each with 256 units. Every unit uses a hyperbolic tangent activation function. Every weight is initially set using variance scaling (Glorot & Bengio, 2010), and every bias is initially zero. These design decisions are similar to the ones made by Mnih et al. (2015).

A sequence of images obtained from the Arcade Learning Environment (Bellemare, Naddaf, Veness, & Bowling, 2013) is preprocessed as follows. Individually for each color channel, an elementwise maximum operation is employed between two consecutive images to reduce rendering artifacts. Such a $210\xd7160\xd73$ preprocessed image is converted to grayscale, cropped, and rescaled into an $84\xd784$ image $xt$. A sequence of images $xt-12,xt-8,xt-4,xt$ obtained in this way is stacked into an $84\xd784\xd74$ image, which is an input to the policy network (recall that each action is repeated for 13 game ticks). The goal information is concatenated with the flattened output of the last convolutional layer.

#### 6.2.3 FetchPush Environment

The policy network has three hidden layers, each with 256 hyperbolic tangent units. Every weight is initially set using variance scaling (Glorot & Bengio, 2010), and every bias is initially zero.

### 6.3 Evaluation

We assess sample efficiency through learning curves and average performance scores, obtained as follows. After collecting a number of batches (composed of trajectories and goals), each of which enables one step of gradient ascent, an agent undergoes evaluation. During evaluation, the agent interacts with the environment for a number of episodes, selecting actions with maximum probability according to its policy. A learning curve shows the average return obtained during each evaluation step, averaged across multiple runs (independent learning procedures). The curves presented in this text also include a $95%$ bootstrapped confidence interval. The average performance is given by the average return across evaluation steps, averaged across runs. During both training and evaluation, goals are drawn uniformly at random. Note that there is no held-out set of goals for evaluation, since we are interested in evaluating sample efficiency instead of generalization.

For every combination of environment and batch size, grid search is used to select hyperparameters for each estimator according to average performance scores (after the corresponding standard deviation across runs is subtracted, as suggested by Duan et al., 2016). Definitive results are obtained by using the best hyperparameters found for each estimator in additional runs. In most cases, we present definitive results for small (2) and medium (16) batch sizes.

Tables 1, 2, and 3 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and definitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. Note that it is very difficult to establish this setting before hyperparameter search. The number of batches between evaluations is adapted so that there are 100 evaluation steps in total.

. | Bit Flipping (8 bits) . | Bit Flipping (16 bits) . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

Runs (definitive) | 20 | 20 | 20 | 20 |

Training batches (definitive) | 5000 | 1400 | 15000 | 1000 |

Batches between evaluations (definitive) | 50 | 14 | 150 | 10 |

Runs (search) | 10 | 10 | 10 | 10 |

Training batches (search) | 4000 | 1400 | 4000 | 1000 |

Batches between evaluations (search) | 40 | 14 | 40 | 10 |

Policy learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Baseline learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Episodes per evaluation | 256 | 256 | 256 | 256 |

Maximum active goals per episode | $\u221e$ | $\u221e$ | $\u221e$ | $\u221e$ |

. | Bit Flipping (8 bits) . | Bit Flipping (16 bits) . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

Runs (definitive) | 20 | 20 | 20 | 20 |

Training batches (definitive) | 5000 | 1400 | 15000 | 1000 |

Batches between evaluations (definitive) | 50 | 14 | 150 | 10 |

Runs (search) | 10 | 10 | 10 | 10 |

Training batches (search) | 4000 | 1400 | 4000 | 1000 |

Batches between evaluations (search) | 40 | 14 | 40 | 10 |

Policy learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Baseline learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Episodes per evaluation | 256 | 256 | 256 | 256 |

Maximum active goals per episode | $\u221e$ | $\u221e$ | $\u221e$ | $\u221e$ |

. | Empty Room . | Four Rooms . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

Runs (definitive) | 20 | 20 | 20 | 20 |

Training batches (definitive) | 2200 | 200 | 10,000 | 1700 |

Batches between evaluations (definitive) | 22 | 2 | 100 | 17 |

Runs (search) | 10 | 10 | 10 | 10 |

Training batches (search) | 2500 | 800 | 10,000 | 3500 |

Batches between evaluations (search) | 25 | 8 | 100 | 35 |

Policy learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Baseline learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Episodes per evaluation | 256 | 256 | 256 | 256 |

Maximum active goals per episode | $\u221e$ | $\u221e$ | $\u221e$ | $\u221e$ |

. | Empty Room . | Four Rooms . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

Runs (definitive) | 20 | 20 | 20 | 20 |

Training batches (definitive) | 2200 | 200 | 10,000 | 1700 |

Batches between evaluations (definitive) | 22 | 2 | 100 | 17 |

Runs (search) | 10 | 10 | 10 | 10 |

Training batches (search) | 2500 | 800 | 10,000 | 3500 |

Batches between evaluations (search) | 25 | 8 | 100 | 35 |

Policy learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Baseline learning rates | $R1$ | $R1$ | $R1$ | $R1$ |

Episodes per evaluation | 256 | 256 | 256 | 256 |

Maximum active goals per episode | $\u221e$ | $\u221e$ | $\u221e$ | $\u221e$ |

. | Ms. Pac-Man . | FetchPush . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

Runs (definitive) | 10 | 10 | 10 | 10 |

Training batches (definitive) | 40,000 | 12,500 | 40,000 | 12,500 |

Batches between evaluations (definitive) | 400 | 125 | 400 | 125 |

Runs (search) | 5 | 5 | 5 | 5 |

Training batches (search) | 40,000 | 12,000 | 40,000 | 15,000 |

Batches between evaluations (search) | 800 | 120 | 800 | 300 |

Policy learning rates | $R2$ | $R2$ | $R2$ | $R2$ |

Episodes per evaluation | 240 | 240 | 512 | 512 |

Maximum active goals per episode | $\u221e$ | 3 | $\u221e$ | 3 |

. | Ms. Pac-Man . | FetchPush . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

Runs (definitive) | 10 | 10 | 10 | 10 |

Training batches (definitive) | 40,000 | 12,500 | 40,000 | 12,500 |

Batches between evaluations (definitive) | 400 | 125 | 400 | 125 |

Runs (search) | 5 | 5 | 5 | 5 |

Training batches (search) | 40,000 | 12,000 | 40,000 | 15,000 |

Batches between evaluations (search) | 800 | 120 | 800 | 300 |

Policy learning rates | $R2$ | $R2$ | $R2$ | $R2$ |

Episodes per evaluation | 240 | 240 | 512 | 512 |

Maximum active goals per episode | $\u221e$ | 3 | $\u221e$ | 3 |

Other settings include the sets of policy and baseline learning rates under consideration for hyperparameter search, and the number of active goals subsampled per episode. In Tables 1, 2, and 3, $R1={\alpha \xd710-k\u2223\alpha \u2208{1,5}andk\u2208{2,3,4,5}}$ and $R2={\beta \xd710-5\u2223\beta \u2208{1,2.5,5,7.5,10}}$.

As already mentioned, the definitive runs use the best combination of hyperparameters (learning rates) found for each estimator. Every setting was carefully chosen during preliminary experiments to ensure that the best result for each estimator is representative. In particular, the best-performing learning rates rarely lie on the extrema of the corresponding search range. In the single case where the best-performing learning rate found by hyperparameter search for a goal-conditional policy gradient estimator was such an extreme value (FetchPush, for a small batch size), evaluating one additional learning rate lead to decreased average performance.

### 6.4 Analysis

#### 6.4.1 Bit Flipping Environments

Figure 4 presents the learning curves for $k=8$. Goal-conditional policy gradient estimators with and without an approximate value function baseline (*GCPG+B* and *GCPG*, respectively) obtain excellent policies and lead to comparable sample efficiency. HPG+B obtains excellent policies more than 400 batches earlier than these estimators, but its policies degrade upon additional training. Additional experiments strongly suggest that the main cause of this issue is the fact that the value function baseline is still very poorly fit by the time that the policy exhibits desirable behavior. In comparison, *HPG* obtains excellent policies as early as HPG+B, but its policies remain remarkably stable upon additional training.

The learning curves for $k=16$ are presented in Figure 5. Clearly, both GCPG and GCPG+B are unable to obtain policies that perform better than chance, which is explained by the fact that they rarely incorporate reward signals during training. Confirming the importance of hindsight, HPG leads to stable and sample efficient learning. Although HPG+B also obtains excellent policies, they deteriorate upon additional training.

Similar results can be observed for a small batch size (see section 6.5.3). The average performance results documented in section 6.5.1 confirm that HPG leads to remarkable sample efficiency. Importantly, sections 6.5.4 and 6.5.5 present hyperparameter sensitivity plots suggesting that HPG is less sensitive to hyperparameter settings than the other estimators. A hyperparameter sensitivity plot displays the average performance achieved by each hyperparameter setting (sorted from best to worst along the horizontal axis). Section 6.5.5 also documents an ablation study where the likelihood ratios are removed from HPG, which notably promotes increased hyperparameter sensitivity. This study confirms the usefulness of the correction prescribed by importance sampling.

#### 6.4.2 Grid World Environments

Figure 6 shows the learning curves for the empty room environment. Clearly, every estimator obtains excellent policies, although HPG and HPG+B improve sample efficiency by at least 200 batches. The learning curves for the four-rooms environment are presented in Figure 7. In this surprisingly challenging environment, every estimator obtains unsatisfactory policies. However, it is still clear that HPG and HPG+B improve sample efficiency. In contrast to the experiments presented in the previous section, HPG+B does not give rise to instability, which we attribute to easier value function estimation. Similar results can be observed for a small batch size (see section 6.5.3). HPG achieves the best average performance in every grid world experiment except for a single case, where the best average performance is achieved by HPG+B (see section 6.5.1). The hyperparameter sensitivity plots presented in sections 6.5.4 and 6.5.5 once again suggest that HPG is less sensitive to hyperparameter choices and that ignoring likelihood ratios promotes increased sensitivity (at least in the four-rooms environment).

#### 6.4.3 Ms. Pac-Man Environment

Figure 8 presents the learning curves for a medium batch size. Approximate value function baselines are excluded from this experiment due to the significant cost of systematic hyperparameter search. Although HPG obtains better policies during early training, GCPG obtains better final policies. However, for such a medium batch size, only 3 active goals per episode (out of potentially 28) are subsampled for HPG. Although this harsh subsampling brings computational efficiency, it also appears to handicap the estimator. This hypothesis is supported by the fact that HPG outperforms GCPG for a small batch size, when all active goals are used (see sections 6.5.1 and 6.5.3). Policies obtained using each estimator are illustrated by videos included on the project website.

#### 6.4.4 FetchPush Environment

Figure 9 presents the learning curves for a medium batch size. HPG obtains good policies after a reasonable number of batches, in sharp contrast to GCPG. For such a medium batch size, only 3 active goals per episode (out of potentially 50) are subsampled for HPG, showing that subsampling is a viable alternative to reduce the computational cost of hindsight. Similar results are observed for a small batch size, when all active goals are used (see sections 6.5.1 and 6.5.3). Policies obtained using each estimator are illustrated by videos included on the project website.

### 6.5 Results

#### 6.5.1 Average Performance Results

. | Bit Flipping (8 bits) . | Bit Flipping (16 bits) . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

HPG | $4.60\xb10.06$ | $4.72\xb10.02$ | $7.11\xb10.12$ | $7.39\xb10.24$ |

GCPG | $1.81\xb10.61$ | $3.44\xb10.30$ | $0.00\xb10.00$ | $0.00\xb10.00$ |

HPG+B | $3.40\xb10.46$ | $4.04\xb10.10$ | $5.35\xb10.40$ | $6.09\xb10.29$ |

GCPG+B | $0.64\xb10.58$ | $3.31\xb10.58$ | $0.00\xb10.00$ | $0.00\xb10.00$$[[$ |

$[[$Empty Room | Four Rooms | |||

Batch Size 2 | Batch Size 16 | Batch Size 2 | Batch Size 16$[[$ | |

$[[$HPG | $20.22\xb10.37$ | $16.83\xb10.84$ | $7.38\xb10.16$ | $8.75\xb10.12$ |

GCPG | $12.54\xb11.01$ | $10.96\xb11.24$ | $4.64\xb10.57$ | $6.12\xb10.54$ |

HPG+B | $19.90\xb10.29$ | $17.12\xb10.44$ | $7.28\xb11.28$ | $8.08\xb10.18$ |

GCPG+B | $12.69\xb11.16$ | $10.68\xb11.36$ | $4.26\xb10.55$ | $6.61\xb10.49$$[[$ |

$[[$Ms. Pac-Man | FetchPush | |||

Batch Size 2 | Batch Size 16 | Batch Size 2 | Batch Size 16$[[$ | |

$[[$HPG | $6.58\xb11.96$ | $6.80\xb10.64$ | $6.10\xb10.34$ | $13.15\xb10.40$ |

GCPG | $5.29\xb11.67$ | $6.92\xb10.58$ | $3.48\xb10.15$ | $4.42\xb10.28$ |

. | Bit Flipping (8 bits) . | Bit Flipping (16 bits) . | ||
---|---|---|---|---|

. | Batch Size 2 . | Batch Size 16 . | Batch Size 2 . | Batch Size 16 . |

HPG | $4.60\xb10.06$ | $4.72\xb10.02$ | $7.11\xb10.12$ | $7.39\xb10.24$ |

GCPG | $1.81\xb10.61$ | $3.44\xb10.30$ | $0.00\xb10.00$ | $0.00\xb10.00$ |

HPG+B | $3.40\xb10.46$ | $4.04\xb10.10$ | $5.35\xb10.40$ | $6.09\xb10.29$ |

GCPG+B | $0.64\xb10.58$ | $3.31\xb10.58$ | $0.00\xb10.00$ | $0.00\xb10.00$$[[$ |

$[[$Empty Room | Four Rooms | |||

Batch Size 2 | Batch Size 16 | Batch Size 2 | Batch Size 16$[[$ | |

$[[$HPG | $20.22\xb10.37$ | $16.83\xb10.84$ | $7.38\xb10.16$ | $8.75\xb10.12$ |

GCPG | $12.54\xb11.01$ | $10.96\xb11.24$ | $4.64\xb10.57$ | $6.12\xb10.54$ |

HPG+B | $19.90\xb10.29$ | $17.12\xb10.44$ | $7.28\xb11.28$ | $8.08\xb10.18$ |

GCPG+B | $12.69\xb11.16$ | $10.68\xb11.36$ | $4.26\xb10.55$ | $6.61\xb10.49$$[[$ |

$[[$Ms. Pac-Man | FetchPush | |||

Batch Size 2 | Batch Size 16 | Batch Size 2 | Batch Size 16$[[$ | |

$[[$HPG | $6.58\xb11.96$ | $6.80\xb10.64$ | $6.10\xb10.34$ | $13.15\xb10.40$ |

GCPG | $5.29\xb11.67$ | $6.92\xb10.58$ | $3.48\xb10.15$ | $4.42\xb10.28$ |

#### 6.5.2 Learning Curves (Batch Size 16)

#### 6.5.3 Learning Curves (Batch Size 2)

#### 6.5.4 Hyperparameter Sensitivity Plots (Batch Size 16)

#### 6.5.5 Hyperparameter Sensitivity Plots (Batch Size 2)

### 6.6 Likelihood Ratio Study

This section presents a study of the active (normalized) likelihood ratios computed by agents during training. A likelihood ratio is considered active if and only if it multiplies a nonzero reward (see expression 5.2). Note that only these likelihood ratios affect gradient estimates based on HPG.

This study is conveyed through plots that encode the distribution of active likelihood ratios computed during training, individually for each time step within an episode. Each plot corresponds to an agent that employs HPG and obtains the highest definitive average performance for a given environment (see Figures 28 to 33). Note that the length of the largest bar for a given time step is fixed to aid visualization.

The most important insight provided by these plots is that likelihood ratios behave very differently across environments, even for equivalent time steps (e.g., compare bit flipping environments to grid world environments). In contrast, after the first time step, the behavior of likelihood ratios changes slowly across time steps within the same environment. In any case, alternative goals have a significant effect on gradient estimates, which agrees with the results presented in the previous sections.

### 6.7 Hindsight Experience Replay Study

#### 6.7.1 Experience Replay

Our implementations of both DQN and DQN+HER are based on OpenAI Baselines (Dhariwal et al., 2017) and use mostly the same hyperparameters that Andrychowicz et al. (2017) used in their experiments on environments with discrete action spaces, all of which resemble our bit flipping environments. The only notable differences in our implementations are the lack of both Polyak averaging and temporal difference target clipping.

Concretely, a cycle begins when an agent collects a number of episodes (16) by following an $\epsilon $-greedy policy derived from its deep Q-network ($\epsilon =0.2$). The corresponding transitions are included in a replay buffer, which contains at most $106$ transitions. In the case of DQN+HER, hindsight transitions derived from a final strategy are also included in this replay buffer (Andrychowicz et al., 2017, sec. 4.5). When the cycle for a total of 40 different batches is completed, a batch composed of 128 transitions chosen at random from the replay buffer is used to define a loss function and allow one step of gradient-based minimization. The targets required to define these loss functions are computed using a copy of the deep Q-network from the start of the corresponding cycle. Parameters are updated using Adam (Kingma & Ba, 2014). A discount factor of $\gamma =0.98$ is used and seems necessary to improve the stability of both DQN and DQN+HER.

#### 6.7.2 Network Architectures

In every experiment, the deep Q-network is implemented by a feedforward neural network with a linear output neuron corresponding to each action. The input to such a network is a triple composed of state, goal, and time step. The network architectures are the same as those described in section 6.2, except that every weight is initially set using variance scaling (Glorot & Bengio, 2010), and all hidden layers use rectified linear units (Nair & Hinton, 2010). For the Ms. Pac-Man environment, the time step information is concatenated with the flattened output of the last convolutional layer (together with the goal information). In comparison to the architecture employed by Andrychowicz et al. (2017) for environments with discrete action spaces, our architectures have one or two additional hidden layers (besides the convolutional architecture used for Ms. Pac-Man).

#### 6.7.3 Experimental Protocol

The experimental protocol employed in our comparison is very similar to the one described in section 6.3. Each agent is evaluated periodically, after a number of cycles that depends on the environment. During this evaluation, the agent collects a number of episodes by following a greedy policy derived from its deep Q-network.

For each environment, grid search is used to select the learning rates for both DQN and DQN+HER according to average performance scores (after the corresponding standard deviation across runs is subtracted, as described in section 6.3). The candidate sets of learning rates are the following: bit flipping and grid world environments: ${\alpha \xd710-k\u2223\alpha \u2208{1,5}andk\u2208{2,3,4,5}}$; FetchPush: ${10-2,5\xd710-3,10-3,5\xd710-4,10-4}$; and Ms. Pac-Man: ${10-3,5\xd710-4,10-4,5\xd710-5,10-5}$. These sets were carefully chosen such that the best-performing learning rates do not lie on their extrema.

Definitive results for a given environment are obtained by using the best hyperparameters found for each method in additional runs. These definitive results are directly comparable to our previous results for GCPG and HPG (batch size 16), since every method will have interacted with the environment for the same number of episodes before each evaluation step. For each environment, the number of runs, the number of training batches (cycles), the number of batches (cycles) between evaluations, and the number of episodes per evaluation step are the same as those listed in Tables 1, 2, and 3.

#### 6.7.4 Analysis

#### 6.7.5 Discussion

Our results suggest that the decision between applying HPG or DQN+HER in a particular sparse-reward environment requires experimentation. In contrast, the decision to apply hindsight was always successful.^{34}^{35}^{36}^{37}^{38}^{39}

Note that we have not employed heuristics that are known to sometimes increase the performance of policy gradient methods (such as entropy bonuses, reward scaling, learning rate annealing, and simple statistical baselines) to avoid introducing confounding factors. We believe that such heuristics would allow both GCPG and HPG to achieve good results in both the four-rooms environment and Ms. Pac-Man. Furthermore, whereas hindsight experience replay is directly applicable to state-of-the-art techniques, our work can probably benefit from being extended to state-of-the-art policy gradient approaches, which we intend to explore in future work. Similarly, we believe that additional heuristics and careful hyperparameter settings would allow DQN+HER to achieve good results in the FetchPush environment. This is evidenced by the fact that Andrychowicz et al. (2017) achieve good results using the deep deterministic policy gradient (Lillicrap et al., 2016, DDPG) in a similar environment (with a continuous action space and a different reward function). The empirical comparisons between either GCPG and HPG or DQN and DQN+HER are comparatively more conclusive, since the similarities between the methods minimize confounding factors.

Regardless of these empirical results, policy gradient approaches constitute one of the most important classes of model-free reinforcement learning methods, which by itself warrants studying how they can benefit from hindsight. Our approach is also complementary to previous work, since it is entirely possible to combine a critic trained by hindsight experience replay with an actor that employs hindsight policy gradients. Although hindsight experience replay does not require a correction analogous to importance sampling, indiscriminately adding hindsight transitions to the replay buffer is problematic, which has mostly been tackled by heuristics (Andrychowicz et al., 2017, sec. 4.5). In contrast, our approach seems to benefit from incorporating all available information about goals at every update, which also avoids the need for a replay buffer.

## 7 Conclusion

We introduced techniques that enable learning goal-conditional policies using hindsight. In this context, hindsight refers to the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended. Prior to our work, hindsight has been limited to off-policy reinforcement learning algorithms that rely on experience replay (Andrychowicz et al., 2017) and policy search based on Bayesian optimization (Karkus et al., 2016; Pinsler et al., 2019).

In addition to the fundamental hindsight policy gradient, our technical results include its baseline and advantage formulations. These results are based on a self-contained goal-conditional policy framework that is also introduced in this text. Besides the straightforward estimator built on the per decision hindsight policy gradient, we also presented a consistent estimator inspired by weighted importance sampling, together with the corresponding baseline formulation. A variant of this estimator leads to remarkable comparative sample efficiency on a diverse selection of sparse-reward environments, especially in cases where direct reward signals are extremely difficult to obtain. This crucial feature allows natural task formulations that require just trivial reward shaping.

The main drawback of hindsight policy gradient estimators appears to be their computational cost, which is directly related to the number of active goals in a batch. This issue may be mitigated by subsampling active goals, which generally leads to inconsistent estimators. Fortunately, our experiments suggest that this is a viable alternative. Note that the success of hindsight experience replay also depends on an active goal subsampling heuristic (Andrychowicz et al., 2017, sec. 4.5). The inconsistent hindsight policy gradient estimator with a value function baseline employed in our experiments sometimes leads to unstable learning, which is likely related to the difficulty of fitting such a value function without hindsight. This hypothesis is consistent with the fact that such instability is observed only in the most extreme examples of sparse-reward environments. Although our preliminary experiments in using hindsight to fit a value function baseline have been successful, this may be accomplished in several ways and requires a careful study of its own. Further experiments are also required to evaluate hindsight on dense-reward environments.

There are many possibilities for future work besides integrating hindsight policy gradients into systems that rely on goal-conditional policies: deriving additional estimators; implementing and evaluating hindsight (advantage) actor-critic methods; assessing whether hindsight policy gradients can successfully circumvent catastrophic forgetting during curriculum learning of goal-conditional policies; approximating the reward function to reduce required supervision; analyzing the variance of the proposed estimators; studying the impact of active goal subsampling; and evaluating every technique on continuous action spaces.

## Appendix A: Goal-Conditional Policy Gradients

### A.1 Theorem 1

**Theorem 1**

**Proof.**

### A.2 Theorem 2

**Theorem 2**

**Proof.**

### A.3 Lemma 4

**Lemma 4.**

**Proof.**

### A.4 Theorem ^{2}

**Theorem 3**

### A.5 Lemma 1

**Lemma 1**

**Proof.**

### A.6 Theorem ^{7}

**Theorem 4**

### A.7 Theorem 11

^{2}and the fact that $Eh(T,G)\u2223\theta =0$ by lemma

^{16},

**Theorem 11.**

**Proof.**

The result is an application of lemma ^{35}.

## Appendix B: Hindsight Policy Gradients

This appendix contains proofs related to the results presented in section 4: theorems ^{6}, ^{7}, ^{24}, and ^{10} (in, respectively, sections B.1, B.2, B.4, and B.6) and lemma ^{8} (in section B.3). Section B.7 presents optimal constant baselines for hindsight policy gradients. Section B.5 contains an auxiliary result.

### B.1 Theorem ^{13}

**Theorem 5**

**Proof.**

^{2}, importance sampling allows rewriting the partial derivative $\u2202\eta (\theta )/\u2202\theta j$ as

### B.2 Theorem 6

**Theorem 6**

**Proof.**

^{25}and canceling terms,

### B.3 Lemma 2

**Lemma 2.**

**Proof.**

^{31}and writing the expectations explicitly,

^{31}once again and reversing the likelihood-ratio trick,

### B.4 Theorem 12

**Theorem 12**

### B.5 Lemma 6

**Lemma 6**

**Proof.**

^{31}and rewriting the previous equation using expectations,

### B.6 Theorem ^{14}

**Theorem 7**

### B.7 Theorem 13

^{7}and the fact that $Eh(T)\u2223g',\theta =0$ by lemma

^{16},

**Theorem 13.**

**Proof.**

The result is an application of lemma ^{35}.

## Appendix C: Hindsight Gradient Estimators

This appendix contains proofs related to the estimators presented in section 5: theorems ^{12} and ^{13}. Section C.3 presents a result that enables a consistency-preserving weighted baseline.

### C.1 Theorem 9

**Theorem 9.**

**Proof.**

^{7}, the expected value $EIj(N)\u2223\theta $ is given by

Therefore, $Ij(N)$ is an unbiased estimator of $\u2202\eta (\theta )/\u2202\theta j$.

Therefore, $Ij(N)$ is a consistent estimator of $\u2202\eta (\theta )/\u2202\theta j$.

### C.2 Theorem 10

**Theorem 10.**

**Proof.**

^{31}, and canceling terms, $EXi$ can be written as

Conditionally on $\Theta $, the variable $X(g,t,t')j(N)$ is an average of i.i.d. random variables with expected value $EXi$. By the strong law of large numbers (Sen & Singer, 1994, theorem 2.3.13), $X(g,t,t')j(N)\u27f6a.s.EXi$.

^{31}, the expected value $EYi=EY(T(i),G(i),g,t,t',\theta )j\u2223\theta $ is given by

Conditionally on $\Theta $, the variable $Y(g,t,t')j(N)$ is an average of i.i.d. random variables with expected value 1. By the strong law of large numbers, $Y(g,t,t')j(N)\u27f6a.s.1$.

^{2}and the fact that $Wj(N)$ is a linear combination of terms $X(g,t,t')j(N)/Y(g,t,t')j(N)$,

### C.3 Theorem 14

**Theorem 14.**

**Proof.**

Conditional on $\Theta $, the variable $X(g,t)j(N)$ is an average of i.i.d. random variables with expected value zero. By the strong law of large numbers (Sen & Singer, 1994, theorem 2.3.13), $X(g,t)j(N)\u27f6a.s.0$.

^{13}. Because both $X(g,t)j(N)$ and $Y(g,t)j(N)$ converge almost surely to real numbers (Thomas, 2015, chap. 3, property 2),

Because $Bj(N)$ is a linear combination of terms $X(g,t)j(N)/Y(g,t)j(N)$, $Bj(N)\u27f6a.s.0$.

Clearly, if $E(N)$ is a consistent estimator of a some quantity given $\theta $, then so is $E(N)-Bj(N)$, which allows using this result in combination with theorem ^{13}.

## Appendix D: Fundamental Results

### D.1 Lemma 5

**Lemma 5.**

**Proof.**

### D.2 Lemma 6

**Lemma 6.**

**Proof.**

^{16},

### D.3 Lemma ^{16}

**Lemma 7.**

**Proof.**

From the definition of value function, $z=E[Vt+1\theta (St+1,g)\u2223St=s,At=a]$.

### D.4 Theorem 8

**Theorem 8.**

**Proof.**

The result follows from the definition of advantage function and lemma ^{16}.

### D.5 Lemma 8

Consider a discrete random variable $X$ and real-valued functions $f$ and $h$. Suppose also that $Eh(X)=0$ and $Varh(X)>0$. Clearly, for every $b\u2208R$, we have $Ef(X)-bh(X)=Ef(X)$.

**Lemma 8.**

**Proof.**

The first and second derivatives of $v$ with respect to $b$ are given by $dv/db=-2Ef(X)h(X)+2bEh(X)2$ and $d2v/db2=2Eh(X)2$. Our assumptions guarantee that $Eh(X)2>0$. Therefore, by Fermat's theorem, if $b$ is a local minimum, then $dv/db=0$, leading to the desired equality. By the second derivative test, $b$ must be a local minimum.

## Note

^{1}

An open-source implementation of these estimators is available on http://paulorauber.com/hpg.

## Acknowledgments

We thank Sjoerd van Steenkiste, Klaus Greff, Imanol Schlag, and the anonymous reviewers of previous versions of this work for their valuable feedback. This research was supported by the Swiss National Science Foundation (grant 200021_165675/1), the European Research Council (Advanced Grant 742870), and CAPES (Filipe Mutz, PDSE, 88881.133206/2016-01). We are grateful to Nvidia Corporation for donating a *DGX-1* machine and to IBM for donating a Minsky machine.

## References

*Advances in neural information processing systems*

*30*(pp.

*Journal of Artificial Intelligence Research*

*Pattern recognition and machine learning*

*Proceedings of International Conference of Machine Learning*

*Proceedings of the IEEE International Conference on Robotics and Automation, 2014*

*Openai baselines*

*Proceedings of International Conference on Machine Learning*

*Journal of Machine Learning Research*

*Proceedings of the First Annual Conference on Robot Learning*

*International Conference on Learning Representations*

*Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics*

*Journal of Machine Learning Research*

*Advances in neural information processing systems*

*23*

*Factored contextual policy search with Bayesian optimization*

*Proceedings of the Third International Conference on Learning Representations*

*Proceedings of the National Academy of Sciences*

*Autonomous Robots*

*Advances in neural information processing systems*

*Proceedings of the 27th AAAI Conference on Artificial Intelligence*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the International Conference on Learning Representations*

*Machine Learning*

*Unicorn: Continual learning with a universal, off-policy agent*

*Psychology of Learning and Motivation: Advances in Research and Theory*

*Proceedings of the Second Machine Learning in Planning and Control of Robot Motion Workshop.*

*Nature*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the 34th International Conference on Machine Learning*

*International Conference on Learning Representations*

*Proceedings of the Nineteenth International Conference on Machine Learning*

*Neural Networks*

*Proceedings of the 2019 International Conference on Robotics and Automation*

*Multi-goal reinforcement learning: Challenging robotics environments and request for research*

*International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Frontiers in Psychology*

*Reinforcement learning upside down: Don't predict rewards—just map them to actions*

*Learning to generate focus trajectories for attentive vision*

*Large sample methods in statistics: An introduction with applications*

*Neural Networks*

*Proceedings of the International Conference on Learning Representations*

*Reinforcement learning: An introduction*

*Advances in neural information processing systems*

*Artificial Intelligence*

*Safe reinforcement learning*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the 34th International Conference on Machine Learning*

*Reinforcement-learning in connectionist networks: A mathematical analysis*

*Machine Learning*

*Proceedings of the IEEE International Conference on Robotics and Automation*