Abstract
Adaptive behavior often requires predicting future events. The theory of reinforcement learning prescribes what kinds of predictive representations are useful and how to compute them. This review integrates these theoretical ideas with work on cognition and neuroscience. We pay special attention to the successor representation and its generalizations, which have been widely applied as both engineering tools and models of brain function. This convergence suggests that particular kinds of predictive representations may function as versatile building blocks of intelligence.
1. Introduction
The ability to make predictions has been hailed as a general feature of both biological and artificial intelligence, cutting across disparate perspectives on what constitutes intelligence (Ciria et al., 2021; Clark, 2013; Friston & Kiebel, 2009; Ha & Schmidhuber, 2018; Hawkins & Blakeslee, 2004; Littman & Sutton, 2001; Lotter et al., 2016). Despite this general agreement, attempts to formulate the idea more precisely raise many questions: Predict what, and over what timescale? How should predictions be represented? How should they be used, evaluated, and improved? These normative “should” questions have corresponding empirical questions about the nature of prediction in biological intelligence. Our goal is to provide systematic answers to these questions. We will develop a small set of principles that have broad explanatory power.
Our perspective is based on an important distinction between predictive models and predictive representations. A predictive model is a probability distribution over the dynamics of a system’s state. A model can be “run forward” to generate predictions about the system’s future trajectory. This offers a significant degree of flexibility: an agent with a predictive model can, given enough computation time, answer virtually any query about the probabilities of future events. However, the “given enough computation time” proviso places a critical constraint on what can be done with a predictive model in practice. An agent that needs to act quickly under stringent computational constraints may not have the luxury of posing arbitrarily complex queries to its predictive model. Predictive representations, however, cache the answers to certain queries, making them accessible with limited computational cost.1 The price paid for this efficiency gain is a loss of flexibility: only certain queries can be accurately answered.
Caching is a general solution to ubiquitous flexibility-efficiency trade-offs facing intelligent systems (Dasgupta & Gershman, 2021). Key to the success of this strategy is caching representations that make task-relevant information directly accessible to computation. We will formalize the notion of task-relevant information, as well as what kinds of computations access and manipulate this information, in the framework of reinforcement learning (RL) theory (Sutton & Barto, 2018). In particular, we will show how one family of predictive representation, the successor representation (SR) and its generalizations, distills information that is useful for efficient computation across a wide variety of RL tasks. These predictive representations facilitate exploration, transfer, temporal abstraction, unsupervised pretraining, multi-agent coordination, creativity, and episodic control. On the basis of such versatility, we argue that these predictive representations can serve as fundamental building blocks of intelligence.
Converging support for this argument comes from cognitive science and neuroscience. We review a body of data indicating that the brain uses predictive representations for a range of tasks, including decision making, navigation, and memory. We also discuss biologically plausible algorithms for learning and computing with predictive representations. This convergence of biological and artificial intelligence suggests that predictive representations may be a widely used tool for intelligent systems.
Several previous surveys on predictive representations have scratched the surface of these connections (Gershman, 2018; Momennejad, 2020). The purpose of this survey is to approach the topic in much greater detail, yielding a comprehensive reference on both technical and scientific aspects. Despite this broad scope, the survey’s focus is restricted to predictive representations in the domain of RL; we do not review predictive representations that have been developed for language modeling, vision, and other problems. An important long-term goal will be to fully synthesize the diverse notions of predictive representations across these domains.
2. Theory
In this section, we introduce the general problem setup and a classification of solution techniques. We then formalize the SR and discuss how it fits into the classification scheme. Finally, we describe two key extensions of the SR that make it much more powerful: the successor model and successor features. Due to space constraints, we omit some more exotic variants such as the first-occupancy representation (Moskovitz et al., 2022) or the forward-backward representation (Touati et al., 2022).
2.1 The Reinforcement Learning Problem
2.2 Classical Solution Methods
Algorithmic solutions to the RL problem. An agent solving a three-armed maze (bottom) can adopt different classes of strategies (top). Model-based strategies (left) learn an internal model of the environment, including the transition function (), the reward function (), and (optionally) the features (). At decision time, the agent can run forward simulations to predict the outcomes of different actions. Model-free strategies (middle) learn action values () and/or a policy (). At decision time, the agent can consult the cached action values and/or policy in the current state. Strategies relying on predictive representations (right) learn the successor representation (SR) matrix () mapping states to future states and/or the successor features () mapping states to future features, as well as the reward function (). At decision time, the agent can consult the cached predictions and cross-reference them with its task (specified by the reward function) to choose an action.
Algorithmic solutions to the RL problem. An agent solving a three-armed maze (bottom) can adopt different classes of strategies (top). Model-based strategies (left) learn an internal model of the environment, including the transition function (), the reward function (), and (optionally) the features (). At decision time, the agent can run forward simulations to predict the outcomes of different actions. Model-free strategies (middle) learn action values () and/or a policy (). At decision time, the agent can consult the cached action values and/or policy in the current state. Strategies relying on predictive representations (right) learn the successor representation (SR) matrix () mapping states to future states and/or the successor features () mapping states to future features, as well as the reward function (). At decision time, the agent can consult the cached predictions and cross-reference them with its task (specified by the reward function) to choose an action.
These definitions allow us to be precise about what we mean by predictive model and predictive representation. A predictive model corresponds to , the internal model of the transition distribution, and , an internal model of the reward function. An agent equipped with can simulate state trajectories and answer arbitrary queries about the future. Of principal relevance to solving MDPs is policy evaluation, an answer to the query, “How much reward do I expect to earn in the future under my current policy?” A simple (but inefficient) way to do this, known as Monte Carlo policy evaluation, is by running many simulations from each state (roll-outs) and then averaging the discounted return. The basic problem with this approach stems from the curse of dimensionality (Bellman, 1957): the trajectory space is very large, requiring a number of roll-outs that is exponential in the trajectory length.
The approximation becomes an equality when the agent’s internal model is accurate.
Value iteration is powerful, but still too cumbersome for large state spaces, since each iteration requires steps. The basic problem is that algorithms like value iteration attempt to compute the optimal policy for every state, but in an online setting, an agent only needs to worry about what action to take in its current state. This problem is addressed by tree search algorithms, which rely on roll-outs (as in Monte Carlo policy evaluation) but only from the current state. When combined with heuristics for determining which roll-outs to perform (e.g., learned value functions; see below), this approach can be highly effective (Silver et al., 2016).
Despite their effectiveness for certain problems (e.g., games like Go and chess), model-based algorithms have had only limited success in a wider range of problems (e.g., video games) due to the difficulty of learning a good model and planning in complex (possibly infinite/continuous) state spaces.4 For this reason, much of the work in modern RL has focused on model-free algorithms.
A model-free agent by definition has no access to (and sometimes no access to ), but nonetheless can still answer certain queries about the future if it has cached a predictive representation. For example, an agent could cache an estimate of the state-action value function, . This predictive representation does not afford the same flexibility as a model of the MDP, but it has the advantage of caching, in a computationally convenient form, exactly the information about the future that an agent needs to act optimally.
We have briefly discussed the dichotomy of model-based versus model-free algorithms for learning an optimal policy. Model-based algorithms are more flexible—capable of generating predictions about future trajectories—while model-free algorithms are more computationally efficient—capable of rapidly computing the approximate value of an action. The flexibility of model-based algorithms is important for transfer: when the environment changes locally (e.g., a route is blocked or the value of a state is altered), an agent’s model will typically also change locally, allowing it to transfer much of its previously learned knowledge without extensive new learning. In contrast, a cached value function approximation (due to its long-term dependencies) will change nonlocally, necessitating more extensive learning to update all the affected cached values.
One of the questions we aim to address is how to get some aspects of model-based flexibility without learning and computing with a predictive model. This leads us to another class of predictive representations: the SR. In this section, we describe the SR (see section 2.3), its probabilistic variant (the successor model; see section 2.4), and an important generalization (successor features; see section 2.5). A visual overview of these predictive representations is shown in Figure 2. Applications of these concepts are covered in see section 4.
Three kinds of predictive representations: the successor representation (see section 2.3), the successor model (see section 2.4), and successor features (see section 2.5). Their computations are summarized in Table 1. Each of these predictive predictive representations describes a state by a prediction of what will happen when a policy is followed. With the successor representation, one gets a description of how much all states will be visited in the near future when beginning at state . One limitation of this is that it does not scale well to large state spaces, since it is impractical to maintain predictions about all states in the state space. Successor models circumvent this challenge by framing learning as a density estimation problem. By leveraging methods from density estimation, an agent can efficiently learn successor models and scale to high-dimensional state and action spaces (including continuous spaces) with amortized learning procedures (see section 3.2). Successor features are another method for circumventing the challenge of representing large state spaces. Here, we do so by describing states with a shared set of state features. Rather than making predictions about how much all states will be visited, we make predictions about how much features will be experienced. Both successor models and successor features have their own pros and cons. Successor models are useful because they open up new possibilities like supporting temporally abstract sampling of future states under a policy. Additionally, methods for learning successor models typically subsume learning of state features, whereas successor features typically need a separate mechanism for learning state features. On the other hand, successor features are easier to learn and more readily enable stitching together policies concurrently (see section 2.5.1) and sequentially (see section 2.5.2) in time—though there is progress on doing this with successor models (see see section 4.2.3).
Three kinds of predictive representations: the successor representation (see section 2.3), the successor model (see section 2.4), and successor features (see section 2.5). Their computations are summarized in Table 1. Each of these predictive predictive representations describes a state by a prediction of what will happen when a policy is followed. With the successor representation, one gets a description of how much all states will be visited in the near future when beginning at state . One limitation of this is that it does not scale well to large state spaces, since it is impractical to maintain predictions about all states in the state space. Successor models circumvent this challenge by framing learning as a density estimation problem. By leveraging methods from density estimation, an agent can efficiently learn successor models and scale to high-dimensional state and action spaces (including continuous spaces) with amortized learning procedures (see section 3.2). Successor features are another method for circumventing the challenge of representing large state spaces. Here, we do so by describing states with a shared set of state features. Rather than making predictions about how much all states will be visited, we make predictions about how much features will be experienced. Both successor models and successor features have their own pros and cons. Successor models are useful because they open up new possibilities like supporting temporally abstract sampling of future states under a policy. Additionally, methods for learning successor models typically subsume learning of state features, whereas successor features typically need a separate mechanism for learning state features. On the other hand, successor features are easier to learn and more readily enable stitching together policies concurrently (see section 2.5.1) and sequentially (see section 2.5.2) in time—though there is progress on doing this with successor models (see see section 4.2.3).
2.3 The Successor Representation
The SR, denoted , was introduced to address the transfer problem described in the previous section (Dayan, 1993; Gershman, 2018). In particular, the SR is well suited for solving sets of tasks that share the same transition structure but vary in their reward structure; we delve more into this problem setting later when we discuss applications.
The successor representation (SR). (a) A schematic of an environment where the agent is a red box at state and the goal is a green box at state . In general, an SR (see equation 2.13) describes the discounted state occupancy for state when beginning at state and following policy . In panels b and c, we showcase for a random policy and an optimal policy. (b) The SR under a random policy measures high state occupancy near the agent’s current state (e.g., ) and low state occupancy at points farther away from the agent (e.g., ). (c) The SR under the optimal policy has the highest state occupancy along the shortest path to the goal (e.g., here, ), fading as we get farther from the current state. In contrast to a random policy, states not along that path have 0 occupancy (e.g., ). Once we know a reward function, we can efficiently evaluate both policies (see equation 2.19). (d) An example reward function that has a cost of for each state except the goal state where reward is 1. The SR allows us to efficiently compute (e) the value function under a random policy and (f) the value function under the optimal policy.
The successor representation (SR). (a) A schematic of an environment where the agent is a red box at state and the goal is a green box at state . In general, an SR (see equation 2.13) describes the discounted state occupancy for state when beginning at state and following policy . In panels b and c, we showcase for a random policy and an optimal policy. (b) The SR under a random policy measures high state occupancy near the agent’s current state (e.g., ) and low state occupancy at points farther away from the agent (e.g., ). (c) The SR under the optimal policy has the highest state occupancy along the shortest path to the goal (e.g., here, ), fading as we get farther from the current state. In contrast to a random policy, states not along that path have 0 occupancy (e.g., ). Once we know a reward function, we can efficiently evaluate both policies (see equation 2.19). (d) An example reward function that has a cost of for each state except the goal state where reward is 1. The SR allows us to efficiently compute (e) the value function under a random policy and (f) the value function under the optimal policy.
Second, the SR can, like model-based algorithms, adapt quickly to certain kinds of environmental changes. In particular, local changes to an environment’s reward structure induce local changes in the reward function, which immediately propagate to the value estimates when combined with the SR.6 Thus, SR-based value computation enjoys flexibility comparable to model-based algorithms, at least for changes to the reward structure. Changes to the transition structure, however, require more substantial nonlocal changes to the SR due to the fact that an internal model of the detailed transition structure is not available.
Our discussion has already indicated several limitations of the SR. First, the policydependence of its predictions limits its generalization ability. Second, the SR assumes a finite, discrete state space. Third, it does not generalize to new environment dynamics. When the transition structure changes, equation 2.15 no longer holds. We discuss in section 3 how the first and second challenges can be addressed and in section 4.2.3 some attempts to address the third challenge.
2.4 Successor Models: A Probabilistic Perspective on the SR
As we’ve discussed, the SR buys efficiency by caching transition structure while maintaining some model-based flexibility. One thing that is lost, however, is the ability to simulate trajectories through the state space. In this section, we introduce a generalization of the SR—the successor model (SM; Janner et al., 2020; Eysenbach et al., 2020)—that defines an explicit distribution over temporally abstract trajectories (see Figure 4). Here, “temporal abstraction” means a conditional distribution over future states within some time horizon rather than only the next time-step captured by the transition model.
The successor model (SM). A cartoon schematic of a robot leg that can hop forward. Left: A single-step model can only compute likelihoods for states at the next time-step. Right: Multistep successor models can compute likelihoods for states over some horizon into the future. One key difference between the SM and the SR is that the SM defines a valid probability distribution. This means that we can leverage density estimation techniques for learning it over continuous state and action spaces. Additionally, as this figure suggests, we can use it to sample potentially distal states (see section 4.4). Adapted with permission from Janner et al. (2020).
The successor model (SM). A cartoon schematic of a robot leg that can hop forward. Left: A single-step model can only compute likelihoods for states at the next time-step. Right: Multistep successor models can compute likelihoods for states over some horizon into the future. One key difference between the SM and the SR is that the SM defines a valid probability distribution. This means that we can leverage density estimation techniques for learning it over continuous state and action spaces. Additionally, as this figure suggests, we can use it to sample potentially distal states (see section 4.4). Adapted with permission from Janner et al. (2020).
Since the SM integrates to 1, a key difference to the SR is that it defines a valid probability distribution. This is important because it allows for the SR to generalize to continuous state and action spaces, where we can leverage density estimation techniques for estimating this value on a per state basis. As we will discuss in section 3.2, we can estimate the SM with density estimation techniques such as generative adversarial learning (Janner et al., 2020), variational inference (Thakoor et al., 2022), and contrastive learning (Eysenbach et al., 2020; Zheng et al., 2023).
SMs are interesting because they are a different kind of environment model. Rather than defining transition probabilities over next states, they describe the probability of reaching within a horizon determined by when following policy . While we don’t know exactly when will be reached, we can answer queries about whether it will be reached within some relatively long time horizon with less computation compared to rolling out the base transition model. Additionally, depending on how the SM is learned, we can also sample from it. This can be useful for policy evaluation and model-based control (Thakoor et al., 2022). We discuss this in more detail in section 4.4.
2.5 Successor Features: A Feature-Based Generalization of the SR
2.5.1 Generalized Policy Improvement: Adaptively Combining Policies
One limitation of equations 2.19 and 2.26 is that they only enable us to recompute the value of states for a new reward function under a known policy. However, we may want to synthesize a new policy from the other policies we have learned so far. We can accomplish this with SFs by combining them with generalized policy improvement (GPI; Barreto et al., 2017), illustrated in Figure 5.
A schematic of Successor features (SFs; see section 2.5) and Generalized policy improvement (GPI; see section 2.5.1). Note that we use the shorthand to represent state features that describe what is visible to the agent at time . corresponds to policies that the agent knows how to perform. (a) Examples of SFs (see equation 2.30) for the “open drawer” and “open fridge” policies. In this hypothetical scenario, the state features that agent holds describe whether an apple, milk, fork, or knife are present. Beginning for the first time step, the SFs for these policies encode predictions for which of these features will be present when the policies are executed—predicted to be present for apple and milk for the open fridge policy and for the fork and knife when the open drawer policy is executed. (b) The agent can reuse these known policies with GPI (see equation 2.35). When given a new task, say “get milk,” it is able to leverage the SFs for the policies to know to decide which behavior will enable it to get milk. In this example, the policy for opening the fridge will also lead to milk. The agent selects actions with GPI by computing Q-values for each known behavior as the dot-product between the current task and each known SF. The highest Q-value is then used to select actions. If the agent wants to execute the option keyboard (OK; see section 2.5.2), they can adaptively set based on the current state. For example, at some states, the agent may want to pursue getting milk, while in others, they may want to pursue getting a fork. Adapted from Carvalho et al. (2024) with permission.
A schematic of Successor features (SFs; see section 2.5) and Generalized policy improvement (GPI; see section 2.5.1). Note that we use the shorthand to represent state features that describe what is visible to the agent at time . corresponds to policies that the agent knows how to perform. (a) Examples of SFs (see equation 2.30) for the “open drawer” and “open fridge” policies. In this hypothetical scenario, the state features that agent holds describe whether an apple, milk, fork, or knife are present. Beginning for the first time step, the SFs for these policies encode predictions for which of these features will be present when the policies are executed—predicted to be present for apple and milk for the open fridge policy and for the fork and knife when the open drawer policy is executed. (b) The agent can reuse these known policies with GPI (see equation 2.35). When given a new task, say “get milk,” it is able to leverage the SFs for the policies to know to decide which behavior will enable it to get milk. In this example, the policy for opening the fridge will also lead to milk. The agent selects actions with GPI by computing Q-values for each known behavior as the dot-product between the current task and each known SF. The highest Q-value is then used to select actions. If the agent wants to execute the option keyboard (OK; see section 2.5.2), they can adaptively set based on the current state. For example, at some states, the agent may want to pursue getting milk, while in others, they may want to pursue getting a fork. Adapted from Carvalho et al. (2024) with permission.
2.5.2 Option Keyboard: Chaining Together Policies
One advantage of equation 2.35 is that it facilitates adaptation to linear combinations of training task encodings. However, when transferring to a new task encoding , the preferences are constant over time. This becomes problematic when dealing with complex tasks that necessitate different preferences for different states—for example, tasks that require both avoidance and approach behaviors at different times.
2.6 Summary
The predictive representations introduced above can be concisely organized in terms of particular cumulants, as summarized in Table 1. These cumulants have different strengths and weaknesses. Value functions (reward cumulants) directly represent the key quantity for RL tasks, but they suffer from poor flexibility. The SR (state occupancy cumulant) and its variations (feature occupancy and state probability cumulants) can be used to compute values but also retain useful information for generalization to new tasks (e.g., using generalized policy improvement).
Summary of Predictive Representations That We Focus On.
Predictive representation . | Cumulant . | : TD update when and . |
---|---|---|
(sec. 2.2) | ||
(SR; sec. 2.3) | ||
(SFs; sec. 2.5) | ||
(SM; sec. 2.4) |
Predictive representation . | Cumulant . | : TD update when and . |
---|---|---|
(sec. 2.2) | ||
(SR; sec. 2.3) | ||
(SFs; sec. 2.5) | ||
(SM; sec. 2.4) |
Notes: For each, we also describe the “cumulant” that this predictive representation forms predictions over, along with a corresponding on-policy Bellman update one can use to learn the representation for a policy . is the action-value function, which forms predictions about future reward. is the successor representation (SR), forms predictions about how much a state will be visited. are successor features (SFs) that form predictions about how much state features will be experienced. is the successor model (SM), which predicts the likelihood of experiencing in the future.
3 Practical Learning Algorithms and Associated Challenges
Of the predictive representations discussed in section 2, only SFs and SMs have been successfully scaled to environments with large, high-dimensional state spaces (including continuous state spaces). Thus, these will be our focus of discussion. We first discuss learning SFs in section 3.1 and then learning successor models in section 3.2.
3.1 Learning Successor Features
3.1.1 Discovering Cumulants
3.1.2 Estimating Successor Features
Another strategy to stabilize SF learning is to learn individual SF dimensions with separate modules (Carvalho et al., 2023, 2024). Beyond stabilizing learning, this modularity also enables approximating SFs that generalize better to novel environment configurations (i.e., which are more robust to novel environment dynamics).
Estimating successor features with changing cumulants. In some cases, the cumulant itself will change over time (e.g., when the environment is nonstationary). This is challenging for SF learning because the prediction target is nonstationary (Barreto et al., 2018). This is an issue even when the environment is stationary but the policy is changing over time: different policies induce different trajectories and different state features induce different descriptions of those trajectories.
Prior work has found that the two-hot representation is a good method for defining (Carvalho et al., 2024; Schrittwieser et al., 2020). In general, estimating predictive representations such as SFs with distributional losses such as equation 3.6 has been shown to reduce the variance in learning updates (Imani & White, 2018). This is particularly important when cumulants are being learned, as this can lead to high variance in .
3.2 Learning Successor Models
In this section, we focus on estimating with . In a tabular setting, one can leverage TD-learning with the Bellman equation in equation 2.25. However, for very large state spaces (such as with infinite size continuous state spaces), this is intractable or impractical. Depending on one’s use case, different options exist for learning. First, we discuss the setting where one wants to learn an SM they can sample from (see section 3.2.1). Then we discuss the setting where one only wants to evaluate an SM for different actions given a target state (see section 3.2.2).
3.2.1 Learning Successor Models That One Can Sample From
While this is a simple strategy, it has several challenges. For values of close to 1, this becomes a challenging learning problem requiring predictions over very long time horizons. Another challenge is that you are using obtained under policy . In practice, we may want to leverage data collected under a different policy. This happens when, for example, we want to learn from a collection of different data sets, or we are updating our policy over the course of learning. Learning from such off-policy data can lead to high bias, or a high variance learning update from off-policy corrections (Precup et al., 2000).
3.2.2 Learning Successor Models That One Can Evaluate
We can define as the dot product between a predictive representation and label representation , . can then be thought of as state features analogous to SFs (see section 2.5). is a prediction of these future features similar to SFs, , with labels coming from future states; however, it doesn’t necessarily have the same semantics as a discounted sum (that is, equation 2.32). We use similar notation because of their conceptual similarity. We can then understand equation 3.15 as doing the following. The first term in this objective pushes the prediction toward the features at the next time step , and the second term pushes toward the features at arbitrary state features . Both terms repel from arbitrary “negative” state features .
4 Artificial Intelligence Applications
In this section, we discuss how the SR and its generalizations have enabled advances in artificial agents that learn and transfer policies.
4.1 Exploration
4.1.1 Pure Exploration
Learning to explore and act in the environment before exposure to reward. In the “pure exploration” setting, an agent can explore its environment for some period of time without external reward. In some cases, the goal is to learn a policy that can transfer to an unknown task . SFs can be used to achieve such transfer.
Hansen et al. (2019) leveraged this strategy to develop agents that could explore Atari games without any reward for 250 million time steps and then have 100,000 time steps to earn reward. They showed that this strategy was able to achieve superhuman performance across most Atari games, despite not observing any rewards for most of its experience. Liu and Abbeel (2021) improved on this algorithm by adding an intrinsic reward function that favors exploring parts of the state space that are surprising (i.e., that induce high entropy) given a memory of the agent’s experience. This dramatically improved sample efficiency for many Atari games.
where is a uniform policy.7
4.1.2 Balancing Exploration and Exploitation
4.2 Transfer
We’ve already introduced the idea of cross-task transfer in our discussion of GPI. We now review the broader range of ways in which the challenges of transfer have been addressed using predictive representations.
4.2.1 Transferring Policies between Tasks
We first consider transfer across tasks that are defined by different reward functions. In the following two sections, we consider other forms of transfer.
Few-shot transfer between pairs of tasks. SFs can enable transferring policies from one reward function to another function by exploiting equation 2.34 with learned cumulant and SFs for a source task. At transfer time, one freezes each set of parameters and solves for (e.g., with equation 3.1). Kulkarni et al. (2016) showed that this enabled an RL agent that learned to play an Atari game to transfer to a new version of that game where the reward function was scaled. Later, Zhu et al. (2017) showed that this enabled transfer to new “go to” tasks in a photorealistic 3D household environment.
Continual learning across a set of tasks. Beyond transferring across task pairs, an agent may want to continually transfer its knowledge across a set of tasks. Barreto et al. (2017) showed that SFs and GPI provide a natural framework to do this. Consider sequentially learning tasks. As the agent learns new tasks, they maintain a growing library of SFs where is the number of tasks learned so far. When the agent is learning the th task, they can select actions with GPI according to equation 2.35 using as the current transfer task. The agent learns SFs for the current task according to equation 3.5. Zhang et al. (2017) extended this approach to enable continual learning when the environment state space and dynamics were changing across tasks but still relatively similar. Transferring SFs to an environment with a different state space requires leveraging new state features. Their solution involved reusing old state features by mapping them to the new state space with a linear projection. By exploiting linearity in the Q-value decomposition (see equation 2.34), this allowed reusing SFs for new environments.
Zero-shot transfer to task combinations. Another benefit of SFs and GPI is that they facilitate transfer to task conjunctions when they are defined as weighted sums of training tasks, . A clear example is combining “go to” tasks (Barreto et al., 2018, 2020; Borsa et al., 2019; Carvalho et al., 2023). For example, consider four tasks defining by collecting different object types; defines a new task that tries to avoid collecting objects of type 1, while trying to collect objects of type 2 twice as much as objects of type 3. This approach has been extended to combining policies with continuous state and action spaces (Hunt et al., 2019), though it has so far been limited to combining only two policies. Another important limitation of this general approach is that it can only specify which tasks to prioritize but cannot specify an ordering for these tasks. For example, there is no way to specify collecting object type 1 before object type 2. One can address this limitation by learning a state-dependent transfer task encoding as with the Option Keyboard (see section 2.5.2).
4.2.2 Learning about Nontask Goals
In the previous section, we discussed the transfer setting where an agent learns about a task and then subsequently wants to transfer this knowledge to another task . In this section, we consider an agent that is learning task and wants to concurrently learn policies for other tasks (defined by ). That is, each experience trying to accomplish is reused to learn how to accomplish . This can broadly be categorized as off-task learning.
Borsa et al. (2019) showed that one can reuse experiences accomplishing task to learn control policies for tasks that are not too far from in the task encoding space by leverage universal SFs. In particular, nearby off-task goals can be sampled from a gaussian ball around with standard deviation : . Then an SF loss following equation 3.5 would be applied for each nontask goal . Key to this is that the optimal action for each would be the action that maximized the features determined by at the next time step: . This enabled an agent to concurrently learn a policy for not only but also for nontask goals with no direct experience on those tasks in a simple 3D navigation environment.
Another example where off-task learning is useful is in hindsight experience replay. Typically, experiences that don’t accomplish a task don’t contribute to learning unless some form of reward shaping is employed. Hindsight experience replay provides a strategy for automating reward shaping. In this setting, when the agent fails to accomplish , it relabels one of the states in its experience as a fictitious goal for that experience (Andrychowicz et al., 2017). This strategy is particularly effective when tasks have sparse rewards as it leads there to be a dense reward signal. When learning a policy with SMs (see section 2.4), hindsight experience replay naturally arises as part of the learning objective. It has been shown to improve sample efficiency in sparse-reward virtual robotic manipulation domains and long-horizon navigation tasks (Eysenbach et al., 2020, 2022; Zheng et al., 2023). Despite their potential, learning and exploiting SMs is still in its infancy, whereas SFs have been more thoroughly studied. Recently, Schramm et al. (2023) developed an asymptotically unbiased importance sampling algorithm that leverages SFs to remove bias when estimating value functions with hindsight experience replay. This enabled learning for both simulation and real-world robotic manipulation tasks in environments with large state and action spaces.
4.2.3 Other Advances in Transfer
Generalization to new environment dynamics. One limitation of SFs is that they’re tied to the environment dynamics with which they were learned. Lehnert and Littman (2020) and Han and Tschiatschek (2021) both attempt to address this limitation by learning SFs over state abstractions that respect bisimulation relations (Li et al., 2006). Abdolshah et al. (2021) attempt to address this by integrating SFs with gaussian processes such that they can be quickly adapted to new dynamics given a small amount of experience in the new environment.
Synthesizing new predictions from sets of SFs. While GPI enables combining a set of SFs to produce a novel policy, it does not generate a novel prediction of what features will be experienced from a combination of policies. Some methods attempt to address this by convex combination of SFs (Brantley et al., 2021; Alegre et al., 2022).
Alternatives to generalized policy improvement. Madarasz and Behrens (2019) develop the gaussian SF, which learns a set of reward maps for different environments that can be adaptively combined to adjudicate between different policies. While this compared favorably to GPI, these results were in toy domains; it is currently unclear if their method scales to more complex settings as gracefully as GPI. A potentially more promising alternative to GPI is geometric policy composition (GPC; Thakoor et al., 2022), which enables estimating Q-values when one follows an ordered sequence of policies . Whereas GPI evaluates how much reward will be obtained by the best of a set of policies, GPC is a form of model-based control where the agent evaluates the path obtained from following a sequence of policies. We discuss this in more detail in section 4.4.
4.3 Hierarchical Reinforcement Learning
Many tasks have multiple timescales, which can be exploited using a hierarchical architecture in which the agent learns and acts at multiple levels of temporal abstraction. The classic example of this is the options framework (Sutton et al., 1999). An option is a temporally extended behavior defined by (1) an initiation function that determines when it can be activated, (2) a policy , and (3) a function that determines when the option should terminate: . In this section, we discuss how predictive representations can be used to discover useful options and transfer them across tasks.
4.3.1 Discovering Options to Efficiently Explore the Environment
Collect samples with a random policy that selects between between primitive actions and options. The set of options is initially empty ().
Learn successor representation from the gathered samples.
Get new exploration option . is a policy that maximizes the intrinsic reward function in equation 4.7 using the current SR. The initiation function is 1 for all states. The termination function is 1 when the intrinsic reward becomes negative (i.e., when the agent begins to go toward more frequent states). This option is added to the overall set of options, .
Agents endowed with this strategy were able to discover meaningful options and improve sample efficiency in the four-rooms domain, as well as on challenging Atari games such as Montezuma’s Revenge.
4.3.2 Transferring Options with the SR
Instant synthesis of combinations of options. One of the benefits of leveraging SFs is that they enable transfer to tasks that are linear combinations of known tasks (see section 2.5.1). At transfer time, an agent can exploit this when transferring options by defining subgoals using this space of tasks (see section 2.5.2). In continuous control settings where an agent has learned to move in a set of directions (e.g., up, down, left, right), this has enabled generalization to combinations of these policies (Barreto et al., 2019). For example, the agent could instantly move in novel directions (e.g., up right, down left) as needed to complete a task.
4.4 Jumpy Model-Based Reinforcement Learning
The successor model (SM) is interesting because it offers a novel way to do model-based RL. Traditionally, a model-based agent simulates trajectories with a single-step model. While this is flexible, it is also expensive. SMs enable an alternative strategy, where the agent instead samples and evaluates likely (potentially distal) states that will be encountered when following some policy . As mentioned in section 4.2.3, Thakoor et al. (2022) leverage this property to develop generalized policy composition (GPC), a novel algorithm that enables a jumpy form of model-based RL. Rather than simulate trajectories defined over next states, agents simulate trajectories by using SMs to jump between states using a given set of policies. While this is not as flexible as simulating trajectories with a single-step model, it is much more efficient.
In RL, one typically uses a large discount factor (). When learning an SM, this is useful because you can learn likelihoods over potentially very distal states. However, this makes learning an SM more challenging. GPC mitigates this challenge by composing a shorter horizon SM with a longer horizon SM , where . Composing two separate SMs with different horizons has the following benefits. An SM with a shorter horizon is easier to learn but cannot sample futures as distal as ; on the other hand, is harder to learn but can make very long-horizon predictions and better avoids compounding errors. By combining the two, Thakoor et al. (2022) studied how these two errors can be traded off.
Intuitively, GPC works as follows. Given a starting state-action pair and policies , the agent samples a sequence of next states with our shorter horizon SM and , that is, , , , and so on. The agent then samples a (potentially more distal) state from the longer-horizon SM, , . The reward estimates for the sampled state-action pairs can be combined as a weighted sum to compute analogous to equation 2.28 (see Thakoor et al., 2022, for technical details). Leveraging GPC enabled convergence with an order of magnitude fewer samples in the four-rooms domain and in a continuous-control maze navigation domain.
4.5 Multiagent Reinforcement Learning
As we’ve noted, the SR is conditioned on a policy . In a single-agent setting, the SR provides predictions about what that agent can expect to experience when executing the policy. In multiagent settings, one can parameterize this prediction with another agent’s policy to form predictions about what one can expect to see in the environment when other agents follow their own policies. This is the basis for numerous algorithms that aim to learn about, from, and with other agents (Rabinowitz et al., 2018; Kim et al., 2022; Filos et al., 2021; Gupta et al., 2021; Lee et al., 2019).
4.5.1 Learning about Other Agents
Rabinowitz et al. (2018) showed that an AI agent could learn aspects of theory of mind (including passing a false belief test) by meta-learning SFs that described other agents. While Rabinowitz did not explicitly compare against humans (and was not trying to directly model human cognition), this remains exciting as a direction for exploring scalable algorithms for human-like theory of mind.
4.5.2 Learning from Other Agents
One nice property of SFs is that they can be learned with TD learning using off-policy data (i.e., data collected from a policy different from the one currently being executed). This can be leveraged to learn SFs for the policies of other agents just as an agent learns SFs for their own policy. Filos et al. (2021) exploited this to design an agent that simultaneously learned SFs for both its own policy and for multiple other agents. They were then able to generalize effectively to new tasks via a combination of all of their policies by exploiting GPI (see equation 2.35).
4.5.3 Learning with Other Agents
4.6 Other Artificial Intelligence Applications
The SR and its generalizations have been broadly applied within other areas of AI. For example, it has been used to define an improved similarity metric in episodic control settings (Emukpere et al., 2021). By leveraging SFs, one can incorporate information from previously experienced states with similar dynamics to the current state. The SR has also been applied toward improving importance sampling in off-policy learning. If the agent learns a density ratio similar to equation 3.13, this can enable simpler marginalized importance sampling algorithms (Liu et al., 2018) that improve off-policy evaluation (Fujimoto et al., 2021). In addition to these examples, we highlight the following applications of the SR.
4.6.1 Representation Learning
Learning SMs can obviate the need for separate representation losses. In many applications, the reward signal is not enough to drive learning of useful representations. Some strategies to address this challenge include data augmentation and learning of auxiliary tasks. Learning the SM has been shown to enable representation learning with superior sample efficiency without using these additions (Eysenbach et al., 2020, 2022; Zheng et al., 2023). Predictive representations can also be used to define an auxiliary task for representation learning, which has been shown to be helpful in several settings. A simple example comes from inspecting the loss for learning SFs (see equation 3.5). In standard Q-learning, the agent learns only about achieving task-specific reward. When learning SFs, the agent also learns representations that enable achieving state features that are potentially not relevant for the current task (i.e., the agents are by default leaning auxiliary tasks). This important ability is even possible in a continual learning setting where the distribution of state features are nonstationary (McLeod et al., 2021). Another interesting example comes from proto-value networks (Farebrother et al., 2023). The authors show that if one learns a successor measure (a set-inclusion-based generalization of the SR) over random sets, this can enable the discovery of predictive representations that enable very fast learning in the Atari learning environment.
4.6.2 Learning Diverse Policies
A final application of the SR has been in learning diverse policies. In the mujoco environment, Zahavy et al. (2021) showed that SFs enabled discovering a set of diverse policies for controlling a simulated dog avatar. Their approach used SFs to prospectively summarize trajectories. A set of policies was than incrementally learned so that each new policy would be different in its expected features from all policies learned so far. This idea was then generalized to diversify chess playing strategies based on their expected future features (Zahavy et al., 2023).
5 Neuroscience Applications
In this section, we discuss how the computational ideas reviewed above have been used to understand a variety of brain systems. Medial temporal lobe regions, and in particular the hippocampus, appear to encode predictive representations. We review evidence for this claim and efforts to formalize its mechanistic basis in neurobiology. We also discuss how vector-valued dopamine signals may provide an appropriate learning signal for these representations.
5.1 A Brief Introduction to the Medial Temporal Lobe
Before discussing evidence for predictive representations, it is important to take a sufficiently broad view of the medial temporal lobe’s functional organization. Not everything we know about these regions fits neatly into a theory of predictive representations. Indeed, classical views are quite different, emphasizing spatial representation and episodic memory.
Extensive evidence identifies the hippocampus and associated cortical regions as providing a neural-level representation of space, often conceptualized as a cognitive map (O’Keefe & Nadel, 1978; Morris et al., 1982, see also section 6.4). Integral to this framework are the distinct firing patterns of various cell types found in structures across the hippocampal formation (see Figure 6). Place cells, in regions CA3 and CA1, offer a temporally stable, sparse representation of self-location able to rapidly reorganize in novel environments—the phenomenon of remapping (O’Keefe & Dostrovsky, 1971; Muller & Kubie, 1987; Bostock et al., 1991).
Spatial representations in the medial temporal lobe. As a rodent navigates space (e.g., a rectangular arena; top left), place cells recorded in regions CA1 and CA3 of hippocampus fire in stable, sparse representations of self-location (bottom left; hot colors indicate increased neuronal activity). Conversely, grid cells in neighboring medial entorhinal cortex (EC; bottom right), which provides input to both CA1 and CA3 directly, as well as CA3 via dentate gyrus (DG; solid arrows imply directional connectivity between regions), have spatially periodic hexagonal firing patterns tiling the entire environment. Additionally, boundary-responsive neurons in both medial entorhinal cortex and subiculum (SUB) fire when the animal occupies specific positions relative to external and internal environmental boundaries.
Spatial representations in the medial temporal lobe. As a rodent navigates space (e.g., a rectangular arena; top left), place cells recorded in regions CA1 and CA3 of hippocampus fire in stable, sparse representations of self-location (bottom left; hot colors indicate increased neuronal activity). Conversely, grid cells in neighboring medial entorhinal cortex (EC; bottom right), which provides input to both CA1 and CA3 directly, as well as CA3 via dentate gyrus (DG; solid arrows imply directional connectivity between regions), have spatially periodic hexagonal firing patterns tiling the entire environment. Additionally, boundary-responsive neurons in both medial entorhinal cortex and subiculum (SUB) fire when the animal occupies specific positions relative to external and internal environmental boundaries.
Subiculum and dentate gyrus also contain spatially modulated neurons with broadly similar characteristics, the former tending to be diffuse and elongated along environmental boundaries (Lever et al., 2009) while the latter are extremely sparse (Jung & McNaughton, 1993; Leutgeb et al., 2007). In contrast, in medial entorhinal cortex (mEC), the primary cortical partner of hippocampus, the spatially periodic firing patterns of grid cells effectively tile the entire environment (see Figure 6) and are organized into discrete functional modules of different scale (Hafting et al., 2005; Barry et al., 2007; Stensola et al., 2012). The highly structured activity of grid cells has provoked a range of theoretical propositions pointing to roles in path integration (McNaughton et al., 2006; Burgess et al., 2007), vector-based navigation (Bush et al., 2015; Banino et al., 2018), and as an efficient basis set for spatial generalization (Whittington et al., 2020). Notably, mEC also contains a “zoo” of other cell types with functional characteristics related to self-location, including head direction cells (Sargolini et al., 2006), border cells (Solstad et al., 2008), speed cells (Kropff et al., 2015), and multiplexed conjunctive responses (Sargolini et al., 2006; Hardcastle et al., 2017).
In summary, the medial temporal lobe exhibits remarkable functional diversity. We now turn to the claim that predictive principles offer a unifying framework for understanding aspects of this diversity.
5.2 The Hippocampus as a Predictive Map
Accumulating evidence indicates that neurons within the hippocampus and its surrounding structures, particularly place and grid cells, demonstrate predictive characteristics consistent with a predictive map of spatial states. Stachenfeld et al. (2017) were the first to systematically explore this perspective, establishing a connection between the responses of hippocampal neurons and the SR.8 They argued that place cells were not inherently representing the animal’s spatial location, but rather its expectations about future spatial locations. Specifically, they argued that the receptive fields of place cells correspond to columns of the SR matrix from section 2.3 (see Figure 7, left). This implies that each receptive field is a retrodictive code, in the sense that the cells are more active in locations that tend to precede the cell’s “preferred” location (i.e., the location of the peak firing). The population activity of place cells at a given time corresponds to a row of the SR matrix; this is a predictive code, in the sense that they collectively encode expectations about upcoming states.
Successor representation model of the hippocampus and medial entorhinal cortex. As an agent explores a linear track environment in a unidirectional manner, the SR skews backward down the track opposite to the direction of motion (top left; hot colors indicate increased predicted occupancy of the agent’s depicted state), as observed in hippocampal place cells (Mehta et al., 2000). In a 2D arena, the SR forms place cell–like sparse representations of self-location (bottom left), while the eigenvectors of the SR form spatially periodic activity patterns reminiscent of entorhinal grid cells (top right). Similar to real grid cells (Krupic et al., 2015), the periodicity of these eigenvectors is deformed in polarized environments such as trapezoids (bottom right).
Successor representation model of the hippocampus and medial entorhinal cortex. As an agent explores a linear track environment in a unidirectional manner, the SR skews backward down the track opposite to the direction of motion (top left; hot colors indicate increased predicted occupancy of the agent’s depicted state), as observed in hippocampal place cells (Mehta et al., 2000). In a 2D arena, the SR forms place cell–like sparse representations of self-location (bottom left), while the eigenvectors of the SR form spatially periodic activity patterns reminiscent of entorhinal grid cells (top right). Similar to real grid cells (Krupic et al., 2015), the periodicity of these eigenvectors is deformed in polarized environments such as trapezoids (bottom right).
In line with this hypothesis, the tuning of place fields (see Figure 7, bottom left) is influenced by the permissible transitions afforded by an environment: they do not typically cross environmental boundaries like walls, tending to extend along them, mirroring the trajectories animals follow (Alvernhe et al., 2011; Tanni et al., 2022). Alternations made to an environment’s layout, affecting the available paths, influence the activity of adjacent place cells, consistent with the SR (Stachenfeld et al., 2017). Notably, even changes in policy alone, such as training rats to switch between foraging and directed behavior, can markedly alter place cell firing (Markus et al., 1995), also broadly consistent with the SR. Further, when animals are trained to generate highly stereotyped trajectories, for example, repeatedly traversing a track in one direction, CA1 place fields increasingly exhibit a backward skew, opposite the direction of travel (Mehta et al., 2000), thereby anticipating the animal’s future state. This arises naturally from learning the SR, since the upcoming spatial states become highly predictable when agents consistently move in one direction, resulting in a backward skew of the successor states (see Figure 7, top left). The basic effect is captured in simple grid worlds like those used by Stachenfeld et al. (2017), but when the anchoring is replaced with continuous feature-based methods, the successor features also capture the backward shift in field peak observed in neural data (Mehta et al., 2000; George et al., 2023).
While the properties of place cells are consistent with encoding the SR, grid cells appear to resemble the eigenvectors of the SR (see Figure 7, right). Specifically, eigendecomposition of the SR matrix yields spatially periodic structures of varying scales, heavily influenced by environmental geometry (Stachenfeld et al., 2017) while being relatively robust to the underlying policy (De Cothi & Barry, 2020). Broadly, these resemble grid cells, but notably lack the characteristic hexagonal periodicity except when applied to hexagonal environments. This discrepancy, however, is likely not significant because subsequent work indicates that biological constraints, such as nonnegative firing rates (Dordek et al., 2016; Sorscher et al., 2019), efficiency considerations (Dorrell et al., 2023), and neurobiologically plausible state features (De Cothi & Barry, 2020) tend to move these solutions closer to the expected hexagonal activity patterns (see Figure 8A). The key point then is that environmental geometries that polarize the transitions available to an animal produce SR eigenvectors with commensurate distortions, matching observations that grid firing patterns are also deformed under such conditions (Derdikman et al., 2009; Krupic et al., 2015). Notably, this phenomenon is also observed in open-field environments with straight boundaries, where both biological grid cells and SR-derived eigenvectors exhibit a tendency to orient relative to walls (Krupic et al., 2015; De Cothi & Barry, 2020). Complementary evidence comes from virtual reality studies of human subjects, where errors in distance estimates made by participants mirrored distortions in eigenvector-derived grid cells (Bellmund et al., 2020).
A successor representation with neurobiological state features. (A) Boundary responsive cells, present in subiculum and entorhinal cortex, are used as state features for learning successor features and their eigenvectors (De Cothi & Barry, 2020). These can capture more nuanced characteristics of both place and grid cells compared to one-hot spatial features derived from a grid world. For example, the eigenvectors have increased hexagonal periodicity relative to a grid world SR model. (B) The successor features can convey duplicated fields when additional walls are inserted in the environment, as observed in real place cells (Barry et al., 2006).
A successor representation with neurobiological state features. (A) Boundary responsive cells, present in subiculum and entorhinal cortex, are used as state features for learning successor features and their eigenvectors (De Cothi & Barry, 2020). These can capture more nuanced characteristics of both place and grid cells compared to one-hot spatial features derived from a grid world. For example, the eigenvectors have increased hexagonal periodicity relative to a grid world SR model. (B) The successor features can convey duplicated fields when additional walls are inserted in the environment, as observed in real place cells (Barry et al., 2006).
Although place and grid cells are predominantly conceptualized as spatial representations, it is increasingly clear that these neurons also represent nonspatial state spaces (Constantinescu et al., 2016; Aronov et al., 2017); in some cases, activity can be interpreted as encoding an SR over such state spaces. For example, a study by Garvert et al. (2017) showed human participants a series of objects on a screen in what appeared to be a random order. However, unknown to the participants, the sequence was derived from a network of nonspatial relationships, where each object followed certain others in a predefined pattern. Brain imaging found that hippocampal and entorhinal activity mirrored the nonspatial predictive relationships of the objects, as if encoded by an SR (see Brunec & Momennejad, 2022).
5.3 Learning a Biologically Plausible SR
In much of the neuroscience work, SRs are formulated over discrete state spaces, facilitating analysis and enabling direct calculation for diffusive trajectories. In spatial contexts, this corresponds to a grid world with one-hot-location encoding, a method that can produce neurobiologically plausible representations (Stachenfeld et al., 2017). However, the brain must use biologically plausible learning rules and features derived from sensory information, with the choice of state features exerting significant influence on the resultant SFs (described in section 2.5).
De Cothi and Barry (2020) employed idealized boundary vector cells (BVCs)—neurons coding for distance and direction to environmental boundaries—as a basis over which to calculate a spatial SR. BVCs have been hypothesized as inputs to place cells (Hartley et al., 2000). They resemble the boundary-responsive cells found in the mEC (Solstad et al., 2008) and subiculum (Barry et al., 2006; Lever et al., 2009); hence, they are plausibly available to hippocampal circuits (see Figure 8A). The SFs of these neurobiological state features and their eigendecomposition resemble place and grid fields as before, but also captured more of the nuanced characteristics of these spatially tuned neurons—for example, the way in which place fields elongate along environmental boundaries (Tanni et al., 2022) and duplicate when additional walls are introduced (see Figure 8B; Barry et al., 2006). Geerts et al. (2020) employed a complementary approach, using an SR over place cell state features in parallel with a model-free framework trained on egocentric features. The dynamics of the interaction between these two elements mirrored the behavioral preference of rodents, which initially favor a map-based navigational strategy, before switching to one based on body turns (Geerts et al., 2020; Packard & McGaugh, 1996).
Subsequent models expanded on these ideas. While these differ in terms of implementation and focus, they employ the common idea of embedding transition probabilities into the weights of a network using biologically plausible learning rules. Fang et al. (2023) advanced a framework using a bespoke learning rule acting on a recurrent neural network supplied with features derived from experimental recordings. The network was sufficient to calculate an SR and could do so at different temporal discounts, producing SFs that resembled place cells. Bono et al. (2023) applied spike-time-dependent plasticity (STDP; Bi & Poo, 1998; Kempter et al., 1999), a Hebbian learning rule sensitive to the precise ordering of pre- and postsynaptic spikes, to a single-layer spiking network. Because the ordering of spikes from spatial inputs inherently reflects the sequence of transitions between them, this configuration also learns an SR. Indeed, the authors were able to show that the synaptic weights learned by this algorithm are mathematically equivalent to TD learning with an eligibility trace. Furthermore, temporally accelerated biological sequences, such as replay (Wilson & McNaughton, 1994; Ólafsdóttir et al., 2016), provide a means to quickly acquire SRs for important or novel routes. Finally, George et al. (2023) followed a similar approach, showing that STDP (Bi & Poo, 1998) applied to theta sweeps—the highly ordered sequences of place cell spiking observed within hippocampal theta cycles (O’Keefe & Recce, 1993; Foster & Wilson, 2007)—was sufficient to rapidly learn place field SFs that were strongly modulated by agent behavior, consistent with empirical observations. Additionally, because the speed and range of theta sweeps are directly linked to the size of the underlying place fields, the authors also noted that the gradient of place field sizes observed along the dorsal-ventral axis of the hippocampus inherently approximates SFs with decreasing temporal discounts (Kjelstrup et al., 2008; Momennejad & Howard, 2018).
The formulation used by George et al. (2023) highlights a paradox: while place fields can serve as state features, they are also generated as SFs. This dual role might suggest a functional distinction between areas such as CA3 and CA1, with CA3 potentially providing the spatial basis and CA1 representing SFs. Alternatively, spatial bases could originate from upstream circuits, such as mEC, as proposed in the Fang et al. (2023) model. Furthermore, it is conceivable that the initial rapid formation of place fields is governed by a distinct plasticity mechanism, such as behavioral-timescale plasticity (Bittner et al., 2017). Once established, these fields would then serve as a basis for subsequent SF learning. Such a perspective is compatible with observations that populations of place fields in novel environments do not immediately generate theta sweeps (Silva et al., 2015).
These algorithms learn SFs under the premise that the spatial state is fully observable, for example, by a one-hot encoding in a grid world or the firing of BVCs computed across the distances and directions to nearby walls. However, in reality, states are often only partially observable and inherently noisy due to sensory and neural limitations. Vértes and Sahani (2019) present a mechanism for how the SR can be learned in partially observable noisy environments, where state uncertainty is represented in a feature-based approximation via distributed, distributional codes. This method supports RL in noisy or ambiguous environments, for example, navigating a corridor of identical rooms.
In summary, the SR framework can be generalized to a state space comprising continuous, nonidentical, overlapping state features, and acquired with biological learning rules. As such, it is plausible that hippocampal circuits could in principle instantiate a predictive map, or close approximation, based on the mixed population of spatially modulated neurons available from its inputs. The models reviewed above demonstrate ways in which the SR could be learned online during active experience. Nonetheless, much learning in the brain is achieved offline, during periods of wakeful rest or sleep. One candidate neural mechanism for this is replay, rapid sequential activity during periods of quiescence.
5.4 Replay
During periods of sleep and awake rest, hippocampal place cells activate in rapid sequences that recapitulate past experiences (Wilson & McNaughton, 1994; Foster & Wilson, 2007). These reactivations, known as replay, often coordinate with activity in the entorhinal (Ólafsdóttir et al., 2016) and sensory cortices (Ji & Wilson, 2007; Rothschild et al., 2017), and are widely thought to be a core mechanism supporting system-level consolidation of experiential knowledge (Girardeau et al., 2009; Ego-Stengel & Wilson, 2010; Ólafsdóttir et al., 2015). In linear track environments, hippocampal place cells typically exhibit directional tuning with firing fields that disambiguate travel in either direction (Navratilova et al., 2012). This directionality enables two functional classes of replay to be distinguished: forward replay where the sequence reflects the order in which the animal experienced the world, and reverse replay where the behavioral sequence is temporally reversed (analogous to the animal walking tail-first down the track). Intriguingly, the phase of a navigation task has been shown to influence the type of replay that occurs; for example, reverse replay is associated with receipt of reward at the end of trials, while forward replay is more abundant at the start of trials prior to active navigation (Diba & Buzsáki, 2007).
Mattar and Daw (2018) modeled the emergence of forward and reverse replays using a reinforcement learning agent that accesses memories of locations in an order determined by their expected utility. Specifically, the agent prioritizes replaying memories as a balance of two requirements: the need to evaluate imminent choices versus the gain from propagating newly encountered information to preceding locations. The need term for a spatial state corresponds to its expected future occupancy given the agent’s current location, thus utilizing the definition of the SR (see section 2.3 and equation 2.13) to provide a measure for how often in the near future that state will tend to be visited. The gain term represents the expected increase in reward from visiting that state. This mechanism produces sequences that favor adjacent backups: upon discovery of an unexpected reward, the last action executed will have a positive gain, making it a likely candidate for replay. Thus, value information can be propagated backward by chaining successive backups in the reverse direction, simulating reverse replay. Conversely, at the beginning of a trial, when the gain differences are small and the need term dominates, sequences that propagate value information in the forward direction will be the likely candidates for replay, prioritizing nearby backups that extend forward through the states the agent is expected to visit in the near future.
Sequential neural activity in humans has similarly been observed to exhibit orderings consistent with sampling based on future need. Using a statistical learning task with graph-like dependencies between visual cues (Schapiro et al., 2013; Garvert et al., 2017; Lynn et al., 2020), Wittkuhn et al. (2022) showed participants a series of animal images drawn from a ring-like graph structure in either a uni- or bidirectional manner. Using fMRI data recorded during 10 s pauses between trials, Wittkuhn et al. (2022) found forward and reverse sequential activity patterns in visual and sensorimotor cortices, a pattern well captured by an SR model learned from the graph structure participants experienced.
5.5 Dopamine and Generalized Prediction Errors
As described in section 2.3, TD learning rules provide a powerful algorithm for value estimation. The elegant simplicity of this algorithm led neuroscientists to explore if, and how, TD learning might be implemented in the brain. Indeed, one of the celebrated successes of neuroscience has been the discovery that the activity of midbrain dopamine neurons appears to report reward prediction errors (Schultz et al., 1997) consistent with model-free RL algorithms (see section 2.2, equation 2.11). This successfully accounts for many aspects of dopamine responses in classical and instrumental conditioning tasks (Starkweather & Uchida, 2021).
While elegant, the classical view that dopamine codes for a scalar reward prediction error does not explain more heterogeneous aspects of dopamine responses. For example, the same dopamine neurons also respond to novel and unexpected stimuli (Ljungberg et al., 1992; Horvitz, 2000) and to errors in predicting the features of rewarding events, even when value remains unchanged (Chang et al., 2017; Takahashi et al., 2017; Stalnaker et al., 2019; Keiflin et al., 2019). Russek et al. (2017) highlighted the biological plausibility of the SR TD learning rule (see equations 2.17 and 2.18) in light of its similarity to the model-free TD learning rule (see equation 2.11), while refraining from making any explicit connections between vector-valued SR TD errors and dopamine. Gardner et al. (2018) took this idea further and proposed an extension to the classic view of dopamine, suggesting that it also encodes prediction errors related to sensory features. According to this model, dopamine reports vector-valued TD errors suitable for updating SFs (see section 2.5), using the fact that SFs obey a Bellman equation and hence are learnable by TD. This model explains a number of phenomena that are puzzling under a classic TD model based only on scalar reward predictions.
First, it explains why the firing rate of dopamine neurons increases after a change in reward identity, even when reward magnitude is held fixed (Takahashi et al., 2017): changes in reward identity induce a sensory prediction error that shows up as one component of the error vector. Second, it explains, at least partially, why subpopulations of dopamine neurons encode a range of different nonreward signals (Engelhard et al., 2019; de Jong et al., 2022; Gonzalez et al., 2023). Third, it explains why optogenetic manipulations of dopamine influence conditioned behavior even in the absence of reward (Chang et al., 2017; Sharpe et al., 2017).
How do dopamine neurons encode a vector-valued error signal? One possibility is that the errors are distributed across population activity. Pursuing this hypothesis, Stalnaker et al. (2019) analyzed the information content of dopamine neuron ensembles. They showed that reward identity can be decoded from these ensembles, but not from single neuron activity. Moreover, they showed that this information content disappeared over the course of training following an identity switch, consistent with the idea that error signals go to 0 as learning proceeds. The question remains how a vector-valued learning system is implemented biophysically. Some progress in this direction (albeit within a different theoretical framework) has been made by Wärnberg and Kumar (2023).
6 Cognitive Science Applications
A rich body of work dating back over a century has linked RL algorithms to reward-based learning processes in humans and nonhuman animals (Niv, 2009). Empirical findings align with the theoretical properties of model-based and model-free control (described in section 2.2), suggesting that model-based control underlies reflective, goal-directed behaviors, while model-free control underlies reflexive, habitual behaviors. The existence of both systems in the brain and their synergistic operation has received extensive support by a wide range of behavioral and neural studies across a number of species and experimental paradigms (see Dolan & Dayan, 2013, for a detailed review).
Recall that the SR (described in section 2.3) occupies an intermediate ground between model-based and model-free algorithms. This can make it advantageous when flexibility and efficiency are both desirable, which is the case in most real-world decision-making scenarios. In the field of cognitive science, several lines of research suggest that human learning and generalization are indeed consistent with the SR and related predictive representations. In this section, we examine studies showing that patterns of responding to changes in the environment (Momennejad et al., 2017), transfer of knowledge across tasks (Tomov et al., 2021), planning in spatial domains (Geerts et al., 2024), and contextual memory and generalization (Gershman et al., 2012; Smith et al., 2013; Zhou et al., 2023) exhibit signature characteristics of the SR-like predictive representations that cannot be captured by pure model-based or model-free strategies.
6.1 Revaluation
Some of the key findings pointing to a balance between a goal-directed system and a habitual system in the brain came from studies of reinforcer revaluation (Adams & Dickinson, 1981; Adams, 1982; Dickinson, 1985; Holland, 2004). In a typical revaluation paradigm, an animal (e.g., a rat) is trained to associate a neutral action (e.g., a lever press) with an appetitive outcome (e.g., food). The value of that outcome is subsequently reduced (e.g., the rat is satiated, so food is less desirable), and the experimenter measures whether the animal keeps taking the action in the absence of reinforcement. Goal-directed control predicts that the animal would not take the action, since the outcome is no longer valuable, while habitual control predicts that the animal would keep taking the action, since the action itself was not devalued. Experimenters found that under some conditions—such as moderate training, complex tasks, or disruptions to dopamine inputs to striatum—behavior appears to be goal directed (e.g., lever pressing is reduced), while under other conditions—such as extensive training or disruptions to prefrontal cortex—behavior appears to be habitual (e.g., the rat keeps pressing the lever).
A modeling study by Daw et al. (2005) interpreted these findings through the lens of RL (see section 2.2). The authors formalized the goal-directed system as a model-based controller putatively implemented in prefrontal cortex and the habitual system as a model-free controller putatively implemented in dorsolateral stratum. They proposed that the brain arbitrates dynamically between the two controllers based on the uncertainty of their value estimates, preferring the more certain (and hence likely more accurate) estimate for a given action. Under this account, moderate training or complex tasks would favor the model-based estimates, since the model-free estimates may take longer to converge and hence be less reliable. On the other hand, extensive training would favor the model-free estimates, since they will likely have converged and hence be more reliable than the noisy model-based simulations.
One limitation of this account is that it explains sensitivity to outcome revaluation in terms of a predictive model, but it does not rule out the possibility that the animals may instead be relying on a predictive representation. A hallmark feature of predictive representations is that they allow an agent to adapt quickly to changes in the environment that keep its cached predictions relevant, but not to changes that require updating them. In particular, an agent equipped with the SR (see section 2.3) should adapt quickly to changes in the reward structure () of the environment but not to changes in the transition structure (). Since the earlier studies on outcome revaluation effectively only manipulated reward structure, both model-based control and the SR could account for them, leaving open the question of whether outcome revaluation effects could be fully explained by the SR instead.
This question was addressed in a study by Momennejad et al. (2017), which examined how changes in either the reward structure or the transition structure experienced by human participants affect their subsequent choices. The authors used a two-step task consisting of three phases: a learning phase, a relearning (or revaluation) phase, and a test phase (see Figure 9A). During the learning phase, participants were presented with two distinct two-step sequences of stimuli and rewards corresponding to two distinct trajectories through state space. The first trajectory terminated with high reward ( $10), while the second trajectory terminated with a low reward ( $1), leading participants to prefer the first one over the second one.
Predictive representations in cognitive science. (A) Revaluation paradigm and predictions from Momennejad et al. (2017, sec. 6.1). (B) Multitask learning paradigm from Tomov et al. (2021, sec. 6.2). A person is trained on tasks where they are either hungry () or groggy (), and then tested on a task in which they are hungry, groggy, and looking to have fun (). (C) Context-dependent Bayesian SR from Geerts et al. (2024, sec. 6.3). , context. , SR matrix associated with each context. sCRP, sticky Chinese restaurant process. LDS, linear-gaussian dynamical system. (D) Spatial navigation paradigm from de Cothi et al. (2022, sec. 6.4). (E) TCM-SR: Using the temporal context model (TCM) to learn the SR for decision making (Zhou et al., 2023, sec. 6.5). , SR matrix. , context vector. , state and/or item. SR, successor representation.
Predictive representations in cognitive science. (A) Revaluation paradigm and predictions from Momennejad et al. (2017, sec. 6.1). (B) Multitask learning paradigm from Tomov et al. (2021, sec. 6.2). A person is trained on tasks where they are either hungry () or groggy (), and then tested on a task in which they are hungry, groggy, and looking to have fun (). (C) Context-dependent Bayesian SR from Geerts et al. (2024, sec. 6.3). , context. , SR matrix associated with each context. sCRP, sticky Chinese restaurant process. LDS, linear-gaussian dynamical system. (D) Spatial navigation paradigm from de Cothi et al. (2022, sec. 6.4). (E) TCM-SR: Using the temporal context model (TCM) to learn the SR for decision making (Zhou et al., 2023, sec. 6.5). , SR matrix. , context vector. , state and/or item. SR, successor representation.
During the revaluation phase, participants had to relearn the second half of each trajectory. Importantly, the structure of the trajectories changed differently depending on the experimental condition. In the reward revaluation condition, the transitions between states remained unchanged, but the rewards of the two terminal states swapped ( $1; $10). In contrast, in the transition revaluation condition, the rewards remained the same, but the transitions to the terminal states swapped ( $1; $10).
Finally, in the test phase, participants were asked to choose between the two initial states of the two trajectories ( and ). Note that under both revaluation conditions, participants should now prefer the initial state of the second trajectory () as it now leads to the higher reward.
Unlike previous revaluation studies, this design clearly disambiguates between the predictions of model-free, model-based, and SR learners. Since the initial states ( and ) never appear during the revaluation phase, a pure model-free learner would not update the cached values associated with those states and would still prefer the initial state of the first trajectory (). On the other hand, a pure model-based learner would update its reward () or transition () estimates during the corresponding revaluation phase, allowing it to simulate the new outcomes from each initial state and make the optimal choice () during the test phase. Critically, both model-free and model-based learners (and any hybrid between them, such as a convex combination of their outputs, as in Daw et al., 2011) would exhibit the same preferences during the test phase in both revaluation conditions.
In contrast, an SR learner would show differential responding in the test phase depending on the revaluation condition, adapting and choosing optimally after reward revaluation but not after transition revaluation. Specifically, during the learning phase, an SR learner would learn the successor states for each initial state (the SR itself, i.e., ; ). In the reward revaluation condition, it would then update its reward estimates () for the terminal states ( $1; $10) during the revaluation phase, much like the model-based learner. Then, during the test phase, it would combine the updated reward estimates with the SR to compute the updated values of the initial states ( $1; $10), allowing it to choose the better one (). In contrast, in the transition revaluation condition, the SR learner would not have an opportunity to update the SR of the initial states ( and ) since they are never presented during the revaluation phase, much like the model-free learner. Then, during the test phase, it would combine the unchanged reward estimates with its old but now incorrect SR to produce incorrect estimates for the initial states ( $10; $1) and choose the worse one ().
The pattern of human responses showed evidence of both model-based and SR learning: participants were sensitive to both reward and transition revaluations, consistent with model-based learning, but they performed significantly better after reward revaluations, consistent with SR learning. To rule out the possibility that this effect can be attributed to pure model-based learning with different learning rates for reward () versus transition () estimates, the researchers extended this Pavlovian design to an instrumental design in which participants’ choices (i.e., their policy ) altered the trajectories they experienced. Importantly, this would correspondingly alter the learned SR: unrewarding states would be less likely under a good policy and hence not be prominent (or not appear at all) in the SR for that policy. Such states could thus get overlooked by an SR learner if they suddenly became rewarding. This subtle kind of reward revaluation (dubbed policy revaluation by the researchers) also relies on changes in the reward structure , but induces predictions similar to the transition revaluation condition: SR learners would not adapt quickly, while model-based learners would adapt just as quickly as in the regular reward revaluation condition.
Human responses on the test phase after policy revaluation were similar to responses after transition revaluation but significantly worse than responses after reward revaluation, thus ruling out a model-based strategy with different learning rates for and . Overall, the authors interpreted their results as evidence of a hybrid model-based-SR strategy, suggesting that the human brain can adapt to changes in the environment by both updating its internal model of the world and by learning and leveraging its cached predictive representations (see also Kahn & Daw, 2023, for additional human behavioral data leading to similar conclusions).
6.2 Multitask Learning
In the previous section, we saw that humans can adapt quickly to changes in the reward structure of the environment (reward revaluation), as predicted by the SR (Momennejad et al., 2017). However, that theoretical account alone does not fully explain how the brain can take advantage of the SR to make adaptive choices. Here we take this idea further and propose that humans learn successor features (SFs; see section 2.5) for different tasks and use something like the GPI algorithm (see section 2.5.1) to generalize across tasks with different reward functions (Barreto et al., 2017, 2018, 2020).
In Tomov et al. (2021), participants were presented with different two-step tasks that shared the same transition structure but had different reward functions determined by the reward weights (see Figure 9B). Each state was associated with a different set of features , which were valued differently depending on the reward weights for a particular task. On each training trial, participants were first shown the weight vector for the current trial and then asked to navigate the environment in order to maximize reward. At the end of the experiment, participants were presented with a single test trial on a novel task .
The main dependent measure was participant behavior on the test task, which was designed (along with the training tasks, state features, and transitions) to distinguish among several possible generalization strategies. Across several experiments, Tomov et al. (2021) found that participant behavior was consistent with SF and GPI. In particular, on the first (and only) test trial, participants tended to prefer the training policy that performed better on the new task, even when this was not the optimal policy. This “policy reuse” is a key behavioral signature of GPI. This effect could not be explained by model-based or model-free accounts. Their results suggest that humans rely on predictive representations from previously encountered tasks to choose promising actions on novel tasks.
6.3 Associative Learning
RL provides a normative account of associative learning, explaining how and why agents ought to acquire long-term reward predictions based on their experience. It also provides a descriptive account of a myriad of phenomena in the associative learning literature (Sutton & Barto, 1990; Niv, 2009; Ludvig et al., 2012). Two recent ideas have added nuance to this story:
Bayesian learning: Animals represent and use uncertainty in their estimates.
Context-dependent learning: Animals partition the environment into separate contexts and maintain separate estimates for each context.
We examine each idea in turn and then explore how they can be combined with the SR (see section 2.3). In brief, the key idea is that animals learn a probability distribution over context-dependent predictive representations.
6.3.1 Bayesian RL
While standard RL algorithms (see section 2.2) learn point estimates of different unknown quantities like the transition function () or the value function (), Bayesian RL posits that agents treat such unknown quantities as random variables and represent beliefs about them as probability distributions.
In effect, this means that each feature (or stimulus dimension) is assigned a certain reward weight. The weights are initialized randomly around zero (with prior covariance ) and evolve according to a random walk (with volatility governed by the transition noise variance ). Observed rewards are given by the linear model (see equation 6.1) plus zero-mean gaussian noise with variance .
is the reward prediction error.
is the residual variance.
is the Kalman gain.
This learning algorithm generalizes the seminal Rescorla-Wagner model of associative learning (Rescorla & Wagner, 1972), and its update rule bears resemblance to the error-driven TD update (see equation 2.11). However, there are a few notable distinctions from its non-Bayesian counterparts:
Uncertainty-modulated updating: The learning rate corresponds to the Kalman gain , which increases with posterior uncertainty (the diagonals of the posterior covariance, ).
Nonlocal updating: The update is multivariate, affecting the weights of all features simultaneously according to . This means that weights of “absent” features (i.e., where ) can be updated, provided those features have nonzero covariance with observed features.
These properties allow the Kalman filter to explain a number of phenomena in associative learning that elude non-Bayesian accounts (Dayan & Kakade, 2000; Kruschke, 2008; Gershman, 2015). Uncertainty-modulated updating implies that nonreinforced exposure to stimuli, or even just the passage of time, can affect future learning. For example, in a phenomenon known as latent inhibition, preexposure to a stimulus reduces uncertainty, which in turn reduces the Kalman gain, retarding subsequent learning for that stimulus. Nonlocal updating implies that learning could occur even for unobserved stimuli if they covary with observed stimuli according to . For example, in a phenomenon known as backward blocking, reinforcing two stimuli simultaneously (a compound stimulus, ) and subsequently reinforcing only one of the stimuli () reduces reward expectation for the absent stimulus () due to the learned negative covariance between the two reward weights ( < 0).
6.3.2 Context-Dependent Learning
If each context is assigned its own probability distribution over observations, then inferring a given context is driven by two factors:
How likely is the current observation under context ?
How likely is context given the previous context assignments (see equation 6.9)?
In particular, a given context is more likely to be inferred if the current observations are more likely under its observation distribution and/or if it has been inferred more frequently in the past (i.e., if past observations have also been consistent with its observation distribution). Conversely, if the current observations are unlikely under any previously encountered context, a new context is induced with its own observation distribution.
This formulation has accounted for a number of phenomena in the animal learning literature (Gershman et al., 2010). For example, it explains why associations that have been extinguished sometimes reappear when the animal is returned to the context in which the association was first learned, a phenomenon known as renewal. It also provides an explanation of why the latent inhibition effect is attenuated if preexposure to a stimulus occurs in one context and reinforcement occurs in a different context.
Recently, a similar formulation has been used to explain variability in the way hippocampal place cells change their firing patterns in response to changes in contextual cues, a phenomenon known as remapping (Sanders et al., 2020). On this view, the hippocampus maintains a separate cognitive map of the environment for each context, and hippocampal remapping reflects inferences about the current context.
6.3.3 Bayesian Learning of Context-Dependent Predictive Maps
As discussed in section 5.2, Stachenfeld et al. (2017) showed that many aspects of hippocampal place cell firing patterns are consistent with the SR, suggesting that the hippocampus encodes a predictive map of the environment. In this light, the view that the hippocampus learns different maps for different contexts (Sanders et al., 2020) naturally points to the idea of a context-dependent predictive map.
Thus, a given context is more likely to be inferred if:
Conversely, a new context is more likely to be inferred if observations are inconsistent with the current SFs (i.e., when there are large SF prediction errors).
The authors show that this model can account for a number of puzzling effects in the animal learning literature that pose problems for both point estimation (TD learning) of the SR/SF (see equations 2.17 and 2.18 in section 2.3) and Bayesian RL (see equations 6.6 and 6.7 in section 6.3.1). One example is the opposing effect that preexposure to a context can have on learning. Brief exposure to a context can facilitate learning (context preexposure facilitation), while prolonged exposure can inhibit learning (latent inhibition; Kiernan & Westbrook, 1993). Context preexposure facilitation by itself can be accounted for by TD learning of the SR alone (Stachenfeld et al., 2017): during preexposure, the animal learns a predictive representation that facilitates propagation of newly learned values. Latent inhibition by itself can be accounted for by Bayesian RL alone (Gershman, 2015), as discussed previously: prolonged exposure reduces value uncertainty, in turn reducing the Kalman gain (the effective learning rate) and inhibiting learning of new values. Kalman learning of SFs combines these two processes and can thus resolve the apparent paradox: initially, the animal learns a predictive representation of the context, which facilitates learning, whereas after prolonged exposure, this effect is offset by a reduction in value uncertainty, which inhibits learning.
Another puzzling effect is the partial transition revaluation observed in Momennejad et al. (2017) and discussed in section 6.1, which cannot be accounted for by TD learning of the SR. This led Momennejad et al. to propose a hybrid model-based-SR strategy that relies on offline simulations. Kalman TD offers a more parsimonious account based on nonlocal updating that does not appeal to model-based simulations. In particular, the covariance matrix learned during the learning phase captures the relationship between the initial states ( and ) and subsequent states ( and ). Updating the transitions from those subsequent states ( and ) during the transition revaluation phase therefore also updates the SR for the initial states ( and ), even though they are not encountered during the revaluation phase.
Nonlocal updating can similarly explain reward devaluation of preconditioned stimuli, a hallmark of model-based learning (Hart et al., 2020). This is similar to reward devaluation, discussed in section 6.1, except with an additional preconditioning phase during which an association is learned between two neutral stimuli (e.g., light tone). During the subsequent conditioning phase, the second stimulus is associated with a rewarding outcome (e.g., tone food), which is then devalued (e.g., by inducing taste aversion) during the devaluation phase. Finally, the animal is tested on the first neutral stimulus (e.g., light). Note that since the first stimulus is never present during the conditioning phase, TD learning would not acquire an association between the first stimulus and the reward and would thus not exhibit sensitivity to reward devaluation (Gardner et al., 2018). In contrast, during the preconditioning phase, Kalman TD learns that the two stimuli covary, allowing it to update the SF for both stimuli during the conditioning phase and consequently propagate the updated value to both stimuli during the devaluation phase.
Note that the phenomena so far can be explained without appealing to context-dependent learning (see section 6.3.2), since the experiments take place in the same context. Context-dependent Kalman TD can additionally explain a number of intriguing phenomena when multiple contexts are introduced.
One such phenomenon is the context specificity of learned associations (Winocur et al., 2009). In this paradigm, an animal learns an association (e.g., tone shock) in one context (e.g., context A) and is then tested in the same context or in another context (e.g., context B). The amount of generalization of the association across contexts was found to depend on elapsed time: if testing occurs soon after training, the animal responds only in the training context (A), indicating context specificity. However, if testing occurs after a delay, the animal responds equally in both contexts A and B, indicating contextual generalization. Even more intriguing, this effect is reversed if the animal is briefly reintroduced to the training context (A) before testing, in which case responding is once again context specific—a hippocampus-dependent reminder effect.
The context-dependent model readily accounts for these effects. Shortly after training, the uncertainty of the SR assigned to context A is low (i.e., the animal is confident in its predictive representation of context A). Introduction to context B therefore results in a large prediction error, leading the animal to (correctly) infer a new context with a new SR, leading to context-specific responding. However, as more time elapses, the uncertainty of the SR assigned to context A gradually increases (i.e., the animal becomes less confident in its predictive representation of context A, a kind of forgetting). Introduction to context B then results in a smaller prediction error, making it likely that the new observations are also assigned to context A, leading to generalization across contexts. Brief exposure to context A reverses this effect by reducing the uncertainty of the SR assigned to context A (i.e., the animal’s confidence in its predictive representation of context A is restored, a kind of remembering), leading once again to context-specific responding.
Recall that the duration of context preexposure has opposing effects on learning, initially facilitating but subsequently inhibiting learning (Kiernan & Westbrook, 1993). But what if the animal is tested in a different context? In a follow-up experiment, Kiernan and Westbrook (1993) showed that longer preexposure to the training context leads to less responding in the test context, indicating that the learned association is not generalized. That is, longer context preexposure has a monotonic inhibitory effect on generalization across contexts. The context-dependent model can account for this with the same mechanism that accounts for context preexposure facilitation: longer preexposure to the training context reduces the uncertainty of its predictive estimate, leading to greater prediction errors when presented with the text context and increasing the probability that the animal will infer a new context, leading to context-specific responding.
Overall, the results of Geerts et al. (2024) suggest that rather than encoding a single monolithic predictive map of the environment, the hippocampus encodes multiple separate predictive maps, each associated with its own context. Both the context and the predictive map are inferred using Bayesian inference: learning of the predictive map is modulated by uncertainty and supports nonlocal updating. A new context is inferred when the current predictive map fails to account for current observations.
6.4 Spatial Navigation
A rich body of work points to the hippocampus as encoding a kind of cognitive map of the environment that mammals rely on for navigation in physical and abstract state spaces (O’Keefe & Dostrovsky, 1971; O’Keefe & Nadel, 1978). As we discussed in the previous section and in section 5.2, this cognitive map can be usefully interpreted as a predictive map in which states predict future expected states, consistent with the SR (see section 2.3; Stachenfeld et al., 2017). Yet despite many studies of navigation in humans and rodents—key model species used to study spatial navigation (Epstein et al., 2017; Ekstrom & Ranganath, 2018; Ekstrom et al., 2018; Gahnstrom & Spiers, 2020; Nyberg et al., 2022; Spiers & Barry, 2015)—until recently there was no direct comparison of human and rodent navigation in a homologous task. This left open the question of whether spatial navigation across mammalian species relies on an evolutionarily conserved strategy supported by such a predictive map.
A recent study by de Cothi et al. (2022) filled this gap by designing a homologous navigation task for humans and rats. They devised a configurable open-field maze that could be reconfigured between trials, allowing experimenters to assess a hallmark aspect of spatial navigation: the ability to efficiently find detours and shortcuts. The maze consisted of a 10-by-10 grid in which squares could be blocked off by the experimenters. The maze was instantiated in a physical environment for rats and in a virtual reality environment for humans.
On each trial, the participant was placed at a starting location and had to navigate to a goal location to receive a reward (see Figure 9D). The starting location varied across trials, while the goal remained hidden at a fixed location throughout the experiment. Keeping the goal location unobservable ensured that participants could not rely on simple visual heuristics (e.g., proximity to the goal). At the same time, keeping the goal location fixed ensured that once it is identified, the key problem becomes navigating to it rather than rediscovering it. During the training phase of the experiment, all squares of the grid were accessible, allowing participants to learn an internal map of the environment. During the test phase, participants were sequentially presented with 25 different maze configurations in which various sections of the maze were blocked off. Participants completed 10 trials of each configuration before moving on to the next.
Using this task, the authors compared human and rat navigation with three types of RL algorithms:
Model-free agent (section 2.2). No internal map of the environment; optimal policy is based on state-action value function , which is learned from experience, specifically, using Q-learning (see equation 2.11) with eligibility traces.
Model-based agent (section 2.2). Full internal map of the environment (transition structure and reward function ) is learned from experience; optimal policy is computed using tree search at decision time—specifically, using A* search.
SR agent (section 2.3). Predictive map of the environment (SR matrix and reward function ) is learned from experience; optimal policy is computed by combining SR and reward function (see equation 2.19).
The key question that the authors sought to answer was which RL strategy best explains human and rat navigation across the novel test configurations. Across a wide range of analyses, the authors observed a consistent trend: both human and rodent behavior was most consistent with the SR agent. Humans also showed some similarity to the model-based agent, but neither species was consistent with the model-free agent.
First, the authors simulated each RL agent generatively on the same trials as the participants: they let the RL agent navigate and solve each trial as a kind of simulated participant, learning from its own experience along the way. These closed-loop11 simulations show what overall participant behavior would look like according to each RL strategy. This revealed that:
Model-free agents struggle on new maze configurations due to the slow learning of the -function, which takes many trials to propagate values from the goal location to possible starting locations.
Model-based agents generalize quickly to new maze configurations, since local updates to the transition structure can be immediately reflected in the tree search algorithm.
SR agents generalize faster than model free but more slowly than model-based agents, since updates to the SR matrix reach farther than updates to the -function, but still require several trials to propagate all the way to the possible starting locations.
Second, the authors clamped each RL agent to participant behavior—is, they fed the agent the same sequence of states and actions experienced by a given participant. These open-loop simulations show what the participant would do at each step if they were following a given RL strategy.12 By matching these predictions with participant behavior using maximum likelihood estimation of model parameters, the authors quantified how consistent step-by-step participant behavior is with each RL strategy. For both humans and rats, this analysis revealed the greatest similarity (i.e., highest likelihood) with the SR agent, followed by the model-based agent, with the model-free agent showing the least similarity (i.e., lowest likelihood).
Third, the authors combined the above approaches by first training each fitted RL agent with the state-action sequences observed by a participant on several maze configurations and then simulating it generatively on another configuration. This hybrid open-loop training (on past configurations)/closed-loop evaluation (on a new configuration) provides a more global view than the step-by-step analysis above by allowing comparison of predicted and participant trajectories rather than individual actions. This led to several findings:
Configurations that were challenging for the SR agent were also challenging for biological agents, and vice versa. This pattern was less consistent for model-based and model-free agents.
Overall directedness and direction of participant trajectories (quantified by linear and angular diffusivity) were most similar to SR trajectories.
The step-by-step distance between participant and SR trajectories was consistently lower compared to model-based and model-free trajectories.
All of these analyses show that the SR agent best explains both human and rat behavior. Overall, the results of de Cothi et al. (2022) indicate that spatial navigation across mammalian species relies on a predictive map that is updated from experience in response to changes in the environment.
6.5 Memory
The hippocampus and the adjacent medial temporal lobe structures are also involved in another high-level cognitive function: episodic memory (Ranganath, 2010). In this section, we review an influential model of episodic memory, the temporal context model (TCM; Howard & Kahana, 2002), through the lens of RL, and show that it can be partially understood as an estimator for the SR (see section 2.3; Gershman et al., 2012). We then discuss how this property can be used in a powerful decision-making algorithm that bridges episodic memory and reinforcement learning systems (Zhou et al., 2023).
6.5.1 The Temporal Context Model
TCM is an influential model of memory encoding and retrieval originally designed to account for a number of phenomena in free recall experiments (Howard & Kahana, 2002). In these experiments, participants are asked to study a list of items and then recall as many of them as they can, in any order. Experimenters observed that recall order is often not, in fact, arbitrary: participants show better recall for recently studied items (the recency effect) and tend to recall adjacent items in the list one after the other (the contiguity effect).
TCM accounts for these phenomena by positing that the brain maintains a temporal context: a slowly drifting internal representation of recent experience that gets bound to specific experiences (memories) during encoding and can serve as a cue to bring those experiences to mind during retrieval. When participants begin recalling the studied items, the temporal context is most similar to the context associated with recently studied items (due to the slow drift), which is why they are recalled better (the recency effect). Recalling items reactivates the context associated with those items, which is similar to the context for adjacent items (again, due to the slow drift), which is why they tend to be recalled soon after (the contiguity effect). Neural evidence for drifting context comes from the finding that lingering brain activity patterns of recent stimuli predicted whether past and present stimuli are later recalled together (Chan et al., 2017). Human brain recordings have also provided evidence for temporal context reactivation during recall (Gershman et al., 2013; Folkerts et al., 2018).
6.5.2 TCM as an Estimator for the SR
Note the similarity between the TCM learning rule (equation 6.16) and the SR TD learning rule (equation 2.17 in section 2.3). There are two main distinctions:
The first distinction can be removed by setting a maximum drift rate of in equation 6.15, which ensures that the context is always updated to the latest stimulus vector, . For the one-hot encoding, this means that only will be updated in the TCM update, since in equation 6.16), just as in the SR TD update (equation 2.17). Conversely, introducing a context term in the SR TD update (equation 2.17) results in a generalization of TD learning using an eligibility trace (Sutton & Barto, 2018), a running average of recently visited states. This is mathematically equivalent to temporal context (equation 6.15), and can sometimes lead to faster convergence. In this way, the TCM learning rule is a generalization of the vanilla SR TD learning rule that additionally incorporates eligibility traces.
This new interpretation of TCM posits that the role of temporal context is to learn predictions of future stimuli rather than to merely form associations. This makes several distinct predictions from the original version of TCM, one of which is the context repetition effect: repeating the context in which a stimulus was observed should strengthen memory for that stimulus, even if the stimulus itself was not repeated. This prediction was validated in a study by Smith et al. (2013). The authors showed participants triplets of stimuli (images in one experiment and words in another experiment), with the first two stimuli in each triplet serving as context for the third stimulus. Participants were then presented with an item-recognition test in which they had to indicate whether different stimuli are either “old” or “new.” Memory performance was quantified as the proportion of test items correctly recognized as old. The key finding was that memory was better for stimuli whose context was presented repeatedly, even if the stimuli themselves were only presented once. This held for different modalities (images and words) and did not occur when context was generally not predictive of stimuli. These findings (see also Manns et al., 2015) substantiate the predicted context repetition effect and lend credence to idea that TCM learns predictions rather than mere associations.
6.5.3 Combining TCM and the SR for Decision Making
The theoretical links between TCM (see section 6.5.1) and the SR (see section 2.3) point to a broader role for episodic memory in RL. Previous studies have implicated the hippocampus in prediction and imagination (Buckner, 2010), as well as replay of salient events (Momennejad et al., 2018), consistent with some form of model-based RL or a successor model (SM; see section 2.4). More generally, episodic memory is thought to support decision making by providing the ingredients for simulating possible futures (Schacter et al., 2015). This idea is corroborated by studies of patients with episodic memory deficits, who also tend to show deficits on decision-making tasks (Gupta et al., 2009; Gutbrod et al., 2006; Bakkour et al., 2019). A related body of work focuses on decision-by-sampling algorithms, according to which humans approximate action values by sampling from similar past experiences stored in memory (Stewart et al., 2006; Plonsky et al., 2015; Bornstein et al., 2017; Lieder et al., 2018; Bhui & Gershman, 2018).
These loosely connected ideas were knitted together in a theoretical proposal by Zhou et al. (2023) that builds on the links between TCM and the SR, showing precisely how a predictive version of TCM can support adaptive decision making. Their model incorporates two key ideas (see Figure 9E):
During encoding (see Figure 9E, top), incoming feature vectors update a slowly drifting context (equation 6.15). This context vector serves as an eligibility trace in a TD update rule (equation 6.17) that learns the SR estimate (Gershman et al., 2012).
During retrieval (i.e., at decision time; see Figure 9E, middle and bottom), possible future stimuli are sampled for each action using a tree-search algorithm that uses the SR as a world model, effectively turning it into a SM (see section 2.4): , where indexes time steps at retrieval and corresponds to the initial state of the retrieval process (the query stimulus, defined as the root of the tree). Retrieval unfolds by recursively sampling states from this process. The corresponding rewards then are averaged to compute a Monte Carlo value estimate for each action.
During the tree search, the context vector can be updated with the sampled feature vector to varying degrees, dictated by the drift rate (see equation 6.15). This spans a continuum between updating and retrieval regimes.
At one extreme, if the drift rate during retrieval is set to its lowest value (), the context is never updated after being initialized with the query stimulus (). This results in independent and identically distributed samples from the normalized SR (i.e., the SM). In the limit of infinite samples, this reduces to simply computing action values by combining the reward function and the SR (see equation 2.19 in section 2.3), as discussed in sections 6.1 and 6.2. For finite samples, this produces an unbiased estimator of action values. Note that this estimate is only as accurate as the SR matrix , which is itself an estimate of the true SR matrix . Hence, this regime inherits all the pros and cons of using the SR (see sections 2.3 and 6.1): it can be used to efficiently compute action values, and it can adapt quickly to changes in the reward structure but not the transition structure of the environment.
At the other extreme, if the drift rate during retrieval is set to its highest value (), the context is always updated to the latest sampled stimulus (). If the discount factor is minimal, , the SR reduces to the one-step transition matrix (i.e., ) and the sampled stimuli are no longer independent and identically distributed but instead form a trajectory through state space that follows the transition structure and corresponds to a single Monte Carlo rollout. Averaging rewards from such Monte Carlo rollouts also produces an unbiased estimator of value (Sutton & Barto, 2018). This regime thus corresponds to a fully model-based algorithm (see section 2.2) and inherits all the pros and cons of that approach: it takes longer to compute action values (since trajectories need to be fully rolled out to produce unbiased estimates, requiring more samples), but it can adapt quickly to changes in both the reward structure and the transition structure of the environment.
Between these extremes lies a continuum () that trades off between a sampling approximation of the SR () and model-based Monte Carlo rollouts (). Indeed, results from free recall experiments are consistent with such an intermediate regime (Howard & Kahana, 2002), indicating that context is partially updated during retrieval. This also raises the intriguing possibility that the brain navigates this continuum by dynamically adjusting the drift rate in a way that balances the pros and cons of both regimes, similarly to the way in which the brain arbitrates between model-based and model-free RL systems (Kool et al., 2018).
The authors also demonstrate how emotionally salient stimuli, such as high rewards, can modulate learning by producing a higher learning rate for the SR update (see equation 6.17). This introduces a kind of bias-variance trade-off: the resulting SR skews toward stimuli that were previously rewarded, which could speed up convergence (lower variance) but also induce potentially inaccurate action values (higher bias). Finally, the authors show how initiating the tree search with a retrieval context that is associated with but different from the query stimulus feature vector can lead to bidirectional retrieval. This is consistent with bidirectional recall in human memory experiments and can be advantageous in problems where state transitions can be bidirectional, such as spatial navigation.
In summary, the modeling and simulation results of Zhou et al. (2023) demonstrate how a variant of TCM can be viewed as an estimator of the SR and can serve as the basis for a flexible sampling-based decision algorithm that spans the continuum between vanilla SR and fully model-based search. This work illustrates how episodic memory can be integrated with predictive representations to explain cognitive aspects of decision making.
7 Conclusion
The goal of this survey was to show how predictive representations can serve as the building blocks for intelligent computation. Modern work in AI has demonstrated the power of good representations for a variety of downstream tasks; our contribution builds on this insight, focusing on what makes representations useful for RL tasks. The idea that representing predictions is particularly useful has appeared in many different forms over the past few decades, but has really blossomed only in the past few years. We now understand much better why predictive representations are useful, what they can be used for, and how to learn them.
Acknowledgments
This work was supported by the Kempner Institute for the Study of Natural and Artificial Intelligence, ARO MURI under grant W911NF-23-1-0277, and the Wellcome Trust under grant 212281/Z/18/Z.
While we adhere to this definition consistently throughout this review, we recognize that other uses of the phrase “predictive representation” appear in the literature.
For notational convenience, we will assume that the state and action spaces are both discrete, but this assumption is not essential for many of the algorithms described here.
To simplify notation, we will sometimes leave implicit the distributions over which the expectation is being taken.
Recent work on applying model-based approaches to video games has seen some success (Tsividis et al., 2021), but progress toward scalable and generally applicable versions of such algorithms is still in its infancy.
Note that we overload to also accept actions to reduce the amount of new notation. In general, .
At least initially, these are not exactly the correct value estimates, because the SR is policy dependent, and the policy itself requires updating, which may not happen instantaneously (depending on how the agent is optimizing its policy). Nonetheless, these value estimates will typically be an improvement—a good first guess. As we will see, human learning exhibits similar behavior.
Note that to avoid excessive notation, we’ve overloaded the definition of .
Note that the linear reward model is the same as assumed in much of the SF work reviewed above (see equation 2.29).
Under the CRP, the expected number of contexts after time steps is .
We refer to the them as “closed-loop” since, at each step, the output of the RL agent (its action) is fed back to change its position on the grid, influencing the agent’s input at the following step, and so on.
We refer to them as “open-loop” since at each step, the output of the RL agent has no effect on its subsequent inputs or outputs.
References
Author notes
Wilka Carvalho, Momchil S. Tomov, and William de Cothi contributed equally.