## Abstract

The hypothesis that the phasic dopamine response reports a reward prediction error has become deeply entrenched. However, dopamine neurons exhibit several notable deviations from this hypothesis. A coherent explanation for these deviations can be obtained by analyzing the dopamine response in terms of Bayesian reinforcement learning. The key idea is that prediction errors are modulated by probabilistic beliefs about the relationship between cues and outcomes, updated through Bayesian inference. This account can explain dopamine responses to inferred value in sensory preconditioning, the effects of cue preexposure (latent inhibition), and adaptive coding of prediction errors when rewards vary across orders of magnitude. We further postulate that orbitofrontal cortex transforms the stimulus representation through recurrent dynamics, such that a simple error-driven learning rule operating on the transformed representation can implement the Bayesian reinforcement learning update.

## 1 Introduction

The phasic firing of dopamine neurons in the midbrain has long been thought to report a reward prediction error—the discrepancy between observed and expected reward—whose purpose is to correct future reward predictions (Eshel et al., 2015; Glimcher, 2011; Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). This hypothesis can explain many key properties of dopamine, such as its sensitivity to the probability, magnitude, and timing of reward; its dynamics over the course of a trial; and its causal role in learning. Despite its success, the prediction error hypothesis faces a number of puzzles. First, why do dopamine neurons respond under some conditions, such as sensory preconditioning (Sadacca, Jones, & Schoenbaum, 2016) and latent inhibition (Young, Joseph, & Gray, 1993), where the prediction error should theoretically be zero? Second, why do dopamine responses appear to rescale with the range or variance of rewards (Tobler, Fiorillo, & Schultz, 2005)? These phenomena appear to require a dramatic departure from the normative foundations of reinforcement learning that originally motivated the prediction error hypothesis (Sutton & Barto, 1998).

This letter provides a unified account of these phenomena, expanding the prediction error hypothesis in a new direction while retaining its normative foundations. The first step is to reconsider the computational problem being solved by the dopamine system; instead of computing a single point estimate of expected future reward, the dopamine system recognizes its own uncertainty by computing a probability distribution over expected future reward. This probability distribution is updated dynamically using Bayesian inference, and the resulting learning equations retain the important features of earlier dopamine models. Crucially, the Bayesian theory goes beyond earlier models by explaining why dopamine responses are sensitive to sensory preconditioning, latent inhibition, and reward variance.

The theory presented here was first developed to explain a broad range of associative learning phenomena within a unifying framework (Gershman, 2015). We extend this theory further by equipping it with a mechanism for updating beliefs about cue-specific volatility (i.e., how quickly associations between particular cues and outcomes change over time). This mechanism harkens back to the classic Pearce-Hall theory of attention in associative learning (Pearce & Hall, 1980; Pearce, Kaye, & Hall, 1982), as well as to more recent Bayesian incarnations (Behrens, Woolrich, Walton, & Rushworth, 2007; Mathys, Daunizeau, Friston, & Stephan, 2011; Nassar, Wilson, Heasly, & Gold, 2010; Yu & Dayan, 2005). As we show, volatility estimation is important for understanding the effect of reward variance on the dopamine response.

## 2 Temporal Difference Learning

In the complete serial compound (CSC) representation, each cue is broken down into a cascade of temporal elements, such that each feature corresponds to a binary variable indicating whether a stimulus is present or absent at a particular point in time. This allows the model to generate temporally precise predictions, which have been systematically compared to phasic dopamine signals. While the original work by Schultz et al. (1997) showed good agreement between prediction errors and dopamine using the CSC, later work called into question its adequacy (Daw, Courville, & Touretzky, 2006; Gershman, Moustafa, & Ludvig, 2014; Ludvig, Sutton, & Kehoe, 2008). Nonetheless, we will adopt this representation for its simplicity, noting that our substantive conclusions are unlikely to be changed with other temporal representations.

## 3 Reinforcement Learning as Bayesian Inference

The TD model is a point estimation algorithm, updating a single weight vector over time. Gershman (2015) argued that associative learning is better modeled as Bayesian inference, where a probability distribution over all possible weight vectors is updated over time. This idea was originally explored by Dayan and colleagues (Dayan & Kakade, 2001; Dayan, Kakade, & Montague, 2000) using a simple Bayesian extension of the Rescorla-Wagner model (the Kalman filter). This model can explain retrospective revaluation phenomena like backward blocking that posed notorious difficulties for classical models of associative learning (Miller, Barnet, & Grahame, 1995). Gershman (2015) illustrated the explanatory range of the Kalman filter by applying it to numerous other phenomena. However, the Kalman filter is still fundamentally limited by the fact that it is a trial-level model and hence cannot explain the effects of intratrial structure like the interstimulus interval or stimulus duration. It was precisely this structure that motivated real-time frameworks like the TD model (Sutton & Barto, 1990).

The same logic that transforms Rescorla-Wagner into the Kalman filter can be applied to transform the TD model into a Bayesian model (Geist & Pietquin, 2010). Gershman (2015) showed how the resulting unified model (Kalman TD) can explain a range of phenomena that neither the Kalman filter nor the TD model can explain in isolation. In this section, we describe Kalman TD and its extension to incorporate volatility estimation. We then turn to studies of the dopamine system, showing how the same model can provide a more complete account of dopaminergic prediction errors.

### 3.1 Kalman Temporal Difference Learning

Like the original TD model, the Kalman TD model posits updating of weights by prediction error, and the core empirical foundation of the TD model (see Glimcher, 2011) also applies to Kalman TD. Unlike the original TD model, the learning rates change dynamically in Kalman TD, a property important for explaining phenomena like latent inhibition, as discussed below. In particular, learning rates increase with the posterior variance, reflecting the intuition that new data should influence the posterior more when the agent is more uncertain. At each time step, the posterior variance increases due to unobserved stochastic changes in the weights, but this increase may be compensated by reductions due to observed outcomes. Another deviation from the original TD model is the fact that weight updates may not be independent across cues; if there is nonzero covariance between cues, then observing novel information about one cue will change beliefs about the other cue. This property is instrumental to the explanation of various revaluation phenomena (Gershman, 2015), which we explore using the sensory preconditioning paradigm.

### 3.2 Volatility Estimation

One unsatisfying, counterintuitive property of the Kalman TD model is that the learning rates do not depend on the reward history. This means that the model will not be able to capture changes in learning rate due to variability in the reward history. In fact, a considerable literature suggests that learning rate changes as a function of reward history, though the precise nature of such changes is controversial (Le Pelley, 2004; Mitchell & Le Pelley, 2010). For example, learning is slower when the cue was previously a reliable predictor of reward (Hall & Pearce, 1979). Pearce and Hall (1980) interpreted this and other findings as evidence that learning rate declines with cue-outcome reliability. They formalized this idea by assuming that learning rate is proportional to the absolute prediction error (see Roesch, Esber, Li, Daw, & Schoenbaum, 2012, for a review of the behavioral and neural evidence).

### 3.3 Modeling Details

We use the same parameters as in our earlier paper (Gershman, 2015): , , and . The volatilities were initialized to and then updated using a metalearning rate of . Stimuli were modeled with a four-time-step CSC representation and an intertrial interval of six time steps.

## 4 Applications to the Dopamine System

We are now in a position to resolve the puzzles with which we started, focusing on two empirical implications of the TD model. First, the model updates only the weights of present cues, and hence cues that have not been paired directly with reward or with reward-predicting cues should not elicit a dopamine response. This implication disagrees with findings from a sensory preconditioning procedure (Sadacca et al., 2016) where cue A is sequentially paired with cue B and cue C is sequentially paired with cue D (see Figure 1). If cue B is subsequently paired with reward and cue D is paired with nothing, cue A comes to elicit both a conditioned response and elevated dopamine activity compared to cue B. The TD model predicts no dopamine response to either A or B. The Kalman TD model, in contrast, learns a positive covariance between the sequentially presented cues. As a consequence, the learning rates will be positive for both cues whenever one of them is presented alone, and hence conditioning one cue in a pair will cause the other cue to inherit value.

In the latent inhibition procedure, preexposing a cue prior to pairing it with reward should have no effect on dopamine responses during conditioning in the original TD model (since the prediction error is 0 throughout preexposure), but experiments show that preexposure results in a pronounced decrease in both conditioned responding and dopamine activity during conditioning (Young et al., 1993). The Kalman TD model predicts that the posterior variance will decrease with repeated preexposure presentations (Gershman, 2015) and, hence, the learning rate will decrease as well. This means that the prediction error signal will propagate more slowly to the cue onset for the preexposed cue compared to the non-preexposed cue (see Figure 2).

A second implication of the TD model is that dopamine responses at the time of reward should scale with reward magnitude. This implication disagrees with the work of Tobler et al. (2005), who paired different cues half the time with a cue-specific reward magnitude (liquid volume) and half the time with no reward. Although dopamine neurons increased their firing rate whenever reward was delivered, the size of this increase was essentially unchanged across cues despite the reward magnitudes varying over an order of magnitude. Tobler et al. (2005) interpreted this finding as evidence for a form of adaptive coding, whereby dopamine neurons adjust their dynamic range to accommodate different distributions of prediction errors (see also Diederen & Schultz, 2015; Diederen, Spencer, Vestergaard, Fletcher, & Schultz, 2016, for converging evidence from humans). Adaptive coding has been found throughout sensory areas as well as in reward-processing areas (Louie & Glimcher, 2012). While adaptive coding can be motivated by information-theoretic arguments (Atick, 1992), the question is how to reconcile this property with the TD model.

The Kalman TD model resolves this puzzle if one views dopamine as reporting (the variance-scaled prediction error) instead of (see Figure 3).^{1} Critical to this explanation is volatility updating: the scaling term () increases with the diffusion variance , which itself scales with the reward magnitude in the experiment of Tobler et al. (2005). In the absence of volatility updating, diffusion variance would stay fixed, and hence would no longer be a function of reward history.^{2}

## 5 Representational Transformation in the Orbitofrontal Cortex

Dayan and Kakade (2001) described a neural circuit that approximates the Kalman filter but did not explore its empirical implications. This section reconsiders the circuit implementation applied to the Kalman TD model and then discusses experimental data relevant to its neural substrate.

Figure 4 presents a neuroanatomical gloss on the original proposal by Dayan and Kakade (2001). We suggest that the intermediate units correspond to the orbitofrontal cortex (OFC), with feedforward synapses to reward prediction neurons in the ventral striatum (Eblen & Graybiel, 1995). This interpretation offers a new, albeit not comprehensive, view of the OFC's role in reinforcement learning. Wilson, Takahashi, and Schoenbaum (2014) have argued that the OFC represents a “cognitive map” of task space, providing the state representation over which TD learning operates. The circuit described above can be viewed as implementing one form of state representation based on a whitening transform.

If this interpretation is correct, then OFC damage should be devastating for some kinds of associative learning (namely, those that entail nonzero covariance between cues) while leaving other kinds of learning intact (namely, those that entail uncorrelated cues). A particularly useful example of this dissociation comes from work by Jones et al. (2012), which demonstrated that OFC lesions eliminate sensory preconditioning while leaving first-order conditioning intact. This pattern is reproduced by the Kalman TD model if the intermediate units are “lesioned” such that no input transformation occurs (i.e., inputs are mapped directly to rewards; see Figure 1). In other words, the lesioned model is reduced to the original TD model with fixed learning rates.

## 6 Discussion

The twin roles of Bayesian inference and reinforcement learning have a long history in animal learning theory, but until recently, these ideas were not unified into a single theory known as Kalman TD (Gershman, 2015). In this letter, we applied the theory to several puzzling phenomena in the dopamine system: the sensitivity of dopamine neurons to posterior variance (latent inhibition), covariance (sensory preconditioning), and posterior predictive variance (adaptive coding). These phenomena could be explained by making two principled modifications to the prediction error hypothesis of dopamine. First, the learning rate, which drives updating of values, is vector-valued in Kalman TD, with the result that associative weights for cues can be updated even when that cue is not present, provided it has nonzero covariance with another cue. Furthermore, the learning rates can change over time, modulated by the agent's uncertainty. Second, Kalman TD posits that dopamine neurons report a normalized prediction error, , such that greater uncertainty suppresses dopamine activity (see also Preuschoff & Bossaerts, 2007).

How are the probabilistic computations of Kalman TD implemented in the brain? We modified the proposal of Dayan and Kakade (2001), according to which recurrent dynamics produce a transformation of the stimulus inputs that effectively whitens (decorrelates) them. Standard error-driven learning rules operating on the decorrelated input are then mathematically equivalent to the Kalman TD updates. One potential neural substrate for this stimulus transformation is the OFC, a critical hub for state representation in reinforcement learning (Wilson et al., 2014). We showed that lesioning the OFC forces the network to fall back on a standard TD update (i.e., ignoring the covariance structure). This prevents the network from exhibiting sensory preconditioning, as has been observed experimentally (Jones et al., 2012). The idea that recurrent dynamics in OFC play an important role in stimulus representation for reinforcement learning and reward expectation has also figured in earlier models (Deco & Rolls, 2005; Frank & Claus, 2006).

Kalman TD is closely related to the hypothesis that dopaminergic prediction errors operate over belief state representations. These representations arise when an agent has uncertainty about the hidden state of the world. Bayes's rule prescribes that this uncertainty be represented as a posterior distribution over states (the belief state), which can then feed into standard TD learning mechanisms. Several authors have proposed that belief states could explain some anomalous patterns of dopamine responses (Daw et al., 2006; Rao, 2010), and experimental evidence has recently accumulated for this proposal (Lak, Nomoto, Keramati, Sakagami, & Kepecs, 2017; Starkweather, Babayan, Uchida, & Gershman, 2017; Takahashi, Langdon, Niv, & Schoenbaum, 2016). One way to understand Kalman TD is to think of the weight vector as part of the hidden state. A similar conceptual move has been studied in computer science, in which the parameters of a Markov decision process are treated as unknown, thereby transforming it into a partially observable Markov decision process (Duff, 2002; Poupart, Vlassis, Hoey, & Regan, 2006). Kalman TD is a model-free counterpart to this idea, treating the parameters of the function approximator as unknown. This view allows one to contemplate more complex versions of the model proposed here, for example, with nonlinear function approximators or structure learning (Gershman, Norman, & Niv, 2015), although inference quickly becomes intractable in these cases.

A number of other authors have suggested that dopamine responses are related to Bayesian inference in various ways. Friston and colleagues have developed a theory grounded in a variational approximation of Bayesian inference, whereby phasic dopamine reports changes in the estimate of inverse variance (FitzGerald, Dolan, & Friston, 2015; Friston et al., 2012; Schwartenbeck, FitzGerald, Mathys, Dolan, & Friston, 2014). This theory fits well with the modulatory effects of dopamine on downstream circuits, but it is currently unclear to what extent this theoretical framework can account for the body of empirical data on which the prediction error hypothesis of dopamine is based. Other authors have suggested that dopamine is involved in specifying a prior probability distribution (Costa, Tran, Turchi, & Averbeck, 2015) or influencing uncertainty representation in the striatum (Mikhael & Bogacz, 2016). These different possibilities are not necessarily mutually exclusive, but more research is necessary to bridge these varied roles of dopamine in probabilistic computation.

Of particular relevance here is the finding that sustained dopamine activation during the interstimulus interval of a Pavlovian conditioning task appears to code reward uncertainty, with maximal activation to cues that are the least reliable predictors of upcoming reward (Fiorillo, Tobler, & Schultz, 2003). Although it has been argued that this finding may be an averaging artifact (Niv, Duff, & Dayan, 2005), subsequent research has confirmed that uncertainty coding is a distinct signal (Hart, Clark, & Phillips, 2015). This suggests that dopamine may convey multiple signals, only some of which can be explained in terms of prediction errors as pursued here.

The Kalman TD model makes several new experimental predictions. First, it predicts that a host of posttraining manipulations, identified as problematic for traditional associative learning (Gershman, 2015; Miller et al., 1995), should have systematic effects on dopamine responses. For example, extinguishing the blocking cue in a blocking paradigm causes recovery of responding to the blocked cue in a subsequent test (Blaisdell, Gunther, & Miller, 1999); the Kalman TD model predicts that this extinction procedure should cause a positive dopaminergic response to the blocked cue. Note that this prediction does not follow from the probabilistic interpretation of dopamine in terms of changes in inverse variance (FitzGerald et al., 2015; Friston et al., 2012; Schwartenbeck et al., 2014), which reflects beliefs about policies (whereas we have restricted our attention to Pavlovian state values). A second prediction is that the OFC should exhibit dynamic cue competition and facilitation (depending on the paradigm). For example, in the sensory preconditioning paradigm (where facilitation prevails), neurons selective for one cue should be correlated with the neurons selective for another cue, such that presenting one cue will activate neurons selective for the other cue. By contrast, in a backward blocking paradigm (where competition prevails), neurons selective for different cues should be anticorrelated. Finally, OFC lesions in these same paradigms should eliminate the sensitivity of dopamine neurons to posttraining manipulations.

One general limitation of Kalman TD is that it imposes strenuous computational costs. For stimulus dimensions, a covariance matrix must be maintained and updated. This representation thus does not scale well to high-dimensional spaces, but there are a number of ways the cost can be reduced. In many real-world domains, the intrinsic dimensionality of the state space is lower than the dimensionality of the ambient stimulus space. This suggests that a dimensionality reduction step could be combined with Kalman TD so that the covariance matrix is defined over a low-dimensional state space. Several lines of evidence suggest that this is indeed what the brain does. First, cortical inputs into the striatum are massively convergent, with an order of magnitude reduction in the number of neurons from cortex to striatum (Zheng & Wilson, 2002). Bar-Gad, Morrig, and Bergman (2003) have argued that this anatomical organization is well suited for reinforcement-driven dimensionality reduction. Second, the evidence that dopamine reward prediction errors exhibit signatures of belief states (Lak et al., 2017; Starkweather et al., 2017; Takahashi, Langdon, Niv, & Schoenbaum, 2016) is consistent with the view that value functions are defined over low-dimensional hidden states. Third, many behavioral phenomena suggest that animals are learning about hidden states (Courville, Daw, & Touretzky, 2006; Gershman et al., 2015). Computational models of hidden state inference could be productively combined with Kalman TD in future work.

The theory presented here does not pretend to be a complete account of dopamine; there remain numerous anomalies that will keep RL theorists busy for a long time (Dayan & Niv, 2008). The contribution of this work is to chart a new avenue for thinking about the function of dopamine in probabilistic terms, with the aim of building a bridge between reinforcement learning and Bayesian approaches to learning in the brain.

## Notes

^{1}

Preuschoff and Bossaerts (2007) made an essentially identical suggestion, but did not provide a mechanistic proposal for how the scaling term would be computed.

^{2}

Eshel, Tian, Bukwich, and Uchida (2016) have reported that dopamine neurons in the ventral tegmental area exhibit homogeneous prediction error responses that differ only in scaling. One possibility is that these neurons have different noise levels () or volatility estimates (), which would influence the normalization term .

## Acknowledgments

This research was supported by the NSF Collaborative Research in Computational Neuroscience program grant IIS-120 7833.

## References

*Neuron*,