Temporal difference learning models of dopamine assert that phasic levels of dopamine encode a reward prediction error. However, this hypothesis has been challenged by recent observations of gradually ramping stratal dopamine levels as a goal is approached. This note describes conditions under which temporal difference learning models predict dopamine ramping. The key idea is representational: a quadratic transformation of proximity to the goal implies approximately linear ramping, as observed experimentally.
Temporal difference (TD) learning is arguably the most successful account of dopamine function in the basal ganglia (Glimcher, 2011; Niv & Schoenbaum, 2008; Schultz, Dayan, & Montague, 1997). According to this account, phasic dopamine signals a reward prediction error—the discrepancy between observed and predicted reward—and this signal is used to improve future predictions. Recently, Howe, Tierney, Sandberg, Phillips, & Graybiel (2013) reported a form of dopaminergic activity that appears (at first glance) to fly in the face of the prediction error hypothesis: as a rat approaches the goal in a maze, dopamine levels in the striatum gradually ramp up, peaking when the rat arrives at the goal. As Niv (2013) pointed out, this observation is unanticipated by existing TD models.
In this note, we describe conditions under which the TD model predicts ramping. The essential assumption pertains to the representation of space: provided that proximity to the goal is encoded by a convex transformation, ramping will be observed. In particular, the near-linear ramping that Howe et al. (2013) observed occurs when the proximity transformation is quadratic.
2. Temporal Difference Learning
where t indexes time, rt is the reward delivered at time t, and is a discount factor. Here the expectation is taken over possibly random sequences of rewards; in the remainder of this note, we assume for simplicity that rewards and transitions are deterministic.
3. Modeling Spatial Navigation
4. Why Does Ramping Occur?
The predicted ramping behavior is illustrated in Figure 1 using and . For comparison, we also show the results for linear and exponential proximity transformations. Although there is a slight ramping predicted by the exponential transformation (and this ramping can be made stronger by increasing the slope of the exponential transformation), the ramping is always convex, which is inconsistent with the near-linear (and sometimes slightly concave) ramping observed by Howe et al. (2013).
Several assumptions were made in these simulations for convenience rather than mathematical necessity. First, the proximity transformation was configured to monotonically increase from 0 to 1, but ramping will occur for any monotonically increasing convex transformation. Second, we assumed a one-dimensional proximity representation, but this can be generalized: any nonnegative combination of convex functions is convex, and therefore ramping will occur as long the asymptotic weights are nonnegative and each feature is computed by a convex transformation. A corollary of this assumption is that the spatial representation is not a form of table look-up (Sutton & Barto, 1998), since table look-up is incompatible with a graded representation of space. Although earlier TD models of dopamine used a form of table look-up (e.g., Daw, Courville, & Touretzky, 2006; Schultz et al., 1997), more recent models have emphasized the importance of graded, distributed representations (e.g., Gustafson & Daw, 2011; Kurth-Nelson & Redish, 2009; Ludvig, Sutton, & Kehoe, 2008).
Howe et al. (2013) made a number of other observations that are consistent with this model: (1) ramps leading to similar rewards peaked at similar levels despite differences in running speed; (2) ramps leading to large rewards exceeded ramps leading to small rewards; and (3) the ramps dynamically changed when large and small rewards switched locations. The insensitivity to running speed (and hence time until the goal is reached) arises because the prediction errors near the goal will be the same regardless of how long it took to get there. The model is reward sensitive because the asymptotic weight scales with reward, as stipulated by equation 3.1. When rewards switch locations, the corresponding asymptotic weights will switch, leading to the observed ramp dynamics.
The gradual ramping of dopamine activity as a rat approaches a goal is consistent with the basic predictions of TD models. The special ingredient is a convex transformation of proximity to the goal. This transformation implies a spatial compression of the value function similar to Weber's law, such that values of locations far from the goal are closer together than values of locations near the goal. Interestingly, landmark-based compression of space has been reported in several species (Cheng, 1990; Cheng, Srinivasan, & Zhang, 1999), as well as in the hippocampal representation of space (O'Keefe & Burgess, 1996). We may speculate that the source of ramping lies in the hippocampal inputs to the striatum, which are thought to provide the features for value functions defined over space (Foster, Morris, & Dayan, 2000; Gustafson & Daw, 2011).
This work was supported by a postdoctoral fellowship from the MIT Intelligence Initiative. I thank Yael Niv and Ann Graybiel for helpful discussions.