Two fundamental questions underlie the expression of behavior, namely what to do and how vigorously to do it. The former is the topic of an overwhelming wealth of theoretical and empirical work particularly in the fields of reinforcement learning and decision-making, with various forms of affective prediction error playing key roles. Although vigor concerns motivation, and so is the subject of many empirical studies in diverse fields, it has suffered a dearth of computational models. Recently, Niv et al. [Niv, Y., Daw, N. D., Joel, D., & Dayan, P. Tonic dopamine: Opportunity costs and the control of response vigor. Psychopharmacology (Berlin), 191, 507–520, 2007] suggested that vigor should be controlled by the opportunity cost of time, which is itself determined by the average rate of reward. This coupling of reward rate and vigor can be shown to be optimal under the theory of average return reinforcement learning for a particular class of tasks but may also be a more general, perhaps hard-wired, characteristic of the architecture of control. We, therefore, tested the hypothesis that healthy human participants would adjust their RTs on the basis of the average rate of reward. We measured RTs in an odd-ball discrimination task for rewards whose magnitudes varied slowly but systematically. Linear regression on the subjects' individual RTs using the time varying average rate of reward as the regressor of interest, and including nuisance regressors such as the immediate reward in a round and in the preceding round, showed that a significant fraction of the variance in subjects' RTs could indeed be explained by the rate of experienced reward. This validates one of the key proposals associated with the model, illuminating an apparently mandatory form of coupling that may involve tonic levels of dopamine.
Maximizing rewards and minimizing punishments requires choosing the best action among the set of available options. Reinforcement learning (Sutton & Barto, 1998) offers ways of formalizing this process that resonate closely with the psychology and the neuroscience of decision-making (Daw & Doya, 2006; Montague & Berns, 2002; Montague, Dayan, & Sejnowski, 1996). In particular, phasic responses of macaque and rodent midbrain dopaminergic neurons to rewards and reward-associated stimuli are akin to the reward prediction error signal from reinforcement learning (Roesch, Calu, & Schoenbaum, 2007; Morris, Nevet, Arkadir, Vaadia, & Bergman, 2006; Bayer & Glimcher, 2005; Schultz, Dayan, & Montague, 1997). Moreover, abundant fMRI studies show that BOLD signal in the striatum, a major target of the dopaminergic system, correlates with the prediction error signals derived from fitting subject's behavior with reinforcement learning models (Rutledge, Dean, Caplin, & Glimcher, 2010; O'Doherty et al., 2004; McClure, Berns, & Montague, 2003; O'Doherty, Dayan, Friston, Critchley, & Dolan, 2003). However, until recently, reinforcement learning models overlooked the important behavioral observation that animals systematically vary the vigor with which they execute their near optimal choices (Phillips, Walton, & Jhou, 2007; Salamone & Correa, 2002). This omission is especially troubling considering the substantial data implicating the dopaminergic system in different aspects of response vigor (Lex & Hauber, 2008; Parkinson et al., 2002; Salamone & Correa, 2002; Taylor & Robbins, 1986; Langston, Forno, Rebert, & Irwin, 1984; Ungerstedt, 1971).
Niv, Daw, Joel, and Dayan (2007) recently developed an reinforcement learning model in which the vigor (defined as the inverse latency) of action can be optimized. The model realizes a trade-off between two costs: one stemming from the harder work assumed necessary to emit faster actions and the other from the opportunity cost inherent in acting more slowly. The latter arises from the delay that results to the next reward and, indeed, all subsequent rewards. Niv et al. (2007) suggested that agents should choose latencies (and actions) to maximize the rate of accumulated reward per unit time and showed that the resulting optimal latencies would be inversely proportional to the average reward rate. On the basis of existing experimental evidence, Niv et al. (2007) proposed that tonic levels of dopamine report the average rate of reward and, thus, tied together prediction error (McClure et al., 2003; Schultz et al., 1997; Montague et al., 1996), incentive salience (Berridge & Robinson, 1998), and invigoration (Salamone & Correa, 2002) theories of dopamine. Furthermore, Cools, Nakamura, and Daw (2011) recently formulated an integrative model of opponency between dopamine and serotonin, which has at its heart the average rate of reward and the opportunity cost of time. In paradigms such as Pavlovian instrumental transfer, vigor appears to be at least partially under mandatory Pavlovian rather than wholly instrumental control, and so we, therefore, hypothesized that healthy human volunteers would adjust their response vigor on the basis of estimates of the average reward rate, irrespective of any instrumentality.
Here, we tested this hypothesis using a novel variant of a monetary incentive delay task (Adcock, Thangavel, Whitfield-Gabrieli, Knutson, & Gabrieli, 2006; Knutson, Adams, Fong, & Hommer, 2001). We induced changes in the average reward rate by varying the rewards offered on each trial and studied how the ensuing RTs varied. We deliberately made the task simple to avoid any issues associated with a speed–accuracy trade-off. Following conventional practice (Sutton & Barto, 1998), Niv et al. (2007) suggested that subjects might estimate the average reward rate using the delta or Rescorla–Wagner rule. Consequently, we also tested whether the dependence of RT on recent past rewards was consistent with the operation of such a rule.
Subjects and Behavioral Paradigm
Thirty-nine subjects were recruited from the University College London Psychology Department's recruitment pool, received full written instructions, and provided written consent in accordance with the University College London Research Ethics Committee. The experiment used a regular PC monitor and keyboard. The layout of a trial is depicted in Figure 1A. At the beginning of each round, subjects were presented visually with a number representing the potential payout of that round Rt, in the range of 1–100 pence. The potential payouts, Rt, were varied across trials according to a prespecified function that was fixed across subjects and designed to vary over time in a way that was minimally correlated with other potential variables. The potential payout function used is displayed in Figure 1B. After a variable period (750–1250 msec), subjects were shown three visual figures and had to indicate the “odd one out” by pressing a button. For a trial to be counted as successful, subjects had to respond within 500 msec by pressing the button corresponding to the deviant stimulus. In 20% of the trials, this time constraint was lowered to 400 msec to ensure that the task would lead to unexpected misses and, therefore, to keep the participants engaged throughout the whole task. After being shown a blank screen for 500 msec, subjects were informed as to the success of the trial. This feedback was followed by another blank screen and the beginning of the next round.
Subjects performed as many trials as they could within the time limit of 27 min; this varied from 383 to 467 trials. At the end of the study, 10% of the trials were chosen randomly, and subjects were paid the sum of the value of the successful subset of those trials, plus a fixed show up fee of £5.
We fitted a log normal distribution to each individual's RTs and, thus, a set of associated z scores. We were, therefore, able to study individual and average RTs using a common analysis method we describe below. Missed trials (trials without any behavioral response) were not included in the analysis. For the averaged data, we did not analyze any trials after trial number 400, as the number of trials completed varied across subjects. For both types of analysis, we ignored the first 20 trials to allow subjects to get used to the task.
Given the log-normalized data, we performed a linear regression on the subject RTs using the following regressors:
Rt: available reward for the subjects to win in a given round.rt − 1: experienced reward on the previous trial.
Repetition of stimulus: binary vector indicating whether the stimulus in the last round was the same as in this round.
Linear: linear function.
Too late: binary return indicating whether the response was too late in the previous round.
Intertrial interval: pretrial interval while waiting for the stimulus to be presented.
A constant term.
The available reward Rt, the immediately experienced reward rt − 1, and the averaged rewardFigure 1B); the other regressors were included as nuisance variables.
Subjects performed an “odd-one-out” task with a response being required within 500 msec (400 msec in 20% of the trials, randomly chosen) for the chance of receiving a monetary reward whose magnitude was announced at the beginning of each trial (Cools et al., 2005; Knutson et al., 2001; see Figure 1A). Following each response, subjects were informed whether they chose correctly and sufficiently quickly. As their eventual payout was directly related to performance, subjects had an incentive to be both fast and accurate. Subjects chose the correct response in 92.8% of trials (standard deviation of 5.5% across subjects; only three subjects had less than 85% correct), suggesting that there was no substantial speed–accuracy trade-off. On average, subjects made their choices within 416 msec (standard deviation of 24 msec across subject means, mean standard deviation of 48 msec) and acted too slowly in 19.1% of trials on average (standard deviation of 8.8). In total, subjects responded correctly and within the allocated time on 73.7% of the trials (with standard deviation of 10.3 across subjects).
To vary the perceived average reward rate, the potential reward (Rt on trial t) was changed across trials (see Figure 1B) in a pseudorandom way. Although all subjects were presented with the same sequence, their individual errors implied that each subject would have his or her own individual actual immediate reward, actual previous reward rt − 1, and actual average reward rateMethods). To study the effect of these three quantities on subjects' RTs, we performed a linear regression on the logarithm of the RTs including Rt, rt − 1, and
When performing this regression on the average RTs (Figure 2A), we were able to explain 21.4% of the variance, finding that the average rate of reward (t(37.9) = −3.44), the repetition of stimulus (t(37.9) = −5.75), and the time to the oddball (t(37.9) = −4.13) also contributed significantly (p < .05) to the variance. Rather surprisingly neither the available reward, Rt, nor the immediately experienced reward, r_(t − 1), significantly influenced the RTs (t(37.9) = 0.16 and t(37.9) = −0.09, respectively; see Figure 2B). The negative sign of the beta value for the average reward rate indicated that the regressor had a negative effect on the RTs, that is, causing subjects to speed up, as predicted by our original hypothesis.
We found similar results when performing the regression analysis on individual subjects' data, while explaining 7.7% of the variance on average (range of 1.6–22.2%). We performed a random effects two-tailed t test over beta values for each regressor across subjects. The beta values for the average reward rate (t(38) = −4.68, p < .0001), the repetition of stimulus (t(38) = −7.24, p < .0001), and the intertrial interval (t(38) = −5.48, p < .0001) regressors were all significantly different from zero (indicating their contribution towards the variance in the RTs), whereas the available reward, Rt, was not significantly different from zero (t(38) = 1.13, p = .264). Again, the negative sign for the beta value for the average reward rate indicated that increases in the average reward led to subjects increasing their speed, in accordance with the hypothesis derived from the model. Unlike the analysis of the averages across subjects, the immediate reward obtained in the previous trial r_(t − 1; t(38) = −2.38, p = .0226), and the too late regressor (t(38) = −3.08, p = .0038) also accounted significantly for more modest percentages of the variance in RTs across participants.
Finally, we considered whether subjects might have strategically slowed their responses on trials with large available rewards to optimize a speed–accuracy trade-off. Inconsistent with this possibility, we found that there was no significant correlation between the number of participants that performed correctly on a given trial with the available reward on that trial (r = −0.056, p = .276).
Exactly in line with our hypothesis, we showed that healthy human participants adjusted the vigor of motor behavior on the basis of an estimate of the average reward. This was seen both when the data were averaged across all participants and also when each participant's RT was analyzed individually.
It is well known that the RTs of humans and other animals (i.e., our definition of vigor) are influenced by the incentive motivational value of the goal toward which their actions are directed (e.g., Cools et al., 2005; Wittmann et al., 2005; Satoh, Nakai, Sato, & Kimura, 2003; Takikawa, Kawagoe, & Hikosaka, 2002; Watanabe et al., 2001). The evidence supporting an involvement of the midbrain dopaminergic system in the regulation of response vigor is also broad (Niv et al., 2007; Phillips et al., 2007; Satoh et al., 2003; Salamone & Correa, 2002; Berridge & Robinson, 1998). However, the computational basis of this process is rather less well studied (see Niv et al., 2007; Phillips et al., 2007; McClure et al., 2003). In the computational account of vigor suggested by Niv et al. (2007), vigor is proportional to the average rate of reinforcement.
The model of Niv et al. (2007) is based on a slightly different task, which lacks the immediate link between RT and reward that we employed here. That is, in the model, the only virtue of vigor is the opportunity cost of being slothful. However, in our task, subjects actually have to react quickly to win. It is straightforward to include the penalty of missing a reward because of too slow responses into the model, although we would then need to model more accurately the minimum possible RT. However, if anything, the addition of an imperative to respond quickly in our task should have strengthened the importance of the immediate reward Rt on RTs; thus, our experiment tests the stronger version of the underlying hypothesis, that is, that the coupling between average reward and RT is effectively mandatory, even when it is not instrumental. A less noisy measure of vigor than RT would in any case be needed to enable a more fully quantitative test of the model. We did not find any influence on response vigor of the reward available in a given trial. This came as a surprise, given previous observations that the motivational value of an outcome influences the latency of the associated response (see e.g., Cools et al., 2005; Wittmann et al., 2005). One further possibility is that RTs in our task could have been driven by a temporally local prediction error between the offered reward and the average rate of reward (Rt −2008). The positive excursions of this reward prediction error would be coded by the phasic activity of dopamine neurons (Schultz et al., 1997), which have been shown to influence vigor in monkeys (Satoh et al., 2003). However, the dependence of RT on the average reward rate
At a single subject level, we did find a modestly significant relationship between the vigor on a trial and the reward obtained on the previous trial. Of course, this previous reward is the most significant single contributor to the running estimate of the average reward. This link was not significant in the overall average data. The discrepancy may, for instance, reflect a positively skewed distribution of learning rates for the average reward across the subjects, leaving a net explanatory gap for this effect of the previous trial.
Niv and colleagues suggested that the effects of average reward on vigor would be mediated by tonic dopamine levels. Obviously, the present experiment did not test this possibility, and therefore, future research should directly address the relationship between the computation of the average rate of reward, vigor, and dopaminergic neurotransmission in humans and experimental animals. However, in relation to dopamine, it is possible that the correlation between phasic dopamine release and vigor observed in monkeys (Satoh et al., 2003) arose as a result of influences on dopamine activity that are associated with control over tonic rather than phasic firing. Certainly, the mechanisms controlling tonic levels of extrasynaptic dopamine and its relationship with phasic dopamine appear complex (Floresco, West, Ash, Moore, & Grace, 2003; Grace, 1991), with the two signals possibly being independently regulated (Grace, Floresco, Goto, & Lodge, 2007; Lodge & Grace, 2006). Niv et al. (2007) explicitly discuss the association between such influences and their normative account; it is reminiscent of other apparent influences as in appetitive Pavlovian to instrumental transfer.
According to some influential models of decision-making such as the drift–diffusion or the LATER models, observed behavioral responses result from the accumulation of evidence for execution until a certain threshold is reached (Ratcliff & Rouder, 2000; Reddi & Carpenter, 2000). In this framework, changes in RT may arise because of changes either in the rate of evidence accumulation or in the threshold (Ratcliff & Rouder, 2000; Reddi & Carpenter, 2000). Previous research has shown that the parameters of the drift–diffusion model may be modified depending on whether participants are instructed to perform as quickly or as accurately as they can, suggesting that the RT may be affected by a speed–accuracy trade-off (Ratcliff & McKoon, 2008). It could, thus, be argued that the lack of effect of available reward on RT could be a result of strategic slowing down to increase accuracy when the available reward was high. However, such an interpretation is unlikely because our participants showed a high accuracy in their performance across the whole task and we did not observe any between-subjects correlation between the available reward and the percentage of correct responses. Whether slow adjustments in the threshold of a decision process as specified in these models may have contributed to the observed effects of average reward on RT remains unclear. Such a possibility is orthogonal to our hypothesis, and our design does not permit a test of this possibility.
If the average rate of rewards enhances the vigor of responding, it is natural to consider whether the average rate of punishments enhances sloth and how this might be realized in the brain. In fact, one major pillar of the current version of the computational proposal that serotonin acts as an opponent to dopamine (Boureau & Dayan, 2011; Deakin & Graeff, 1991) is that serotonin is directly implicated in behavioral inhibition and behavioral quiescence (Dayan & Huys, 2009) as a contrast to dopamine's involvement in behavioral activation (Cools et al., 2011). However, there is a key asymmetry to consider: The average reward is an opportunity cost for time, because not acting swiftly postpones rewards that can be earned. By contrast, at least some forms of punishment cannot be postponed by sloth, and the formalism would need to be extended to capture this asymmetry fully.
In conclusion, we found that human vigor in an RT task was directly influenced by the local average reward rate, with higher rates leading to faster reactions. This is in direct contradiction of the intuitive notion that subjects speed up on the basis of immediate fluctuations in potential rewards but supports a recent normative theory of vigor.
Reprint requests should be sent to Marc Guitart-Masip, ICN, University College London, 17 Queen Square, London, WC1N 3AR, United Kingdom, or via e-mail: email@example.com.
These authors contributed equally to this study.