An essential component of skill acquisition is learning the environmental conditions in which that skill is relevant. This article proposes and tests a neurobiologically detailed theory of how such learning is mediated. The theory assumes that a key component of this learning is provided by the cholinergic interneurons in the striatum known as tonically active neurons (TANs). The TANs are assumed to exert a tonic inhibitory influence over cortical inputs to the striatum that prevents the execution of any striatal-dependent actions. The TANs learn to pause in rewarding environments, and this pause releases the striatal output neurons from this inhibitory effect, thereby facilitating the learning and expression of striatal-dependent behaviors. When rewards are no longer available, the TANs cease to pause, which protects striatal learning from decay. A computational version of this theory accounts for a variety of single-cell recording data and some classic behavioral phenomena, including fast reacquisition after extinction.
During skill learning, a response elicited by a specific stimulus might be rewarded, but if this same stimulus is encountered outside the training session, why doesn't the absence of reward extinguish the skill response? This article proposes and tests a computational theory of such context-sensitive learning. Briefly, we propose that a key component of this learning is provided by the tonically active cholinergic interneurons in the striatum (tonically active neurons [TANs]).
The striatum is known to contribute to many aspects of motor, cognitive, and limbic processing, and a huge literature suggests that the striatum is critically important in skill learning (for reviews, see e.g., Ashby & Ennis, 2006; Yin & Knowlton, 2006; Doyon & Ungerleider, 2002; Packard & Knowlton, 2002). In humans, approximately 96% of all striatal neurons are medium spiny neurons (MSNs; Yelnik, Francois, Percheron, & Tande, 1991), which receive cortical input and send axons out of the striatum to BG output structures. The TANs are cholinergic striatal interneurons that have extensive axon fields allowing them to project to large striatal regions (e.g., Calabresi, Centonze, Gubellini, Pisani, & Bernardi, 2000; Kawaguchi, Wilson, Augood, & Emson, 1995).
TANs are tonically active in their resting state, and they have a prominent modulatory effect on MSNs (Pakhotin & Bracci, 2007; Gabel & Nisenbaum, 1999; Akins, Surmeier, & Kitai, 1990; Akaike, Sasa, & Takaori, 1988; Dodt & Misgeld, 1986). These effects are both pre- and postsynaptic.1 With respect to cortical input, however, the predominant effect of TAN activity on MSN activation is inhibitory. For example, Pakhotin and Bracci (2007) reported that a single TAN spike caused a significant reduction in the excitatory postsynaptic current induced by cortical (glutamatergic) input. On the basis of these and other results, they concluded that after a TAN pause, MSNs “will transiently become much more responsive to cortical inputs” (p. 399) and that the resumption of TAN firing “will cause an abrupt reduction of MSN excitation” (p. 399).
Thus, MSNs are especially responsive to cortical input during TAN pauses. To understand the behavioral significance of this phenomenon, it is therefore critical to study the environmental conditions that cause TANs to pause. In fact, it is well established that TANs pause to the delivery of reward and to stimuli that predict the delivery of reward (Apicella, Legallet, & Trouche, 1997; Aosaki, Tsubokawa, et al., 1994; Kimura, 1992; Apicella, Scarnati, & Schultz, 1991). They also pause to novel stimuli (Blazquez, Fujii, Kojima, & Graybiel, 2002). Another important result is that whereas most MSNs fire to a restricted set of stimuli from a single sensory modality (e.g., Caan, Perrett, & Rolls, 1984), many TANs respond to stimuli from a number of different modalities (Matsumoto, Minamimoto, Graybiel, & Kimura, 2001). Thus, a TAN might respond to the discriminative cue associated with reward, but it is also likely to respond to other visual, auditory, and olfactory cues (for example) that occur incidentally at the time of reward delivery.
The TANs receive their strongest excitatory glutamatergic input from the caudal intralaminar nuclei of the thalamus (Smith, Raju, Pare, & Sidibe, 2004; Sadikot, Parent, & Francois, 1992; Cornwall & Phillipson, 1988), which includes the center-median (CM) and the parafascicular (Pf) nuclei. The CM/Pf complex receives input from a number of places, including the OFC, the pedunculopontine tegmental nucleus, and the ascending reticular activating system (Van der Werf, Witter, & Groenewegen, 2002)—structures that are well known to participate in reward processing and arousal.
The TANs are also prominent targets of substantia nigra dopamine cells. Two features of this dopaminergic input are relevant to the model proposed here. First, dopamine cell responses and TAN pauses are temporally coincident (Cragg, 2006; Morris, Arkadir, Nevet, Vaadia, & Bergman, 2004). Second, long-term potentiation (LTP) in TANs requires elevated levels of dopamine (Suzuki, Miura, Nishimura, & Aosaki, 2001; Aosaki, Graybiel, & Kimura, 1994). These results suggest that TANs may learn to pause to cues that signal reward via reinforcement learning at CM/Pf–TAN synapses. In support of this idea, simultaneous single-unit recordings from CM/Pf neurons and TANs show that although an intact CM/Pf response is required for the TANs to pause, the CM/Pf response to environmental cues is relatively constant, regardless of the reward contingencies of the task, whereas the TANs pause primarily to reward-predicting cues (Matsumoto et al., 2001). Because TAN pauses are primarily driven by the CM/Pf complex, it therefore seems reasonable that plasticity at CM/Pf–TAN synapses allows the TANs to learn to pause in the presence of cues that predict reward.
To test these ideas formally, we constructed a computational model with the overall architecture shown in Figure 1. The idea is that, in the absence of CM/Pf input, the TAN's high spontaneous firing tonically inhibits the MSN response to cortical input.2 When cells in the CM/Pf complex fire, reinforcement learning at the CM/Pf–TAN synapse quickly causes the TAN to pause when in a rewarding environment. This releases the MSNs from tonic inhibition, thereby allowing them to respond to cortical inputs and thus to gate learning at cortical–striatal synapses.
The model described by Equations 1 and 2 accurately accounts for patch–clamp data collected from MSNs in the rat (i.e., see Figure 8.37 of Izhikevich, 2007) in the sense that it displays both the up and the down states that characterize MSN firing patterns, and it displays realistic spiking behavior.
Note that we modeled the effects of CM/Pf activation as purely excitatory. In fact, the evidence is good that glutamate inputs from the CM/Pf also synapse on GABAergic interneurons in the striatum, which then synapse on TANs. As a result, CM/Pf activation also can induce an inhibitory input to the TANs (Zackheim & Abercrombie, 2005; Suzuki et al., 2001). We chose not to model this indirect inhibitory effect because TANs pause when positive current is injected directly into the cell (e.g., see Figure 4). Thus, whereas the inhibitory input may potentiate the TAN pause, it is apparently not necessary to induce the pause.
Following standard models, we assumed that synaptic plasticity at all cortical–striatal synapses and at the CM/Pf–TAN synapse is modified according to reinforcement learning that requires three factors: (1) strong presynaptic activation, (2) postsynaptic activation that is strong enough to activate N-methyl d-aspartate (NMDA) receptors, and (3) dopamine levels above baseline (e.g., Reynolds & Wickens, 2002; Arbuthnott, Ingham, & Wickens, 2000; Calabresi, Pisani, Mercuri, & Bernardi, 1996). If postsynaptic activation is strong but dopamine is below baseline, then the synapse is weakened. The synapse is also weakened, regardless of dopamine level, if postsynaptic activation is below the NMDA threshold but above the α-amino-3-hydroxy-5-methyl-4-isoxazole propionic acid (AMPA) threshold (i.e., the AMPA receptor is a low-threshold glutamate receptor).
Note that these learning equations do not depend on striatal acetylcholine (ACh) levels. The evidence is good, however, that ACh does modulate corticostriatal LTP and LTD (Bonsi et al., 2008; Wang et al., 2006; Centonze, Gubellini, Bernardi, & Calabresi, 1999). In vitro results seem to suggest that (1) steady-state ACh levels are required for normal corticostriatal LTP and (2) reduced ACh levels are required for LTD. An obvious assumption is that that a TAN pause is associated with reduced ACh levels, so these results seem to imply that corticostriatal LTP cannot occur during a TAN pause, only LTD. This creates a paradox, however, because the environmental conditions known to cause TANs to pause (e.g., the appearance of cues that predict reward) are the same as the conditions thought to promote corticostriatal LTP. For example, in conditioning tasks, an animal is rewarded for associating a motor response with a sensory cue. Many such studies have shown that MSNs learn to fire a burst to the presence of the cue (e.g., Carelli, Wolske, & West, 1997; another example occurs in the Figure 8 data of Barnes, Kubota, Hu, Jin, & Graybiel, 2005), and presumably this increase in MSN activation is mediated by LTP at corticostriatal synapses.
One possibility is that a TAN pause may not cause a simple reduction in striatal ACh levels. The TAN response to sensory cues associated with reward is multiphasic. Frequently, the TAN pause is preceded by an initial burst (as, e.g., in Figure 4) and also followed by a rebound burst (as, e.g., in Figure 6). Thus, ACh levels may fluctuate rapidly during the course of a TAN pause. As a result, an informed model of the role that ACh plays in corticostriatal LTP (and LTD) may require a better understanding of the temporal dynamics of the ACh signal and of corticostriatal LTP and LTD. Lacking such data, we opted for a simpler model that ignores the role of ACh. Even so, as the next section will show, for the applications we considered, this simplified model was sufficient.
We assumed that learning at both cortical–MSN synapses and at Pf/TAN synapses is mediated by this same model. We allowed the learning rates to differ at these two synapse types, but we assumed the same numerical values for θNMDA and θAMPA. The numerical values for all parameters are given in the Appendix (i.e., see Table A1).
|wK,J||0.2||Uniform (0.2, 0.225)|
|Equations 4 and 5|
|v(n), n = 0||0.2||0.2|
|Equation 9 (MSN)|
|αw||0.07 × 10−9||1.0 × 10−9|
|βw||0.02 × 10−9||0.9 × 10−9|
|γw||0.005 × 10−9||0.005 × 10−9|
|Equation 9 (TAN)|
|αw||0.6 × 10−7||0.8 × 10−7|
|βw||0.1 × 10−7||0.2 × 10−7|
|γw||0.005 × 10−7||0.005 × 10−7|
|Equation 9 (General)|
|wK,J||0.2||Uniform (0.2, 0.225)|
|Equations 4 and 5|
|v(n), n = 0||0.2||0.2|
|Equation 9 (MSN)|
|αw||0.07 × 10−9||1.0 × 10−9|
|βw||0.02 × 10−9||0.9 × 10−9|
|γw||0.005 × 10−9||0.005 × 10−9|
|Equation 9 (TAN)|
|αw||0.6 × 10−7||0.8 × 10−7|
|βw||0.1 × 10−7||0.2 × 10−7|
|γw||0.005 × 10−7||0.005 × 10−7|
|Equation 9 (General)|
The Equation 9 model of reinforcement learning requires that we specify the amount of dopamine released on every trial in response to the feedback signal [the D(n) term]. The more that the dopamine level increases above baseline (Dbase), the greater the increase in synaptic strength, and the more it falls below baseline, the greater the decrease.
A simple model of dopamine release can be built by specifying how to compute obtained reward, predicted reward, and exactly how the amount of dopamine release is related to the RPE. Our solution to these three problems is as follows.
Computing Obtained Reward
None of the applications considered in this article vary reward valence. Thus, we can use a simple model to compute obtained reward. Specifically, we defined the obtained reward Rn on trial n as +1 if correct or reward feedback is received, 0 in the absence of feedback, and −1 if error feedback is received.
Computing Predicted Reward
Computing Dopamine Release from the RPE
Figures 2 and 3 illustrate an application of the model to a simple conditioning task in which the participant must execute some specific response (e.g., button press) when a certain sensory cue is presented (e.g., a tone) to receive a reward. Figure 2 shows activation in each brain region in the model during one trial early in training—before the model has learned to reliably respond to the sensory cue. Note that the CM/Pf and the sensory cortex activations are both modeled as simple square waves that are assumed to coincide with the stimulus presentation. Because the TAN has not yet learned that the cue is associated with reward, it fails to pause when the stimulus is presented. As a result of the tonic inhibition from the TAN, the MSN does not fire to the stimulus, although stimulus presentation does move it from the down state to the up state. In the absence of any inhibitory input from the striatum, the globus pallidus does not slow its high spontaneous firing rate, and therefore the thalamus is prevented from firing to other excitatory inputs. The premotor unit fires at a slow tonic rate, but note that this rate does not increase during stimulus presentation. As a result, the model does not respond on this trial.
Figure 3 illustrates a trial in this same experiment, but later in training. Now the TAN pauses when the stimulus is presented, which allows the MSN to fire a vigorous burst, which inhibits the globus pallidus. The pause in pallidal firing allows the thalamus to respond to its other excitatory inputs, and the resulting burst from the thalamus drives the firing rate in the premotor unit above the response threshold. The model now responds to the sensory cue.
Single-unit Recordings from TANs
We begin by testing the model of the TANs against some basic single-unit recording data. The goal is to test whether our model of TAN activity is qualitatively consistent with spiking behavior recorded from real TANs. See the Appendix for technical details of all simulations.
The Patch–Clamp Data of Reynolds et al. (2004)
Reynolds et al. (2004) collected in vivo intracellular recordings from single TANs of anesthetized rats. The results of one such recording are shown in the top panel of Figure 4. In this experiment, a suprathreshold positive current of 100-msec duration was injected into the cell (denoted by the small gray bar in the figure). Figure 4 shows that the TAN responded with an initial burst followed by a prolonged after-hyperpolarization that caused a pause in firing that persisted for approximately 900 msec. Note that these data show that excitatory input alone is enough to induce a TAN pause. In other words, although CM/Pf activation may have both excitatory and inhibitory effects on TANs, the Figure 4 data suggest that the excitatory inputs by themselves may be sufficient to induce a pause.
The bottom panel of Figure 4 shows the response of the model's TAN to these same experimental conditions. Note that the model also fires a burst to the injected current and then pauses for roughly 900 msec. Thus, the model displays the same temporal dynamics as real TANs. Figure 5 shows the phase portraits from the Figure 4 application, which explain why the model exhibits its pronounced pause to excitatory input. When the input is turned off, the voltage resetting mechanism in the model moves the model's state to a region where the derivative on voltage is negative (bottom panel). Voltage then decreases until the derivative is zero (i.e., the set of all (u,v) pairs where the derivative of voltage is zero is called the v-nullcline). The state then slowly drifts down the v-nullcline until eventually it breaks free. This prolonged period where voltage does not change produces the pause in firing.
The Learning Data of Aosaki, Tsubokawa, et al. (1994)
The Figure 4 data of Reynolds et al. (2004) clearly show the characteristic short-term dynamics of TANs, but they fail to show several other well-documented features of the TANs that will be critical to our later modeling. Most obviously, they do not show the high spontaneous firing rate of TANs that inspired their name, and second, they do not show the ability of the TANs to learn to pause to a stimulus that predicts reward. Figure 6 shows data from Aosaki, Tsubokawa, et al. (1994) that illustrate both of these properties. In this experiment, monkeys received a juice reward when a click occurred. This click-reward pairing was repeated many times while extracellular recordings were collected from 858 TANs in two animals. At the beginning of training, the animals ignored the clicks, and only a small percentage of TANs responded to the clicks (i.e., 17%). During training, the number of TANs that paused to the clicks gradually increased, until eventually well over half of the TANs were pausing. Individual TANs learned to pause after as little as 10 min of training, and the pause response was maintained over the course of a 4-week intermission. In addition, when other sensory cues were substituted for the click, TANs that paused to the clicks quickly learned to pause to the new stimulus.
The top panel of Figure 6 shows the spike histogram for a single TAN before and after conditioning. Note the high spontaneous firing rate before the click is presented and that before conditioning the TAN does not respond to the click. After conditioning, however, the TAN pauses about 90 msec after the click for a duration of 189 msec. The bottom panel of Figure 6 shows the responses of the model's TAN under these same conditions. Note that the model's TAN has a high spontaneous firing rate, that it initially does not respond to the click, and that after training, it pauses to the click with roughly the same lag and duration as the monkey's TAN.5
These applications suggest that our model of TAN firing mimics the most important properties of real TANs.
A wide variety of evidence implicates the striatum in instrumental conditioning (e.g., Yin, Ostlund, Knowlton, & Balleine, 2005; O'Doherty et al., 2004). In a typical experiment, a reward-neutral environment is suddenly altered so that rewards become available when certain instrumental behaviors are emitted (acquisition phase). During extinction, the environment is changed again so that any potential to retrieve rewards is eliminated. Finally, during reacquisition, the environment is changed once more so that the instrumental behavior is again rewarded. The conditioning literature has naturally focused on how the strength of the association between the instrumental behavior and the reward varies during these different conditions, but it is widely recognized that secondary associations are also learned to a variety of environmental cues (e.g., Kamin, 1969). In this section, we examine the role that the TANs might play in these phenomena.
As a model experiment, we considered a task in which an animal must produce a single instrumental response (e.g., a lever press) at the onset of a sensory cue (e.g., a tone) to retrieve a reward. To model initial learning, extinction, and reacquisition in this task, we constructed a version of the model with a single unit in sensory cortex, which was either active or not depending on whether the sensory cue was present. Similarly, because only one response was possible, there was only one unit in all other brain regions. We assumed that a response was emitted when the integrated premotor unit activity crossed a threshold. Because there is only one choice for the model to make, feedback is never negative. Indeed, because the model can either respond and collect a reward or fail to respond and thereby fail to collect a reward, feedback is always either positive or neutral. Figures 2 and 3 show the architecture of this version of the model and predicted neural activations in each unit during a typical trial early in learning (Figure 2) or much later after the instrumental behavior is well learned (Figure 3).
The behavioral performance of the model in this experiment is shown in the top panel of Figure 7. Note that the model learns to respond reliably to the cue during initial conditioning, that the cue is gradually ignored during extinction, and that the behavior is quickly reacquired after the reward is reinstated. The most important result in Figure 7, however, is that reacquisition is considerably faster than the initial learning of the behavior. This is one of the most widely known results in the conditioning literature. It is famous because it is a ubiquitous empirical phenomenon that is seen in almost all conditioning–extinction–reacquisition experiments (for an exception, see Ricker & Bouton, 1996) and because it has posed a difficult challenge for learning theories. For example, fast reacquisition has long been known to disconfirm any theory that assumes learning is purely a process of strengthening associations between stimuli and responses (e.g., Redish, Jensen, Johnson, & Kurth-Nelson, 2007). In such models, response rate is completely determined by the strength of these associations. Conditioning increases the strength from some initial value, and extinction decreases it back to its starting point (if the extinction phase is long enough). Thus, at the beginning of the reacquisition phase, the strength of the stimulus–response association is the same as at the beginning of the initial conditioning phase. As a result, relearning must follow the same course as initial learning.
The bottom panel of Figure 7 shows how the model accounts for fast reacquisition. This graph shows the strengths of the CM/Pf–TAN synapse (broken line) and of the sensory cortex–MSN synapse (solid line) for each trial in the experiment. Note that the CM/Pf–TAN synaptic strength increases before the cortex–MSN synaptic strength. Of course, it must rise earlier because the cortical–medium spiny cell synapse cannot be strengthened until the TAN has begun to pause. In addition to increasing sooner, however, note that the CM/Pf–TAN synaptic strength also rises at a greater rate. We hypothesize that this is because TANs are more broadly tuned than MSNs.6 For example, consider an experiment where an animal must press a lever when a tone is presented to retrieve a food reward. The MSN that is conditioned in this experiment will fire to the tone, but it is unlikely to fire to other cues that are present, especially those from other sensory modalities (e.g., visual and olfactory cues from the testing chamber). As mentioned previously, however, TANs are so broadly tuned that they will respond to stimuli from multiple sensory modalities (Matsumoto et al., 2001). For this reason, the TANs will have many more opportunities to experience synaptic plasticity than the MSNs, and as a result, we hypothesize that they learn more quickly when placed in a rewarding environment.
Note next in Figure 7 that during extinction, the CM/Pf–TAN synaptic strength drops all the way to its preconditioning baseline level, but the sensory cortex–MSN synaptic strength drops only a small amount, where it remains throughout the extinction period. As the CM/Pf–TAN synaptic strength weakens, it becomes less and less likely that CM/Pf activation will induce the TAN to pause. In the absence of a TAN pause, Figure 2 shows that the MSN will not fire. In Equation 9, this corresponds to a trial in which the postsynaptic activation is below the AMPA receptor activation threshold. As a result, under these conditions, synaptic strength does not change. Thus, the TANs have the desirable property that they protect prior cortical–striatal learning during periods when the environment has changed in such a way that rewards are no longer available.
Figure 7 also shows that reacquisition time is essentially equal to the time it takes the TANs to learn that rewards are again available. At that point, the TANs begin pausing again, and the protected cortical–MSN synaptic strengths allow performance to jump nearly to its preextinction level. Finally, note that during reacquisition, the cortical–MSN synaptic strength grows to an even larger level than it reached during initial acquisition. As a result, the model predicts that after the end of the reacquisition period, the neural representation of the learned behavior is stronger than it was after initial acquisition.
We know of no single-unit recording data from exactly this experiment. Even so, Barnes et al. (2005) reported that single-unit recording results from a similar experiment in which seven rats were trained to run down a T-maze to obtain a food reward. When the animals reached the intersection point, either a high- or a low-pitch tone sounded, which instructed them whether to turn right or left for the reward. Barnes et al. recorded from single (striatal) MSNs during sessions of acquisition, extinction, and reacquisition. The top panel of Figure 8 shows relative firing rates averaged across the MSNs that responded to the auditory tone.7
Using the same version of the model that was used to generate Figure 7, we computed this same relative firing rate statistic (averaged across 70 simulated animals) from the spikes elicited from the MSN of the model in response to the stimulus cue (see the Appendix for modeling details). Results are shown in the bottom panel of Figure 8. Note that the model correctly captures many properties of the data. These include (1) an increase in firing rate during acquisition, (2) a reduction in firing rate during extinction, (3) increasing firing rates during reacquisition, (4) lower relative firing rates during extinction than during reacquisition or the end of acquisition but higher than baseline (i.e., Sessions 1 and 2), (5) higher relative firing rates during reacquisition than during acquisition, and (6) numerical firing rates during the entire experiment that are close to the observed relative firing rates. Only one parameter was estimated during this simulation process (baseline firing rate in the absence of any cues), and this parameter value only affected the last of these six properties.
Next we focus on the ability of the model to account for behavioral and single-unit recording data from the category-learning experiment of Merchant, Zainos, Hernandez, Salinas, and Romo (1997). In this experiment, a rod was dragged against a monkey's finger at one of 10 different speeds. The animals were trained to push one button if one of the five low speeds occurred and to press a different button if they received one of the five high speeds. After extended feedback training, the animals reliably learned these categories.
After training, the animals completed an additional session during which single-unit recordings were collected from the putamen. Within the putamen, the responses of 695 cells were characterized in detail. Of these, 196 responded to the movement onset of the rod, regardless of category membership, 258 responded to the animal's arm movement, regardless of response, and 165 responded to the category membership of the stimulus. The neurons in this latter category responded to all stimuli in one category but not to any stimuli in the contrasting category. An example from two such cells is shown in the left column of Figure 9.
These same neurons, however, exhibited dramatically different behavior when the monkeys were presented with the same stimuli under passive conditions—that is, when no rewards were available and when their arms were restrained to prevent a response and the device housing the response keys was removed. Under these conditions, as illustrated in the right column of Figure 9, these same category-specific neurons showed no response to stimulus presentation.
According to the theory proposed here, in the passive conditions, the TANs quickly learned that there are no rewards available and therefore failed to pause when the categorization stimuli were presented. In the absence of such a pause, the MSNs were tonically inhibited and therefore unable to respond to any cortical stimulation. To test this prediction rigorously, we constructed a version of the model with 10 sensory cortical units,8 one tuned to each stimulus, and two pathways through the striatum, globus pallidus, thalamus, and premotor cortex (i.e., as in Figure 1). To model the passive condition, we simply removed feedback delivery from the model (i.e., we set Rn = 0 in Equation 11).
The model easily learned the tactile categories. Figure 10 describes the behavioral performance of the model and the monkeys. The left column of Figure 11 shows the category-specific firing in the model's striatal output units.9 Comparing Figure 11 with the Merchant et al. (1997) data shown in Figure 9 suggests that the major discrepancy between the model and the data is in the timing of onset and offset of spiny cell firing relative to the stimulus onset and offset. However, it is important to note that we made no attempt to model these temporal dynamics. For example, Merchant et al. varied the duration of each stimulus so that the distance the rod traveled across the monkey's finger was constant. We used the same duration for all stimuli. Also, we made no attempt to model the delay between the time when the rod was first applied to the monkey's finger and the time when cells in somatosensory cortex began to fire to this stimulation. Making these two changes would improve the correspondence between Figures 9 and 11.
Our main goal in this article is not to construct the most accurate model possible of the active categorization condition but instead to account for the striking difference between the active and the passive conditions. The right column of Figure 11 shows that the model provides a reasonable account of this difference. During active categorization, the TAN learns to pause after stimulus presentation. This allows the medium spiny unit to respond to sensory input and the model to eventually make a response. In the passive condition, the TAN quickly unlearns its pause response. The medium spiny unit is consequently tonically inhibited and cannot respond to sensory input. Thus, the difference between the two conditions is driven by the model's TAN unit. These results support the hypothesis of Ashby, Ennis, and Spiering (2007) that the TANs might be responsible for mediating the difference in firing properties of category-specific neurons in the Merchant et al. (1997) data.
Many sensory cues are typically present during skill acquisition. It is quite common to encounter some of these in later contexts where the skilled behavior is no longer appropriate. In this article, we showed how cholinergic interneurons in the striatum might protect cortical–striatal synapses during these periods when rewards for the skilled behavior are not available. The idea is that the TANs exert a tonic inhibitory influence over cortical input to striatal MSNs that prevents the execution of striatal-dependent actions. However, the TANs learn to pause in rewarding environments, and this pause releases the striatal output neurons from inhibition, thereby facilitating the learning and expression of striatal-dependent behaviors. When rewards are no longer available, the TANs cease to pause, which protects striatal learning from decay. We showed that the resulting model was consistent with a variety of single-cell recording data and that it also predicted some classic behavioral phenomena, including fast reacquisition after extinction.
Relations to Other Theories
There have been a number of other proposals that the TANs learn to pause in environments associated with reward (e.g., Apicella, 2007; Shimo & Hikosaka, 2001; Sardo, Ravel, Legallet, & Apicella, 2000). However, to our knowledge, none of these have been developed into predictive theories.
There have also been several computational models that have provided accounts of acquisition and extinction on the basis of a computational model of dopamine release that is similar to the dopamine model used here (Redish et al., 2007; Kakade & Dayan, 2002; O'Reilly & Munakata, 2000). The former two of these models assume extinction is exclusively an unlearning phenomenon and do not account for fast reacquisition. In the O'Reilly and Munakata (2000) model, however, once the strength of what in our model is the cortical–striatal synapse falls low enough for the behavior to disappear, this synaptic strength is no longer weakened by further extinction trials. This allows the model to predict fast reacquisition because the first rewarded trial after extinction brings the synaptic strength above threshold and therefore reinstates the behavior. The TANs endow our model with a similar property.
Tan and Bullock (2008) recently developed a Hodgkin–Huxley-type computational model of TAN firing that is more detailed than the model proposed here. For example, Tan and Bullock specifically modeled changes in several specific ion concentrations, along with the effects on TAN activation of dopamine and of GABAergic interneurons. On the other hand, they did not specifically model activity in striatal MSNs, nor did they model activity in any cells outside the striatum. Thus, their empirical applications are limited to data collected from single TAN units (e.g., no behavioral data were modeled). The major difference between their model and ours is that they account for intrinsic and learned TAN responses via modulation of TAN activity by GABAergic, and dopaminergic input rather than by synaptic plasticity (as we assume). Because the evidence of LTP at TAN synapses is good (Suzuki et al., 2001; Aosaki, Graybiel, et al., 1994), it seems likely that TAN pauses are modulated by all of these factors. In any case, Tan and Bullock's model may best be seen as a detailed theory of TAN responses within classical conditioning paradigms. Our model, however, is primarily concerned with how TAN responses act to gate learning at corticostriatal synapses and how this function influences behavior in instrumental conditioning paradigms and more general striatal-dependent behaviors.
It is important to acknowledge the rather severe limitations of the theory proposed here. First, with respect to its interactions with cortex, it is important to note that the striatum is organized into a set of functionally separate, parallel loops (Alexander, DeLong, & Strick, 1986). Which loop a particular subregion of striatum is in is determined primarily by the cortical regions that project to it. The theory developed here concerns the learning of stimulus–response associations and therefore applies to striatal regions receiving input from sensory areas of cortex (e.g., see Figure 1). This excludes anterior regions of striatum, which are innervated primarily by areas of frontal cortex. For example, tasks that activate pFC also commonly activate the head of the caudate nucleus because pFC projects strongly to this anterior region of the striatum. pFC and its striatal targets (e.g., dorsal striatum, head of the caudate nucleus) have their own role in context processing, which is beyond the scope of this article. For example, there is considerable evidence that a pFC—head of the caudate circuit—plays a critical role in attentional switching between different contexts (e.g., Robbins, 2007). There is also evidence that the TANs play a critical role in this process (Ragozzino, 2003), so a theory that attempts to account for such switching may postulate a role for the TANs similar to the one proposed here.
It is also important to note that the theory proposed here is meant to apply only to initial skill (or habit) learning. With overtraining, skills eventually come to be executed automatically. Following similar suggestions in the literature (Ashby, Alfonso-Reese, Turken, & Waldron, 1998; Miller, 1981, 1988), Ashby et al. (2007) proposed a model in which the development of automaticity is mediated by a transfer of control from the cortical–striatal pathways emphasized here to cortical–cortical pathways from the relevant areas of sensory cortex to the areas of premotor and motor cortices that mediate the selection and execution of the appropriate motor program. For example, several studies have reported evidence that with overtraining, skills of the type modeled here become independent of both dopamine and the striatum (e.g., Bespalov, Harich, Jongen-Rêlo, van Gaalen, & Gross, 2007; Choi, Balsam, & Horvitz, 2005; Turner, McCairn, Simmons, & Bar-Gad, 2005; Carelli et al., 1997). It would be straightforward to augment the present model with the cortical–cortical pathways proposed by Ashby et al. (with cortical–cortical plasticity mediated by Hebbian learning). This augmented model should be used to make predictions about the effects of overtraining on the tasks considered in this article.
Third, we have greatly oversimplified the neuroanatomy of the BG, omitting, for example, the GABAergic interneurons, the striosomes (i.e., patch compartments), the ventral striatum, and the indirect and hyperdirect pathways. However, rather than build the most complete model of the BG that was possible, our goal instead was to focus on the effects of the TANs on MSNs. For all regions downstream of the striatum, we simply tried to construct the simplest reasonable model that could account for the limited behavioral phenomena considered in this article. It seems likely that if the model was extended to more complex behaviors, then more biological detail would be needed in downstream areas. This generalization will be a goal of future research.
APPENDIX: SIMULATION METHODS
Each modeling application was based on the network structure illustrated in Figures 1 or 2, depending on whether there were two or one response alternatives, respectively. Solutions to all differential equations were estimated using Euler's method. In single-response applications, the model made a response whenever the output of the premotor unit crossed a threshold. In the two-response application, the model responded A or B depending on whether the output from premotor unit A or premotor unit B crossed a threshold first. If the output from neither premotor unit crossed the threshold during the trial, then the model responded A if the maximum output value of the premotor A unit was greater than the maximum output value of the premotor B unit. The behavioral simulations were replicated 100 times, and the results were averaged.
We began by estimating the intrinsic firing properties of each cell (e.g., the numerical values of 80 and 25 in Equation 2) followed by the synaptic strengths between units (e.g., βS in Equation 2). The parameters in the learning equations were estimated last. The parameter estimates from all applications are listed in Table A1.
It is important to point out that although the model includes many numerical parameters, its performance is qualitatively inflexible. There is a range on each parameter that allows the model to make responses and to learn. For numerical values outside this range, the behavior of the network collapses (e.g., a unit never fires, no matter what its input, or it always fires). Within the range of reasonable parameter values, the model tends to always make the same qualitative predictions. For example, the TANs always inhibit the MSNs, so when the TAN activity decreases, the MSNs become more responsive to cortical input. Different numerical values of the parameters within the reasonable range tend to change the predictions of the model only slightly. For example, learning and extinction rates may change, but not whether the model learns or extinguishes. Thus, we believe that all of the predictions derived in this article follow in a necessary fashion from the general architecture of the model and do not depend in any critical way on our ability to find exactly the right set of parameter values.
To verify these observations more formally, we implemented the following sensitivity analysis for the most complex empirical application reported in this article—namely, our demonstration that the model accounts for fast reacquisition after extinction of an instrumental behavior (i.e., top panel of Figure 7). The analysis proceeded as follows. For each of the nine most important parameters in the model (Equation 8 response threshold, θAMPA and θNMDA from Equation 9, Equation 9 values of αw, βw, and γw for corticostriatal learning, Equation 9 values of αw, βw, and γw for CM/Pf–TAN learning), we successively changed the parameter estimate from the value used to generate the predictions shown in the top panel of Figure 7 by −1%, −10%, +1%, and +10%. After each change, we simulated the behavior of the model in the same conditions used to generate Figure 7. Next, after each new simulation, we computed the correlation between the learning curve shown in the top panel of Figure 7 and the learning curve produced by the new version of the model. In all except one case, these correlations exceeded .99, suggesting that the model makes the same qualitative predictions for a wide range of each of its parameters. The only exception occurred for a +10% increase in the response threshold parameter (i.e., the threshold for a cortical unit to initiate a motor response). In this case, the correlation was .74. Importantly, however, even in this case, the model predicted that reacquisition was faster than original acquisition.
Specific Notes on Model Fitting
When updating cortical–striatal synaptic strengths, we summed the total positive medium spiny activation during stimulus presentation to obtain the postsynaptic activity sum. When updating CM/Pf–TAN synapses, we only summed the first 200 msec after stimulus presentation to obtain the postsynaptic activity sum. This was necessary because the TAN pauses for most of the stimulus presentation. By limiting our sum to the first 200 msec after stimulus presentation, we were essentially capturing the short burst of spikes that tends to precede each pause.
The data from the Barnes et al. (2005) experiment were based on an average of 38 trials per block for six blocks during acquisition, 33 trials per block for five blocks during extinction, and 38 trials per block for six blocks during reacquisition. Acquisition and reacquisition occurred with continuous reinforcement. During extinction, four animals received rewards on 3–9% of trials in each block, and three animals never received a reward. We simulated this experiment with the single-response version of the model (as in Figures 2 and 3) to ensure that the same model was used to generate both Figures 7 and 8. To generate predictions, the model was run through the entire experiment for 70 iterations (i.e., 228 trials of acquisition, 165 trials of extinction, and 228 trials of reacquisition). In all iterations, the model received continuous reinforcement during acquisition and reacquisition. For 40 iterations, the model received a reward on 5% of extinction trials, and for the other 30 iterations, it never received a reward during extinction. Spike frequencies were converted to relative firing rates in the following way. First, note that Barnes et al. defined relative firing rate as the proportion of total spikes recorded while the animal was in the maze that were produced in response to the auditory cue. We assumed that the Session 1 data from the acquisition period could be used to estimate baseline firing levels (i.e., before significant learning has occurred). Figure 8 shows that this value is roughly .10 (actually slightly less), which means that 10% of the total spikes before learning are produced to the tone. Suppose the absolute number of spikes produced to the tone before learning was B0, then the total number of spikes produced while the animal is in the maze (before learning) is 10B0. Let N equal the number of spikes produced by the model in response to the tone. As mentioned before (see footnote 9), our model of MSNs has a baseline firing level of 0. So before learning, n = 0 (or a very small number). Thus, on each trial, we assumed the number of spikes recorded in response to the tone was N + B0, and the total number of spikes recorded for the entire trial was N + 10B0. Figure 8 was produced with a value of B0 = 6.5.
This research was supported by NIH Grants R01 MH3760-2 and P01 NS044393 and by support from the U.S. Army Research Office through the Institute for Collaborative Biotechnologies under contract DAAD19-03-D-0004. The authors thank John Ennis for his help in developing an earlier version of the model proposed here and Aaron Ettenberg for his helpful comments.
Reprint requests should be sent to F. Gregory Ashby, Department of Psychology, University of California, Santa Barbara, CA 93106, or via e-mail: firstname.lastname@example.org.
Evidence suggests that the postsynaptic effect of ACh is to stabilize MSN membrane potential while it is either in the up or down state (Gabel & Nisenbaum, 1999). In contrast, the presynaptic effects seem to be mostly inhibitory (mediated by muscarinic M2 receptors on the axons of cortical pyramidal neurons; e.g., Calabresi et al., 2000).
We modeled the inhibitory effect of TANs on activation at cortical–striatal synapses as postsynaptic. As mentioned above, the most significant inhibitory effect may be presynaptic (Pakhotin & Bracci, 2007; Calabresi et al., 2000). Modeling the inhibitory effects as postsynaptic simplifies the model because it allows us to model cortical input as a simple square wave. We are confident that none of the simulations reported in this article would change in any significant way if we changed the model by replacing the square-wave model of cortical input with a more realistic spiking model and making the TAN inhibition presynaptic rather than postsynaptic. Note also that our model ignores postsynaptic excitatory effects of ACh. These are poorly understood, and it is not clear what role they play in cortical–striatal dynamics or how they should be modeled.
This is the classical view of the basal ganglia (i.e., as providing a brake on cortex). Thalamic neurons frequently fire a rebound burst when released from inhibition, however (Sherman & Guillery, 2006), so another possibility may be that striatal firing initiates an excitatory input from thalamus to cortex. We believe that the qualitative behavior of our model would not change if we had included rebound spiking in our model of thalamus.
Bayer and Glimcher (2007) recently reported evidence that negative RPEs may be coded by the duration of the pause in dopamine cell firing. This suggests that the dynamic range of positive and negative RPEs may be more balanced than we have modeled. However, we also constructed a model with equal dynamic range for positive and negative RPEs and found that the model's qualitative behavior and ability to account for the data were unaffected.
Note that in the data of Aosaki, Tsubokawa, et al. (1994), the TANs fire a burst at the end of the pause and then quickly reduce their firing to baseline levels. We chose not to model this feature of the data of Aosaki, Tsubokawa, et al. (1994) because the data of Reynolds et al. (2004) shown in Figure 4 do not display this property. If we had incorporated a rebound burst into the model, then Figures 4 and 6 would change of course, but none of the other predictions reported in this article would change in any way.
We did not explicitly model this broad tuning. Instead, we mimicked the effects of broad tuning by setting the learning rates higher on the CM/Pf–TAN synapse than on the sensory cortex–MSN synapses (i.e., see Table A1 for specific numerical values of all parameters).
The relative firing rate plotted in Figure 8 is defined as the number of spikes elicited by the auditory tone divided by the total number of spikes recorded during the entire time the animal was running in the maze.
We assumed that the response of each unit decreased as a Gaussian function of the distance in stimulus space between the stimulus preferred by that unit and the presented stimulus. Specifically, if a stimulus with stylus speed xs mm/sec was presented, then the response of the unit maximally tuned to speed x mm/sec was exp[−(x − xs)2 / 2.5].
The Izhikevich (2007) model of MSN activation used in this article gives a good account of patch–clamp experiments, but the model predicts that MSNs never fire spontaneously. In fact, MSNs do not have a high spontaneous firing rate (e.g., Wilson, 1995). Nevertheless, they do sporadically fire in the absence of significant stimulation. This is easily seen in Figure 9. We chose to model this spontaneous activity by adding a Poisson process to the spike trains that were generated from Equations 1 and 2. In the present application, this noise process added, on average, three spikes per second. We augmented the model in this way only for the two applications where we fit the model to spike trains from MSNs (i.e., this application and the application to the data of Barnes et al., 2005). Note, however, that even without adding these extra random spikes, the model still accounts for the most important qualitative properties of the data of Merchant et al. (1997)—namely, category-specific responding during the active condition and no response to these same stimuli in the passive condition. It is also important to note that adding or not adding this extra noise source would not affect any other applications in this article.