Appetitive goal-directed behavior can be associated with a cue-triggered expectancy that it will lead to a particular reward, a process thought to depend on the OFC and basolateral amygdala complex. We developed a biologically informed neural network model of this system to investigate the separable and complementary roles of these areas as the main components of a flexible expectancy system. These areas of interest are part of a neural network with additional subcortical areas, including the central nucleus of amygdala, ventral (limbic) and dorsomedial (associative) striatum. Our simulations are consistent with the view that the amygdala maintains Pavlovian associations through incremental updating of synaptic strength and that the OFC supports flexibility by maintaining an activation-based working memory of the recent reward history. Our model provides a mechanistic explanation for electrophysiological evidence that cue-related firing in OFC neurons is nonselectively early after a contingency change and why this nonselective firing is critical for promoting plasticity in the amygdala. This ambiguous activation results from the simultaneous maintenance of recent outcomes and obsolete Pavlovian contingencies in working memory. Furthermore, at the beginning of reversal, the OFC is critical for supporting responses that are no longer inappropriate. This result is inconsistent with an exclusive inhibitory account of OFC function.
Deciding on which course of action to take critically depends on which rewards are expected to be available in the current situation. A reward may be expected because of the presence of a sensory stimulus that has reliably preceded reward delivery before or because reward was recently received in the current context. Existing data suggest that the amygdala learns which conditioned stimuli (CSs) and unconditioned stimuli (USs) are associated (Schoenbaum, Chiba, & Gallagher, 1999; Kita, Nishijo, Eifuku, Terasawa, & Ono, 1995) and that the orbital frontal cortex (OFC) can keep track of recent reward history in the current context (Wallis, 2007; Frank & Claus, 2006). We propose a model in which learning (and memory) in amygdala (specifically its basolateral complex, BLA) is solely weight-based, but memory in OFC is activation-based, and therefore does not depend on synaptic plasticity over relatively short time scales (O'Reilly & Munakata, 2000; see O'Reilly, Mozer, Munakata, & Miyake, 1999, for a theoretical discussion). Hence, OFC is capable of dynamically updating to new reward expectancy representations very quickly (despite no weight changes), but the amygdala changes more slowly because it is dependent on adapting its synaptic weights. As a result of this dynamic, OFC supports flexible decision-making when and if environmental contingencies change.
To investigate the separable contributions of these different brain areas to reinforcement expectancies, we developed a biologically informed neural network model. This model captures empirical data on the effects of lesions of OFC, BLA, and simultaneous lesions of both areas on the ability to acquire the initial Pavlovian contingencies and the ability to adapt to a reversal of the Pavlovian contingencies.
Our simulations provide a mechanistic account for how OFC supports behavioral flexibility when Pavlovian contingencies change: Associations in BLA only change relatively slowly, but OFC can actively maintain a working memory of the recent reward history (Wallis, 2007; Frank & Claus, 2006). Thus, at the beginning of reversal, OFC can promote flexibility by biasing approach behavior associated with a recently experienced and now maintained US, even in the face of a CS that had previously predicted an aversive US. This role in Pavlovian reversal is in contrast with perspectives that have proposed that the primary role of OFC is the inhibition of inappropriate behavior (e.g., Elliott, Dolan, & Frith, 2000; Dias, Robbins, & Roberts, 1996; Damasio, 1994; Mishkin, 1964; Ferrier, 1886).
In addition to accounting for behavioral effects of lesions, the model provides an explicit mechanistic explanation for electrophysiological data that suggest that, when Pavlovian contingencies change, OFC promotes behavioral flexibility by providing an ambiguous reinforcement expectancy. Under these circumstances, OFC shows nonselective cue-evoked activity (Schoenbaum, Roesch, Stalnaker, & Takahashi, 2009). The results of our simulations suggest that this nonselective activity could be because of the OFC simultaneously maintaining a working memory of the recent reward history, as well as the now-obsolete reinforcement expectancies being driven by the lagging BLA. Furthermore, consistent with empirical data, the speed at which the BLA can acquire and update Pavlovian associations is severely impaired if the OFC is lesioned data (Saddoris, Gallagher, & Schoenbaum, 2005). The model provides an explicit mechanistic explanation for this phenomenon as well, and it makes several related predictions.
In our model, acquisition and performance of a Pavlovian approach/avoid task is the result of a division of labor among an expectancy system, an actor, and a critic. Different groups of layers in our model contribute to these three systems. The expectancy system produces a CS-evoked expectation for future reinforcement. It comprises the OFC, BLA, and ventral (limbic) striatum (VS). The actor system executes approach behavior toward appetitive reinforcers and avoidance behavior away from aversive reinforcers and comprises dorsomedial striatum (DMS) and motor cortices. The midbrain dopamine system and the central nucleus of amygdala (CNA) cause a phasic modulation of striatal dopamine and act as a critic that mediates feedback about the success of behavior (see Figure 1).
As a part of the expectancy system, the BLA has preexisting representations of USs and learns which CSs and USs are associated with each other (LeDoux, 2000; Schoenbaum et al., 1999; Kita et al., 1995; Quirk, Repa, & LeDoux, 1995). It is believed that this learning depends on synaptic plasticity (Fanselow & LeDoux, 1999) and is simulated by incremental updates of synaptic weights of projections from the CS to the BLA layer of the model. As in the Rescorla–Wagner model, learning in the BLA occurs when the US received at the end of a trial does not match the expectation developed at the time of the presentation of the CS (Rescorla & Wagner, 1972). Activation in BLA at CS onset is a combination of the result of this learning process and, to a limited extent, also the result of a top–down bias from the OFC (Corbit, Muir, & Balleine, 2003), which maintains a working memory of the recent reward history (Wallis, 2007; Frank & Claus, 2006).
The OFC of the model also has preexisting representations of USs (Ongür & Price, 2000). In contrast to the BLA, however, activation of US representations in the OFC is not a result of CS–US associations by neurons in this area (Holland & Gallagher, 2004). Instead this area is specialized for representing objects as USs and can also act as “working memory for USs,” including which US was predicted by the BLA at CS onset (Wallis, 2007). It receives this information over ascending projections from the BLA (Corbit et al., 2003). In addition, the OFC of the model can maintain a working memory of which USs have recently been received (Wallis, 2007; Frank & Claus, 2006). That is, if, for example, sucrose is presented to the model at the end of a trial, OFC can maintain this in working memory. Thus, active maintenance of USs can get into OFC working memory in two ways: (1) recent experience of a US, particularly if unexpected, and (2) expected USs based on BLA-driven inputs. Finally, there is no passive decay of working memory representations considered over the time scale of the tasks modeled nor an integration with previous reward history as in previous models on OFC function (e.g., Frank & Claus, 2006).
The OFC working memory mechanisms were developed using the PFC BG Working Memory framework (PBWM; Hazy, Frank, & O'Reilly, 2006, O'Reilly & Frank, 2006). The central tenet of the PBWM model is that the BG provides an adaptive, dynamic gating signal for controlling the active maintenance and updating, and the output of, information in frontal cortex (O'Reilly, 2006). The layers are interconnected with frontal cortex through a series of parallel loops (Postuma & Dagher, 2006; Middleton & Strick, 2000; Alexander, DeLong, & Strick, 1986). These loops enable the BG to exert a gating-like modulation of representations in frontal areas (see Figure 2). This kind of gating mechanism is consistent with a wide range of empirical data, and similar implementations of dynamic gating were included in previous computational models (e.g., Cisek, 2007; Houk et al., 2007; Humphries, Stewart, & Gurney, 2006; Frank, 2005; Brown, Bullock, & Grossberg, 2004; Gurney, Prescott, & Redgrave, 2001; Berns & Sejnowski, 1998; Mink, 1996; Dominey, Arbib, & Joseph, 1995; Houk, Adams, & Barto, 1995; Houk & Wise, 1995; Wickens, Kotter, & Alexander, 1995).
In our model, the VS layer provides a dynamic gating mechanism for the OFC (Frank & Claus, 2006). The VS learns to update memory in OFC based on excitatory input from the BLA and input from the CS layer (Cardinal, Parkinson, Hall, & Everitt, 2002; Gray, 1999). Learning when to update OFC depends on phasic dopamine release in VS by neurons in ventral tegmental area (VTA)/substantia nigra pars compacta (SNc; see below). Although OFC and BLA both receive sensory information from higher-level sensory areas of the temporal lobe (Ghashghaei & Barbas, 2002), the OFC of the model does not receive direct sensory input from the CS layer. Our model does not include these projections, because we believe that the extremely slow learning rate at OFC synapses precludes the formation of CS–US associations within the timescale of the experiments we simulated (Holland & Gallagher, 2004). That is, if the expectation for a particular reward is activated in OFC when a CS is presented, this is because BLA is sending its US prediction to OFC, but not because OFC itself learned the CS–US association (Schoenbaum, Setlow, Saddoris, & Gallagher, 2003).
With sufficient training, we do believe that OFC can eventually acquire multisensory feature–US associations, so that a US representation in OFC will include information about all the features reliably associated with the core sensory experience (Holland & Gallagher, 2004), which may be the core process underlying stimulus substitution. Thus, the OFC can come to represent all of the multisensory aspects associated with a reward, such as its size, shape, texture, and flavor (Rolls & Grabenhorst, 2008; Schoenbaum & Roesch, 2005), and in our view, when it pairs sensory features with USs, it does so as unitary US representations and not as CS–US pairings per se. We do not address these aspects of OFC function in our simulations.
In our simulations of the function of the OFC, we focus on a lateral region for which there are strong anatomical and functional parallels between rodents and primates (see also Schoenbaum et al., 2009). This region encompasses lateral orbital regions, anterior parts of the agranular insular cortex and the dorsal bank of the rhinal sulcus in rodents. These areas are heavily interconnected with the BLA, VS, mediodorsal thalamus, and sensory cortices. These areas in rodent OFC correspond to Areas 11, 12, and 13 in the primate OFC (Schoenbaum & Roesch, 2005; Ongür, Ferry, & Price, 2003; Ongür & Price, 2000; Preuss, 1995). The role of this area in decision-making is to determine which stimulus outcomes are possible in the current context, but not what is necessary to achieve that outcome. More medial aspects of ventral frontal cortex are involved in learning and representing action–outcome values, including costs (Rushworth, Behrens, Rudebeck, & Walton, 2007). The division of labor among regions of OFC has been discussed elsewhere (Noonan et al., 2010; Hare, O'Doherty, Camerer, Schultz, & Rangel, 2008; Ongür et al., 2003; Ongür & Price, 2000).
The actor system of the model consists a simulated DMS and motor cortices. It is well accepted that the DMS is involved in the initiation of the motor gating of behavior in motor cortices (Mink, 1996; Wickens, 1993). The DMS is believed to guide goal-directed behavior according to the expectancy information it receives from the BLA and OFC (Pauli, Atallah, & O'Reilly, 2010; Pauli, Hazy, & O'Reilly, 2009; Balleine, Delgado, & Hikosaka, 2007). Lesions of this region have been found to lead to similar reversal deficits as lesions of the OFC itself (Clarke, Robbins, & Roberts, 2008). In our model, OFC and BLA can independently promote a go response (e.g., approach the food well) if they predict an appetitive US (Frank, Seeberger, & O'Reilly, 2004). If OFC and BLA expect an aversive US, they bias the DMS medium spiny neurons of the indirect (no-go) pathway to prevent approaching the food well. The DMS also receives sensory input from the CS layer (McGeorge & Faull, 1989). Because of this connection, the DMS can acquire CS–response associations (Everitt & Robbins, 2005), so that the conditioned response is spared even if both BLA and OFC are lesioned (Stalnaker, Franz, Singh, & Schoenbaum, 2007), and there is most likely no acquisition of Pavlovian CS–US associations.
For the above gating mechanism to work successfully, the striatum has to learn when to update representations in frontal areas. This learning is dopamine-based and allows each striatal projection neuron (medium spiny neuron) to develop its own unique pattern of input weights that determine its actions. Dopamine release in the striatum of our model is determined by projections from the dopaminergic neurons of the SNc/VTA, captured by the PVLV model (primary value, learned value; Hazy, Frank, & O'Reilly, 2010; Hazy et al., 2007; O'Reilly, Frank, Hazy, & Watz, 2007; O'Reilly & Frank, 2006).
It is well established that the midbrain dopamine neurons in the SNc/VTA of the mammalian brain are driven by inputs from the CNA, the lateral hypothalamus, and the patch-like neurons of the VS (Ahn & Phillips, 2003; Floresco, West, Ash, Moore, & Grace, 2003; Fudge & Haber, 2000; Joel & Weiner, 2000; Rouillard & Freeman, 1995; Semba & Fibiger, 1992). The contributions of these inputs are described by the PVLV model as follows (Hazy et al., 2007, 2010; O'Reilly et al., 2007; O'Reilly & Frank, 2006). The lateral hypothalamus delivers primary reward information and contributes to the phasic dopamine release in response to unexpected reward delivery. The patch-like neurons in the VS learn to expect such rewards and thereby block the dopamine spike that would otherwise occur to them. This is the PV system of PVLV. The LV system, involving the CNA, is important for learning reward associations for CSs, which can then drive dopamine firing at the time of CS onset. These two interacting systems provide a good account of the extant neural recording data from the SNc (Schultz, 1998; Schultz, Apicella, & Ljungberg, 1993). In many learning paradigms, the PVLV algorithm can be considered as a biologically informed version of the temporal differences algorithm (Sutton & Barto, 1998; Sutton, 1988), although there are also important differences between these two models in some specific learning paradigms (Hazy et al., 2010).
The functional contribution of the PVLV system is to provide positive dopamine bursts for successful behavior and CSs associated therewith and negative dopamine dips for unsuccessful behavior and associated CSs. The positive dopamine bursts cause go pathway neurons in the striatum to become more active (because of a preponderance of dopamine D1 receptors, which are excitatory) and no-go pathway neurons to become less active (from D2 receptors, which are inhibitory; Shen, Flajolet, Greengard, & Surmeier, 2008; Frank, 2005; Frank et al., 2004). The opposite case holds for negative dopamine dips. This shapes the gating firing in ways that lead to successful learning of complex working memory tasks in the PBWM model (Hazy et al., 2006, 2007; O'Reilly & Frank, 2006).
Because the main focus of our model was on the acquisition and reversal of Pavlovian contingencies, we only simulated the effect of phasic dopamine on plasticity in striatal areas but did not simulate modulations of tonic dopamine levels in the BLA and frontal cortex. Modulation of tonic dopamine levels in BLA is thought to be critical for motivation tone (Niv, Daw, Joel, & Dayan, 2007) and the generalized form of Pavlovian-to-instrumental transfer (Hazy et al., 2010). In PFC, dopamine has been proposed to affect the amount of information held in working memory buffers in PFC networks (Seamans & Yang, 2004).
Separable Functional Roles of BLA and CNA
The amygdala has long been recognized for its critical role in emotional processing (e.g., LeDoux, 2000; Adolphs, Tranel, Damasio, & Damasio, 1995; Quirk et al., 1995). Despite the dominant interest in the role of amygdala in fear and anxiety (Fanselow & Gale, 2003; Fanselow & LeDoux, 1999; Davis, 1992), its role in representing positive affect has started to receive more attention as well (Murray, 2007; Paton, Belova, Morrison, & Salzman, 2006; Gottfried, O'Doherty, & Dolan, 2003). The CNA and BLA of the amygdala have been shown to be highly dissociable across many experimental paradigms and, in our model, make separable contributions to goal-directed behavior. As a key component of the PVLV reinforcement learning system, the CNA (LVe in PVLV) learns to control the release of dopamine in striatal layers at the onset of the CS (Hazy et al., 2007, 2010; O'Reilly et al., 2007; O'Reilly & Frank, 2006). The BLA, on the other hand, does not have direct access to the dopamine cells but does project very densely to the VS and associative striatum (i.e., DMS), which CNA does not. Thus, the BLA is in a position to influence the learning and performance of goal-directed behaviors by signaling its expectancies about particular USs to downstream areas (Hatfield, Han, Conley, & Holland, 1996).
Each experimental trial corresponds to three discrete steps in our simulations. Each simulation step consists of one minus phase and one plus phase (O'Reilly, 1996b). The first step is for “stimulus sampling,” the CS is presented to the network until settling finishes. During this step, working memory in OFC can be updated. In the second “response” step, the model decides whether to approach or avoid the food well and receives simulated dopamine feedback for its choice. In the third “feedback” step, USs are presented according to which response the model chose in the “response” step. Working memory in OFC is updated to maintain the history for recent rewards. The distinction between “response” and “feedback” trials in the model is required to accommodate computational constraints associated with the PBWM mechanisms.
The model first had to learn to associate one conditioned (CS1+) stimulus with an appetitive US and another CS (CS2−) with an aversive US. The model was trained until it had correctly performed 95 trials of each type without any errors. After the model had acquired the initial associations, contingencies were reversed so that the first CS was now associated with a negative US (CS1−) and the other with a positive US (CS2+). The model was trained on the reversed contingencies until it had performed 100 trials of each type without an error. The model was run with either set of lesions (BLA only, BLA+OFC lesion, and no lesion) for 50 runs to acquire a good estimate of the average performance.
No systematic attempt was made to fit the exact quantitative pattern of the rat behavioral data. To capture the effects of lesions on acquisition on reversal performance, we adjusted the following parameters:
The weight scale between the US and BLA layer so that BLA would learn more slowly which CSs and USs are associated if the OFC was lesioned. The value was reduced such that an US representation in the BLA was less active at the time of the US presentation without the additional excitatory input from the OFC.
We increased the amount of Hebbian learning to make sure that the CS–US associations in BLA would strengthen further even when there was no error in the US expectation.
We increased the learning rate of the CS layer to DMS projections so that the model would acquire the initial contingencies at the same rate if it was intact or if BLA and OFC were lesioned simultaneously.
We increased the weight scale between the BLA, DMS, and OFC to DMS, so that expectations about USs of these two areas were able to exert a strong bias onto the DMS and overcome net input from the CS layer.
We increased the random go firing in the DMS so that the model would start exploring faster at the beginning of reversal after not receiving reward in either trial type for several trials in a row.
We developed a biologically informed neural network model to investigate the role of the OFC and BLA in Pavlovian acquisition and reversal. To test the contributions of the two areas to the acquisition and reversal of Pavlovian contingencies, we trained the model to associate two CSs with two different USs. The model first had to learn to associate one CS with a positive US and another CS with a negative US. After the model had acquired the initial associations, contingencies were reversed so that the first CS was now associated with a negative US and the other with a positive US.
Reversal Deficit after OFC Lesions
OFC lesions have been repeatedly found to cause learning impairments if contingencies are reversed after acquisition in Pavlovian conditioning studies. As displayed in Figure 3, the model also exhibited a reversal deficit after inactivation of the OFC. Reversal deficits seem to be caused by perseverative encoding of the original Pavlovian CS–US associations in the BLA, which would normally be compensated for by activation-based working memory for recent outcomes in the OFC, as described earlier. Stalnaker et al. (2007) were able to confirm this idea by abolishing the reversal deficit by simultaneously ablating the OFC and the BLA in rats. Simultaneous lesions of BLA and OFC in our model also abolished the reversal deficits found after OFC lesions (Figure 3). In the case of simultaneous inactivation of OFC and BLA, phasic dopamine release in response to unexpected delivery of the positive US and phasic reductions of dopamine in response to the delivery of a negative US support the acquisition of CS–response associations in the DMS. With simultaneous OFC and BLA lesions, the model produces approach and avoid behavior without the expectancy for a particular US.
Ambiguous CS-evoked Activity in OFC
How does the OFC support behavioral flexibility? Rolls (1996) originally suggested that the OFC was fast and flexible at encoding CS–US associations and was therefore particularly critical when Pavlovian contingencies changed. Although Rolls (1996) originally attributed this flexibility to rapid weight-based learning, we have reframed this flexibility in terms of activation-based memory, as described earlier. According to either general framework, the OFC provides this updated associative information to other brain areas to guide appropriate behavior. Several single-unit studies provided evidence in support of this hypothesis (Schoenbaum et al., 1999; Rolls, Critchley, & Treves, 1997; Thorpe, Rolls, & Madison, 1983).
However, although the OFC learns to fire selectively in anticipation of a particular US (Schoenbaum, Chiba, & Gallagher, 1998), selective firing of OFC neurons neither develops particularly rapidly in comparison with other brain areas nor is it very pervasive (Paton et al., 2006; Stalnaker, Roesch, Franz, Burke, & Schoenbaum, 2006; Schoenbaum et al., 1999). The OFC and BLA layers in our model exhibit this same behavior. As shown in Figure 4, selective cue-related firing occurred earlier during acquisition and reversal in the BLA than in the OFC. Activation in the OFC layer represents a combination of both the current expectation by the BLA, but also recent reward history. That is, as long as performance is low and unexpected USs will periodically be received, the OFC will maintain both received US as well as the (now incorrect) expected US in working memory, signaling that both of these USs are possible in the current context. In contrast to the slow development of selective anticipatory OFC activity, OFC already fires selectively at the moment a US is received early during acquisition and reversal (Figure 5).
OFC Modulates Plasticity in BLA
Acquisition of Pavlovian associations by the BLA has been found to depend on a functioning OFC (Saddoris et al., 2005). In particular, the lateral OFC seems to be more critical for learning than for decision-making directly (Noonan et al., 2010). If we lesioned the OFC in the model, anticipatory firing would only develop slowly, if at all, in the BLA of the model because of a reduced excitatory input to BLA neurons representing the delivered US (Figure 6). On the other hand, the Pavlovian associations in BLA developed more strongly if OFC activity at CS onset was ambiguous or incorrect (Figure 7). If OFC represents an incorrect US outcome expectation or that both US outcomes are possible, when a CS is presented, the top–down bias from OFC onto the BLA will also cause BLA to have this incorrect expectation. When the actual US is presented at the end of the trial, the difference between the US expectation and the actual US will increase the amount of plasticity in BLA. That is, the expectancy error in the OFC activation is not only proportional to the amount of learning in BLA, it actually causes a modulation of plasticity in BLA. This is consistent with the finding that animals are better at adapting to a reversal of Pavlovian contingencies if selective firing in response to a CS in OFC is slow at reversing (Stalnaker et al., 2006).
This modulation of plasticity by the expectancy error is similar to the finding that the phasic changes in dopamine release proportional to reward prediction error modulate plasticity in striatal areas. However, unlike the activity in midbrain dopamine neurons, activity in OFC in response to the delivery of a US is not modulated by how much this US had been expected (Schoenbaum et al., 2003; Takahashi et al., 2009, Figure 5).
We were able to develop a computational model that captures various findings of studies that looked at electrophysiological changes in the BLA and the OFC and the effects of lesions to either area on the ability to acquire and adapt to the reversal of Pavlovian contingencies. These simulations were based on two simple assumptions. The first assumption was that the BLA learns CS–US associations via purely weight-based learning and, thus, predict USs on the basis of CS cues; the second was that the OFC acts as “working memory for USs” based on activation-based memory—and US representations can get loaded into OFC in two ways: (1) when USs occur and (2) when BLA predicts them.
We were able to account for the finding that neither lesions of the OFC or the BLA nor simultaneous lesions of both areas would impair initial performance in this catergory of tasks (Stalnaker et al., 2007). Furthermore, our simulations captured the empirical finding that lesions of the OFC alone would greatly impair reversal performance whereas simultaneous lesions of both areas would abolish this reversal deficit (Stalnaker et al., 2007). Although neither lesion affected behavior during the acquisition phase, the BLA only acquired the initial Pavlovian contingencies very slowly, if the OFC was lesioned in the model. At the beginning of the reversal phase, the BLA then contributed a bias to behavior according to the initial contingencies, which were no longer appropriate. The OFC supported rapid reversal in two different ways. First, it supported rapid reversal of associative weights in the BLA. Secondly, it maintained recent trial USs in working memory and, therefore, biased responding in DMS against the no-longer-appropriate Pavlovian associations stored in the BLA.
The model could solve Pavlovian acquisition and reversal without any contributions from the expectancy system when OFC and BLA were lesioned simultaneously, which is consistent with empirical data (Stalnaker et al., 2007). With simultaneous lesions, the model solved the task because the DMS would acquire stimulus–response associations through reinforcement learning. In a sense, the model was producing the correct behavior without anticipating a particular US to result from it, that is, without acquiring Pavlovian CS–US associations. Although the expectancy system would normally be involved in this task, this demonstrates the task can be solved without the acquisition of CS–US associations. That the simultaneous lesions did not affect the speed of acquisition or reversal appears to be rather perplexing, because it implies that the expectancy system is not really very useful. However, we interpret this to be because of the extremely impoverished environment, in which there are only two things to do (avoid and approach). This is similar to findings that simple instrumental tasks (e.g., t maze) can be learned without the DMS, because the task is so simple that the dorsal striatum can just acquire S–R associations (Palencia & Ragozzino, 2005; Featherstone & McDonald, 2004). We believe that the expectancy system becomes more critical when there are multiple options to choose from within the same valence category (e.g., R1–sugar, R2–food pellet). Animals cannot learn those tasks without, for example, the DMS, as the actor of the expectancy system (Yin, Ostlund, Knowlton, & Balleine, 2005). More generally, the expectancy system is critical for modulating behavior as a function of changing needs and goals, as explored, for example, in devaluation and related paradigms (Ostlund & Balleine, 2007).
In addition to capturing these behavioral findings, our model also accounted for various electrophysiological findings. Consistent with empirical data, BLA acquired and reversed Pavlovian associations more slowly, if the OFC layer was lesioned in the model (Saddoris et al., 2005). Furthermore, if OFC was slow to adapt to the contingency reversal or failed to do so altogether, the BLA would acquire the reversed Pavlovian associations more readily (Stalnaker et al., 2006).
It has previously been proposed that the OFC inhibits inappropriate responses (Elliott et al., 2000; Dias et al., 1996; Damasio, 1994; Mishkin, 1964; Ferrier, 1886). This inhibitory role of the OFC is consistent with deficits in detour reaching tasks (Wallis, Dias, Robbins, & Roberts, 2001) and stop signal tasks (Eagle et al., 2008). However, other studies have also produced results inconsistent with an inhibitory role of OFC. For example, rhesus monkeys with orbito-frontal lesions were still capable of inhibiting a prepotent response to pick up a small reward to receive a larger reward later (Chudasama, Kralik, & Murray, 2007). The contribution of the OFC in our model is also inconsistent with an exclusive role of OFC in response inhibition. At the beginning of reversal, the model continues to approach in response to the presentation of one CS (CS1), because it learned to associate it with the positive US during acquisition. Every time it approaches the food well in response to the CS1, the model receives the aversive US and finally stops responding to CS1. Because it had also previously learned that CS2 is associated with an aversive US, it never approaches the food well in response to CS2 and, in fact, stops behaving completely. Thus, it never gets an opportunity to experience the new contingencies—until the model eventually starts to explore again after not receiving any positive US for several trails. As soon as this happens, OFC holds on to the positive US and exerts a bias onto the DMS that makes this approach behavior more likely to be expressed again. Taken together, therefore, we believe that converging evidence supports the “working memory for USs” model of OFC function. Although our simulations focused on the role of lateral OFC in Pavlovian, we believe that they provide a more general and comprehensive description of the OFC's role in supporting flexible behavior that goes beyond inhibition of inappropriate behaviors (Schoenbaum et al., 2009).
Predictions from the Model
The model makes several testable predictions:
Stalnaker et al. (2007) found that simultaneous lesions of the BLA and the OFC would abolish the reversal deficits associated with OFC lesions alone. If both layers were inactivated in our model, it would solve the task according to stimulus–response associations in the DMS. We predict that, if plasticity is blocked in the DMS throughout the reversal period, animals with simultaneous lesions to OFC and BLA should be significantly impaired because the DMS would be unable to reverse the stimulus-response associations.
The second prediction is based on the fact that the model was able to account for the above-discussed empirical findings without requiring any synaptic plasticity in OFC. Thus, blocking plasticity in OFC at any point during the experiments without blocking neuronal activity, for example, by injecting the selective PKMzeta inhibitor ZIP (see Sacktor, 2011), should not affect the results, in particular, the speed of reversal learning.
If we lesioned the OFC of the model, the BLA would very slowly acquire the initial Pavlovian contingencies and provide an inappropriate response bias onto the DMS at the beginning of the reversal period that would impair the ability to adapt to the changed contingencies. Because lesions to the BLA did not affect the ability of rats and the model to acquire the initial Pavlovian contingencies (because of CNA being intact), blocking plasticity in the BLA should not affect the ability of animals to acquire Pavlovian contingencies either (as already shown for BLA lesions) and may actually facilitate the adaptation to a reversal of Pavlovian contingencies because the BLA would not contribute the now-inappropriate Pavlovian contingencies.
Finally, blocking plasticity in BLA should prevent selective firing in the OFC at the CS onset, because the OFC would then be lacking the CS–US associations normally acquired by the BLA and the OFC does not learn fast enough on its own.
Our simulations suggest a division of labor within an expectancy system between the OFC and the BLA. The BLA acquires Pavlovian associations based on long-term synaptic plasticity. The OFC supports flexibility by maintaining activation-based memories for USs, including the recent reward history. This memory does not require synaptic plasticity. Therefore, the OFC is a source of flexibility, and the BLA is a source of continuity. Ambiguous reward expectancies in OFC at the time of the CS presentation promotes behavioral flexibility and synaptic plasticity in the BLA. When contingencies change, OFC supports responses that are no longer inappropriate, which is inconsistent with an exclusive inhibitory role of OFC function.
APPENDIX: IMPLEMENTATIONAL DETAILS
The model was implemented using the Leabra framework, which is described in detail in O'Reilly (2001) and O'Reilly and Munakata (2000) and summarized here. See Table 1 for a listing of parameter values; nearly all of which are at their default settings. These same parameters and equations have been used to simulate over 40 different models in O'Reilly and Munakata (2000) and a number of other research models. Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardized mechanisms instead of constructing new mechanisms for each model. The model can be obtained by emailing the first author at firstname.lastname@example.org.
|to PFC khebb||.001*|
|to PFC ϵ||.001*|
|to PFC khebb||.001*|
|to PFC ϵ||.001*|
See equations in text for explanations of parameters. All are standard default parameter values except for those with an asterisk. The slower learning rate of PFC connections produced better results and is consistent with a variety of converging evidence, suggesting that PFC learns more slowly than the rest of cortex (Morton & Munakata, 2002).
The pseudocode for Leabra is given here, showing exactly how the pieces of the algorithm described in more detail in the subsequent sections fit together.
Outer loop: Iterate over events (trials) within an epoch. For each event:
Iterate over minus (−), plus (+), and update (++) phases of settling for each event.
(a)At start of settling:
i.For non-PFC/BG units, initialize state variables (activation, v_m, etc.).
ii.Apply external patterns (clamp input in minus, input and output, external reward based on minus-phase outputs).
(b)During each cycle of settling, for all nonclamped units:
ii.For striatum go/no-go units in ++ phase, compute additional excitatory and inhibitory currents based on dopamine inputs from SNc (Equation 20).
iii.Compute kWTA inhibition for each layer, based on giΘ (Equation 6):
A.Sort units into two groups based on giΘ: top k and remaining k + 1 to n.
B.If basic, find k and k + 1-th highest; if average based, compute average of 1 → k and k + 1 → n.
C.Set inhibitory conductance gi from gkΘ and gk+1Θ (Equation 5).
iv.Compute point neuron activation combining excitatory input and inhibition (Equation 1).
(c)After settling, for all units:
i.Record final settling activations by phase (yj−, yj+, y++).
ii.At end of + and ++ phases, toggle PFC maintenance currents for stripes with SNr/Thal act > threshold (.1).
After these phases, update the weights (based on linear current weight values):
(a)For all non-BG connections, compute error-driven weight changes (Equation 8) with soft weight bounding (Equation 9), Hebbian weight changes from plus-phase activations (Equation 7), and overall net weight change as weighted sum of error-driven and Hebbian (Equation 10).
(b)For PV units, weight changes are given by delta rule computed as difference between plus phase external reward value and minus phase expected rewards (Equation 11).
(c)For LV units, only change weights (using Equation 13) if PV expectation > θpv or external reward/punishment actually delivered.
(d)For striatum units, weight change is the delta rule on dopamine-modulated second-plus phase activations minus unmodulated plus phase acts (Equation 19).
(e)Increment the weights according to net weight change.
Point Neuron Activation Function
In the basic version of the kWTA function, which is relatively rigid about the kWTA constraint and is therefore used for output layers, gkΘ and gk+1Θ are set to the threshold inhibition value for the kth and k + 1-th most excited units, respectively. In the average-based kWTA version, gkΘ is the average giΘ value for the top k most excited units, and gk+1Θ is the average of giΘ for the remaining n − k units. This version allows for more flexibility in the actual number of active units, depending on the nature of the activation distribution in the layer.
Hebbian and Error-driven Learning
For learning, Leabra uses a combination of error-driven and Hebbian learning. The error-driven component is the symmetric midpoint version of the GeneRec algorithm (O'Reilly, 1996a), which is functionally equivalent to the deterministic Boltzmann machine and contrastive Hebbian learning. The network settles in two phases, an expectation (minus) phase where the network's actual output is produced and an outcome (plus) phase where the target output is experienced, and then computes a simple difference of a pre- and postsynaptic activation product across these two phases. For Hebbian learning, Leabra uses essentially the same learning rule used in competitive learning or mixtures of Gaussians, which can be seen as a variant of the Oja normalization (Oja, 1983). The error-driven and Hebbian learning components are combined additively at each connection to produce a net weight change.
See O'Reilly et al. (2007) for further details on the PVLV system. We assume that time is discretized into steps that correspond to environmental events (e.g., the presentation of a CS or US). All of the following equations operate on variables that are a function of the current time step t—we omit the t in the notation because it would be redundant. PVLV is composed of two systems, PV and LV, each of which in turn are composed of two subsystems (excitatory and inhibitory). Thus, there are four main value representation layers in PVLV (PVe, PVi, LVe, LVi), which then drive the dopamine layers (VTA/SNc).
The PVLV value layers use standard Leabra activation and kWTA dynamics, as described above, with the following modifications. They have a three-unit distributed representation of the scalar values they encode, where the units have preferred values of (0, .5, 1). The overall value represented by the layer is the weighted average of the unit's activation times its preferred value, and this decoded average is displayed visually in the first unit in the layer. The activation function of these units is a “noisy” linear function (i.e., without the x/(x + 1) nonlinearity to produce a linear value representation, but still convolved with Gaussian noise to soften the threshold, as for the standard units, Equation 4), with gain γ = 220, noise variance σ = .01, and a lower threshold Θ = .17. The k for kWTA (average based) is 1, and the q value is .9 (instead of the default of .6). These values were obtained by optimizing the match for value represented with varying frequencies of 0–1 reinforcement (e.g., the value should be close to .4 when the layer is trained with 40% of 1 values and 60% of 0 values). Note that having different units for different values, instead of the typical use of a single unit with linear activations, allows much more complex mappings to be learned. For example, units representing high values can have completely different patterns of weights than those encoding low values, whereas a single unit is constrained by virtue of having one set of weights to have a monotonic mapping onto scalar values.
Special Basal Ganglia Mechanisms
Striatal Learning Function
SNr and Thalamus Units
Random Go Firing
When a random go fires, we set the SNrThal unit activation to be above go threshold, and we apply a positive dopamine signal to the corresponding striatal stripe so that it has an opportunity to learn to fire for this input pattern on its own in the future.
PFC active maintenance is supported in part by excitatory ionic conductances that are toggled by go firing from the SNrThal layers. This is implemented with an extra excitatory ion channel in the basic Vm update equation (Equation 1). This channel has a conductance value of .5 when active. See Frank, Loughry, and O'Reilly (2001) for further discussion of this kind of maintenance mechanism, which has been proposed by several researchers, for example, Durstewitz, Seamans, and Sejnowski (2000), Gorelova and Yang (2000), Lewis and O'Donnell (2000), Dilmore, Gutkin, and Ermentrout (1999), Lisman, Fellous, and Wang (1999), and Wang (1999). The first opportunity to toggle PFC maintenance occurs at the end of the first plus phase and then again at the end of the second plus phase (third phase of settling). Thus, a complete update can be triggered by two gos in a row, and it is almost always the case that if a go fires for the first time, it will fire the next, because Striatum firing is primarily driven by sensory inputs, which remain constant.
Assessment of Selective Firing in BLA and OFC
We assessed whether a neuron in OFC or BLA that represented a particular US was more active after the presentation of the associated CS, relative to the presentation of the other CS.
The selectivity of firing in layer w equals the sum over the units k in layer w of the difference of activations x of a unit representing the US k in response to the associated CS i minus activation in response to nonassociated stimulus j, divided by the number of neurons in this layer: selw = 1/nw × ∑k(xki − xkj).
This study was supported by ONR Grants N00014-07-1-0651 and N00014-03-1-0428 and NIH Grants MH069597 and MH079485.
Reprint requests should be sent to Wolfgang M. Pauli, Department of Psychology, University of Colorado at Boulder, 345 UCB, Boulder, CO 80309, or via e-mail: email@example.com.