Reinforcement learning models have proven highly effective for understanding learning in both artificial and biological systems. However, these models have difficulty in scaling up to the complexity of real-life environments. One solution is to incorporate the hierarchical structure of behavior. In hierarchical reinforcement learning, primitive actions are chunked together into more temporally abstract actions, called “options,” that are reinforced by attaining a subgoal. These subgoals are capable of generating pseudoreward prediction errors, which are distinct from reward prediction errors that are associated with the final goal of the behavior. Studies in humans have shown that pseudoreward prediction errors positively correlate with activation of ACC. To determine how pseudoreward prediction errors are encoded at the single neuron level, we trained two animals to perform a primate version of the task used to generate these errors in humans. We recorded the electrical activity of neurons in ACC during performance of this task, as well as neurons in lateral prefrontal cortex and OFC. We found that the firing rate of a small population of neurons encoded pseudoreward prediction errors, and these neurons were restricted to ACC. Our results provide support for the idea that ACC may play an important role in encoding subgoals and pseudoreward prediction errors to support hierarchical reinforcement learning. One caveat is that neurons encoding pseudoreward prediction errors were relatively few in number, especially in comparison to neurons that encoded information about the main goal of the task.
Reinforcement learning (RL) is one of the most influential learning models to date and has had a dramatic impact on both artificial intelligence (Mnih et al., 2015) and our understanding of neural computation (Schultz, Dayan, & Montague, 1997). RL uses discrepancies between expected and actual reward outcomes to drive learning (Sutton & Barto, 1998). This estimation, known as a reward prediction error (RPE), is encoded by midbrain dopamine neurons (Hollerman & Schultz, 1998; Schultz et al., 1997) and is thought to underlie how animals and humans learn behaviors necessary to acquire rewards from the environment (Lee, Seo, & Jung, 2012; Dayan & Niv, 2008). RPE-related neural signals are also found in lateral prefrontal cortex (LPFC; Asaad & Eskandar, 2011) and ACC (Kennerley, Behrens, & Wallis, 2011). However, RL suffers from a problem of scaling (Botvinick, Niv, & Barto, 2009). Although it performs well in relatively constrained learning environments, when the number of environmental states and actions increases, the amount of sampling required by the agent—hence the amount of training time needed to acquire a behavior—scales as a positively accelerating function. Thus, an environment can quickly become too complex for RL to be a feasible learning solution.
Computational theoretical studies have proposed modifications to the conventional RL models to allow them to accommodate more complex hierarchical behavioral structure that is typical of the real word (Sutton, Precup, & Singh, 1999). Instead of reinforcing individual actions, hierarchical RL (HRL) allows the chunking of actions into more temporally abstract behaviors, referred to as “options.” Each option terminates when a particular subgoal is attained, which generates an option-specific prediction error, referred to as a pseudoreward prediction error (PPE). For example, when making a cup of coffee, one option might be adding milk, but individual actions that contribute to that option (e.g., getting the milk out of the fridge, opening the milk carton) would contribute solely to the PPE rather than the RPE generated by drinking the coffee.
The notion that complex behavior is organized hierarchically also has a long history in neuroscience. Hughlings Jackson, for example, emphasized the notion that the frontal lobe represented behaviors in a hierarchical manner (Phillips, 1973). Neuroimaging and neuropsychology studies have shown that progressively more complex behaviors are controlled by progressively more anterior regions of prefrontal cortex (Badre, Hoffman, Cooney, & D'Esposito, 2009; Badre & D'Esposito, 2007; Koechlin, Ody, & Kouneiher, 2003). Recent efforts have focused on determining the neural substrates of the algorithmic processes derived from computational theories of HRL (Holroyd & McClure, 2015; Badre & Frank, 2012; Frank & Badre, 2012; Ribas-Fernandes et al., 2011). However, to date there has been little attempt to study HRL at the level of individual neurons, which could provide insights into the specific computations performed by prefrontal neurons that support HRL. Therefore, we trained two monkeys to perform a primate version of a task that has been used in humans to study HRL (Ribas-Fernandes et al., 2011). The task required performing a sequence of lever movements to move a stimulus from a start position to a goal position, by way of an intermediate subgoal position. On a fraction of trials, the position of the subgoal changed, thereby generating a PPE. In the human version of the task, the BOLD response in ACC positively correlated with the magnitude of the PPE. To examine whether this information was encoded at the level of single neurons, we recorded the electrical activity of single neurons in LPFC, ACC, and OFC while animals performed the HRL task.
Subjects and Behavioral Task
Two male rhesus monkeys (Macaca mulatta) served as subjects (Q and R). Subjects were 5 and 6 years old and weighed approximately 7 and 9 kg at the time of recording. We regulated the daily fluid intake of our subjects to maintain motivation on the task. Subjects sat in a primate chair and viewed a computer screen. We used the MonkeyLogic system (Asaad & Eskandar, 2008) to control the presentation of the stimuli and the task contingencies. Eye movements were tracked with an infrared system (ISCAN). All procedures were in accord with the National Institute of Health guidelines and the recommendations of the University of California at Berkeley Animal Care and Use Committee.
Our behavioral task has previously been used to measure PPEs in humans (Ribas-Fernandes et al., 2011). The delivery task requires subjects to take the perspective of a delivery driver that has to choose between two jobs involving picking up a package (the subgoal) and delivering it to a customer (goal). After the subject selects one of the jobs, the position of the package sometimes changes, which generates a PPE. We trained two animals to perform a version of this task (Figure 1A). Subjects were required to fixate a central cue to initiate a trial, after which two stimulus configurations appeared on the left and right of the screen. Each configuration consisted of three colored dots, which represented the start position (green), subgoal position (white), and goal position (blue). Subjects selected one of the configurations with a joystick movement. Once one of the two configurations was chosen, the other one disappeared and the subject had to make a series of joystick movements back-and-forth between the center location and the chosen side to move the green dot step-by-step from the start position to the goal position via the subgoal position. Each movement outwards caused the cursor to disappear, and then the movement back to the center caused the cursor to reappear 1° of visual angle closer to the subgoal or goal. The animal was allowed to make these movements as quickly as they desired. A juice reward was delivered once the green dot reached the goal position. The optimal choice was to select the shortest route, because this would lead to reward more quickly and with less physical effort.
The start and goal positions in each original configuration were placed on the circumference of a circle 8° of visual angle in diameter. This circle was not visible to the animal. We manipulated two variables in each configuration: total steps (TS), the number of steps from the start position to the goal via the subgoal, and subgoal steps (SG), the number of steps from the start position to the subgoal. We also calculated the straight line distance (SD), which is the degrees of visual angle in a straight line from the start position to the goal.
Once the animals had been trained on the choice task, we implanted the neurophysiological recording equipment and recorded neural activity. During recording sessions, only 10% of the trials were choice trials. The other 90% of trials, which we collectively refer to as “jump” trials, began with the presentation of a single stimulus configuration for 500 msec in the center of the screen (prejump configuration), followed by a second configuration (postjump configuration) for 500 msec. On 56% of the jump trials, the postjump configuration contained no new information, either because it was identical to the prejump configuration or because the subgoal changed position but remained the same distance from the start and goal positions (Figure 1B, “mirror” condition). On the other 44% of the jump trials, the postjump configuration generated a PPE (because the difference from the start position to the subgoal changed) and/or an RPE (because the total number of steps to the goal changed). These errors were the inverse of the number of steps, because fewer steps meant the animal would attain the reward with less effort. In other words, moving goals or subgoals closer would generate positive prediction errors while moving them further away would generate negative prediction errors. The fixation cue then changed color, indicating to the subject whether they should make rightward or leftward joystick movements to move the green dot to the goal position. Table 1 describes the different combinations of experimental conditions, and Figure 1B illustrates example configurations.
|9||↓||↓||−||+||+||RPEp and PPEp|
|10||↑||↑||−||−||−||RPEn and PPEn|
|11||↓||↑||−||+||−||RPEp and PPEn|
|12||↓||↓||−||+||+||RPEp and PPEp|
|13||↑||↑||−||−||−||RPEn and PPEn|
|14||↑||↓||−||−||+||RPEn and PPEp|
|9||↓||↓||−||+||+||RPEp and PPEp|
|10||↑||↑||−||−||−||RPEn and PPEn|
|11||↓||↑||−||+||−||RPEp and PPEn|
|12||↓||↓||−||+||+||RPEp and PPEp|
|13||↑||↑||−||−||−||RPEn and PPEn|
|14||↑||↓||−||−||+||RPEn and PPEp|
Our methods for neurophysiological recording have been reported in detail previously (Lara, Kennerley, & Wallis, 2009). Briefly, we implanted both subjects with a titanium head positioner for restraint and one recording chamber over each hemisphere, the position of which was determined using a 1.5-T MRI scanner. One recording chamber was positioned at an angle to allow access to LPFC and ACC, and the other was a vertical chamber to allow access to OFC. We recorded simultaneously from LPFC, ACC, and OFC using arrays of 6–14 tungsten microelectrodes (FHC Instruments). We determined the approximate distance to lower the electrodes from the MRI scans and advanced the electrodes using custom-built, manual microdrives until they were located just above the cell layer. We then slowly lowered the electrodes into the cell layer until we obtained a neuronal waveform, which were digitized and analyzed offline (Plexon Instruments). We randomly sampled neurons; we did not attempt to select neurons based on responsiveness. This procedure aimed to reduce any bias in our estimate of neuronal activity, thereby allowing a fairer comparison of neuronal properties between the different brain regions. We reconstructed our recording locations by measuring the position of the recording chambers using stereotactic methods. We plotted the positions onto the MRI sections using commercial graphics software (Adobe Illustrator). We confirmed the correspondence between the MRI sections and our recording chambers by mapping the position of sulci and gray and white matter boundaries using neurophysiological recordings. We traced and measured the distance of each recording location along the cortical surface from the lip of the ventral bank of the principal sulcus. We also measured the positions of the other sulci in this way, allowing the construction of unfolded cortical maps.
Behavioral Data Analysis
We estimated the weights of each parameter in the model by determining the values that minimized the log-likelihood of the model. To fit the weights (w1 to w5), we used a maximum likelihood fitting (“fmincon” function in MATLAB) to find the set of parameters that best predicted the experimental data. To obtain fitted weights, we ran the maximum likelihood fitting function 100 times for each of 10 different randomly determined initial weights and then calculated the mean of the fitted weights. This helps to avoid accepting weights that reflect a local minimum in the fitting function. We compared models using Akaike's information criterion (AIC; Akaike, 1974).
Our other behavioral measure was the lever movement time, which we defined as the time taken to move the joystick from the center position to the chosen side and then back again following the movement of the green dot.
Neural Data Analysis
All data for the neural analysis was from the jump trials. We visualized single neuron activity by constructing spike density histograms. We calculated the mean firing rate of the neuron across the appropriate experimental conditions using a sliding window of 100 msec. We then analyzed neuronal activity in two predefined epochs of 50–500 msec each, corresponding to the presentation of pre- and postjump configurations. For each neuron, we calculated its mean firing rate on each trial during each epoch. To determine whether a neuron encoded an experimental factor, we used linear regressions to quantify how well the experimental manipulation predicted the neuron's firing rate. Before conducting the regression, we standardized our dependent variable (i.e., firing rate) by subtracting the mean of the dependent variable from each data point and dividing each data point by the SD of the distribution. The standardization of firing rate was performed across all trials, pooling across conditions. We evaluated the significance of selectivity at the single neuron level using an alpha level of p < .05.
SV and LR are defined as for Equation 4. Another four predictors represented positive or negative RPEs or PPEs. We defined PPE as the difference between the original position of the subgoal and its position following the jump. Thus, we calculated this difference using the weighting that was ascribed to this parameter from the animal's initial choice behavior (Equations 1 and 2). Selective neurons were then defined in the same way as for Equation 4.
To examine the time course of the contribution of each predictor, we performed a “sliding” regression analysis to calculate the CPD at each time point for each neuron. We fit each regression model (Equation 4 for the prejump configuration and Equation 5 for the postjump configuration) to neuronal firing for overlapping 200-msec windows, beginning with the 200 msec immediately before the task epoch and then shifting the window in 10-msec steps until we reached the end of task epoch. The sliding regression analysis requires a correction for multiple comparisons, because it involves performing a statistical test for each time point. We calculated this correction by calculating a false alarm rate. We applied the same statistical criterion to an equivalent analysis using shuffled neural data where significant parameters can only reflect noise. We preserved the firing patterns of individual neurons on individual trials, but shuffled the experimental conditions. The results of this analysis showed that a statistical criterion of three consecutive time bins where the regression parameter was significant at p < .005 yielded a false alarm rate of less than 5%.
Behavioral Task Performance
To examine the influence of the stimulus configurations, we performed a model comparison, as described in detail in the Methods. The full model included parameters for TS, SG, and SD (w1, w2, and w3). Against this model, we compared other models in which we tested subsets of these parameters. In addition, we evaluated whether choice behavior relied on linear or logarithmic estimates of distances. In both animals, the full model was clearly favored, although Subject R favored logarithmic estimates of distance whereas Subject Q favored linear estimates (Table 2 and Figure 2A). For Subject R, w1 = 3.2, w2 = 0.5, and w3 = 1.4, whereas for Subject Q, w1 = 0.5, w2 = 0.4, and w3 = 1.1. Thus, for both subjects, the subgoal position had the smallest effect on choice behavior, although Subject R based his choices more on TS whereas Subject Q used SD. These two variables were positively correlated (correlation coefficient = .91), which likely accounted for why either variable could be used to solve the task. Overall our models provided an excellent fit to choice behavior (Figure 2B), explaining 93% of the variance in Subject R's choice behavior and 90% of the variance in Subject Q.
|Subject||Distance Estimate||Parameters Included in the Model|
|Subject||Distance Estimate||Parameters Included in the Model|
Although the subgoal position only had a small effect on choice behavior, the model in which it was included clearly performed better than the model in which it was omitted in both animals. This indicated that the animals were not simply ignoring the subgoal. Further evidence was apparent in the lever movement times. Both animals showed a tendency to slow down as they approached both the subgoal and the goal and to speed up again once the subgoal had been acquired (Figure 3A). This was evident when we looked at the change in movement time from one step to the next (Figure 3B). We found that subjects slowed down (positive values on the y axis) on approaching the subgoal and sped up (negative values on the y axis) immediately after its attainment (one-way ANOVA, F(5, 227) = 17.62, p < 1 × 10−13 for Subject R; F(5, 179) = 22.85, p < 1 × 10−16 for Subject Q). In other words, subjects did pay attention to the subgoal position in the series of lever movements.
We recorded the activity of 308 neurons from LPFC (Subject R 132; Subject Q 176), 249 neurons from OFC (R 130; Q 119), and 212 neurons from ACC (R 106; Q 106). Recording locations are illustrated in Figure 4. We collected the data across 38 recording sessions for Subject R and 30 sessions for Subject Q. To obtain sufficient statistical power, the neurons from the two subjects were pooled. For all significant results, there were no qualitative differences between the two subjects (i.e., the effects were in the same direction), unless otherwise noted.
During the presentation of the prejump configuration, the most prevalent encoding was the value of the configuration, and this was more prevalent in ACC relative to the other two areas (Figure 5). Figure 6A illustrates example neurons that encoded selectively the SV of the prejump configurations. Fewer neurons encoded sensory information about the stimulus configuration, that is, LR, which is whether the start position was to the right or left of fixation. Because sensory encoding is not the focus of this report, we will not discuss LR encoding further. The time course of SV selectivity across the neural population is illustrated in Figure 7. There was no difference between the areas with respect to the onset of SV selectivity (median LPFC = 171 msec, median OFC = 166 msec, median ACC = 151 msec, one-way ANOVA, F(2, 122) = 0.45, p > .05).
During the postjump configuration, we also found that the most prevalent encoding was the encoding of the configuration's value (Figures 5 and 6B). Note that some neurons encoded SV before the onset of the postjump configuration, which indicates that they also encoded the SV of the prejump configuration. Some neurons also encoded RPE, and these neurons were most prevalent in ACC. In our task, rewards were fixed and delivered with certain probability, and so the RPE reflected changes in the amount of work that the animal needed to do, because the goal had moved either closer (positive RPE) or further (negative RPE) from the start position. In contrast, very few neurons encoded PPE, although the prevalence of such neurons did exceed chance in ACC (20/212 or 9.4%, binomial test, p < .01). However, the weak encoding of PPE relative to the other variables is evident in the population plots shown in Figure 8, where the robust encoding of SV contrasts with the weaker encoding of RPE and the virtually absent encoding of PPE.
We developed a primate version of a task that has been used to study HRL in humans (Ribas-Fernandes et al., 2011). Animals had to use a lever to move a dot to a subgoal position and then on toward a goal position. Many prefrontal neurons encoded the value of the presented task configuration, as defined by the parameters that individual animals used to guide their choice behavior. This replicates our previous results (Kennerley, Dahmubed, Lara, & Wallis, 2009; Kennerley & Wallis, 2009), in which neurons encoded the number of lever presses that animals needed to make to earn a reward. Prefrontal neurons also encoded RPE, particularly in ACC, which is also consistent with our previous work (Kennerley et al., 2011). A novel aspect of our results is that these RPEs appeared to be driven by changes in effort rather than reward.
Both rodent lesion (Rudebeck, Walton, Smyth, Bannerman, & Rushworth, 2006) and human neuroimaging (Prevost, Pessiglione, Metereau, Clery-Melin, & Dreher, 2010) suggest that ACC may be particularly involved in effort-based decisions. Furthermore, neurophysiology studies have shown a stronger dynamical interaction between ACC and LPFC for effort-based decisions compared with delay-based decisions, whereas the opposite is true for OFC and LPFC (Hunt, Behrens, Hosokawa, Wallis, & Kennerley, 2015). These ideas have been extended to include decisions about cognitive effort, which could be used to determine whether to exert cognitive control (Shenhav, Botvinick, & Cohen, 2013). If ACC is responsible for incorporating effort into value calculations, this would include calculating value prediction errors based on effort. An outstanding question is the role that dopamine plays in this process. Although dopamine neurons have long been associated with encoding reward predictions, the evidence for their involvement in effort calculations is more mixed. In an effort-based decision-making task, only a small minority of dopamine neurons incorporated effort information (Pasquereau & Turner, 2013). Future research should examine the precise role of dopamine in ACC prediction error calculations.
Very few neurons encoded PPEs, although those that did appear to be located in ACC. Neuroimaging studies in humans that used the same task showed that PPE correlated with increased BOLD activation in ACC (Ribas-Fernandes et al., 2011). Our data therefore appear to provide convergent evidence to support the HRL theoretical framework and a role for ACC in this process. However, an important caveat is that the degree of neural encoding that we observed in ACC was not particularly compelling. Only a handful of neurons showed significant encoding, and none of those neurons were particularly strongly tuned to PPE.
One possible explanation for the weak effects is that the animals were not paying sufficient attention to the task configurations, because the majority of trials did not require a choice. This explanation seems inadequate. In previous tasks where we have interleaved trials requiring a choice with those that did not require a choice, we have seen little difference in the response of prefrontal neurons to both types of trial (Rich & Wallis, 2016). In addition, in the current study, we observed robust encoding of the value of the stimulus configuration. Finally, both animals showed changes in RT on attaining the subgoal. Taken together, these results suggest that the animals were appropriately attending to the task and the subgoal.
Differences might also have arisen due to the way the task was represented across the two species. Humans bring context to the task in a way that monkeys cannot. For example, in humans, the description of the task involved a driver picking up a package and delivering it to a customer. This real-world knowledge might have contributed to humans approaching the task in a more hierarchical fashion compared with the relatively abstract representation that the animals experience. An additional difference between the two species relates to the value of acquiring the subgoal. In humans, there was no evidence that the subgoal influenced choice behavior, suggesting that acquiring the subgoal was not rewarding. In contrast, in the current study, the subgoal did influence the animals' choice behavior, albeit to a smaller extent than the other stimulus parameters. Thus, we cannot rule out the possibility that the PPE that was generated in ACC simply reflected an RPE that was generated by the acquisition of the subgoal.
This raises a broader issue with the HRL framework. The original study examining HRL in humans emphasized that pseudorewards are distinct from primary rewards because attaining subgoals is not necessarily rewarding in and of itself (Ribas-Fernandes et al., 2011). An example is adding milk to coffee: The subgoal brings one closer to the first sip of coffee, but the act itself is not rewarding. However, traditional RL models can also account for the influence of nonrewarding subgoals on behavior, because reward values become progressively associated with earlier reward-predictive events, which would include attaining subgoals that are not in themselves necessarily rewarding. Thus, the critical difference between HRL and RL rests, not so much in the distinction between pseudorewards and primary rewards, but rather in the way in which behavior is organized, in particular, the unit of behavior that is reinforced. In RL, prediction errors are calculated for each individual action, whereas in HRL, individual actions are chunked into a subroutine that generates its own prediction error on completion. It is not clear whether the prediction error generated by the subroutine necessarily needs to involve a distinct neural signal compared with the prediction error generated by the primary reward.
How the brain determines the appropriate behavioral unit for RL mechanisms is an area of active investigation. One idea is that the brain tends to group together mutually predictive stimuli and actions into a single event (Schapiro, Rogers, Cordova, Turk-Browne, & Botvinick, 2013). For example, driving to a restaurant and ordering a meal are both actions that can acquire an ultimate goal of eating a tasty meal, but behaviorally, the agent experiences a continuous stream of stimuli and actions. However, the act of driving involves many mutually predictive stimuli (e.g., steering wheel, traffic lights, seat belt) but only weakly predicts going to a restaurant, because one can drive to many alternate destinations. Likewise, ordering a meal involves many mutually predictive stimuli (e.g., server, menu, water) but may only weakly predict driving, because one could have also walked or caught the subway. Thus, the act of driving is grouped as a separate event from ordering the meal. The responses of prefrontal neurons are consistent with organizing behavior into these high-level events. For example, one of the major determinants of prefrontal firing rates is in which part of the task the agent is currently engaged (Sigala, Kusunoki, Nimmo-Smith, Gaffan, & Duncan, 2008). Prefrontal neurons also encode events at an abstract, high-level, incorporating categories (Freedman, Riesenhuber, Poggio, & Miller, 2001) and rules (Wallis, Anderson, & Miller, 2001). It may be that standard RL mechanisms operating on these high-level, behavioral events are sufficient to account for hierarchical behavior.
In summary, our results provide partial support for the involvement of ACC in HRL. In a task designed to use hierarchical behavior, we observed neurons in ACC whose firing rate correlated with PPE. However, there were caveats to this support, including the weak encoding, particularly in comparison to other signals that have been more firmly associated with ACC, such as predicted value and RPE, and whether HRL even requires a PPE signal distinct from RPE.
This work was funded by NIMH R01 MH097990 (to J. D. W.) and by Taiwan Top University Strategic Alliance Graduate Fellowship USA-UCB-100-S01 to (F.-K. C.).
Reprint requests should be sent to Joni D. Wallis, Department of Psychology, University of California at Berkeley, 132 Barker Hall, Berkeley, CA 94720, or via e-mail: email@example.com.