## Abstract

Reinforcement learning models have proven highly effective for understanding learning in both artificial and biological systems. However, these models have difficulty in scaling up to the complexity of real-life environments. One solution is to incorporate the hierarchical structure of behavior. In hierarchical reinforcement learning, primitive actions are chunked together into more temporally abstract actions, called “options,” that are reinforced by attaining a subgoal. These subgoals are capable of generating pseudoreward prediction errors, which are distinct from reward prediction errors that are associated with the final goal of the behavior. Studies in humans have shown that pseudoreward prediction errors positively correlate with activation of ACC. To determine how pseudoreward prediction errors are encoded at the single neuron level, we trained two animals to perform a primate version of the task used to generate these errors in humans. We recorded the electrical activity of neurons in ACC during performance of this task, as well as neurons in lateral prefrontal cortex and OFC. We found that the firing rate of a small population of neurons encoded pseudoreward prediction errors, and these neurons were restricted to ACC. Our results provide support for the idea that ACC may play an important role in encoding subgoals and pseudoreward prediction errors to support hierarchical reinforcement learning. One caveat is that neurons encoding pseudoreward prediction errors were relatively few in number, especially in comparison to neurons that encoded information about the main goal of the task.

## INTRODUCTION

Reinforcement learning (RL) is one of the most influential learning models to date and has had a dramatic impact on both artificial intelligence (Mnih et al., 2015) and our understanding of neural computation (Schultz, Dayan, & Montague, 1997). RL uses discrepancies between expected and actual reward outcomes to drive learning (Sutton & Barto, 1998). This estimation, known as a reward prediction error (RPE), is encoded by midbrain dopamine neurons (Hollerman & Schultz, 1998; Schultz et al., 1997) and is thought to underlie how animals and humans learn behaviors necessary to acquire rewards from the environment (Lee, Seo, & Jung, 2012; Dayan & Niv, 2008). RPE-related neural signals are also found in lateral prefrontal cortex (LPFC; Asaad & Eskandar, 2011) and ACC (Kennerley, Behrens, & Wallis, 2011). However, RL suffers from a problem of scaling (Botvinick, Niv, & Barto, 2009). Although it performs well in relatively constrained learning environments, when the number of environmental states and actions increases, the amount of sampling required by the agent—hence the amount of training time needed to acquire a behavior—scales as a positively accelerating function. Thus, an environment can quickly become too complex for RL to be a feasible learning solution.

Computational theoretical studies have proposed modifications to the conventional RL models to allow them to accommodate more complex hierarchical behavioral structure that is typical of the real word (Sutton, Precup, & Singh, 1999). Instead of reinforcing individual actions, hierarchical RL (HRL) allows the chunking of actions into more temporally abstract behaviors, referred to as “options.” Each option terminates when a particular subgoal is attained, which generates an option-specific prediction error, referred to as a pseudoreward prediction error (PPE). For example, when making a cup of coffee, one option might be adding milk, but individual actions that contribute to that option (e.g., getting the milk out of the fridge, opening the milk carton) would contribute solely to the PPE rather than the RPE generated by drinking the coffee.

The notion that complex behavior is organized hierarchically also has a long history in neuroscience. Hughlings Jackson, for example, emphasized the notion that the frontal lobe represented behaviors in a hierarchical manner (Phillips, 1973). Neuroimaging and neuropsychology studies have shown that progressively more complex behaviors are controlled by progressively more anterior regions of prefrontal cortex (Badre, Hoffman, Cooney, & D'Esposito, 2009; Badre & D'Esposito, 2007; Koechlin, Ody, & Kouneiher, 2003). Recent efforts have focused on determining the neural substrates of the algorithmic processes derived from computational theories of HRL (Holroyd & McClure, 2015; Badre & Frank, 2012; Frank & Badre, 2012; Ribas-Fernandes et al., 2011). However, to date there has been little attempt to study HRL at the level of individual neurons, which could provide insights into the specific computations performed by prefrontal neurons that support HRL. Therefore, we trained two monkeys to perform a primate version of a task that has been used in humans to study HRL (Ribas-Fernandes et al., 2011). The task required performing a sequence of lever movements to move a stimulus from a start position to a goal position, by way of an intermediate subgoal position. On a fraction of trials, the position of the subgoal changed, thereby generating a PPE. In the human version of the task, the BOLD response in ACC positively correlated with the magnitude of the PPE. To examine whether this information was encoded at the level of single neurons, we recorded the electrical activity of single neurons in LPFC, ACC, and OFC while animals performed the HRL task.

## METHODS

### Subjects and Behavioral Task

Two male rhesus monkeys (Macaca mulatta) served as subjects (Q and R). Subjects were 5 and 6 years old and weighed approximately 7 and 9 kg at the time of recording. We regulated the daily fluid intake of our subjects to maintain motivation on the task. Subjects sat in a primate chair and viewed a computer screen. We used the MonkeyLogic system (Asaad & Eskandar, 2008) to control the presentation of the stimuli and the task contingencies. Eye movements were tracked with an infrared system (ISCAN). All procedures were in accord with the National Institute of Health guidelines and the recommendations of the University of California at Berkeley Animal Care and Use Committee.

Our behavioral task has previously been used to measure PPEs in humans (Ribas-Fernandes et al., 2011). The delivery task requires subjects to take the perspective of a delivery driver that has to choose between two jobs involving picking up a package (the subgoal) and delivering it to a customer (goal). After the subject selects one of the jobs, the position of the package sometimes changes, which generates a PPE. We trained two animals to perform a version of this task (Figure 1A). Subjects were required to fixate a central cue to initiate a trial, after which two stimulus configurations appeared on the left and right of the screen. Each configuration consisted of three colored dots, which represented the start position (green), subgoal position (white), and goal position (blue). Subjects selected one of the configurations with a joystick movement. Once one of the two configurations was chosen, the other one disappeared and the subject had to make a series of joystick movements back-and-forth between the center location and the chosen side to move the green dot step-by-step from the start position to the goal position via the subgoal position. Each movement outwards caused the cursor to disappear, and then the movement back to the center caused the cursor to reappear 1° of visual angle closer to the subgoal or goal. The animal was allowed to make these movements as quickly as they desired. A juice reward was delivered once the green dot reached the goal position. The optimal choice was to select the shortest route, because this would lead to reward more quickly and with less physical effort.

Figure 1.

(A) Timeline of the behavioral task. On choice trials, subjects chose one of two stimulus configurations and then moved a joystick back-and-forth to move the green dot forwards on a green-white-blue route. The three dots were arranged on or within a circle of 8° (dashed black line) that was not visible to the subject. The optimal choice was to select the shortest route, because this would lead to reward more quickly and with less physical effort. On jump trials, a single configuration was presented, followed by a second configuration, which sometimes required updating the expectancy of how much work would be required to earn the reward because the position of the goal and/or subgoal changed. (B) Sample postjump configurations. The original configuration is shown in the top left. Numbers above the configuration indicate the number of steps for TS, SG, and SD. The subscript p and n refer to positive and negative prediction errors, respectively.

Figure 1.

(A) Timeline of the behavioral task. On choice trials, subjects chose one of two stimulus configurations and then moved a joystick back-and-forth to move the green dot forwards on a green-white-blue route. The three dots were arranged on or within a circle of 8° (dashed black line) that was not visible to the subject. The optimal choice was to select the shortest route, because this would lead to reward more quickly and with less physical effort. On jump trials, a single configuration was presented, followed by a second configuration, which sometimes required updating the expectancy of how much work would be required to earn the reward because the position of the goal and/or subgoal changed. (B) Sample postjump configurations. The original configuration is shown in the top left. Numbers above the configuration indicate the number of steps for TS, SG, and SD. The subscript p and n refer to positive and negative prediction errors, respectively.

The start and goal positions in each original configuration were placed on the circumference of a circle 8° of visual angle in diameter. This circle was not visible to the animal. We manipulated two variables in each configuration: total steps (TS), the number of steps from the start position to the goal via the subgoal, and subgoal steps (SG), the number of steps from the start position to the subgoal. We also calculated the straight line distance (SD), which is the degrees of visual angle in a straight line from the start position to the goal.

Once the animals had been trained on the choice task, we implanted the neurophysiological recording equipment and recorded neural activity. During recording sessions, only 10% of the trials were choice trials. The other 90% of trials, which we collectively refer to as “jump” trials, began with the presentation of a single stimulus configuration for 500 msec in the center of the screen (prejump configuration), followed by a second configuration (postjump configuration) for 500 msec. On 56% of the jump trials, the postjump configuration contained no new information, either because it was identical to the prejump configuration or because the subgoal changed position but remained the same distance from the start and goal positions (Figure 1B, “mirror” condition). On the other 44% of the jump trials, the postjump configuration generated a PPE (because the difference from the start position to the subgoal changed) and/or an RPE (because the total number of steps to the goal changed). These errors were the inverse of the number of steps, because fewer steps meant the animal would attain the reward with less effort. In other words, moving goals or subgoals closer would generate positive prediction errors while moving them further away would generate negative prediction errors. The fixation cue then changed color, indicating to the subject whether they should make rightward or leftward joystick movements to move the green dot to the goal position. Table 1 describes the different combinations of experimental conditions, and Figure 1B illustrates example configurations.

Table 1.

Relationship of Jump Configurations to Parameters in the HRL Model

 Configuration No. TS SG SD RPE PPE Notes 1 − − − = = original 2 − − − = = mirror 3 ↓ ↓ ↓ + + min 4 ↑ ↑ ↑ − − max 5 − ↓ − = + PPEp only 6 − ↑ − = − PPEn only 7 ↓ − − + = RPEp only 8 ↑ − − − = RPEn only 9 ↓ ↓ − + + RPEp and PPEp 10 ↑ ↑ − − − RPEn and PPEn 11 ↓ ↑ − + − RPEp and PPEn 12 ↓ ↓ − + + RPEp and PPEp 13 ↑ ↑ − − − RPEn and PPEn 14 ↑ ↓ − − + RPEn and PPEp
 Configuration No. TS SG SD RPE PPE Notes 1 − − − = = original 2 − − − = = mirror 3 ↓ ↓ ↓ + + min 4 ↑ ↑ ↑ − − max 5 − ↓ − = + PPEp only 6 − ↑ − = − PPEn only 7 ↓ − − + = RPEp only 8 ↑ − − − = RPEn only 9 ↓ ↓ − + + RPEp and PPEp 10 ↑ ↑ − − − RPEn and PPEn 11 ↓ ↑ − + − RPEp and PPEn 12 ↓ ↓ − + + RPEp and PPEp 13 ↑ ↑ − − − RPEn and PPEn 14 ↑ ↓ − − + RPEn and PPEp

### Neurophysiological Procedures

Our methods for neurophysiological recording have been reported in detail previously (Lara, Kennerley, & Wallis, 2009). Briefly, we implanted both subjects with a titanium head positioner for restraint and one recording chamber over each hemisphere, the position of which was determined using a 1.5-T MRI scanner. One recording chamber was positioned at an angle to allow access to LPFC and ACC, and the other was a vertical chamber to allow access to OFC. We recorded simultaneously from LPFC, ACC, and OFC using arrays of 6–14 tungsten microelectrodes (FHC Instruments). We determined the approximate distance to lower the electrodes from the MRI scans and advanced the electrodes using custom-built, manual microdrives until they were located just above the cell layer. We then slowly lowered the electrodes into the cell layer until we obtained a neuronal waveform, which were digitized and analyzed offline (Plexon Instruments). We randomly sampled neurons; we did not attempt to select neurons based on responsiveness. This procedure aimed to reduce any bias in our estimate of neuronal activity, thereby allowing a fairer comparison of neuronal properties between the different brain regions. We reconstructed our recording locations by measuring the position of the recording chambers using stereotactic methods. We plotted the positions onto the MRI sections using commercial graphics software (Adobe Illustrator). We confirmed the correspondence between the MRI sections and our recording chambers by mapping the position of sulci and gray and white matter boundaries using neurophysiological recordings. We traced and measured the distance of each recording location along the cortical surface from the lip of the ventral bank of the principal sulcus. We also measured the positions of the other sulci in this way, allowing the construction of unfolded cortical maps.

### Statistical Methods

#### Behavioral Data Analysis

We conducted all statistical analyses using MATLAB. All data for behavioral analyses were from the choice trials. To determine how the parameters of the stimulus configurations affected choice behavior, we performed a formal model comparison. We predicted that configurations with fewer TS should be considered more valuable than configurations with more steps and consequently should be chosen preferentially by the animals. We expected the position of the subgoal to have a smaller or negligible influence on choice behavior. We also included the SD between the start and goal position, because this provided a complete description of the triangular arrangement of start, subgoal, and goal positions. We tested logarithmic transformations of the distances, in addition to linear distances, because we have previously observed a better fit between visual stimuli and reward value using logarithmic transformations (Rich & Wallis, 2014). We used these parameters to estimate the subjective value (SV) of the left and right choice options:
$SVL=1−w1TSL−w2SGL−w3SDL$
(1)
$SVR=1−w1TSR−w2SGR−w3SDR$
(2)
where TS is the total steps from the start position to the goal position by way of the subgoal, SG is the distance from the start position to the subgoal position, and SD is the straight line distance between the start and goal position. We then fit a logistic regression model using the discounted values (SVL − SVR) to predict PL, the probability that the subject chose the left configuration. We included a bias term, b, which accounted for any tendency of the subject to select the leftward configuration that was independent of the configurations' values:
$PL=11+ew4SVL−SVR−w5b$
(3)

We estimated the weights of each parameter in the model by determining the values that minimized the log-likelihood of the model. To fit the weights (w1 to w5), we used a maximum likelihood fitting (“fmincon” function in MATLAB) to find the set of parameters that best predicted the experimental data. To obtain fitted weights, we ran the maximum likelihood fitting function 100 times for each of 10 different randomly determined initial weights and then calculated the mean of the fitted weights. This helps to avoid accepting weights that reflect a local minimum in the fitting function. We compared models using Akaike's information criterion (AIC; Akaike, 1974).

Our other behavioral measure was the lever movement time, which we defined as the time taken to move the joystick from the center position to the chosen side and then back again following the movement of the green dot.

#### Neural Data Analysis

All data for the neural analysis was from the jump trials. We visualized single neuron activity by constructing spike density histograms. We calculated the mean firing rate of the neuron across the appropriate experimental conditions using a sliding window of 100 msec. We then analyzed neuronal activity in two predefined epochs of 50–500 msec each, corresponding to the presentation of pre- and postjump configurations. For each neuron, we calculated its mean firing rate on each trial during each epoch. To determine whether a neuron encoded an experimental factor, we used linear regressions to quantify how well the experimental manipulation predicted the neuron's firing rate. Before conducting the regression, we standardized our dependent variable (i.e., firing rate) by subtracting the mean of the dependent variable from each data point and dividing each data point by the SD of the distribution. The standardization of firing rate was performed across all trials, pooling across conditions. We evaluated the significance of selectivity at the single neuron level using an alpha level of p < .05.

We examined how neurons encoded information about the prejump configuration by performing a linear regression on the neuron's mean firing rate (F) during the prejump configuration presentation:
$F=b0+b1SV+b2LR$
(4)
where SV denotes the subjective value of the prejump configuration calculated according to the weights derived from our behavioral model and LR was a dummy variable that indicated whether the start position was to the left or right of fixation. Selective neurons were defined as those in which Equation 4 significantly predicted the neuron's firing rate (F test evaluated at p < .05) and one or more of the beta coefficients (excluding b0) was significant (coefficient t test evaluated at p < .05).
We examined how neurons encoded the postjump configuration by performing a linear regression on the neuron's mean firing rate (F) during the postjump event with six predictors:
$F=b0+b1SV+b2LR+b3RPEp+b4RPEn+b5PPEp+b6PPEn$
(5)
$RPE=SVpost−SVpre$
(6)
$PPE=w2SGpost−SGpre$
(7)

SV and LR are defined as for Equation 4. Another four predictors represented positive or negative RPEs or PPEs. We defined PPE as the difference between the original position of the subgoal and its position following the jump. Thus, we calculated this difference using the weighting that was ascribed to this parameter from the animal's initial choice behavior (Equations 1 and 2). Selective neurons were then defined in the same way as for Equation 4.

To quantify the strength of neural encoding, for each neuron, we calculated the coefficient of partial determination (CPD) for each parameter. This is the amount of variance in the neuron's firing rate that can be explained by one predictor over and above the variance explained by other predictors included in the model. The CPD for predictor i is defined as
$CPDi=SSEX−i−SSEXSSEX−i$
(8)
where SSEXi is the sum of squared errors in a regression model that includes all of the relevant predictor variables except i and SSEX is the sum of squared errors in a regression model that includes all of the relevant predictor variables.

To examine the time course of the contribution of each predictor, we performed a “sliding” regression analysis to calculate the CPD at each time point for each neuron. We fit each regression model (Equation 4 for the prejump configuration and Equation 5 for the postjump configuration) to neuronal firing for overlapping 200-msec windows, beginning with the 200 msec immediately before the task epoch and then shifting the window in 10-msec steps until we reached the end of task epoch. The sliding regression analysis requires a correction for multiple comparisons, because it involves performing a statistical test for each time point. We calculated this correction by calculating a false alarm rate. We applied the same statistical criterion to an equivalent analysis using shuffled neural data where significant parameters can only reflect noise. We preserved the firing patterns of individual neurons on individual trials, but shuffled the experimental conditions. The results of this analysis showed that a statistical criterion of three consecutive time bins where the regression parameter was significant at p < .005 yielded a false alarm rate of less than 5%.

## RESULTS

### Behavioral Task Performance

To examine the influence of the stimulus configurations, we performed a model comparison, as described in detail in the Methods. The full model included parameters for TS, SG, and SD (w1, w2, and w3). Against this model, we compared other models in which we tested subsets of these parameters. In addition, we evaluated whether choice behavior relied on linear or logarithmic estimates of distances. In both animals, the full model was clearly favored, although Subject R favored logarithmic estimates of distance whereas Subject Q favored linear estimates (Table 2 and Figure 2A). For Subject R, w1 = 3.2, w2 = 0.5, and w3 = 1.4, whereas for Subject Q, w1 = 0.5, w2 = 0.4, and w3 = 1.1. Thus, for both subjects, the subgoal position had the smallest effect on choice behavior, although Subject R based his choices more on TS whereas Subject Q used SD. These two variables were positively correlated (correlation coefficient = .91), which likely accounted for why either variable could be used to solve the task. Overall our models provided an excellent fit to choice behavior (Figure 2B), explaining 93% of the variance in Subject R's choice behavior and 90% of the variance in Subject Q.

Table 2.

AIC Values for Both Subjects across All Tested Models

 Subject Distance Estimate Parameters Included in the Model TS TS TS TS SG SG SG SG SD SD SD SD R Logarithmic 700 706 703 735 709 902 763 Linear 706 710 709 737 713 902 759 Q Logarithmic 955 981 959 964 984 1112 979 Linear 952 975 956 958 980 1107 972
 Subject Distance Estimate Parameters Included in the Model TS TS TS TS SG SG SG SG SD SD SD SD R Logarithmic 700 706 703 735 709 902 763 Linear 706 710 709 737 713 902 759 Q Logarithmic 955 981 959 964 984 1112 979 Linear 952 975 956 958 980 1107 972
Figure 2.

(A) AIC weights across the 14 tested models. The AIC weight is the relative likelihood of a given model within the set of tested models. The full model was clearly favored in both subjects, although a logarithmic transformation of distance was favored by Subject R, whereas Subject Q estimated distances linearly. (B) Behavioral performance during the choice trials. The probability of selecting the left configuration as a function of the difference in value of the left and right configurations as determined by Equations 1 and 2. Gray circles indicate actual data, and green lines indicate the best fitting model as determined by a formal model comparison.

Figure 2.

(A) AIC weights across the 14 tested models. The AIC weight is the relative likelihood of a given model within the set of tested models. The full model was clearly favored in both subjects, although a logarithmic transformation of distance was favored by Subject R, whereas Subject Q estimated distances linearly. (B) Behavioral performance during the choice trials. The probability of selecting the left configuration as a function of the difference in value of the left and right configurations as determined by Equations 1 and 2. Gray circles indicate actual data, and green lines indicate the best fitting model as determined by a formal model comparison.

Although the subgoal position only had a small effect on choice behavior, the model in which it was included clearly performed better than the model in which it was omitted in both animals. This indicated that the animals were not simply ignoring the subgoal. Further evidence was apparent in the lever movement times. Both animals showed a tendency to slow down as they approached both the subgoal and the goal and to speed up again once the subgoal had been acquired (Figure 3A). This was evident when we looked at the change in movement time from one step to the next (Figure 3B). We found that subjects slowed down (positive values on the y axis) on approaching the subgoal and sped up (negative values on the y axis) immediately after its attainment (one-way ANOVA, F(5, 227) = 17.62, p < 1 × 10−13 for Subject R; F(5, 179) = 22.85, p < 1 × 10−16 for Subject Q). In other words, subjects did pay attention to the subgoal position in the series of lever movements.

Figure 3.

(A) Lever movement times for steps relative to subgoal or goal positions. (B) Movement times relative to the previous steps. The diagram indicates the specific movements referenced by the x axis. Asterisks indicate that the values were significantly lower than any other values, using appropriate pairwise comparisons (p < .01, Bonferroni-corrected).

Figure 3.

(A) Lever movement times for steps relative to subgoal or goal positions. (B) Movement times relative to the previous steps. The diagram indicates the specific movements referenced by the x axis. Asterisks indicate that the values were significantly lower than any other values, using appropriate pairwise comparisons (p < .01, Bonferroni-corrected).

### Neural Encoding

We recorded the activity of 308 neurons from LPFC (Subject R 132; Subject Q 176), 249 neurons from OFC (R 130; Q 119), and 212 neurons from ACC (R 106; Q 106). Recording locations are illustrated in Figure 4. We collected the data across 38 recording sessions for Subject R and 30 sessions for Subject Q. To obtain sufficient statistical power, the neurons from the two subjects were pooled. For all significant results, there were no qualitative differences between the two subjects (i.e., the effects were in the same direction), unless otherwise noted.

Figure 4.

(A) Coronal MRI scans illustrating potential electrode paths. Red, green, and blue target areas indicate LPFC, ACC, and OFC, respectively. (B) Flattened reconstructions of the cortex indicating the locations of recorded neurons. The size of the circles indicates the number of neurons recorded at that location. We measured the anterior–posterior position from the interaural line (x axis) and the dorsoventral position relative to the lip of the ventral bank of the principal sulcus (0 point on y axis). Gray shading indicates unfolded sulci. LPFC recording locations were located within the principal sulcus. ACC recording locations were located within the cingulate sulcus. OFC recording locations were largely located within and between the lateral and medial orbital sulci. All recording locations are plotted relative to the ventral bank of the principal sulcus, which is a consistent landmark across animals. PS = principal sulcus; CS = cingulate sulcus; LOS = lateral orbital sulcus; MOS = medial orbital sulcus.

Figure 4.

(A) Coronal MRI scans illustrating potential electrode paths. Red, green, and blue target areas indicate LPFC, ACC, and OFC, respectively. (B) Flattened reconstructions of the cortex indicating the locations of recorded neurons. The size of the circles indicates the number of neurons recorded at that location. We measured the anterior–posterior position from the interaural line (x axis) and the dorsoventral position relative to the lip of the ventral bank of the principal sulcus (0 point on y axis). Gray shading indicates unfolded sulci. LPFC recording locations were located within the principal sulcus. ACC recording locations were located within the cingulate sulcus. OFC recording locations were largely located within and between the lateral and medial orbital sulci. All recording locations are plotted relative to the ventral bank of the principal sulcus, which is a consistent landmark across animals. PS = principal sulcus; CS = cingulate sulcus; LOS = lateral orbital sulcus; MOS = medial orbital sulcus.

During the presentation of the prejump configuration, the most prevalent encoding was the value of the configuration, and this was more prevalent in ACC relative to the other two areas (Figure 5). Figure 6A illustrates example neurons that encoded selectively the SV of the prejump configurations. Fewer neurons encoded sensory information about the stimulus configuration, that is, LR, which is whether the start position was to the right or left of fixation. Because sensory encoding is not the focus of this report, we will not discuss LR encoding further. The time course of SV selectivity across the neural population is illustrated in Figure 7. There was no difference between the areas with respect to the onset of SV selectivity (median LPFC = 171 msec, median OFC = 166 msec, median ACC = 151 msec, one-way ANOVA, F(2, 122) = 0.45, p > .05).

Figure 5.

Percentage of neurons in LPFC, OFC, and ACC that encode different predictors during the prejump and postjump configurations. Shading indicates the proportion of neurons that encoded the variable with a given relationship: dark shading = positive, light shading = negative. For the postjump configuration, gray color indicates the proportion encoding both positive and negative predictors, which was possible since we included these as separate regressors. None of the proportions significantly differed from the 50:50 split expected by chance (binomial test, p < .05, Bonferroni-corrected for multiple comparisons). Asterisks indicate that the prevalence of neurons is significantly different between areas (chi-squared test, *p < .05, **p < .01). Dotted line indicates the percentage of selective neurons expected by chance given our statistical threshold for selectivity.

Figure 5.

Percentage of neurons in LPFC, OFC, and ACC that encode different predictors during the prejump and postjump configurations. Shading indicates the proportion of neurons that encoded the variable with a given relationship: dark shading = positive, light shading = negative. For the postjump configuration, gray color indicates the proportion encoding both positive and negative predictors, which was possible since we included these as separate regressors. None of the proportions significantly differed from the 50:50 split expected by chance (binomial test, p < .05, Bonferroni-corrected for multiple comparisons). Asterisks indicate that the prevalence of neurons is significantly different between areas (chi-squared test, *p < .05, **p < .01). Dotted line indicates the percentage of selective neurons expected by chance given our statistical threshold for selectivity.

Figure 6.

Spike density histograms illustrating selective neurons from the three recording areas encoding (A) SV during the prejump configuration, (B) SV during the postjump configuration, and (C) RPE during the postjump configuration. Each plot shows the mean firing rate (top) and CPD (bottom) as a function of the relevant experimental parameter. The CPD indicates the amount of variance in the neuron's firing rate that is accounted for by the experimental parameter independently of the other parameters in the regression model (see Methods). Magenta datapoints indicate that the experimental parameter significantly predicts the neuron's firing rate. The gray vertical line indicates the onset of the configuration.

Figure 6.

Spike density histograms illustrating selective neurons from the three recording areas encoding (A) SV during the prejump configuration, (B) SV during the postjump configuration, and (C) RPE during the postjump configuration. Each plot shows the mean firing rate (top) and CPD (bottom) as a function of the relevant experimental parameter. The CPD indicates the amount of variance in the neuron's firing rate that is accounted for by the experimental parameter independently of the other parameters in the regression model (see Methods). Magenta datapoints indicate that the experimental parameter significantly predicts the neuron's firing rate. The gray vertical line indicates the onset of the configuration.

Figure 7.

Encoding of the SV of the prejump configuration across the population in three prefrontal areas. Each horizontal line on the plot indicates the selectivity of a single neuron as measured using the CPD (see Methods). Neurons have been sorted according to the latency at which they first show selectivity. The vertical white lines indicate the onset and offset of the pre-jump configuration.

Figure 7.

Encoding of the SV of the prejump configuration across the population in three prefrontal areas. Each horizontal line on the plot indicates the selectivity of a single neuron as measured using the CPD (see Methods). Neurons have been sorted according to the latency at which they first show selectivity. The vertical white lines indicate the onset and offset of the pre-jump configuration.

During the postjump configuration, we also found that the most prevalent encoding was the encoding of the configuration's value (Figures 5 and 6B). Note that some neurons encoded SV before the onset of the postjump configuration, which indicates that they also encoded the SV of the prejump configuration. Some neurons also encoded RPE, and these neurons were most prevalent in ACC. In our task, rewards were fixed and delivered with certain probability, and so the RPE reflected changes in the amount of work that the animal needed to do, because the goal had moved either closer (positive RPE) or further (negative RPE) from the start position. In contrast, very few neurons encoded PPE, although the prevalence of such neurons did exceed chance in ACC (20/212 or 9.4%, binomial test, p < .01). However, the weak encoding of PPE relative to the other variables is evident in the population plots shown in Figure 8, where the robust encoding of SV contrasts with the weaker encoding of RPE and the virtually absent encoding of PPE.

Figure 8.

Encoding of predictors related to the postjump configuration across the population in three prefrontal areas. Conventions are as in Figure 7.

Figure 8.

Encoding of predictors related to the postjump configuration across the population in three prefrontal areas. Conventions are as in Figure 7.

## DISCUSSION

We developed a primate version of a task that has been used to study HRL in humans (Ribas-Fernandes et al., 2011). Animals had to use a lever to move a dot to a subgoal position and then on toward a goal position. Many prefrontal neurons encoded the value of the presented task configuration, as defined by the parameters that individual animals used to guide their choice behavior. This replicates our previous results (Kennerley, Dahmubed, Lara, & Wallis, 2009; Kennerley & Wallis, 2009), in which neurons encoded the number of lever presses that animals needed to make to earn a reward. Prefrontal neurons also encoded RPE, particularly in ACC, which is also consistent with our previous work (Kennerley et al., 2011). A novel aspect of our results is that these RPEs appeared to be driven by changes in effort rather than reward.

Both rodent lesion (Rudebeck, Walton, Smyth, Bannerman, & Rushworth, 2006) and human neuroimaging (Prevost, Pessiglione, Metereau, Clery-Melin, & Dreher, 2010) suggest that ACC may be particularly involved in effort-based decisions. Furthermore, neurophysiology studies have shown a stronger dynamical interaction between ACC and LPFC for effort-based decisions compared with delay-based decisions, whereas the opposite is true for OFC and LPFC (Hunt, Behrens, Hosokawa, Wallis, & Kennerley, 2015). These ideas have been extended to include decisions about cognitive effort, which could be used to determine whether to exert cognitive control (Shenhav, Botvinick, & Cohen, 2013). If ACC is responsible for incorporating effort into value calculations, this would include calculating value prediction errors based on effort. An outstanding question is the role that dopamine plays in this process. Although dopamine neurons have long been associated with encoding reward predictions, the evidence for their involvement in effort calculations is more mixed. In an effort-based decision-making task, only a small minority of dopamine neurons incorporated effort information (Pasquereau & Turner, 2013). Future research should examine the precise role of dopamine in ACC prediction error calculations.

Very few neurons encoded PPEs, although those that did appear to be located in ACC. Neuroimaging studies in humans that used the same task showed that PPE correlated with increased BOLD activation in ACC (Ribas-Fernandes et al., 2011). Our data therefore appear to provide convergent evidence to support the HRL theoretical framework and a role for ACC in this process. However, an important caveat is that the degree of neural encoding that we observed in ACC was not particularly compelling. Only a handful of neurons showed significant encoding, and none of those neurons were particularly strongly tuned to PPE.

One possible explanation for the weak effects is that the animals were not paying sufficient attention to the task configurations, because the majority of trials did not require a choice. This explanation seems inadequate. In previous tasks where we have interleaved trials requiring a choice with those that did not require a choice, we have seen little difference in the response of prefrontal neurons to both types of trial (Rich & Wallis, 2016). In addition, in the current study, we observed robust encoding of the value of the stimulus configuration. Finally, both animals showed changes in RT on attaining the subgoal. Taken together, these results suggest that the animals were appropriately attending to the task and the subgoal.

Differences might also have arisen due to the way the task was represented across the two species. Humans bring context to the task in a way that monkeys cannot. For example, in humans, the description of the task involved a driver picking up a package and delivering it to a customer. This real-world knowledge might have contributed to humans approaching the task in a more hierarchical fashion compared with the relatively abstract representation that the animals experience. An additional difference between the two species relates to the value of acquiring the subgoal. In humans, there was no evidence that the subgoal influenced choice behavior, suggesting that acquiring the subgoal was not rewarding. In contrast, in the current study, the subgoal did influence the animals' choice behavior, albeit to a smaller extent than the other stimulus parameters. Thus, we cannot rule out the possibility that the PPE that was generated in ACC simply reflected an RPE that was generated by the acquisition of the subgoal.

This raises a broader issue with the HRL framework. The original study examining HRL in humans emphasized that pseudorewards are distinct from primary rewards because attaining subgoals is not necessarily rewarding in and of itself (Ribas-Fernandes et al., 2011). An example is adding milk to coffee: The subgoal brings one closer to the first sip of coffee, but the act itself is not rewarding. However, traditional RL models can also account for the influence of nonrewarding subgoals on behavior, because reward values become progressively associated with earlier reward-predictive events, which would include attaining subgoals that are not in themselves necessarily rewarding. Thus, the critical difference between HRL and RL rests, not so much in the distinction between pseudorewards and primary rewards, but rather in the way in which behavior is organized, in particular, the unit of behavior that is reinforced. In RL, prediction errors are calculated for each individual action, whereas in HRL, individual actions are chunked into a subroutine that generates its own prediction error on completion. It is not clear whether the prediction error generated by the subroutine necessarily needs to involve a distinct neural signal compared with the prediction error generated by the primary reward.

How the brain determines the appropriate behavioral unit for RL mechanisms is an area of active investigation. One idea is that the brain tends to group together mutually predictive stimuli and actions into a single event (Schapiro, Rogers, Cordova, Turk-Browne, & Botvinick, 2013). For example, driving to a restaurant and ordering a meal are both actions that can acquire an ultimate goal of eating a tasty meal, but behaviorally, the agent experiences a continuous stream of stimuli and actions. However, the act of driving involves many mutually predictive stimuli (e.g., steering wheel, traffic lights, seat belt) but only weakly predicts going to a restaurant, because one can drive to many alternate destinations. Likewise, ordering a meal involves many mutually predictive stimuli (e.g., server, menu, water) but may only weakly predict driving, because one could have also walked or caught the subway. Thus, the act of driving is grouped as a separate event from ordering the meal. The responses of prefrontal neurons are consistent with organizing behavior into these high-level events. For example, one of the major determinants of prefrontal firing rates is in which part of the task the agent is currently engaged (Sigala, Kusunoki, Nimmo-Smith, Gaffan, & Duncan, 2008). Prefrontal neurons also encode events at an abstract, high-level, incorporating categories (Freedman, Riesenhuber, Poggio, & Miller, 2001) and rules (Wallis, Anderson, & Miller, 2001). It may be that standard RL mechanisms operating on these high-level, behavioral events are sufficient to account for hierarchical behavior.

In summary, our results provide partial support for the involvement of ACC in HRL. In a task designed to use hierarchical behavior, we observed neurons in ACC whose firing rate correlated with PPE. However, there were caveats to this support, including the weak encoding, particularly in comparison to other signals that have been more firmly associated with ACC, such as predicted value and RPE, and whether HRL even requires a PPE signal distinct from RPE.

## Acknowledgments

This work was funded by NIMH R01 MH097990 (to J. D. W.) and by Taiwan Top University Strategic Alliance Graduate Fellowship USA-UCB-100-S01 to (F.-K. C.).

Reprint requests should be sent to Joni D. Wallis, Department of Psychology, University of California at Berkeley, 132 Barker Hall, Berkeley, CA 94720, or via e-mail: wallis@berkeley.edu.

## REFERENCES

Akaike
,
H.
(
1974
).
A new look at the statistical model identification
.
IEEE Transactions on Automatic Control
,
19
,
716
723
.
,
W. F.
, &
Eskandar
,
E. N.
(
2008
).
A flexible software tool for temporally-precise behavioral control in Matlab
.
Journal of Neuroscience Methods
,
174
,
245
258
.
,
W. F.
, &
Eskandar
,
E. N.
(
2011
).
Encoding of both positive and negative reward prediction errors by neurons of the primate lateral prefrontal cortex and caudate nucleus
.
Journal of Neuroscience
,
31
,
17772
17787
.
,
D.
, &
D'Esposito
,
M.
(
2007
).
Functional magnetic resonance imaging evidence for a hierarchical organization of the prefrontal cortex
.
Journal of Cognitive Neuroscience
,
19
,
2082
2099
.
,
D.
, &
Frank
,
M. J.
(
2012
).
Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits 2: Evidence from fMRI
.
Cerebral Cortex
,
22
,
527
536
.
,
D.
,
Hoffman
,
J.
,
Cooney
,
J. W.
, &
D'Esposito
,
M.
(
2009
).
Hierarchical cognitive control deficits following damage to the human frontal lobe
.
Nature Neuroscience
,
12
,
515
522
.
Botvinick
,
M. M.
,
Niv
,
Y.
, &
Barto
,
A. C.
(
2009
).
Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective
.
Cognition
,
113
,
262
280
.
Dayan
,
P.
, &
Niv
,
Y.
(
2008
).
Reinforcement learning: The good, the bad and the ugly
.
Current Opinion in Neurobiology
,
18
,
185
196
.
Frank
,
M. J.
, &
,
D.
(
2012
).
Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: Computational analysis
.
Cerebral Cortex
,
22
,
509
526
.
Freedman
,
D. J.
,
Riesenhuber
,
M.
,
Poggio
,
T.
, &
Miller
,
E. K.
(
2001
).
Categorical representation of visual stimuli in the primate prefrontal cortex
.
Science
,
291
,
312
316
.
Hollerman
,
J. R.
, &
Schultz
,
W.
(
1998
).
Dopamine neurons report an error in the temporal prediction of reward during learning
.
Nature Neuroscience
,
1
,
304
309
.
Holroyd
,
C. B.
, &
McClure
,
S. M.
(
2015
).
Hierarchical control over effortful behavior by rodent medial frontal cortex: A computational model
.
Psychological Review
,
122
,
54
83
.
Hunt
,
L. T.
,
Behrens
,
T. E.
,
Hosokawa
,
T.
,
Wallis
,
J. D.
, &
Kennerley
,
S. W.
(
2015
).
Capturing the temporal evolution of choice across prefrontal cortex
.
eLife
,
4
,
e11945
.
Kennerley
,
S. W.
,
Behrens
,
T. E.
, &
Wallis
,
J. D.
(
2011
).
Double dissociation of value computations in orbitofrontal and anterior cingulate neurons
.
Nature Neuroscience
,
14
,
1581
1589
.
Kennerley
,
S. W.
,
Dahmubed
,
A. F.
,
Lara
,
A. H.
, &
Wallis
,
J. D.
(
2009
).
Neurons in the frontal lobe encode the value of multiple decision variables
.
Journal of Cognitive Neuroscience
,
21
,
1162
1178
.
Kennerley
,
S. W.
, &
Wallis
,
J. D.
(
2009
).
Evaluating choices by single neurons in the frontal lobe: Outcome value encoded across multiple decision variables
.
European Journal of Neuroscience
,
29
,
2061
2073
.
Koechlin
,
E.
,
Ody
,
C.
, &
Kouneiher
,
F.
(
2003
).
The architecture of cognitive control in the human prefrontal cortex
.
Science
,
302
,
1181
1185
.
Lara
,
A. H.
,
Kennerley
,
S. W.
, &
Wallis
,
J. D.
(
2009
).
Encoding of gustatory working memory by orbitofrontal neurons
.
Journal of Neuroscience
,
29
,
765
774
.
Lee
,
D.
,
Seo
,
H.
, &
Jung
,
M. W.
(
2012
).
Neural basis of reinforcement learning and decision making
.
Annual Review of Neuroscience
,
35
,
287
308
.
Mnih
,
V.
,
Kavukcuoglu
,
K.
,
Silver
,
D.
,
Rusu
,
A. A.
,
Veness
,
J.
,
Bellemare
,
M. G.
, et al
(
2015
).
Human-level control through deep reinforcement learning
.
Nature
,
518
,
529
533
.
Pasquereau
,
B.
, &
Turner
,
R. S.
(
2013
).
Limited encoding of effort by dopamine neurons in a cost-benefit trade-off task
.
Journal of Neuroscience
,
33
,
8288
8300
.
Phillips
,
C. G.
(
1973
).
Proceedings: Hughlings Jackson Lecture. Cortical localization and “sensori motor processes” at the “middle level” in primates
.
Proceedings of the Royal Society of Medicine
,
66
,
987
1002
.
Prevost
,
C.
,
Pessiglione
,
M.
,
Metereau
,
E.
,
Clery-Melin
,
M. L.
, &
Dreher
,
J. C.
(
2010
).
Separate valuation subsystems for delay and effort decision costs
.
Journal of Neuroscience
,
30
,
14080
14090
.
Ribas-Fernandes
,
J. J.
,
Solway
,
A.
,
Diuk
,
C.
,
McGuire
,
J. T.
,
Barto
,
A. G.
,
Niv
,
Y.
, et al
(
2011
).
A neural signature of hierarchical reinforcement learning
.
Neuron
,
71
,
370
379
.
Rich
,
E. L.
, &
Wallis
,
J. D.
(
2014
).
Medial-lateral organization of the orbitofrontal cortex
.
Journal of Cognitive Neuroscience
,
26
,
1347
1362
.
Rich
,
E. L.
, &
Wallis
,
J. D.
(
2016
).
Decoding subjective decisions from orbitofrontal cortex
.
Nature Neuroscience
,
19
,
973
980
.
Rudebeck
,
P. H.
,
Walton
,
M. E.
,
Smyth
,
A. N.
,
Bannerman
,
D. M.
, &
Rushworth
,
M. F.
(
2006
).
Separate neural pathways process different decision costs
.
Nature Neuroscience
,
9
,
1161
1168
.
Schapiro
,
A. C.
,
Rogers
,
T. T.
,
Cordova
,
N. I.
,
Turk-Browne
,
N. B.
, &
Botvinick
,
M. M.
(
2013
).
Neural representations of events arise from temporal community structure
.
Nature Neuroscience
,
16
,
486
492
.
Schultz
,
W.
,
Dayan
,
P.
, &
Montague
,
P. R.
(
1997
).
A neural substrate of prediction and reward
.
Science
,
275
,
1593
1599
.
Shenhav
,
A.
,
Botvinick
,
M. M.
, &
Cohen
,
J. D.
(
2013
).
The expected value of control: An integrative theory of anterior cingulate cortex function
.
Neuron
,
79
,
217
240
.
Sigala
,
N.
,
Kusunoki
,
M.
,
Nimmo-Smith
,
I.
,
Gaffan
,
D.
, &
Duncan
,
J.
(
2008
).
Hierarchical coding for sequential task events in the monkey prefrontal cortex
.
Proceedings of the National Academy of Sciences, U.S.A.
,
105
,
11969
11974
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning: An introduction (adaptive computation and machine learning)
.
Cambridge, MA
:
MIT Press
.
Sutton
,
R. S.
,
Precup
,
D.
, &
Singh
,
S.
(
1999
).
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
.
Artificial Intelligence
,
112
,
181
211
.
Wallis
,
J. D.
,
Anderson
,
K. C.
, &
Miller
,
E. K.
(
2001
).
Single neurons in prefrontal cortex encode abstract rules
.
Nature
,
411
,
953
956
.