Although much is known about decision making under uncertainty when only a single step is required in the decision process, less is known about sequential decision making. We carried out a stochastic sequence learning task in which subjects had to use noisy feedback to learn sequences of button presses. We compared flat and hierarchical behavioral models and found that although both models predicted the choices of the group of subjects equally well, only the hierarchical model correlated significantly with learning-related changes in the magneto-encephalographic response. The significant modulations in the magneto-encephalographic signal occurred 83 msec before button press and 67 msec after button press. We also localized the sources of these effects and found that the early effect localized to the insula, whereas the late effect localized to the premotor cortex.
Coherent sequences of decisions are an important feature of organized behavior, and real-world decisions are often made with uncertainty. Given the importance of coherent behavior and the fact that it is often disrupted by brain disorders (McKenna & Oh, 2005; Schwartz, Reed, Montgomery, Palmer, & Mayer, 1991; Andreasen, 1979a, 1979b; Penfield & Evans, 1935), there is a need to understand the neural processes involved in the learning and orchestration of sequences of decisions. However, most work on decision making has focused on single-step decisions (Wittmann, Daw, Seymour, & Dolan, 2008; Fellows & Farah, 2007; Daw, O'Doherty, Dayan, Seymour, & Dolan, 2006; Hsu, Bhatt, Adolphs, Tranel, & Camerer, 2005; Samejima, Ueda, Doya, & Kimura, 2005; Glimcher & Rustichini, 2004).
In the current study, we addressed several aspects of sequential decision making. The first questions we addressed were when and where do the cognitive processes relevant to learning sequences from stochastic feedback take place? We used magneto-encephalographic (MEG) imaging, which has high temporal precision, to record brain activity during the task. This allowed us to define in time when cognitive processes became active and allowed us to analyze data on a movement-by-movement basis, without making movements artificially slow. Furthermore, we used source localization tools to estimate where in space these processes were taking place. As such, we examined how the signals related to learning in the task evolve over time and space.
In addition, we were interested in the extent to which subjects used optimal strategies to learn in our task. In an interesting contradiction, two dominant fields of enquiry separately describe behavioral processes as either optimal (Knill & Saunders, 2003) or heuristic, which implies suboptimality (Kahneman, Slovic, & Tversky, 1982). To a large extent, these perspectives come from studying different classes of behaviors (Trommershauser, Maloney, & Landy, 2008). Many who study sensory–motor integration have found that subject performance can have features of optimality (Kording & Wolpert, 2006; Knill & Saunders, 2003; Trommershauser, Maloney, & Landy, 2003a, 2003b; Ernst & Banks, 2002; Todorov & Jordan, 2002; Jacobs, 1999; Kersten, 1999; Knill & Richards, 1996; Poggio, Torre, & Koch, 1985), whereas those who study decision under risk have consistently shown that subject performance is not optimal and instead heuristic (Gilovich, Griffin, & Kahneman, 2002; Payne, Bettman, & Johnson, 1992; Tversky & Kahneman, 1986; Kahneman et al., 1982), the neural correlates of which are currently being explored (Tobler, Christopoulos, O'Doherty, Dolan, & Schultz, 2008). Thus, the difference may have to do with the class of behavioral process being studied as well as other features, including whether models are updated with learning (Trommershauser et al., 2008).
The stochastic sequence learning task we used has elements of both sensory–motor integration and decision making, as the uncertainty in our task is external and not due to noise in the sensory–motor system. Furthermore, by bringing decision making into a sequential framework, we examined aspects of the cognitive processes that underlie decision-making behaviors that do not exist in single-step decision-making paradigms. Specifically, we asked whether subjects were able to learn optimal sequences of decisions when information at other points in the sequence affected the current choice. This relationship among parts of a sequence can be modeled using a hierarchical model. Many important human behaviors, including speech, contain hierarchical structure between sequence elements, but it is not clear if these processes are solved by a domain general system or by a domain-specific language system (Fiebach & Schubotz, 2006).
Subjects and Task
Fourteen subjects (7 men) carried out a stochastic sequence learning task while being scanned. Informed consent was obtained from each subject in accordance with procedures approved by the Joint Ethics Committee of the National Hospital for Neurology and Neurosurgery and the Institute of Neurology, London. Each trial began with the presentation of a green outline circle at the center of the screen, which cued the subjects to execute a movement (Figure 1A). They responded by executing a button press with either their left or right thumb. After each response, they were given feedback about whether they had pressed the correct button for that movement of the sequence. If they were correct, the outline circle filled green (positive feedback), and if they were incorrect, the outline circle filled red (negative feedback). After the feedback had been given for 200 msec, they were again presented with a green outline circle that cued the next movement of the sequence. They again pressed one of two buttons and were given feedback. This was repeated four times, such that each trial was composed of four button presses, with each button press followed by feedback. In 15% of the cases, the wrong feedback was given. In other words, if they had pressed the correct button, they were given red (negative) feedback, and if they had pressed the incorrect button, they were given green (positive) feedback. Thus, the feedback from any individual button press did not necessarily allow the subjects to correct their mistakes and execute the correct sequence in the subsequent trial. Integrated over trials, however, the subjects could infer the correct button press sequence.
Exact task timing depended on the RTs of the subjects and in the case of visual feedback, on where the screen refresh cycle was when we initiated draw commands. The mean RT following the pacing cue was 385 msec (SD = 284 msec). Visual feedback followed the button press by 40 msec (SD = 6 msec). The mean time between the feedback and the presentation of the subsequent pacing cue was 226 msec (SD = 9 msec), and the mean time between button presses was 602 msec (SD = 275 msec). The break between trials was indicated with a blank screen that lasted 1 sec.
We used 6 of the 16 possible sequences (i.e., 4 button presses with either the left or the right thumb gives 16 possible sequences), which were balanced for first-order button press probabilities and contained at least two left and two right button presses. The sequences used were LLRR, RRLL, LRLR, RLRL, LRRL, and RLLR. Each subject executed eight sets of six blocks, where each block was one sequence (Figure 1B). A single set consisted of all six sequence blocks where the order of the blocks was chosen pseudorandomly. For example, a block of Sequence 4 followed by a block of Sequence 1, and so forth. In each block, subjects had to determine the correct sequence using the stochastic feedback and execute it correctly eight times before they advanced to the next block. Subjects were informed when the block switched, and thus they knew when they had to start learning a new sequence. If they failed to complete the sequence correctly eight times by 20 trials, they were advanced to the next sequence. If the subjects completed all blocks of trials, each sequence was executed correctly 64 times within a session, resulting in a total of 384 correct sequences and 1,536 correct button presses for each subject. In addition, subjects received at least one set of training on all six sequences before entering the MEG and were instructed that they only had to determine which of the six sequences was correct in the current block. The training familiarized them with the six sequences, the stochastic feedback, and the other aspects of the task. The subjects were not told specifically that all sequences would require two left and two right button presses. Thus, within the experiment, the subjects were familiar with the mechanics of the task and the sequences, and their task was to use the stochastic feedback for each sequence to estimate which sequence was correct and then execute that sequence.
The task was challenging, and we found post hoc that in about 9% of the blocks, the 12 subjects retained for analysis failed to learn the sequence. Specifically, they went through 20 trials without learning to criterion. With respect to the MEG analysis described below, we carried out the analysis both with and without the blocks of data in which they failed to reach criterion. Although including or not including these data affected the exact F values of the MEG analysis, the rank order of the fit of the three models was the same (see Results), and the plots showing the spatial distribution of the effects were similar. We report results for the analyses in which we excluded the blocks in which subjects did not learn.
Analyses proceeded in a series of steps (Figure 1C). First, we fit behavioral models to each subject's learning data. Then, we extracted the movement-by-movement estimates of learning from each subject's behavioral model. This learning estimate (quantified as the probability that the subject knew which button they should press at each point in time) was then regressed on the data in sensor space to see if the button-press-related activity was modulated by learning. We then used source localization on a subject-by-subject basis to localize the significant effects that we found in sensor space. After localization was done for each subject, the results were used to carry out SPM statistics on whether the localizations were significant across subjects.
We fit Bayesian statistical models to the subjects' behavior. The models allowed us to quantify trial by trial how much the subjects had learned about which sequence or which button was correct in the current block. The subjects could press either the left or the right button at each point in the sequence, and therefore they had a binary decision. The model assumes that the subjects were trying to learn the sequence and therefore that they were trying to optimize the number of times green feedback was received. Statistically, this can be accomplished by remembering how often green feedback was given for the left (or right) button at each point in the sequence. For example, if green feedback was given more often for the left button, then the left button should be pressed. Thus, the model integrates information about red versus green feedback given for left and right button presses individually for each of the four button presses in the sequence.
Equations 1–5 describe an ideal observer model, but it is likely that the subjects' behavior will deviate from this model. To better predict their behavior, we added two parameters to the basic model that allowed for differential weighting of positive and negative feedback. These parameters affect both the flat and the hierarchical model. When the models were fit to the subject behavior, different parameters were fit under each model to individual subjects, as shown in the Results section.
Two of the subjects performed very poorly in the task (Subjects 5 and 8), and their parameter values reflected this as they had values near zero for both α and β, implying no integration of the feedback. Once the parameters were found that maximized the likelihood, either Equation 2 (PF(B))or Equation 5 (PH(B)) was used to generate the probability, on a trial by trial basis, that the subject knew which button to press at each point in the sequence, and Equation 4 was used to estimate the sequence probability (P(S)), given the feedback from the current block.
We also examined three other models that were extensions of the basic model. None of them fit the subject behavior better than the flat model, and therefore we will not go into the formal description of the models. However, it is worth describing them briefly as we did explore various possibilities, beyond the flat and hierarchical models, for how subjects might have been performing the task. We were trying to model the possibility that subjects were integrating information over the first few trials of the block, but at some point they decided which sequence they thought was correct in the block. In that case, the subjects' probability estimate would jump from its current value to 1. Effectively, we were trying to model the possibility that when the probability estimate crossed a threshold the subjects were sufficiently confident of which sequence they were executing and their belief estimate went to 1 or some very high value. We did this in three ways. First, we passed the button press probabilities (Equation 2) generated by the basic model through a soft-max function (Bishop, 1995). This function is used to convert value estimates in reinforcement learning models into probabilities. It contains a temperature parameter that controls how much probabilities are amplified. In effect, it causes the probabilities during learning to go much more quickly to 1. We estimated the temperature as a free parameter but found that it did not improve the fit. The second approach was to implement a threshold. In this case, when the belief value crossed that threshold it was set to 1. In the third case, we switched from the button probability generated by Equation 2 to the sequence button probability generated by Equation 5 when the probability passed a threshold. The threshold was always allowed to vary as a free parameter. As stated, however, none of these three approaches worked better than the basic button model.
MEG Data Acquisition and Preprocessing
MEG data were recorded using 275 third-order axial gradiometers with the Omega275 CTF MEG system (VSM MedTech, Vancouver, Canada) located in a magnetically shielded room. The signals were recorded at a sampling rate of 480 Hz. Visual stimulus lag was estimated using a photodiode to measure the onset time at the screen relative to the signal sent from the task control computer to the data acquisition computer and was found to be 25 msec. This delay was used in calculating all of our timing values. Data analysis was carried out using SPM5 (Wellcome Department of Imaging Neuroscience, London) and custom written Matlab routines to implement the behavioral model.
We began the analyses by low-pass filtering the MEG signal at 50 Hz and downsampling to 120 Hz. The data were then epoched into a 400-msec window centered on the button press. Trials with a response that had an absolute value greater than 3000 fT were discarded as outliers. Statistical analysis and source localization was carried out using SPM. Details of the general linear model (GLM) that was fit are given below. Gaussian random field theory was used to control for multiple comparisons in either 2-D space × 1-D time (sensor space) or 3-D space (source space) (Kiebel, Kilner, & Friston, 2007; Kilner, Kiebel, & Friston, 2005). Sensors were converted into a 2-D space using Gaussian interpolation. Smoothing was never done across both time and space. Thus, the signals were first filtered in time and then filtered in space at a single time point.
When relevant sensor space effects were identified, we estimated the sources of these effects using source reconstruction algorithms in SPM5. For each subject, we constructed a forward model describing the transformation between distributed dipole sources and the magnetic field distribution measured by the MEG sensors. Sources were modeled using the 7204 vertex template cortical mesh available in SPM5, defined in Talairach and Tournoux coordinates. It was coregistered to the sensor locations via three fiducial marker positions (Mattout, Henson, & Friston, 2007), and the gain matrix of the lead-field model was then computed using a spherical head model, and source estimates were computed using restricted maximum likelihood estimation to invert the forward model within a parametric empirical Bayes framework (Mattout, Phillips, Daunizeau, & Friston, 2007; Mattout, Phillips, Penny, Rugg, & Friston, 2006). This inversion proceeded using multiple sparse empirical priors for covariance components (Friston, Harrison, et al., 2008). The greedy search algorithm (Friston, Chu, et al., 2008) provided an optimal mixture of sparse prior components. This produced source reconstructions for each experimental condition and for each subject.
We then compared activation levels on the mesh across subjects using a random effects model (i.e., a second level model in SPM). Voxels were accepted at an uncorrected p value of .01, and all significance values are reported at cluster level corrected for whole brain. Sources were estimated in a 2 × 2 design, Probability × Button, where probability was low or high (i.e., early or late learning). Significance of the probability factor was calculated by first computing main effect contrasts subject by subject and then doing univariate t tests.
Fourteen subjects carried out the stochastic sequence learning task (Figure 1), in which they had to learn sequences of four left/right button presses (e.g., LLRR, LRLR, etc.). Explicit feedback was given after each button press, but 15% of the time inaccurate feedback was given. Two of the subjects failed to learn the sequences to criterion, so their results will not be discussed further. It took the remaining 12 subjects, on average, 3.2 trials of learning with each new sequence before they executed a complete trial correctly, where a correct trial was defined as pressing all four buttons in the sequence correctly (Figure 2A). The subjects' performance reached a plateau by about three trials correct (about six total, i.e., combined correct and error trials) in each block, and they executed the sequence in the remaining trials with few errors. When the performance was examined as the serial position of the movement, results were similar. It could be seen, however, that the early movements of the sequence were learned slightly faster (Figure 2B). There was also a small bias to perform better on the early and last movements of the sequence between correct Trials 3 and 4 (Figure 2B), consistent with the primacy and recency gradients seen in most sequence tasks (Averbeck, Chafee, Crowe, & Georgopoulos, 2002). However, later in the block, the performance reached a ceiling, and this effect could not be seen.
In the task, the subjects made a sequence of four left or right button presses. Thus, there were 16 possible sequences (24). However, we only used 6 of the 16 possible sequences in the experiment. Because we only used a subset of the possible sequences, there were correlations between the correct button presses, and feedback about button presses at other points in the sequence could be used to better infer the correct button at the current point in the sequence. For example, if in a particular block the subject was certain that the first two button presses were LL, they would predict that the subsequent button presses were RR even if the evidence for RR was equivocal because LLRR was the only sequence we used that started with LL. These correlations between button presses can be represented with a hierarchical structure.
We explored the hypothesis that subjects took this hierarchical or correlational structure into account when they were learning the sequences. To do this, we predicted the subjects' behavior with two different models, one which did not take the hierarchical structure into account (Figure 3A, flat model) and one which did (Figure 3B, hierarchical model). Behavioral parameters were optimized for each subject under each model. We found that subjects learned more from positive feedback than negative feedback under both models (Figure 3C). The parameter values were also lower for the hierarchical model than the flat model. This is due to the fact that the hierarchical model is more efficient with the data because it correctly models the actual stochastic process used in the experiment. Because parameters were optimized under each model, learning rates for the two models were similar, although the hierarchical model tended to learn more smoothly across trials because it integrated information across buttons, as can be seen in a single example block from a single subject (Figure 4).
We next examined whether we could find differences in the ability of the two models to predict the behavior of the subjects. We did this by computing a t test on the log-likelihood ratio of the two models across subjects. A negative log-likelihood ratio for a single subject favored the hierarchical model. There was, however, no significant difference between the models, across our subjects, t(11) = 1.8, p = .09. We did find that the behavior of the subjects who learned better was better described by the hierarchical model. Thus, the total number of trials it took subjects to finish the task, which is a measure of how efficiently they learned, was negatively correlated with the log-likelihood ratio (r = −0.76, p < .01, n = 12). In other words, the hierarchical model better predicted the behavior of subjects who learned more efficiently.
MEG Responses—Parametric Effect of Probability
We measured MEG responses while subjects learned and executed the sequences. We began by examining effects in sensor space, that is, by looking at task effects on the temporal response in the interpolated scalp map around the time of button press (200 msec before to 200 msec after button press). All analyses in sensor space were carried out by converting the 2-D sensor × 1-D time data into a 3-D volume, which allowed us to carry out corrections for multiple comparisons using the tools developed for fMRI data (see Methods). Reported statistical results are based on clusters of samples that exceeded a threshold.
In our analyses of the MEG data, we compared the relative ability of three different learning-related variables, taken directly from the behavioral models discussed above, to predict the change in the MEG response with learning (Figure 4). We examined the button probabilities under the flat (PF(B)) and hierarchical models (PH(B)), and we also considered the probability of the sequence (P(S)). In all cases, these probabilities were derived on a movement-by-movement basis with the models fit to the behavioral data of the individual subjects, including separate modeling of positive and negative feedback, as described above. These probabilities represent how much has been learned about the sequence before the current button press. Thus, they represent the subjects' knowledge of which button should be pressed or which sequence is correct in the current block. We also included the button that was pressed and the interaction between the probability and the button that was pressed in the analysis (see Equation 12).
As we were interested in studying the learning effect, we only carried out our analysis on blocks in which the subjects completed eight sequences correctly, although results in sensor space were similar when all blocks were used (data not shown). There were no significant clusters for the button probability under the flat model (PF(B); p = .24, t test, cluster level, df = 11), and there was one cluster that just missed significance under the button probability for the hierarchical model (PH(B); p = .052, t test, cluster level, df = 11). The sequence effect, however, had two significant clusters (P(S); p < .05, t test, cluster level, df = 11), one had a maximum at 83 msec before button press (Figure 5) and extended above threshold from 108 to 67 msec before button press. The other had a maximum at 67 msec after button press (Figure 6) and extended above threshold from 58 msec after until 150 msec after button press. There was also a significant interaction between the sequence probability and the button that was pressed in the cluster which followed the button press (p < .05, t test, cluster level, df = 11, max at 67 msec, above threshold from 58 to 100 msec after button press), but not for the cluster that preceded the button press.
These modulations of the MEG response manifested as changes in the temporal evolution of the signal just before and after button press (Figure 5B); that is, the temporal response was different depending on how well the sequence had been learned in the current block. In all cases, there was an increase in the signal early in the block, when the sequence was not well learned (P(S) = 0.33), leading to a decrease later in the block when the sequence was well learned (P(S) = 1.0). This could also be seen in sensor space, as the parametric regressor was negative in the region with significant effects (Figure 5C). Thus, as the probability increased across the block (Figure 4), the MEG signal decreased. Interestingly, these learning-related differences occurred before and after the peak response (Figure 5B). This suggests that the peak response reflects a purely motor effect, which is not modified by learning, whereas the learning effects are related to preparatory and postmotor processing.
As there were differences in how well the hierarchical and nonhierarchical models predicted the behavior of individual subjects, we were interested in whether there would be correlations between the relative fit of the models and how well the sequence model fit the behavior of individual subjects. To test this, we examined the correlation of the relative fit of the models, measured with the log-likelihood ratio, and the contrast estimates for individual subjects. We did not, however, find that there were significant correlations, after corrections for multiple comparisons, in the fit of the models and in the contrast estimates for the sequence model.
Given that these effects were only significant under the sequence model and not the flat model, we next assessed whether the contrast estimates (i.e., the parametric regressors from the GLM) were significantly larger for the sequence model than for the flat model by comparing the distribution of contrast values across subjects between the two models at the peak location, 83 msec before and 67 msec after button press. Neither of these distributions were significantly different (p = .55, unequal variance t test, n = 24). We did, however, find that there was a significant difference in the variance of the second level contrast distribution at 83 msec before (p < .05, F test, df = 11, 11) and an almost significant difference at 67 msec after (p = .056, F test, df = 11, 11) between the sequence model and the flat model. Thus, the increased significance under the sequence model is due to lower variance in the second level contrasts as opposed to a larger mean of the parametric modulator.
Next, we used source localization to identify the possible locations of the sequence probability effects. Significant effects in source space were assessed by estimating distributed activation levels on a cortical mesh for individual subjects and then carrying out second level statistics in SPM on these activation levels. We first assessed the source of the probability effect seen at −83 msec, using a window from −117 to −67 msec. We used a window slightly larger than the window over which the sensor effect was significant as the algorithm rarely converged for small windows. We found a significant source (p < .05) bilaterally in the insula, just lateral to the striatum (Figure 5D).
Next, we carried out source localization for the significant effect 67 msec after button press, using a window from 50 to 150 msec after button press. We found a significant source for the probability effect at this time in premotor cortex (Figure 6D; p < .05). There was an additional significant source (p < .05) in early visual cortex bilaterally (left side x = −28, y = −82, z = −14), perhaps reflecting an attentional effect on the assessment of the feedback and a source at the frontal pole (x = 12, y = 64, z = −12).
We examined the behavioral and neural correlates of learning in a stochastic sequence learning paradigm. Comparing flat and hierarchical behavioral models suggested that both predicted the subjects' decisions equally well. However, only the sequence probability from the hierarchical model resulted in significant correlations with the MEG signal, and this significance was due to less variance in the contrast estimates across subjects. When we localized the learning-related signals, we found a cluster of activity in the insula that preceded the button press and a cluster in the premotor cortex that followed the press.
Our sequence task had important features that allowed us to assess how well subjects were learning and whether they were optimal. The learning coefficients show that subjects were not optimal, as positive and negative feedback should have been weighted similarly, whereas in fact subjects relied more on positive feedback as has been seen previously in sequence learning (Averbeck, Sohn, & Lee, 2006). In the task, a hierarchical structure is optimal, whereas a flat structure is not. The behavioral data suggested that subjects that learned better tended to learn in a more hierarchical manner. Perhaps additional training on the sequences would have benefitted the subjects who did not learn as well. Future experiments could clarify this point.
Two different learning-related signals emerged in the MEG sensor data, one just before button press and one just after button press. The early signal was near the midline, whereas the later signal was lateralized over the right side, although the left side signal may have been just below significance. When we carried out source localization on these two signals, we found a source in the insula for the early signal and a source in premotor cortex for the later signal. Some caution in interpreting these results is necessary, however, as it is difficult to know how precisely the MEG sensor signals can be localized.
Previous fMRI studies have shown activation in the insula during motor learning (Floyer-Lea & Matthews, 2004), and this area has a direct projection to the striatum (Chikama, McFarland, Amaral, & Haber, 1997) and as such it likely takes part in a network of areas related to updating actions on the basis of feedback. Previous work has also shown activity in this area during outcome anticipation that is either negatively or positively valenced (Knutson & Greer, 2008; Volz, Schubotz, & von Cramon, 2004; Critchley, Mathias, & Dolan, 2001). Although the MEG signal that was localized to the insula preceded the button press by 83 msec, it is likely that the press has already been initiated at the cortical level at this time. Therefore, this signal may represent anticipation of either a red or a green outcome, where the anticipation is scaled by how much the subject has learned in the block. Once the sequence is well learned, green feedback is highly likely. It is also interesting that, unlike the signal that follows the button press, this signal was not modulated by the button that was pressed as we did not find an interaction effect in the sensors. This makes it unlikely that this signal was directly involved in learning, as there was no information about the action. This signal may be more related to one's subjective sense of progression through the block, as many studies implicate the anterior insula in subjective interoception (Craig, 2009).
Activation in premotor cortex has been seen in tasks with hierarchical structure (Koechlin & Jubault, 2006; Schubotz & von Cramon, 2003, 2004), and the sequence probability is hierarchical as it represents the entire series of button presses in the order that they unfold. As there is an interaction between sequence probability and the button that was pressed, this signal may have a more direct role in updating the probability information based on the feedback.
A series of studies by Koechlin and colleagues have also suggested that when tasks have explicit hierarchical structure, task factors that map to different levels of the cognitive hierarchy map to different locations in frontal cortex (Badre & D'Esposito, 2007; Koechlin & Summerfield, 2007; Koechlin & Jubault, 2006; Koechlin, Ody, & Kouneiher, 2003). The behavior being studied in their work, however, differs in important ways from the behavior we have studied. Specifically, the studies by Koechlin et al. did not examine decision making in a framework where subjects had to deal with uncertainty about the relationship between actions and outcomes. Rather, the previous studies used rule-based cognitive tasks where the link between stimulus/action/feedback was deterministic given the behavioral rule. Stochasticity in these tasks was implied by the frequency with which the rule changed across blocks. However, all of the information was always provided by the task, and the mapping between actions and outcomes was deterministic. This is very different from the task we have used, which required subjects to deal with uncertainty in an effort to infer the sequence (rule) that was in operation. In our task, even if one knew which sequence was correct in a particular block, one would not be able to predict the feedback on an individual button press. As such, different cognitive processes are likely required to solve our task. The advantage of our approach is that we were able to test directly whether subjects were using hierarchical or flat statistical models when learning the sequences. Thus, we have provided imaging evidence for hierarchical control in a task that could have been solved using a flat model, although the behavioral data were more equivocal. It is not clear what the alternative model would be in the tasks used by Koechlin et al. In their experiments, however, it is less of an issue as they were not studying learning but rather performance of a complex cognitive task.
One important caveat to the model comparison approach we have taken, with respect to the imaging data, is that we have examined significance by linearly correlating the model prediction with the MEG signal (Behrens, Hunt, Woolrich, & Rushworth, 2008; Wittmann et al., 2008; Behrens, Woolrich, Walton, & Rushworth, 2007; Daw et al., 2006). Linear correlations, however, do not allow one to infer conclusively that the underlying neural responses are necessarily favoring one model over the other. More specifically, the sequence probability is a nonlinear function of the button probability under the flat model. As such, there cannot be more information in the underlying neural response about the sequence variable than there is about the button probability because of the data processing inequality (Cover & Thomas, 1991). Thus, our inference relates only to the specific functional form that we have examined, the linear relationship, and does not tell us about the detailed neural representation of this probability. For this, single-unit studies can be more valuable, partly for practical reasons. It is in many cases possible to examine the relationship between single-unit firing rate responses and various behavioral variables graphically and fit models accordingly. Also, there are often more trials available for fitting more complex models. The high dimensionality of MEG data makes examining the relationship between time varying signals and task variables highly complex. Interestingly, the data from the single-unit studies have consistently shown, in many brain areas and in many similar tasks, that sequence information is explicitly represented in the brain (Averbeck et al., 2006; Tanji, 2001; Nakamura, Sakai, & Hikosaka, 1998).
Comparison with Single-unit Studies
One of the goals of the present study was to approach a question we have already examined in macaques at the single-cell level (Averbeck et al., 2006) in humans using an imaging approach. We had originally intended to use a task that was as similar as possible to the task used in the macaque study. However, unlike macaques that require several trials to learn a three-movement sequence with explicit feedback, human participants given a four-movement sequence learn it in about one trial (unpublished data). This rapid learning makes studying the learning process difficult, and this is why we made the task more difficult by introducing the stochastic feedback.
The second difference has to do with the nature of the information that can be extracted from single-unit data versus MEG imaging data. Specifically, in the macaque study, we were able to track learning by following the emergence of a signal in single neurons that explicitly represented the sequence that was correct in the current block. However, we were not able to extract sequence-specific information from the MEG signal (unpublished data). Thus, we had to use a different approach to examine the learning-related changes in neural activity. Given this difference, however, the premotor signal that follows the button press is in many respects comparable with the location we studied in the macaque, as the activity in the macaque was just anterior to the FEFs, and we used eye movements as our behavioral output in the macaque. Thus, premotor cortex and caudal area 46 may have similar functions for different effectors.
In conclusion, we found that when subjects learned efficiently, they learned hierarchically. Furthermore, the learning-related variable that most strongly correlated with the imaging data was the probability of the sequence, a parameter that is present in the hierarchical model but not in the flat model. Thus, the imaging data and to some extent the behavioral data suggest that when efficient subjects were faced with learning a sequence that had hierarchical structure, they were able to take advantage of that structure.
Reprint requests should be sent to Dr. Bruno B. Averbeck, UCL Institute of Neurology, Sobell Department, Box 28, Queen Square, London WC1N 3BG, UK, or via e-mail: firstname.lastname@example.org.