The dual-system model of sequence learning posits that during early learning there is an advantage for encoding sequences in sensory frames; however, it remains unclear whether this advantage extends to long-term consolidation. Using the serial RT task, we set out to distinguish the dynamics of learning sequential orders of visual cues from learning sequential responses. On each day, most participants learned a new mapping between a set of symbolic cues and responses made with one of four fingers, after which they were exposed to trial blocks of either randomly ordered cues or deterministic ordered cues (12-item sequence). Participants were randomly assigned to one of four groups (n = 15 per group): Visual sequences (same sequence of visual cues across training days), Response sequences (same order of key presses across training days), Combined (same serial order of cues and responses on all training days), and a Control group (a novel sequence each training day). Across 5 days of training, sequence-specific measures of response speed and accuracy improved faster in the Visual group than any of the other three groups, despite no group differences in explicit awareness of the sequence. The two groups that were exposed to the same visual sequence across days showed a marginal improvement in response binding that was not found in the other groups. These results indicate that there is an advantage, in terms of rate of consolidation across multiple days of training, for learning sequences of actions in a sensory representational space, rather than as motoric representations.
Many complex skills require learning to bind temporally distinct movements into a unified sequence of actions. For example, when learning to play a novel arpeggio on the piano, a student begins by playing each individual note in successive fashion. With long-term practice, she can eventually master executing whole phrases and melodies as a singular, unified action. This mastery of complex sequential skills can arise from learning at multiple levels. For example, the student may learn the piece by serially predicting the high-level action goals of the piece, as would occur if she memorized the written notes on the sheet of music or the sequence of spatial locations along the keyboard. In contrast, the student may learn action-specific motoric synergies such that individual finger movements become encoded as a singular, sequential action. Of course, these are not mutually exclusive, and learning can occur at many levels of representation.
The idea that multiple systems are recruited during sequential skill learning dates back to some of the first studies of sensorimotor sequence learning (for review, see Abrahamse, Jiménez, Verwey, & Clegg, 2010; Ashe, Lungu, Basford, & Lu, 2006). In their initial experiments with the serial RT task (SRTT), Nissen and Bullemer (1987) found that explicit awareness afforded an advantage when learning a novel sensorimotor sequence (see also Curran & Keele, 1993). More recent studies suggest that providing explicit knowledge of a cued sequence of movements immediately improves motor vigor and accuracy (Wong, Lindquist, Haith, & Krakauer, 2015), alluding to the possibility that high-level concept knowledge of action goals improves basic motoric efficiency in action execution.
Hikosaka, Nakamura, Sakai, and Nakahara (2002) formally proposed a dual-system model for sequence learning. According to their model, prefrontal cortico-basal ganglia and cortico-cerebellar networks learn sequences of high-level planning features, such as perceptually driven spatial goals. This system learns quickly, is effector independent, and has a relatively short time scale of retention, on the order of a few seconds or minutes. In contrast, cortico-basal ganglia and cortico-cerebellar circuits, routed through the primary motor cortex, learn sequences of motoric actions. This motor sequence system learns slowly, is effector specific, and has a very long time scale of retention. According to their model, the basal ganglia circuits evaluate reward likelihoods of individual actions, whereas the cerebellar circuits monitor execution errors. Several behavioral (Albouy et al., 2013; Witt, Margraf, Bieber, Born, & Deuschl, 2010; Cohen, Pascual-leone, Press, & Robertson, 2005; Willingham, Wells, Farrell, & Stemwedel, 2000; Willingham, 1999) and neuroimaging (Albouy et al., 2015; Rose, Haider, Salari, & Buchel, 2011) studies have also validated the general dichotomy proposed by Hikosaka and colleagues (2002). This dual-system model is also consistent with more recent models of general response planning, where potential decisions compete on two separate levels: a competition between the relative values of action goals and a competition between low-level execution response costs (Cisek, 2012).
Along with representing sequential goals or actions, another key aspect of skill learning is the ability to bind individual responses together into a unified action (Lashley, 1951), and this binding results in movements that can be parceled into meaningful chunks (Verwey, 1994). In the context of cued motor actions, chunking is typically studied by training a participant to learn a simple (i.e., three to nine items) sequence. Here, the repetition structure of specific elements in the sequence is manipulated so as to be easily detectable (e.g., “1-2-3” or “1-3-1”). With practice, the first item in the concatenated set of actions exhibits a slower RT than the rest of the elements in the set. A larger RT difference between the first and the second action or between the first, second, and third key press reveals the bound responses. This slowing is used as an index of the segmentation of the learned chunk (Verwey, Abrahamse, Ruitenberg, Jiménez, & de Kleine, 2011; Verwey, Abrahamse, & Jiménez, 2009; Kennerley, Sakai, & Rushworth, 2004; Verwey & Eikelboom, 2003; Verwey, Lammens, & Van Honk, 2002; Verwey, 1996, 2001). Several lines of evidence suggest that this type of response binding may happen upstream from motor execution systems: Chunking is correlated with working memory capacity, but not simple motor production abilities (Bo, Jennett, & Seidler, 2011; Bo & Seidler, 2009; Bapi, Doya, & Harner, 2000), chunking efficiency is context-specific (i.e., chunks of one sequence do not transfer to another sequence with similar structure; Verwey, 2001), the structure of chunked responses is not affected by manipulations of execution parameters (e.g., target distance, effector; Sakai, Kitaguchi, & Hikosaka, 2003), and chunking performance is impaired by disruptions of striatal dopamine pathways, suggesting that reinforcement learning mechanisms contribute to binding actions together (Tremblay et al., 2009, 2010). More nuanced signatures of response binding can emerge only after prolonged training, particularly as behavioral complexity increases, chunking tends to emerge only after prolonged training (Wymbs, Bassett, Mucha, Porter, & Grafton, 2012). For example, in both humans (Verstynen et al., 2012) and nonhuman primates (Acuna et al., 2014), only after many days of practice do sequentially produced actions develop a temporally correlative structure in their response dynamics that reflects being part of a common command that persists across individual cue–response trials. These findings are consistent with the observation that motoric sequence learning uses coarticulation mechanisms to tune activity patterns of trained effectors (e.g., fingers) when repeatedly paired together (Verwey & Clegg, 2005).
Although aspects of skill crystallization, such as binding and chunking, require several days of training, to date it remains unclear whether one level of representation has an advantage for binding sequential representations over other levels of representation. Diedrichsen and Kornysheva (2015) proposed that response chunking during movement sequencing occurs at lower, motoric levels of representation. According to this hypothesis, long-term skill learning entrains synergies of motor primitives together so that they are triggered by a singular upstream motor command. This hierarchical model of sequence chunking is effector-dependent and predicts a specific advantage for repeating sequences of individual actions together, regardless of how they are cued. An alternative model is one in which sequences of independent action plans or sensory goals get bound with training, leaving low-level motor synergies contingent on high-level action plans. This is consistent with the fact that learning is facilitated by explicit awareness of sequence structure (Curran & Keele, 1993; Nissen & Bullemer, 1987). It predicts that response binding will be effector-independent and that serial ordering of cues will be advantageous for chunking actions together. Alternatively, response binding could happen between levels of representation. For example, previous behavioral models have proposed that learning of event sequences in the SRTT is mediated by learning a relationship between a current manual response and the stimulus cue that follows it (Ziessler & Nattkemper, 2001; Ziessler, 1998).
Here we set out to disambiguate long-term (i.e., multiday) sequence learning at sensory levels of representation (i.e., visual cues) from representations at motoric levels (i.e., manual responses). Specifically, we wanted to evaluate which level of representation better facilitates learning rate across multiple days of training and temporal correlation measures of binding. The focus on multiday learning, as opposed to learning within one or two training sessions, is necessary because stable response binding is only observed after multiple days of training (Acuna et al., 2014; Verstynen et al., 2012; Wymbs et al., 2012). To do this, we used a version of the SRTT to train participants to learn an embedded 12-item sequence. By remapping the cue–key associations on each day of training, we could independently train sequences of cues from sequences of responses across five consecutive days. Multiple measures of learning, including measures of response binding, show that sequences are learned faster across days when they are presented in a visual modality, but that learning may interact across multiple levels of representation as sequential actions are bound into a unified chunk.
Healthy college-aged adults (n = 62) were recruited from the Carnegie Mellon University population and gave written consent to participate. All participants were financially compensated for their time. Inclusion criteria for participation required unimpeded use of the right hand, no history of carpal tunnel syndrome, and no familiarity with the Cyrillic alphabet. Two participants dropped out before completing the full 5 days of training, and these participants were not included in the final analysis. Carnegie Mellon University's internal review board approved all testing procedures.
Serial RT Task
All participants were tested in a version of the SRTT. At the beginning of each training day, participants were explicitly told the mapping of a specific cue to an individual finger and given three practice repetitions of cue-to-digit mapping with guided instruction. The cues consisted of four Cyrillic symbols (“Ж,” “Є,” “Њ,” and “Л”) presented individually in white font on a black background (Figure 1A) on a 24-in. Asus monitor. Participants were instructed to respond as quickly as possible to the cue using one of four keys on the keyboard corresponding to the right index (1; “j” key), middle (2; “k” key), ring (3; “l” key), and pinky (4; “:/;” key) fingers, respectively (see group definitions below). After a familiarization period to get participants accustomed to the cue-to-digit mapping, training commenced. The experimental blocks were divided into two types: random blocks and sequence blocks. The random blocks (Trial Blocks 1, 2, and 5) were composed of 264 stimulus–response pairings with the constraint that each element occur 25% of the time. The trial-by-trial order of cues was selected pseudorandomly with the restriction to not allow the repetition of two contiguous stimuli. The first two random blocks were designed to consolidate the cue-to-digit remapping on each day so that this was well learned by the start of the sequence blocks. The sequence blocks (Trial Blocks 3, 4, and 6) were composed of 22 repetitions of stimuli from a 12-trial sequence: 1, 3, 2, 1, 4, 3, 1, 4, 2, 3, 4, 2. Each block began at a random part of the sequence so as to minimize immediate identification of the sequential pattern. Participants had 600 msec to respond to the symbol once it was shown on the screen. Not responding fast enough or responding with the wrong key resulted in the symbol turning red for 50 msec. A correct response within the allotted time resulted in the symbol turning green for 50 msec. All testing procedures were run using PsychToolbox (Brainard, 1997 version 3.0) on an Ubuntu Linux system.
For two groups of participants, the mapping from cue-to-digit was reset at the beginning of each training day (Figure 1B). This mapping was randomly determined on each day and allowed for us to disambiguate sequences of cues from sequences of movements across all training days. Participants were assigned to one of four groups that defined the temporal trial structure of the sequence blocks. For the Visual group (n = 15, 6 men), the 12-item sequence defined the order of visually presented cues in the sequence block on all days; however, because of the remapping, the order of key presses varied from day to day. For the Response group (n = 15, 6 men), the sequence defined the order of key presses performed in sequence blocks on all days, but the trial-by-trial order of the visual cues changed from one day to the next. The Combined group (n = 15, 5 men) was not exposed to remapping on each day, allowing them to be exposed to the same sequence of cues and manual responses on all training days. Finally, as a control for within-day learning rates, the Control group (n = 15, 7 men) learned a novel sequence on each day by remapping both the cue-to-key mapping and the cue ID mapping (i.e., which symbol is mapped to which cue). Because of this, it was possible for the Control group to have repeating cues during sequence blocks, whereas no other group could have the same symbol or key press in a row. This difference in groups is adjusted in the data analysis (next section).
Because sequences were somewhat easier to identify in the Control group (see Figure 3), sequence learning scores were normalized to Day 1 performance to measure across day learning. This removes any biases in sequence structure between groups. A mixed between- and within-factor ANOVA was applied to these normalized learning scores to measure main effects of and interactions between Training day and Groups.
To look at the intertrial binding of RTs during both probe blocks, the first 12 trials (i.e., first sequence run) were excluded from analysis, because these trials often exhibited an exponential decrease in RTs during the sequence probe block on later training days (see Verstynen et al., 2012). The linear trend in subsequent trials was removed using an ordinary least squares linear regression, the time series was mean-centered, and the first 12 items in the sequence were removed so as to eliminate any exponential trends in the time series (see Verstynen et al., 2012). Autocorrelation analysis was then performed on this detrended vector using the xcorr.m function in Matlab. The autocorrelation was estimated independently for the random and sequence probe blocks on each day for each participant. The number of consecutive lags with significant positive correlations was then used as the signature for binding (see Verstynen et al., 2012). This autocorrelation analysis was only performed on the last two test blocks.
Finally, at the end of each training session, a post hoc verbal questionnaire was administered to gauge explicit learning of embedded sequences. Each participant was asked the same questions, starting with “Did any of the stimuli appear different than each other?” If the participant answered “no” or did not state any of the keywords (“pattern,” “sequence,” “sequential,” “order,” or “ordering”), they were given a score of “0” and the session was concluded. If the participant answered “yes” and was capable of successfully replicating a part of the sequence, they were given a score of “4” and the session was concluded. If the participant was aware of the existence of a pattern without explicit knowledge of the sequence, they were given a score ranging from 1 to 3, with awareness of a pattern and any use of any of the mentioned keywords, a score of “1”; awareness of the occurrences of the patterns as always present, a score of “2”; and the replication of any part of a sequence, a score of “3.” A score of “2” served as the threshold for explicit awareness of the nested sequences.
Speed and Accuracy
With training, all groups showed improvements in both RT (Figure 2A, C, E, G) and accuracy (Figure 2B, D, F, H) during the sequence blocks relative to the random blocks. To better understand this sequence-specific learning on each day, a learning score (δ) was calculated for both RT and accuracy using performance in the last random probe block as a control for general increases in response speed (see Methods, Equations 1–2). Modest gains in both RT and accuracy were observed during the sequence blocks on the first day of training across all groups, as seen in the positive δRT, δACC scores (all one-sample ts > 2.26, ps < .02; Figure 3A, C). However, there was a significant main effect of Group in the learning scores for both δRT (F(3, 8) = 8.80, p = .006) and δACC (F(3, 8) = 10.33, p = .004) on the first day of training. Post hoc t tests revealed that learning scores were substantially higher for the Control group than any of the other groups. This was due in large part to the regular presence of repeats in the sequences presented to this group (average of two repeated response pairs on each training day). Therefore, to accommodate for this within-session bias between the groups, all across-day learning patterns are normalized to performance on the first day of training.
After normalizing to first day performance, all groups showed an increase in sequence-specific response speeds across the remaining training days (Figure 3B; F(3, 168) = 21.03, p < .001). However, there was a significant main effect of Group on the across-day RT learning (F(3, 56) = 3.84, p = .014), with an advantage for the Visual group in sequence-specific learning. Although the interaction between Group and Training day was not significant (F(9, 168) = 1.02, p = .427), a significant main effect of Group was found on Training Days 3 (F(3, 56) = 4.05, p = .011) and 4 (F(3, 56) = 3.17, p = .031), with a marginal effect of Group on Days 2 (F(3, 56) = 2.26, p = .060) and 5 (F(3, 56) = 2.21, p = .097).
Accuracy during sequence blocks also improved across training days, with a significant main effect of Training day on accuracy learning scores (Figure 3D; F(3, 168) = 3.92, p = .009). Although the day-by-day means were more variable than what was observed in the RT learning score, there was still a significant main effect of Group on accuracy learning scores (F(3, 56) = 4.36, p = .007). There was also a trend for an interaction between Group and Training day for accuracy (F(9, 168) = 1.88, p = .057). There was a marginal main effect of Group on accuracy learning on Training Days 2 (F(3, 56) = 1.76, p = .165) and 4 (F(3, 56) = 2.66, p = .057). The main effect of Group was significant on Days 3 (F(3, 56) = 6.54, p < .001) and 5 (F(3, 56) = 3.39, p = .024).
To get a more direct estimate of the rate of consolidation across training days, we calculated the average between-day change in sequence learning (Δ, Methods, Equation 3). There was a significant main effect of Group on RT learning rates (Figure 4A; F(1, 58) = 6.23, p = .015). Post hoc t tests revealed that this main effect was driven largely by the Visual group, which learned significantly faster than the Response group (t(14) = 2.18, p = .038) and marginally faster than the Combined (t(14) = 1.88, p = .071) and Control (t(14) = 1.95, p = .061)groups. There was not a significant main effect of Group on accuracy learning rates (Figure 4B; F(1, 58) = 2.64, p = .110).
Temporal binding of RTs was estimated by calculating the autocorrelation of trial-wise RTs in the last random (Block 5) and sequence (Block 6) blocks (Verstynen et al., 2012; see Methods). Bound responses should exhibit a significant positive correlation in RTs for early lags. During the random probe block, we observed significant peaks at Lags 4 and 8 that strengthened with training (Figure 5A). This means that participants exhibited a mild correlation in response speeds every fourth key press, indicating rhythmicity in their natural production regardless of whether the sequence was present or not. During the final sequence block, this phasic four-lag pattern was also present (Figure 5B); however, we also detected a secondary increase in autocorrelation values at early lags across training days.
To estimate sequence-specific response binding, we subtracted the first three lags of the autocorrelation function of the sequence probe block from the first three lags of the random probe block (Methods, Equation 4). There was a marginal effect of Group on changes in sequence-specific binding between the first and last days of training (Figure 5C; F(1, 58) = 3.04, p = .087). Post hoc t tests revealed that only the Visual (t(14) = 4.23, p < .001) and Combined (t(14) = 2.03, p = .031) groups had a significant change in binding with training.
Awareness of the Sequence
Explicit knowledge of the sequence can provide a performance advantage in response speed and accuracy (Wong et al., 2015; Curran & Keele, 1993; Nissen & Bullemer, 1987). To see whether the groups differed in explicit awareness of the sequence, we analyzed the posttraining questionnaire scores for all participants and all days (Figure 6A). As expected questionnaire scores improved with training, confirming that explicit awareness of the sequence improved with training. To test for group differences in awareness, we ran a logistic regression to estimate when participants were able to replicate any part of the sequence (i.e., when their posttest questionnaire score passed a value of 2) on each training day (Figure 6B). Although there was a trend for the Combined and Control groups to be more explicitly aware of the sequence than the Response and Visual groups on the first 2 days of training, we did not detect a significant Group effect on any training day (all Fs > 2.17, all ps > .102). Thus, despite robust differences in the performance measures, the groups did not differ in their explicit awareness of the sequence.
Here we were able to disambiguate long-term, multiday learning of serial orders of visual cues from serial orders of responses, within the same experimental paradigm, by reordering the cue–response mapping on each day of training. We found that being exposed to the same order of visual response cues confers an advantage for long-term sequential skill learning across training days, particularly for response speed and accuracy, whereas learning to execute serially ordered actions has a much slower learning rate. We also found that binding of temporally adjacent actions occurs primarily when visual sequences are repeated across days, but not when manual response sequences are repeated across days. These findings are consistent with the dual-system model of sequence learning (Hikosaka et al., 2002), where sequences of response selections can be represented in both sensory and motor reference frames; however, our results extend this work by showing that the representation of sequential orders of cues is learned faster across days than sequences of motoric responses, and this difference does not relate to differences in explicit awareness of the sequence structure.
At first glance, it is tempting to think of the difference in learning rates between the Visual and Response groups as reflecting the differential recruitment of explicit and implicit learning systems, respectively. Previous studies have found that explicit awareness of a trained sequence improves task performance and the efficiency of learning relative to conditions where participants are not aware of the sequential structure (Hazeltine, Grafton, & Ivry, 1997; Willingham, Peterson, Manning, & Brashear, 1997; Grafton, Hazeltine, & Ivry, 1995; Curran & Keele, 1993; Nissen & Bullemer, 1987). Indeed, this behavioral distinction between explicit and implicit learning is associated with differences in neural circuits that are recruited during learning. Frontal, parietal, striatal, and medial-temporal lobe circuits have all been associated with the efficiency of explicit sequence learning (Reber & Squire, 1998; Willingham, 1998; Hazeltine et al., 1997; Grafton et al., 1995), whereas striatal, cerebellar, and premotor circuits are commonly associated with implicit sequence learning (Peigneux et al., 2000; Vakil, Kahan, Huberman, & Osimani, 2000; Owen, Doyon, Dagher, Sadikot, & Evans, 1998; Berns, Cohen, & Mintun, 1997; Doyon, 1997; Rauch & Savage, 1997; Grafton et al., 1995; Knopman & Nissen, 1991). Some areas, such as the striatum and medial-temporal lobe, have been implicated in both explicit and implicit sequence learning (e.g., see Schendan, Searl, Melrose, & Stern, 2003; Turk-Browne, Yi, & Chun, 2006); however, the observation of at least partially distinct cortical and subcortical networks is consistent with a dual-system model of sequence learning.
It is important to point out that our posttest awareness assessment did not reveal a difference in explicit awareness between any of the four groups, although it did show that explicit awareness was generally high for all groups. It is entirely possible that the questionnaire itself encouraged participants to attend to the sequence and thus quickly become fully aware of it by the second day of testing. If so, it is possible that awareness confers a specific advantage for detecting visual sequences, but not response sequences. In this way, explicit awareness could bias learning when visual sequences are repeated across days. Thus, it is possible that a strictly implicit version of our task would show a different long-term learning advantage. Although it should be noted that there is some evidence to suggest that it is possible to implicitly encode sequences in a perceptual reference frame, suggesting that awareness and level of representation may be independent aspects of sequence learning (Cohen et al., 2005).
The current results are also largely consistent with previous findings on the level of sensorimotor sequence representation. For example, Bapi and colleagues (2000) showed that, during short-term learning, sequences are encoded in an effector-independent reference frame. These findings were crucial to the initial development of the dual-system model proposed by Hikosaka et al. (2002). In a similar vein Willingham (1999) found that, within a single training day, passive viewing of the visual sequence did not afford an advantage for future learned sequences and that transfer learning was facilitated when the sequence was maintained in the motoric reference frame rather than in the visual reference frame. Although at first this may seem to suggest that short-term sequences are stored as motoric representations, follow-up experiments (Willingham et al., 2000) showed an advantage for sequence learning when response location, rather than motoric action, was held constant. Thus, participants may learn the sequence of spatial locations rather than motor articulations. Taken all together, this previous work suggests that sensorimotor sequences may be encoded at the interface of stimulus (i.e., visual) and response (i.e., motoric) representations. Indeed, this is consistent with findings from a visual search variant of the SRTT showing that consolidation of motor sequences is stronger when a more consistent stimulus–response mapping is used (Ziessler & Nattkemper, 2001; Ziessler, 1998).
We should point out, however, that the performance of our Combined group is inconsistent with the hypothesis that sequences are learned as serial orders of stimulus–response pairings, as it predicts that their overall performance should be as good or better than the Visual group. Part of this discrepancy may be driven by differences between the indirectly cued SRTT, used here, and the visual search version of the SRTT (Ziessler & Natt Kemper, 2001; Ziessler, 1998). The emerging consensus, instead, is that sequence learning happens at multiple levels of representation. For example, Dirnberger and Novak-Knollmueller (2013) used a stimulus–response remapping procedure to probe the nature of sequence representations within a single training day. They found that the degree of motoric and visual sequence learning was highly correlated within an individual and that motoric sequences were consolidated more quickly during training than perceptual sequences. The current results fit nicely within this multisystem framework by showing that long-term consolidation is possible when sequences are presented reliably as either serial orders of responses or visual cues, with representing sequences in a visual reference frame affords an advantage for long-term, multiday sequence learning. In addition, we show that learning sequences of visual responses is advantageous not only for typical speed and accuracy measures but also for measures of temporal response binding as well.
Beyond the level of representation, the acquisition of sequential skills also appears to rely on multifaceted feedback signals to drive learning. Reinforcement learning algorithms, like the upper confidence bound algorithm (Auer, 2003) and temporal difference learning (Sutton & Barto, 1998), have had great success at solving serial order problems in machine learning (see Littman, 2015). This reinforcement learning hypothesis of sequence learning is consistent with observations of cortico-basal ganglia networks being recruited during long-term sequence learning (Wymbs & Grafton, 2015; Wymbs et al., 2012; Bassett et al., 2011; Bischoff-Grethe, Goedert, Willingham, & Grafton, 2002; Hazeltine et al., 1997) and that impairments in dopamine function also impair sequence learning (Pendt, Reuter, & Müller, 2011; Shin & Ivry, 2003). Thus, dopaminergic reward pathways play a clear role in learning sequential actions. However, along with reward outcome signals, production errors are also essential to learning sequential actions. For example, in nonhuman primates, muscimol injections of the dentate nucleus, the primary output nucleus of the cerebellum, impair the production of overly learned sequences but not the production of novel sequences (Hikosaka, Miyashita, Miyachi, Sakai, & Lu, 1998). In humans, patients with lesions of the cerebellum are impaired at learning spatial-temporal sensorimotor sequences (Shin & Ivry, 2003). Although some have argued that cerebellar-dependent error correction only serves to support fast adaptive corrections during the sequence production but not directly contribute to learning the sequences themselves (Doyon, Penhune, & Ungerleider, 2003). Nonetheless, this would count as a contribution of production error signals to the efficiency of sequence learning. Within the context of the current study, it is possible that the learning of serial orders of cues relies on different feedback signals than learning sequential orders of responses. For example, visual sequence acquisition may rely on cortico-striatal reinforcement learning systems, with very long time scales of consolidation, whereas the sequence learning at the response level may rely on faster cerebellar dependent error correction. Characterizing the precise feedback signals that contribute to cue and response learning is left to future studies.
More recent studies have suggested that feedback-free associative learning may also contribute to sequence learning (Acuna et al., 2014; Verstynen et al., 2012). For example, using a state space model, Verstynen and colleagues (2012) suggested that temporal pairing of two actions increases the likelihood that the current action primes the upcoming response. This associative process has many similarities with statistical learning (Saffran, Aslin, & Newport, 1996) whereby temporally associated events become bound or chunked together (for a review, see Perruchet & Pacton, 2006). Statistical learning has been proposed as a mechanism for naturalistic language acquisition by using passive experience to identify sequential or hierarchical structure in grammatical systems (Saffran, 2002). At the neural level, statistical learning has been associated with activity in both the hippocampus and early visual areas (Turk-Browne, Scholl, Johnson, & Chun, 2010; Turk-Browne, Scholl, Chun, & Johnson, 2009; Breitenstein et al., 2005), suggesting that the medial-temporal lobe may serve as a temporal association detector for high-level perceptual areas to bind representations for stimuli that follow regular temporal patterns. Indeed, as with other forms of hippocampal dependent learning, the consolidation of complex response sequences is improved following a normal sleep cycle (Spencer, Sunm, & Ivry, 2006; Walker, Brakefield, Morgan, Hobson, & Stickgold, 2002). This common association with the hippocampus suggests that passive associative learning mechanisms may be at play during sequence learning, particularly when regularities of statistical structure occur in the perceptual domain.
The Visual group's learning advantage, particularly in binding, is also at odds with recent models of complex skill learning. Diedrichsen and Kornysheva (2015) proposed that binding multiple responses into a single chunk happens in levels of the motor planning hierarchy that are closer to motoric execution. According to this hierarchical model, the binding of low-level motor synergies reduces computational complexity by allowing for the recruitment of a single high-level motor command to produce a single spatiotemporal sequence of low-level muscle commands. This would predict that the Response group should exhibit an advantage for learning over the Visual group. In fact, our results show the opposite pattern. In both measures of response speed and response binding, the Visual group exhibited the greatest degree of learning. However, from a behavioral flexibility standpoint, it makes sense to have the binding process occur above the execution level. Pure binding of motor synergies, such that engaging one action automatically executes other bound movements, would lead to behavioral entrenchment. In many incidences, this would prove problematic for adapting to changes in environmental context. For example, automatically typing the letters “h” and “e” every time the letter “t” is pressed on the computer keyboard, just because the word “the” is one of the most frequently typed words in the English language, would prove inefficient every time you desired to type a word beginning with the letter “t.” But binding high-level motor plans or even representations of sensory cues that guide actions would allow for the efficiency of chunking actions together while maintaining flexibility at the motoric level to adapt in changes of context. In this way, the letters “t,” “h,” and “e” may be conceptually bound together into the word “the,” triggering a cascade of motor actions that type the target word, but just typing the letter “t” would not trigger the automatic execution of “the.” Thus, from this perspective, it would make sense that the group exposed to serially presented cues, presumably encoded at higher levels of representation, would learn sequences faster than the group exposed to consistent response sequences but different visual cues.
One unexpected result in the current study is that the Combined group, which was exposed to the same sequence at the cue and response level, had across-day learning rates that were lower than the Visual only group and in similar ranges as the Control group that learned a new sequence each day. If learning sequences of responses were a pure top–down process, where high-level cue representations or action plans become bound together, then the Visual and Combined groups should have equal performance across training. On the other hand, if sequence learning happens via multiple learning systems and at multiple levels of representation that sum together, then the Combined group should outperform the Visual group because the sequential pattern is expressed at multiple levels of representation. Our results suggest an alternative, bottlenecking hypothesis, where learning happens at multiple levels; however, the slower learning rate at the response level constrains the efficiency of the overall output. Indeed, neuroimaging evidence suggests that during sequence learning striatal systems, implicated in learning response contingencies, and hippocampal systems, implicated in learning temporal associations of events, interact and possibly compete (Albouy et al., 2013, 2015). If the Visual group primarily learns temporal associations of visual cues, that is, relies mostly on hippocampal systems but does not rely as much on striatal response sequencing systems that might have a slower learning rate, this would predict an advantage for learning in the Visual group over the Combined group. Although this hypothesis is intriguing from a theoretical perspective, more experimental and modeling work is needed to both validate this effect and understand its mechanistic underpinnings.
When considering how to interpret the present results, we should point out that, although we assume that the difference in performance between the Visual and Response groups reflects differences in learning rates at two levels of representation, we cannot rule out the possibility that learning only occurs at a single representational level. For example, the design of our experiment does not preclude the possibility that the entirety of sequence learning occurs at a high level of the sensorimotor hierarchy. According to this alternative hypothesis, the performance improvements observed in the Response group may simply reflect improvements in that group's ability to pick up on the sequential cue pattern on each consecutive day. Disambiguating the single from multiple representational system models will require neuroimaging or neurophysiological measures that allow for explicitly measuring representational patterns in sequential coding (see Kriegeskorte, Mur, & Bandettini, 2008) and looking for where signatures of learning are expressed along the hierarchy from visual cues to motoric responses in frontoparietal networks. Finally, Franz and McCormick (2010) showed how the nature of verbal instruction biases performance in a seemingly procedural bimanual coordination task. It is possible that our instruction to respond quickly to the visual cue might bias attention to visual coordinate frames over motoric coordinates and thus afford an advantage for visual learning. Future work should focus on how minor variations in instructions could impact this form of long-term skill learning.
Nonetheless, our current results highlight an explicit divide in how sequential information can be represented during long-term skill learning. Learning serial orders of visual cues occurs faster than learning sequential orders of responses, particularly for response speed and measures of response binding. This provides critical insights into the level of representation that complex skills are encoded at long time scales of training and raises interesting questions about how learning at multiple levels may interact with each other during complex skill learning.
This research was sponsored by the Pennsylvania Department of Health Formula Award SAP4100062201 and National Science Foundation CAREER Award 1351748. We would like to thank Alison Ting, Daniel Marchetto, Amira Millette, and Sophia Wilhelmi for assistance with data collection.
Reprint requests should be sent to Timothy Verstynen, Department of Psychology and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, or via e-mail: firstname.lastname@example.org.