The dual-system model of sequence learning posits that during early learning there is an advantage for encoding sequences in sensory frames; however, it remains unclear whether this advantage extends to long-term consolidation. Using the serial RT task, we set out to distinguish the dynamics of learning sequential orders of visual cues from learning sequential responses. On each day, most participants learned a new mapping between a set of symbolic cues and responses made with one of four fingers, after which they were exposed to trial blocks of either randomly ordered cues or deterministic ordered cues (12-item sequence). Participants were randomly assigned to one of four groups (n = 15 per group): Visual sequences (same sequence of visual cues across training days), Response sequences (same order of key presses across training days), Combined (same serial order of cues and responses on all training days), and a Control group (a novel sequence each training day). Across 5 days of training, sequence-specific measures of response speed and accuracy improved faster in the Visual group than any of the other three groups, despite no group differences in explicit awareness of the sequence. The two groups that were exposed to the same visual sequence across days showed a marginal improvement in response binding that was not found in the other groups. These results indicate that there is an advantage, in terms of rate of consolidation across multiple days of training, for learning sequences of actions in a sensory representational space, rather than as motoric representations.