Speech is perhaps the most sophisticated example of a species-wide movement capability in the animal kingdom, requiring split-second sequencing of approximately 100 muscles in the respiratory, laryngeal, and oral movement systems. Despite the unique role speech plays in human interaction and the debilitating impact of its disruption, little is known about the neural mechanisms underlying speech motor learning. Here, we studied the behavioral and neural correlates of learning new speech motor sequences. Participants repeatedly produced novel, meaningless syllables comprising illegal consonant clusters (e.g., GVAZF) over 2 days of practice. Following practice, participants produced the sequences with fewer errors and shorter durations, indicative of motor learning. Using fMRI, we compared brain activity during production of the learned illegal sequences and novel illegal sequences. Greater activity was noted during production of novel sequences in brain regions linked to non-speech motor sequence learning, including the BG and pre-SMA. Activity during novel sequence production was also greater in brain regions associated with learning and maintaining speech motor programs, including lateral premotor cortex, frontal operculum, and posterior superior temporal cortex. Measures of learning success correlated positively with activity in left frontal operculum and white matter integrity under left posterior superior temporal sulcus. These findings indicate speech motor sequence learning relies not only on brain areas involved generally in motor sequencing learning but also those associated with feedback-based speech motor learning. Furthermore, learning success is modulated by the integrity of structural connectivity between these motor and sensory brain regions.
Producing a novel speech sound sequence, such as an unfamiliar cluster of consonants, is difficult even for fully developed, fluent speakers. Initial attempts are typically slow and error filled. With practice, however, coordinating the complex articulator movements becomes easier and learners produce the sequence more quickly and accurately and with less variability (Smits-Bandstra & De Nil, 2009; Namasivayam & van Lieshout, 2008; Smits-Bandstra, De Nil, & Saint-Cyr, 2006). This process, which we will call “speech motor sequence learning,” results in more stable and efficient articulator movements.
Despite its importance in the development and maintenance of fluent speech production, little is known about the neural mechanisms that underlie speech motor sequence learning. In comparison, a large body of results from neuroimaging, single-unit recording, and pharmacological lesion studies has reliably established the neural correlates of learning motor sequences of finger or eye movements. In general, learning is associated with activity changes in the lateral and medial premotor cortex (PMC), BG, and cerebellum (cf. Hikosaka, Miyashita, Miyachi, Sakai, & Lu, 1998).
In one of the few neuroimaging experiments examining speech motor sequence learning, Rauschecker, Pringle, and Watkins (2008) found that, over several covert repetitions of novel syllable sequences, activity decreased in lateral PMC, pre-SMA, superior temporal cortex, inferior frontal gyrus, and cerebellar cortex. This finding is largely consistent with those of non-speech motor sequence learning studies. However, covert motor practice has been shown to yield significantly smaller behavioral gains than overt practice (Feltz & Landers, 1983). Moreover, neural activity is significantly different during overt and covert speech, particularly in regions associated with motor sequence learning such as the lateral PMC, SMA, BG, and cerebellum (Pei et al., 2011; Shuster & Lemieux, 2005; Palmer et al., 2001). These findings suggest that learning from covert practice and learning from repeated articulation are not equivalent.
Here, we combined a behavioral learning paradigm with functional and structural neuroimaging to characterize the behavioral and neural correlates of speech motor sequence learning. We asked speakers to produce monosyllabic nonwords that contained consonant clusters that were either phonotactically legal (e.g., BLERK) or illegal (e.g., GVAZF). Prior studies of speech motor sequence learning have used legal phoneme sequences (Smits-Bandstra & De Nil, 2009; Namasivayam & van Lieshout, 2008; Rauschecker et al., 2008). Such sequences are relatively easy to produce even on the first attempt. In contrast, illegal phoneme sequences are initially difficult for speakers to produce (Altenberg, 2005; Hansen, 2001; Major, 1999), and performance improves substantially with practice (Hansen, 2001; Major, 1999), making illegal sequences well-suited for investigating speech motor sequence learning. To efficiently produce phonotactically illegal sequences, speakers must learn new speech motor programs—stored neural representations that encode the sequence of movements required to produce the utterance—that include the novel consonant clusters. In contrast, speakers can produce phonotactically legal sequences using existing consonant cluster speech motor programs.
Behavioral measures tested for performance improvements over two practice sessions during which participants repeatedly produced the utterances. fMRI was then used to compare brain activity during production of sequences that had been practiced to activity during production of equivalent sequences that had not been practiced. We also explored whether individual differences in speech motor sequence learning success were correlated with measures of brain structure and function as has been shown for speech and non-speech motor learning (Tomassini et al., 2011; Golestani & Pallier, 2007). To do so, we correlated participant performance with brain activity and with an estimate of white matter integrity derived from diffusion tensor imaging (DTI).
Eighteen right-handed native speakers of American English (10 women, aged 20–43 years, mean = 25.6 years) participated. All participants reported normal or corrected-to-normal vision and no history of hearing, speech, language, or neurological disorders. Informed consent was obtained according to the Boston University institutional review board and the Massachusetts General Hospital human research committee. Two participants (1 women, ages 22 and 34 years) were removed from imaging analysis because of a large percentage of non-response errors (>25%).
Participants produced two types of monosyllabic pseudoword speech sequences that contained bi- or triconsonantal initial (onset) and final (coda) consonant clusters. Legal syllables (e.g., BLERK, THRIMF, TRALP) contained consonant clusters that are phonotactically legal in English, and illegal syllables (e.g., FPESHCH, GVAZF, TPIPF) contained consonant clusters that are illegal or highly infrequent in English but legal in some other natural language. None of the participants had prior experience with any languages in which these consonant clusters are legal. Each consonant cluster was used in only one syllable; no two syllables contained the same consonant cluster.
The number of phonemes per syllable was balanced across conditions. None of the syllables formed an orthographic or a phonological word found in the MRC Psycholinguistic Database. Stimuli were constructed to ensure participants perceived and produced targets as single syllables. Stimulus duration and amplitude were normalized using Praat (www.praat.org).
Before scanning, participants completed two practice sessions over consecutive days in which they repeatedly produced 15 legal syllables and 15 illegal syllables. Participants were divided into four groups, each of which practiced a different subset of the legal and illegal syllables. The illegal syllables that were not learned during the practice sessions were used as novel illegal stimuli during the imaging session. Assignment of illegal syllables to the learned and novel categories was counterbalanced across participants. Practice sessions occurred 1–2 days before scanning to allow for memory consolidation (Davis, Di Betta, Macdonald, & Gaskell, 2009; Fenn, Nusbaum, & Margoliash, 2003; Brashers-Krug, Shadmehr, & Bizzi, 1996). Each syllable was produced 32 times per practice session, with a total of 1920 utterances across both practice sessions. Syllables were presented in pseudorandom order.
Because participants were being asked to produce novel, illegal consonant clusters, they were presented with an auditory model of the target syllables. However, categorical judgments and electrophysiological responses have shown that listeners have difficulty distinguishing monosyllables (e.g., /lbIf/), which contain an illegal consonant cluster, from disyllables formed when a schwa, or neutral vowel, is inserted into that same cluster (e.g., /ləbIf/) when both are presented auditorily (Berent, Steriade, Lennertz, & Vaknin, 2007; Dehaene-Lambertz, Dupoux, & Gout, 2000). To ensure that the participant understood the speech target to be produced, we therefore coupled the auditory model with an orthographic representation of the syllable. So during each practice trial, participants simultaneously saw an orthographic representation of the syllable to be produced for 1450 msec and heard a 480-msec recording of the syllable; visual and auditory stimulus onsets were aligned. Following stimulus presentation and a jittered pause of 500–1000 msec, a tone acted as a go signal to cue participants to produce the target syllable. Participant utterances were recorded for 1 sec with a Samson (Hauppange, NY) C01U USB studio condenser microphone. Participants were asked to produce the syllables as quickly and accurately as possible, to replicate the auditory stimulus, and to produce all the sounds seen in the orthographic cue. Participants were also instructed to avoid inserting schwas within the consonant clusters. After instruction, but before the practice sessions, participants practiced five repetitions of two legal and two illegal syllables that were not used for the rest of the study. During these introductory trials, an experimenter provided feedback about production accuracy.
Practice session data from four participants were collected as part of a separate study in which the intertrial interval was 1 sec longer than that described above. However, the imaging paradigm was identical. No differences in learning-related behavioral measures or brain activity were associated with this longer intertrial interval according to two-sample t tests.
Behavioral Data Analysis
To evaluate speech motor sequence learning, we measured changes in the following three learning success indices over the practice sessions: (i) error rate, (ii) RT, and (iii) utterance duration. These indices are generally believed to quantify the ease or difficulty with which a participant produces a speech sequence (Sternberg, Monsell, Knoll, & Wright, 1978) and are commonly used in both motor learning and second language learning literatures as measures of learning extent (e.g., Rauschecker et al., 2008; Nakamura, Sakai, & Hikosaka, 1998).
Error rates were given by the percentage of incorrect productions among the first five productions of each syllable during each practice session. This was done for the first session to characterize performance early in the learning process and for the second session to provide a fair comparison of performance between the two sessions. Errors were defined as phoneme additions (including schwa insertions), deletions, and substitutions, and utterance repetitions, omissions, and restarts. A single rater judged errors for all trials. A subset of recordings (including recordings from the fMRI session) were also rated for errors by a second rater; the interrater reliability, K, was 0.7708 (Cohen, 1960). RT—time from the go signal to utterance onset—and duration—time from utterance onset to offset—measurements were based on the first five error-free trials of each syllable during each session. Utterance onset and offset were automatically labeled based on sound pressure level thresholds individually chosen for each practice session and manually adjusted when necessary. Less than 10% of utterances required manual intervention.
To assess learning-related changes because of practice, we compared the error rate, utterance duration, and RT changes from the first practice session to the second with paired t tests. Each behavioral measure was averaged first within each syllable, then within each condition, and then within each participant. We hypothesized that we would see greater learning in the illegal condition because those syllables included both novel syllables and novel consonant clusters, whereas the legal condition includes novel syllables with familiar consonant clusters. Paired t tests comparing the mean error rate, duration, and RT in the illegal and legal conditions were performed to test this hypothesis. t Tests were corrected for multiple comparisons using a Bonferroni threshold of 0.05.
During functional imaging, participants produced the 15 legal and 15 learned illegal syllables that they had learned during the practice sessions and 15 novel illegal syllables that they had not been exposed to previously. Thus, there were three syllable production conditions: learned legal, learned illegal, and novel illegal. To ensure that participants had undergone learning of the learned illegal syllables, only participants that demonstrated significant reduction in two of the three learning indices across the practice sessions were included in the neuroimaging portion of the experiment; all participants met this requirement. A baseline condition was also intermixed during imaging in which participants viewed a series of asterisks on the screen instead of the orthographic stimulus and rested quietly instead of producing a syllable.
We acquired fMRI data using a sparse sampling paradigm that allowed participants to hear auditory cues and produce target syllables in the absence of scanner noise (Hall et al., 1999). Participants followed the same behavioral paradigm used during the practice session but with an additional pause after the syllable production to temporally align the image acquisition to the expected peak of the hemodynamic response (Belin, Zatorre, Hoge, Evans, & Pike, 1999). A single trial lasted 10 sec. Each run consisted of 40 trials and lasted 7 min. Participants completed eight runs, 80 trials per condition, and five or six productions of each syllable. Conditions were pseudorandomly distributed across the eight runs with at least eight instances of each condition appearing in each run.
Instructions and visual stimuli were projected onto a screen viewed from within the scanner via a mirror attached to the head coil. Auditory stimuli were played over Sensimetrics model S-14 MRI-compatible earphones. Participants' productions were transduced by a Fibersound model FOM1-MR-30m fiber-optic microphone, sent to a Lenovo ThinkPad X61s, and recorded using Matlab (The MathWorks, Natick, MA) at 44.1 kHz.
MRI data were acquired on a 3-T Siemens Trio Tim scanner with a 32-channel head coil. For each participant, a high-resolution T1-weighted volume was acquired (MPRAGE, voxel size = 1 mm3, 256 sagittal images, repetition time [TR] = 2530 msec, echo time [TE] = 3.44 msec, flip angle = 7°). Functional gradient-echo EPI scans (41 horizontal slices, in plane resolution = 3.1 mm, slice thickness = 3 mm, gap = 25%, TR = 10 sec, acquisition time = 2.5 sec, TE = 20 msec) were automatically registered to the AC–PC line and were collected sparsely with 10 sec between scan onsets. Diffusion-weighted images were also acquired with a single-shot spin-echo echo-planar sequence (64 slices, voxel size = 2 mm3, TR = 8020 msec, TE = 83 msec, GRAPPA parallel reconstruction). Diffusion weighting was performed along 60 independent directions with a b-value of 700 sec/mm2. A reference image with no diffusion weighting was also acquired.
fMRI Behavioral Data Analysis
For each syllable production, RT, utterance duration, and error rate were calculated following the removal of noise associated with the scanner bore echo and peripheral equipment using a Wiener filter (Wiener, 1949). Raters were blind to the condition (learned or novel) of the illegal syllables. Each behavioral measure was averaged within each condition and within each participant. One-way ANOVAs (pFWE < .05, Bonferroni-corrected) were used to test for significant differences across the three conditions for each of the three behavioral measures. Bonferroni-corrected paired t tests (pFWE < .05) were then used to test for behavioral differences between each pair of conditions. A paired t test (p < .05) was also performed to test for significant differences in behavioral measures between the first five trials of each (learned) illegal syllable during the practice session and the novel illegal syllables during the imaging session.
fMRI Data Analysis
The Nipype (Gorgolewski et al., 2011) neuroimaging software interface was used to analyze imaging data. Nipype permits the use of preferred processing routines from various analysis packages. Using SPM8 image processing tools (www.fil.ion.ucl.ac.uk/spm/software/spm8), functional images were motion-corrected and realigned to the participant's anatomical volume and high-pass filtered with a standard 128-sec cutoff frequency. Error trials, intensity-related outliers (>3 standard deviations from participant mean), and motion-related outliers (>2 mm) were removed from the analysis; approximately 10% of all trials were removed because of these criteria.
BOLD responses were estimated using a general linear model, and the hemodynamic response function for each stimulus event was modeled as a finite impulse response. The model included four condition-specific variables (learned illegal, novel illegal, learned legal, and baseline) and additional covariates (utterance duration measures, linear detrending covariates, and motion parameters). Trials rated as having behavioral errors (e.g., phoneme additions or utterance restarts) were removed and not included in the analysis of the corresponding speaking conditions. The model was estimated for each participant. Model estimates for the novel illegal and learned illegal conditions were contrasted (novel illegal–learned illegal) at each voxel as were those for the learned illegal and learned legal conditions (learned illegal–learned legal). Group statistics were then calculated separately for cortical and subcortical regions.
A surface-based analysis was used to assess group BOLD response differences for each contrast in the cerebral cortex. T1 volume segmentation and cortical surface reconstruction for each participant were performed with the FreeSurfer image analysis suite (Fischl et al., 2002; Dale, Fischl, & Sereno, 1999; Fischl, Sereno, & Dale, 1999). The activity of cortical voxels in each contrast volume for each participant was then mapped to that participant's cortical surface. Participant data were aligned by inflating each individual surface to a sphere and registering it to a template representing the average surface curvature of a set of neurologically normal adult brains (Fischl, Sereno, Tootell, & Dale, 1999). The surface-based contrast data were smoothed with a 6-mm FWHM kernel and then averaged across participants. Group-level t statistics were calculated at each vertex. Vertex-wise statistics were first thresholded at p < .001 (uncorrected). Cluster-level significance thresholds were then estimated separately for each hemisphere using a Monte Carlo simulation over 10,000 iterations in which each iteration measured the maximum cluster size in smoothed random noise data (Hayasaka & Nichols, 2003). Results were cluster-thresholded in each hemisphere at cluster-wise probability (CWP) < .0167 to correct for both surface-based tests in each hemisphere and one subcortical volume-based test. This resulted in a family-wise error-corrected threshold of CWP < .05 across all fMRI analyses.
Group differences in subcortical BOLD responses were assessed by normalizing and aligning individual T1 volumes to the MNI152 template using SPM8's DARTEL image registration toolbox (Ashburner, 2007). Each participant's voxel-based contrast data were smoothed using a 6-mm FWHM kernel; these smoothed contrasts were then averaged across participants. Group-level t statistics were calculated at each voxel and thresholded at p < .001 (uncorrected). After a subcortical mask was applied, the results were thresholded at the cluster level at CWP < .0167 (corrected) using a separate Monte Carlo simulation with 10,000 iterations.
DTI Data Analysis
Using the FMRIB Diffusion Toolbox (www.fmrib.ox.ac.uk/fsl), the diffusion-weighted raw data were first corrected for eddy-current distortions and motion artifacts. Diffusion tensors were then fitted at each voxel within a cortical mask. Data from one participant were not included in this analysis because of excessive head motion during collection of the DTI volume that caused a failure in the DTI analysis software. DTI volumes were coregistered with participants' anatomical T1-weighted volume using FreeSurfer. FreeSurfer was also used to identify white matter voxels that lie 2 mm below the cortical gray–white boundary (Kang, Herron, Turken, & Woods, 2012). The fractional anisotropy (FA) within each of these voxels was then calculated. These FA values were used in brain–behavior correlation analyses described in the next subsection.
Brain–Behavior Correlation Analyses
Correlation analyses were used to identify relationships between behavioral measures and brain activity or white matter structure. First, we tested for correlations between BOLD activity clusters from the novel illegal–learned illegal contrast and each of the three learning success measures (utterance duration, error rate, and RT). Each learning measure was normalized by the learned illegal syllable measure. For instance, the utterance duration learning measure was the mean duration difference between the novel illegal and learned illegal productions divided by mean duration of the learned illegal productions. For each participant, a set of significant clusters from the novel illegal–learned illegal contrast was calculated using the same group-level method described above but excluding that participant's own data. Although this leave-one-out cross validation technique may reduce statistical power, it is a necessary loss to avoid biases from nonindependence of cluster selection and the BOLD–behavioral correlation measures (Esterman, Tamber-Rosenau, Chiu, & Yantis, 2010). Moreover, using this procedure does not substantially alter voxel selection. For instance, the activation peak of each left posterior superior temporal sulcus (pSTs) leave-one-out cluster is less than 1 mm from that of the cluster derived from all participants' data (calculated by the L2 norm of the Montreal Neurological Institute [MNI] coordinates).
Each learning success measure was then correlated with the mean beta coefficient within each significant cluster from the novel illegal–learned illegal contrast as determined by this leave-one-out method. In other words, an individual's functional data did not contribute to their cluster selection in the correlational analysis. We expected to find positive correlations in areas of the brain associated with speech motor learning (Tomassini et al., 2011; Zhang et al., 2009; Golestani & Zatorre, 2004) and report significance values of Bonferroni corrected pFWE < .05 for a one-tailed (positive) correlation (Pearson's R).
We then tested for correlations between the mean FA of white matter voxels underlying each active cluster from the novel illegal–learned illegal contrast and each participant's three learning success measures. On the basis of past evidence, we expected to find positive correlations between learning success and brain structure integrity in brain regions associated with speech motor learning (Tomassini et al., 2011; Golestani & Pallier, 2007; Gaser & Schlaug, 2003). We report significance values of pFWE < .05 for a one-tailed (positive) correlation (Pearson's R).
Next, the same methods were used to identify correlations between BOLD activity in significant clusters from the learned illegal–learned legal contrast and differences in duration, error rate, and RT between the learned illegal and learned legal stimuli, including normalization of performance measures as described above. We then tested for correlations between mean FA values for voxels beneath these active clusters and these same performance measures.
Finally, we investigated whether brain–behavior correlations identified using these methods were related to general performance rather than speech motor sequence learning specific to the practiced syllables. For instance, successful learners might have spoken with shorter utterance durations across all conditions compared with less successful learners; subsequent brain–behavior correlations might then reflect the relationship between utterance duration and brain measures, instead of the intended relationship between motor sequence learning changes and brain measures. Therefore, we tested for correlations between learning success measures involved in significant neural–behavioral correlations and the corresponding average behavioral measure across all utterances produced during the imaging session (p < .05, Pearson's R).
Behavioral Measures of Learning
Across-participant measures of error rate and utterance duration indicated significant improvement in performance between practice sessions for the illegal but not the legal syllables (Figure 1A). Illegal syllables had a significantly lower error rate on the second practice session compared with the first (Session 1 mean error rate = 25.8%, Session 2 = 0%, mean decrease = 25.8%, SD = 10.1, t(15) = −8.34, pFWE = .0002). In contrast, legal syllables showed no significant change from the first to second session (Session 1 mean error rate = 3.5%, Session 2 = 0%, mean decrease = 3.5%, SD = 3.8, t(15) = −3.51, pFWE > .05). Furthermore, error rate decreased significantly more for the illegal syllables than legal syllables (mean difference = 22.3%, SD = 11.4, t(15) = −7.39, pFWE = .0002). The duration of illegal syllables was significantly shorter in the second session compared with the first session (Session 1 mean duration = 579.3 msec, Session 2 = 523.5 msec, mean decrease = 54.9 msec, SD = 33.9, t(15) = 6.05, pFWE = .0004), but legal syllables showed no significant duration change from the first to the second session (Session 1 mean duration = 487.7 msec, Session 2 = 499.8 msec, mean decrease = −12.2 msec, SD = 25.2, t(15) = −1.81, pFWE > .05). Duration decreased significantly more for illegal than legal syllables (mean difference = 110.4 msec, SD = 59.9, t(15) = −7.45, pFWE = .0002). RT did not significantly change from the first to the second practice session for either the legal (Session 1 mean RT = 384.8 msec, Session 2 = 299.8, mean decrease = 49.1 msec, SD = 52.9, t(15) = 3.47, pFWE > .05) or illegal syllables (Session 1 mean RT = 344.4 msec, Session 2 = 283.9, mean decrease = 60.9 msec, SD = 58.9, t(15) = 3.87, pFWE > .05), and changes in RT were not significantly different between legal and illegal syllable conditions (mean difference = −11.9 msec, SD = 28.8, t(15) = −1.54, pFWE > .05).
During the fMRI session, significant differences in behavioral learning success indices across the three speaking conditions—learned legal, learned illegal, and novel illegal—were noted (Figure 1B). One-way ANOVAs tested for significant differences in learning indices across conditions; post hoc paired t tests compared pairs of conditions. Error rate (Figure 1B, left) was significantly different across conditions, F(2, 13) = 33.99, pFWE = 1 × 10−11. No difference in accuracy was noted between learned legal and learned illegal syllables (mean difference = 6.1%, SD = 9.9, t(15) = 2.41, pFWE > .05), but participants committed more errors during novel illegal syllables compared with both the learned legal (mean difference = 38.3%, SD = 21.6, t(15) = 6.88, pFWE = 1 × 10−5) and learned illegal syllables (mean difference = 32.2%, SD = 17.8, t(15) = 7.01, pFWE = 7 × 10−6). Utterance durations (Figure 1B, right) were also significantly different between conditions, F(2, 13) = 8.39, pFWE = .001. Learned illegal syllables were significantly shorter in duration than novel illegal syllables (mean difference = 55 msec, SD = 36.9, t(15) = 5.78, pFWE = 7 × 10−5) and learned legal syllables were even shorter (mean difference = 50 msec, SD = 21.6, t(15) = 8.98, pFWE = 3 × 10−7). RTs were not significantly different between conditions, F(2, 13) = 0.04, pFWE > .05.
The error rate and duration associated with novel illegal syllables performed during the fMRI session did not significantly differ from that of illegal syllables produced at the onset of the first practice (mean error rate difference = 7.9%, t(15) = 1.9363, p > .05; mean duration difference = 30 msec, t(15) = 1.8851, p > .05).
Figure 2 and Table 1 show the brain regions that were significantly more active for novel illegal than learned illegal syllables (vertex/voxel-level p < .001, uncorrected; cluster-level p < .0167). The production of novel illegal syllables resulted in greater BOLD response in the frontal operculum and adjacent anterior insula (referred to as the frontal operculum [FO] cluster hereafter) and lateral superior parietal lobule (SPL) bilaterally. In the left hemisphere, additional cortical clusters were noted with peaks in the lateral PMC, including a ventral cluster extending into the inferior frontal gyrus pars opercularis and a dorsal cluster extending into the inferior frontal sulcus), posterior superior temporal gyrus (pSTg), pSTs, and inferior temporal-occipital cortex (ITO). Subcortical activity was found in the left globus pallidus (GP). In the right hemisphere, the production of novel illegal syllables resulted in greater activity in the pre-SMA. No region in either hemisphere was found to be significantly more active for the learned illegal than the novel illegal syllables.
|Region Name .||MNI Coordinates .||t .||Size .||CWP .|
|x .||y .||z .|
|Region Name .||MNI Coordinates .||t .||Size .||CWP .|
|x .||y .||z .|
From left to right, the columns show the region name, MNI stereotactic coordinates, t value, cluster size, and CWP value.
Figure 3 and Table 2 show the cortical brain regions that were significantly more active for learned illegal than learned legal syllables (vertex/voxel-level p < .001, uncorrected; cluster-level p < .0167). The production of learned illegal syllables resulted in greater BOLD response bilaterally in PMC, pSTg, supramarginal gyrus (SMg), SPL, and occipital cortex (OC). PMC activity was more widespread in the left than the right hemisphere, comprising three distinct clusters: a ventral cluster extending into the inferior frontal gyrus, a posterior dorsal cluster near the border with primary motor cortex (MC), and an anterior dorsal cluster. SMg activity was also more widespread in the left hemisphere, including a ventral cluster in the opercular portion of SMg in addition to a dorsal cluster that was found in both hemispheres. In the left hemisphere, additional clusters were noted in ITO and SMA. In the right hemisphere, an additional cluster was found in the superior lateral cerebellar cortex.
|Region Name .||MNI Coordinates .||t .||Size .||CWP .|
|x .||y .||z .|
|Region Name .||MNI Coordinates .||t .||Size .||CWP .|
|x .||y .||z .|
From left to right, the columns show the region name, MNI stereotactic coordinates, t value, cluster size, and CWP value. CBM = cerebellum.
Brain–Behavior Correlation Analysis
Learning success, as measured by the participant-normalized difference in utterance duration between novel illegal and learned illegal syllables, was positively correlated (r = 0.709, pFWE = .029; Figure 4, left) with the mean response in the left FO cluster identified in the novel illegal–learned illegal leave-one-out cross-validation contrast. No other significant correlations between behavioral measures and BOLD response were found in any of the significant clusters identified by the functional imaging analysis of either the novel illegal–learned illegal contrast or the learned illegal–learned legal contrast.
The participant-normalized difference in utterance duration between novel illegal and learned illegal syllables was also positively correlated with the mean FA of white matter voxels 2 mm under the cluster of left pSTs activity in the novel illegal–learned illegal contrast (r = .670, pFWE = .040; Figure 4, middle). No other significant correlations between behavioral measures and mean FA under the significant cortical clusters identified by the functional imaging analyses were noted.
To verify that the two identified brain–behavior correlations were not simply because of general performance rather than speech motor sequence learning specific to the practiced syllables, we tested for correlations between the utterance duration learning success measure and the average utterance duration across all productions during the imaging session. Across participants, the utterance duration learning success measure was not significantly correlated with the average utterance duration (r = −.0749, p > .05).
This study has provided novel evidence for speech motor sequence learning with repeated practice in healthy adults and identified neural correlates of this learning process. With practice, participants produced pseudoword syllables containing phonotactically illegal consonant clusters more accurately and with shorter utterance durations, indicative of sequence learning (Figure 1). Greater gains with practice occurred for these illegal syllables compared with legal syllables that contained phonotactically legal consonant clusters. These learning gains were specific to the practiced syllables. This was revealed by comparing performance measures for a new set of illegal syllables during a subsequent neuroimaging session to the production of the illegal syllables produced in the practice session. During the imaging session, which occurred 1–2 days following practice, participants produced learned illegal syllables faster and more accurately than similar novel illegal syllables to which they had not been exposed. Moreover, participants made similar numbers of errors and spoke for similar lengths of time for novel illegal syllables during imaging and illegal syllables during their initial exposure of the practice session. Learning therefore remained stable across days and did not extend generally to any novel consonant cluster, but only to those that were practiced. Others have similarly reported that speech motor learning is specific to practiced utterances and does not generalize to similar novel utterances (Rochet-Capellan, Richer, & Ostry, 2012; Rochet-Capellan & Ostry, 2011).
The fact that performance improvements were specific to those stimuli encountered during practice indicates that they were not due to learning of new phonological rules, because such learning would have generalized to nonpracticed illegal syllables that followed the same phonological rules. Instead, we hypothesize that, with practice, new motor programs or “chunks” (Cowan, 2001; Miller, 1956) representing novel consonant clusters were learned by concatenating smaller existing motor programs. The larger programs enabled selection and feedforward production of an entire cluster as a single unit, resulting in faster, more accurate performance. Kinematic studies of articulatory overlap support the notion that well-learned consonant clusters are produced as a single motor program. Overlap of articulatory movements increases with speaking rate for consonants that would otherwise form an illegal cluster and straddle a syllable boundary (e.g., /kp/ in “jackpot”). Overlap within a legal consonant cluster, however, is unchanged by speaking rate (Lœvenbruck, Collins, Beckman, Krishnamurthy, & Ahalt, 1999; Byrd, 1996; Byrd & Tan, 1996).
The hypothesized learning of new speech motor programs is schematized in Figure 5. According to the Directions into Velocities of Articulators model of speech motor control (Guenther, Ghosh, & Tourville, 2006), neurons in a speech sound map located in left lateral PMC represent individual sound “chunks” that have their own optimized motor programs for production (for similar proposals, see Shalom & Poeppel, 2008; Levelt, 2001). Each speech sound map node projects to nodes in the primary MC that are responsible for the articulatory gestures (indicated by G1, G2, etc.) needed to produce the sound chunk. Together, the speech sound map node and the articulatory gesture nodes, along with their interconnections (including subcortically mediated connections not shown in Figure 5), can be considered the motor program for producing the sound chunk.
The left half of Figure 5 schematizes the production of a novel consonant cluster, in this case /zv/. Previously learned motor programs (labeled MP1 and MP2 and indicated by dashed lines) exist for the phonemes /z/ and /v/, but not for the cluster /zv/. The transitions between gestures in a well-learned motor program (e.g., G1 and G2 for the phoneme /z/) are assumed to be optimized to allow maximally rapid production. To produce the novel cluster /zv/, the motor programs for /z/ and /v/ must be activated in sequence, thereby producing the required articulatory gestures (G1 through G4) in order. Although the sequences G1 → G2 and G3 → G4 are optimized, the transition between G2 and G3 is not, and thus, the overall gestural sequence for the cluster G1 → G2 → G3 → G4 is not yet fully optimized.
The right half of Figure 5 schematizes the situation after the consonant cluster has been learned through practice. Now, a speech map node in PMC represents the cluster /zv/, and the cluster has its own motor program (MP3). All transitions between gestures in the cluster are optimized, resulting in faster and more accurate performance compared with the situation in the left half of the figure.
Inherent to the account in Figure 5 are several predictions regarding brain activity that were tested in the fMRI portion of the study. First, the production of learned illegal syllables should result in less left PMC activity than production of novel illegal syllables because fewer speech sound map nodes need to be activated. This hypothesis is supported by our finding of left PMC activity in the novel illegal–learned illegal contrast. Prior studies have identified an analogous reduction of activity in the more dorsal premotor region that encodes hand movements when novel hand movement sequences are learned (Orban et al., 2010; Jenkins, Brooks, Nixon, Frackowiak, & Passingham, 1994; Doyon et al., 2002). Second, activity in primary MC should be approximately the same for novel illegal and learned illegal syllables because they require the same articulatory gestures (G1–G4 in Figure 5). This is supported by the lack of activity in primary MC in the novel illegal–learned illegal contrast. Third, brain regions responsible for sequencing speech “chunks” should be more active for novel than learned illegal syllables because of the higher number of “chunks” that must be sequenced. This hypothesis is supported by the findings of reduced activity for learned illegal compared with novel illegal syllables in the right pre-SMA and left GP—regions believed to form part of a BG–thalamocortical loop that selects and initiates motor chunks (Kotz & Schwartze, 2010; Haggard, 2008; Contreras-Vidal, 1999). These findings mirror those of non-speech motor sequence learning studies (Lehericy et al., 2005; Poldrack et al., 2005; Hikosaka et al., 1996), and we have previously proposed that this network interacts with the PMC to sequentially execute a series of speech motor programs (Bohland, Bullock, & Guenther, 2010). Notably, the reduction in pre-SMA activity occurred in the right hemisphere in the novel illegal–learned illegal contrast. Right pre-SMA, in conjunction with right lateral inferior frontal gyrus and anterior insula, has been implicated in selecting and inhibiting unwanted movements (Rae, Hughes, Weaver, Anderson, & Rowe, 2014; Aron, 2011). A common error during initial productions of illegal syllables is schwa insertion—the addition of a neutral vowel—in the consonant cluster (Davidson, 2006). Right pre-SMA activity in the novel illegal–learned illegal contrast may reflect suppression of this extraneous vowel movement to accurately produce novel illegal syllables. This suppression was unnecessary for learned illegal productions for which consonant cluster motor programs were established.
We also noted greater activity in left PT and pSTs for the novel illegal syllables relative to the learned illegal syllables. These auditory regions are thought to be involved in guiding speech movements based on auditory feedback (Hickok, 2012; Guenther et al., 2006). During speech production, activity in this area is greater when there is a mismatch between expected and realized auditory feedback (Tourville, Reilly, & Guenther, 2008; Toyomura et al., 2007). In the Directions into Velocities of Articulators (Guenther et al., 2006) and State Feedback Control (Hickok, 2012) models of speech production, error signals arising from these regions are used to fine-tune speech motor programs. Thus, learning relies on the transmission of these signals to frontal regions involved in motor planning and execution. In keeping with this view, Parker Jones et al. (2013) noted stronger functional connectivity between motor and auditory areas in nonnative speakers than native speakers during an overt production task. The correlation we found between learning success and white matter FA underlying pSTs (Figure 4, middle) supports this interpretation. The finding reveals a potential physiological constraint on sensorimotor learning: Reduced white matter integrity underlying pSTs may interfere with the transmission of auditory error signals to premotor regions (Axer, Klingner, & Prescher, 2013; Saur et al., 2008), thereby hindering the formation and updating of finely tuned speech motor programs.
Production of novel illegal syllables also produced significantly greater activity compared with learned legal syllables in FO, including adjoining parts of the anterior insula, bilaterally. Like pSTs, this area has been implicated in error and feedback processing (Ullsperger, Harsay, Wessel, & Ridderinkhof, 2010; Menon, Adleman, White, Glover, & Reiss, 2001) and, more specifically, monitoring auditory feedback during speech production (Christoffels, Formisano, & Schiller, 2007; Hashimoto & Sakai, 2003). It is anatomically connected to both the posterior superior temporal cortex and lateral PMC (Axer et al., 2013; Saur et al., 2008; Augustine, 1996). Greater activity in this region during production (Moser et al., 2009) and perception of novel speech sounds (Golestani & Zatorre, 2004; Raboyeau et al., 2004) compared with familiar speech sounds has been noted previously, consistent with the current findings. Golestani and Zatorre (2004) also reported a correlation between activity in this region and the degree of success in learning to distinguish novel phonetic contrasts, analogous to the correlation between FO activity and learning success noted herein (Figure 4, left). Moreover, Golestani and Pallier (2007) found higher white matter density under FO for speakers who were more successful at learning to produce a novel phoneme.
FO has also been associated with phonological processing, including translation of phonetic codes into articulatory scores (Dogil et al., 2002), translation from orthographic and auditory stimuli to phonological representations (Steinbrink, Ackermann, Lachmann, & Riecker, 2009; Paulesu et al., 1996), and phonological retrieval (Price, 1998). Combined, this previous work suggests that mappings between language-related sensory inputs and corresponding motor representations occur via phonological representations encoded in FO. The novel sequences of speech sounds in the current study required participants to create and hone new auditory–motor and orthography-to-motor mappings in the FO. The correlation between learning success and the reduction of activity in FO for the learned stimuli noted here suggests that speech motor sequence learning depends on the efficiency with which novel phonological representations and sensorimotor mappings are established and fine-tuned.
In addition to comparing novel and learned illegal syllables, we compared performance and brain activity during production of learned illegal syllables to learned legal syllables. Participants produced learned legal syllables faster than learned illegal syllables. Furthermore, production of learned legal syllables was accompanied by reduced activity compared with learned illegal syllables in many of the same areas that were more active for novel illegal syllables compared with learned illegal syllables. On the basis of these results, we hypothesize that the consolidation of individual phoneme motor programs into larger motor program “chunks” is more complete for the learned legal syllables than the learned illegal syllables because of more experience producing the phoneme combinations in the legal syllables, which occur in the speakers' native language. In effect, learned legal syllables may be produced with fewer speech motor programs than learned illegal syllables.
Notably, there was no significant difference between the learned illegal and learned legal activity in pSTs or FO (unlike the novel illegal–learned illegal contrast), suggesting that the auditory–motor mappings described above already exist for both sets of stimuli. The auditory–motor mappings for the learned illegal stimuli are apparently learned during the practice sessions, and the learned legal mappings are learned as part of the speakers' native language. (Note that we do not suggest that these existing mappings are stagnant representations. We believe that auditory feedback is used continuously to fine-tune speech motor programs—stored in the lateral PMC—including those of learned illegal and learned legal syllables.) Some researchers have suggested that FO modulates attention or coordinates articulation (Baldo, Wilkins, Ogar, Willock, & Dronkers, 2011; Sterzer & Kleinschmidt, 2010; Eckert et al., 2009); however, there is also a disparity in these processes between the learned illegal and learned legal conditions, and BOLD activity was not significantly different between these conditions.
Brain regions implicated in reading (Peyrin, Demonet, N'Guyen-Morel, Le Bas, & Valdois, 2011; Hoeft et al., 2007; Cohen & Dehaene, 2004), including left ITO and bilateral SPL, were also more active for novel illegal compared with learned illegal syllables, and also for learned illegal syllables compared with learned legal syllables. This increased activity may be because of increased reliance on the orthographic stimulus for novel compared with learned illegal stimuli as well as for learned illegal compared with learned legal stimuli.
One alternative interpretation of the brain activity differences we find between novel and learned syllables is that these differences result primarily from the fact that novel illegal syllables are uttered more slowly than learned illegal syllables. To alleviate this concern, we included utterance duration as a covariate of noninterest in the fMRI analyses. Furthermore, inspection of brain activity that covaried with duration revealed small clusters in left PMC, MC, and SMg; no clusters were found in FO or pSTs, supporting the view that activity differences in these areas for learned versus novel illegal syllables reflect the effects of speech motor sequence learning.
It is somewhat surprising that we did not find significant differences in cerebellar activity in the novel illegal–learned illegal contrast. A great deal of motor sequence learning literature research has implicated the cerebellar cortex and nuclei in forming new motor programs in parallel with cerebral cortical motor areas (cf. Doyon et al., 2002). We did find differences in cerebellar activity in the learned illegal–learned legal contrast. This comparison possessed greater statistical power because of fewer removed error trials and a larger difficulty disparity between conditions. We speculate that our whole-brain normalization technique may not have provided good enough anatomical alignment between participants to find small learning-related activity differences in the novel illegal–learned illegal contrast (Diedrichsen, 2006).
Our results demonstrated behavioral improvements because of speech motor sequence learning and identified the network of brain regions involved in this process. Learning resulted in reduced activity in speech-related frontal and posterior superior temporal cortex as well as brain regions known to be involved more generally in motor sequence planning and execution, including the pre-SMA and BG. Reduced activity within the motor sequence learning network supports the notion that motor sequence learning involves merging individual motor programs into larger units. This allows the motor system to use fewer, larger motor programs, thereby reducing cognitive demands during planning and performance. A significant correlation was found between learning success and activity in FO, supporting the view that motor sequence learning relies on mapping sensory representations of novel speech sound sequences to the motor system via phonological representations in FO. White matter FA underlying pSTs was also significantly correlated with learning success, indicating that white matter integrity modulates learning by constraining the efficiency of sensory-to-motor signal transmission.
Research reported in this publication was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award number R01DC007683. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Reprint requests should be sent to Dr. Jennifer A. Segawa, Center for Computational Neuroscience and Neural Technology, Boston University, 677 Beacon Street, Boston, MA 02215, or via e-mail: firstname.lastname@example.org.