Abstract

Visual speech influences the perception of heard speech. A classic example of this is the McGurk effect, whereby an auditory /pa/ overlaid onto a visual /ka/ induces the fusion percept of /ta/. Recent behavioral and neuroimaging research has highlighted the importance of both articulatory representations and motor speech regions of the brain, particularly Broca's area, in audiovisual (AV) speech integration. Alternatively, AV speech integration may be accomplished by the sensory system through multisensory integration in the posterior STS. We assessed the claims regarding the involvement of the motor system in AV integration in two experiments: (i) examining the effect of articulatory suppression on the McGurk effect and (ii) determining if motor speech regions show an AV integration profile. The hypothesis regarding experiment (i) is that if the motor system plays a role in McGurk fusion, distracting the motor system through articulatory suppression should result in a reduction of McGurk fusion. The results of experiment (i) showed that articulatory suppression results in no such reduction, suggesting that the motor system is not responsible for the McGurk effect. The hypothesis of experiment (ii) was that if the brain activation to AV speech in motor regions (such as Broca's area) reflects AV integration, the profile of activity should reflect AV integration: AV > AO (auditory only) and AV > VO (visual only). The results of experiment (ii) demonstrate that motor speech regions do not show this integration profile, whereas the posterior STS does. Instead, activity in motor regions is task dependent. The combined results suggest that AV speech integration does not rely on the motor system.

INTRODUCTION

Visible mouth movements provide information regarding the phonemic identity of auditory speech sounds and have been demonstrated to improve the perception of heard speech (Sumby & Pollack, 1954). That auditory and visual speech signals interact is further exemplified by McGurk fusion, whereby, for example, auditory /pa/ overlaid onto visual /ka/ produces the percept of /ta/ (McGurk & MacDonald, 1976).

Given that the two modalities interact, researchers have attempted to determine the neural correlates of this multisensory processing. Some researchers have highlighted the importance of the posterior STS (pSTS) in audiovisual (AV) speech integration in humans (Beauchamp, Argall, Bodurka, Duyn, & Martin, 2004; Callan et al., 2004; Campbell et al., 2001). The pSTS is a good candidate for multisensory integration, given its position between auditory and visual association cortex (Beauchamp, Lee, Argall, & Martin, 2004) and anatomical studies of the macaque brain that have shown strong anatomical connectivity of this area with different sensory cortices (Yeterian & Pandya, 1985; Mesulam & Mufson, 1982; Seltzer & Pandya, 1978, 1980; Jones & Powell, 1970). fMRI studies have shown that the BOLD response of this region is consistent with AV integration. Calvert, Campbell, & Brammer, 2001), for example, found that the pSTS exhibited an increased response to a multisensory stimulus than to stimuli from the individual modalities: AV > AO (auditory only) and AV > VO (visual only), and Beauchamp, Argall, et al. (2004) demonstrated a patchy organization in the pSTS, with some voxels favoring unisensory stimuli and some voxels maximally sensitive to AV stimuli. Additional studies have suggested a causal role for the pSTS in AV integration. Beauchamp, Nath, and Pasalar (2010) localized the pSTS multisensory area in individual participants with the conjunction of AO and VO in fMRI, then presented participants with McGurk stimuli while simultaneously pulsing the pSTS with TMS. Compared with baseline, participants reported significantly fewer fusion responses, implicating this region in successful McGurk fusion. Nath and Beauchamp (2012) exploited individual differences in susceptibility to McGurk fusion; susceptibility to the illusion was positively correlated with the BOLD response in the functionally localized pSTS region (AO conjunction VO). These studies converge on the pSTS as an important region in integrating a multisensory stimulus into a unified percept.

However, neuroimaging data also show activation of motor speech regions to lipreading and AV speech (Skipper, van Wassenhove, Nusbaum, & Small, 2007; Miller & D'Esposito, 2005; Ojanen et al., 2005; Skipper, Nusbaum, & Small, 2005; Callan et al., 2003; Calvert & Campbell, 2003; Paulesu et al., 2003; Sekiyama, Kanno, Miura, & Sugita, 2003; Campbell et al., 2001; MacSweeney et al., 2000). Ojanen et al. (2005) presented AV vowels that were either congruent or incongruent in an fMRI study and found activation in both the STS and Broca's area (BA 44 and BA 45), with only Broca's area showing increased activation for conflicting stimuli. Inspired by the motor theory of speech perception (Liberman & Mattingly, 1985), in which the perceptual units of speech sounds are motor commands, the authors suggested that Broca's area performs AV integration through overlapping activation of articulatory representations by the two modalities, resulting in increased activation for incongruent stimuli as a wider range of articulatory representations activate. Skipper et al. (2007) found activation in a frontal-motor network including posterior Broca's area (BA 44), dorsal premotor cortex (right), ventral premotor cortex (left), and primary motor cortex (left) using both congruent and McGurk AV stimuli. BOLD time courses to McGurk /ta/ most correlated with those of congruent /ta/ in motor regions, whereas time courses in auditory and visual cortex at first correlated with the congruent stimulus for each respective modality (/pa/ for auditory, /ka/ for visual) while later correlating with congruent /ta/. The authors proposed a network for AV speech in an analysis-by-synthesis framework reliant upon frontal-motor brain structures (primarily the pars opercularis of Broca's area and dorsal premotor cortex) to activate articulatory representations that constrain perception through feedback to sensory regions (Stevens & Halle, 1967).

The hypothesis that the motor system contributes to AV speech integration is further supported by TMS studies. Watkins, Strafella, and Paus (2003) showed that motor-evoked potentials recorded from the lips during stimulation of face motor cortex were significantly enhanced when participants viewed speech lip movements compared with nonspeech facial movements, whereas Sato et al. (2010) found that motor-evoked potentials recorded from the tongue during stimulation of tongue motor cortex were significantly enhanced when perceiving tongue-related AV syllables compared with lip-related syllables. These results support a somatotopic response of the motor system during the processing of AV speech.

In addition to the neuroimaging data, there is behavioral evidence that supports the notion that the motor system contributes to AV speech integration. Sams, Mottonen, and Sihvonen (2005) presented participants with a standard McGurk paradigm but included a condition in which participants did not view a visual speech stimulus but instead silently mouthed congruent or incongruent syllables along with auditory speech. They found that incongruent self-articulation (i.e., audio /pa/, articulate /ka/) produced an interference effect, with the proportion of correctly identified auditory /pa/ reduced from 68% to 33%. The authors posited that, given this effect, AV integration is driven through the activation of articulatory representations.

Given these two broad sources of evidence, Okada and Hickok (2009) hypothesized that both the pSTS and the motor system contribute to the processing of visual speech. However, although both systems may contribute to AV integration, a closer look suggests rather different roles. Ojanen et al. (2005) found that Broca's area generates more activity for incongruent than congruent AV stimuli; the same contrast revealed no activity in the pSTS. Miller and D'Esposito (2005) found more activity for AV stimuli that are perceptually unfused than for fused stimuli in the inferior frontal gyrus (IFG) and the reverse pattern in the pSTS. Fridriksson et al. (2008) found more activity for speech videos with a reduced compared with a smooth frame rate in the motor system, including Broca's area, without seeing these effects in the pSTS. Although the activity of the pSTS does show effects of perceptual fusion and stimulus synchrony (Stevenson, VanDerKlok, Pisoni, & James, 2011), it is important to note that activations to AV speech in the motor system and the pSTS tend to dissociate such that Broca's area is more active when AV integration fails or results in conflicting cues and pSTS is more active when AV integration succeeds. This argues strongly for different roles of the two regions and hints that Broca's area may be more involved in conflict resolution as suggested in other linguistic domains (Novick, Trueswell, & Thompson-Schill, 2005), whereas the pSTS may be more involved in cross-sensory integration per se. The observation that prelinguistic infants show the McGurk effect (Rosenblum, Schmuckler, & Johnson, 1997) is broadly consistent with this view in that it demonstrates that the ability to articulate speech is not necessary for AV integration.

These results suggest that activations in the motor system during experiments may not reflect AV integration per se, but something else. One alternative explanation for these activations is that the motor system responds because of demands on response selection, contingent upon the particular task in the experiment. A recent study by Venezia, Saberi, Chubb, and Hickok (2012) found that during an auditory syllable discrimination task, motor speech regions showed a negative correlation with response bias, whereas no regions in the temporal lobe showed such a correlation. Response bias is the threshold at which participants select one response over another, independent from perceptual analysis, suggesting that activations in the motor system during auditory speech perception may reflect response selection rather than perceptual analysis. This finding from auditory speech perception may account for motor activations to AV speech as well.

The goal of this study was to assess the claim that the motor system plays a primary and necessary role in AV integration, but it is also set up to assess a weaker claim that the motor system plays a secondary, modulatory role in AV integration. We refer to these hypotheses generally as “the motor hypothesis” and distinguish variants in the strength of motor involvement as needed. The motor hypothesis generates a number of predictions, including the following: (i) engaging the motor speech system with a secondary task should modulate the perception of AV speech, strongly if the motor system plays a primary role and more weakly if it plays a secondary role, and (ii) motor speech regions should exhibit a physiological response that is characteristic of cross-model integration. Previous literature (Calvert, Brammer, & Campbell, 2001) had emphasized the importance of supra-additivity as a defining feature of multisensory integration, requiring that the multisensory response be larger than the sum of the individual unisensory responses. However, Beauchamp (2005) suggested more relaxed criteria for identifying multisensory areas, such as requiring that the multisensory response be greater than the larger of the unisensory responses; that is, in effect, greater response to each of the individual modalities in isolation rather than summed (AV > AO and AV > VO; rather than AV > AO + VO). We used this relaxed criterion to assess whether activity in the motor system reflects AV integration in speech.

Experiment 1 was designed to assess the first prediction, that modulating articulation should modulate perception of AV speech. We build on Sams et al. (2005), who found evidence in support of this prediction by demonstrating that a participant's own articulation (/ka/) during the presentation of an auditory stimulus (/pa/) can produce McGurk-like interference. In our study, we presented participants with a McGurk mismatch (MM) stimulus (auditory /pa/, visual /ka/) while modulating the motor system by having participants articulate subvocally throughout stimulus presentation in a manner that should interfere with and therefore reduce the frequency of McGurk fusion, if the motor system were a critical component of the effect. To do this, we chose a syllable sequence for participants to articulate that was congruent with the auditory stimulus and incongruent with the visual stimulus in terms of place-of-articulation. Put differently, the visual signal in an AV MM stimulus tends to pull the percept away from the auditory signal. If this pull is mediated by the motor system, then aligning the listener's motor articulation with the auditory signal should minimize the pull. If the pull away from the auditory signal is mediated by cross-sensory interaction between auditory and visual signals (rather than sensorimotor interaction), then motor modulation should have no effect.

Experiment 2 was designed to test the second prediction of the motor hypothesis: That motor speech regions will show an AV integration activation profile. If motor speech regions are involved in AV integration, then they should show (i) a response to both auditory and visual speech and (ii) a larger response to multisensory speech than to auditory and visual speech in isolation (AV > AO and AV > VO). In an fMRI study utilizing a block design, we presented participants with AO, VO, AV, and McGurk MM speech as well as an articulatory rehearsal condition (ART) to identify areas involved in speech production. If the motor system contributes to AV speech integration, then motor speech areas, particularly the pars opercularis of Broca's area and premotor cortex (implicated by previous research), should show this AV integration activation profile. If the motor system does not contribute to AV speech integration, then these areas would show a profile inconsistent with AV integration.

EXPERIMENT 1

The objective of Experiment 1 was to assess whether direct modulation of the listener's motor system via concurrent speech articulation would modulate the strength of the McGurk effect. Two versions of Experiment 1 are reported.

Methods—Experiment 1a

Participants

Thirteen right-handed, native speakers of English (aged 18–30 years, 11 women) volunteered for participation. Participants had normal or corrected-to-normal vision and no hearing impairment. Participants were given course credit for their participation. Consent was acquired from each participant before participation in the study, and all procedures were approved by the Institutional Review Board of University of California, Irvine.

Stimuli

Auditory stimuli (AO) consisted of recordings of a native speaker of English producing the speech sounds /pa/, /ta/, and /ka/. The duration of each recording was 1000 msec, and the duration of the auditory speech was ∼300 msec for each syllable, digitized at 44,100 Hz. Each stimulus consisted of four repetitions of the same syllable. We presented participants with four repetitions because we wanted to ensure that articulation had the maximal opportunity to impact perception. Low-amplitude continuous white noise (level set to 10% RMS of speech) was added to each stimulus to ensure McGurk fusion as well as mask any sounds inadvertently produced during suppression. We created video recordings of the same speaker articulating the same speech sounds at a frame rate of 30 fps. Congruent AV stimuli were generated by overlaying the auditory stimuli onto the corresponding visual stimuli and aligning the onset of the consonant burst with the audio captured in the video recordings. In addition, one MM video was generated by overlaying auditory /pa/ onto visual /ka/ (McGurk-inducing). During stimulus presentation, stimulus loudness was set at a level that was clearly audible and comfortable for each participant.

Procedure

Participants were informed that they would be viewing videos of a speaker articulating syllables and were asked to make decisions regarding the identity of the acoustic stimuli in a 3AFC design among /pa/, /ta/, and /ka/. Specifically, they were instructed to “report the sound that they heard.” Participants made these judgments while simultaneously performing a secondary task, adapted from Baddeley, Eldridge, and Lewis (1981), which consisted either of continuously articulating the sequence “/pa/…/ba/” without producing sound or continuously performing a finger-tapping sequence, 1-2-3-4-5-5-4-3-2-1 (1 = thumb, 5 = pinky). For the suppression task, we chose the sounds /pa/ and /ba/ because /pa/ is identical to the auditory portion of our MM stimulus and /ba/ differs only on the feature of voicing or the onset time of the vibration of the vocal folds. Otherwise, the vocal tract configuration during the articulation of these two consonants above the larynx is identical. If visual speech influences heard speech via activation of motor representations, then saturating the motor system with auditory-congruent representations should strengthen activation in favor of the auditory stimulus, lessening the effect of the incongruent visual stimulus. Participants were instructed to perform both tasks at 2 Hz and were cued at that rate by an onscreen flickering fixation point that disappeared during stimulus presentation. Participants were instructed to continuously perform the task throughout stimulus presentation. Participants performed the same task (articulation or finger tapping) throughout a given experimental run.

Stimuli were blocked by modality (AO, AV) and task (articulatory suppression, finger-tapping) in four experimental runs. MM stimuli were presented during AV runs. AV runs were presented first to prevent participants from guessing the incongruent nature of the MM stimuli. Order of task was counterbalanced across participants. Ten trials of each stimulus were presented in random order in each run. Participants made their responses by indicating their decision on answer sheets provided to them. Once the participant completed a trial, she or he cued the onset of the next trial in a self-paced fashion. Each trial began by cueing the participant to get ready and then began with a button press when the participant was ready to begin the secondary task. The task cue flickered for 4 sec, at which point the stimulus was presented followed by a prompt to respond. Stimuli were delivered through a laptop computer with Matlab software (Mathworks, Inc., Natick, MA) utilizing Psychtoolbox (Kleiner, Brainard, & Pelli, 2007; Brainard, 1997; Pelli, 1997) and headphones (Sennheiser HD280, Wedemark, Lower Saxony).

To determine whether there was a McGurk effect, we compared performance on auditory identification of the AO /pa/ stimulus to the MM stimulus, with the expectation that successful interference results in reduced auditory identification performance in the MM condition. Therefore, we analyzed the data in a 2 × 2 design, crossing Stimulus (AO /pa/, MM) × Task (finger-tapping, articulatory suppression) to determine if task had an effect on the strength of the McGurk effect. Although only AO /pa/ and MM trials were included in the analysis, we included the other stimuli in the experiment so that participants gave a range of responses throughout the experiment to prevent them from guessing the nature of the MM stimulus.

Results—Experiment 1a

The average correct identification of the AO and congruent AV stimuli was at or near ceiling across /pa/, /ta/, and /ka/ for both secondary tasks. Figure 1 illustrates that the McGurk effect was equally robust during both secondary tasks. Given the presence of ceiling and floor effects, we performed nonparametric statistical examinations of the data (Kruskal–Wallis) that are less sensitive to outliers than parametric tests. Participants reported significantly more /pa/ responses during the AO /pa/ condition than the MM condition, an effect of Condition, χ2(1, n = 52) = 40.94, p < .001, indicating a successful McGurk effect. There was no effect of Task on /pa/ responses in the MM condition, χ2(1, n = 26) = 0.131, p = .718, nor was there an effect of Task on /ta/ (fusion) responses in the MM condition, χ2(1, n = 26) = 1.783, p = .182, indicating no effect of articulatory suppression on the McGurk effect. All reported Kruskal–Wallis tests are Bonferroni-corrected for multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .0167). The majority of responses in the MM condition during articulatory suppression were fusion responses (/ta/, 95%), rather than the visual capture response (/ka/, 2%). Consistent with a cross-sensory model of the source of AV integration and against the predictions of the motor hypothesis, these results strongly suggest that articulatory suppression does not affect McGurk fusion.

Figure 1. 

Average number of responses for each alternative during Experiment 1a for the AO /pa/ and MM (A-/pa/, V-/ka/; McGurk) conditions during the finger-tapping task (black bars) and the articulatory suppression task (white bars). ns = not significant.

Figure 1. 

Average number of responses for each alternative during Experiment 1a for the AO /pa/ and MM (A-/pa/, V-/ka/; McGurk) conditions during the finger-tapping task (black bars) and the articulatory suppression task (white bars). ns = not significant.

Discussion—Experiment 1a

Participants correctly identified AO and congruent AV syllables, but performance changed dramatically during perception of the incongruent stimuli. This is a classic McGurk effect (McGurk & MacDonald, 1976).

Against the predictions of the motor hypothesis, we did not see any difference between participants' responses during the articulatory suppression task and the finger-tapping task. In a framework that highlights the importance of articulatory representations in integrating AV speech, one would expect any distracting articulation to reduce McGurk fusion. In our experiment, participants' own articulations were congruent with the auditory stimulus, which should have the strongest possible effect. Instead, the articulatory suppression task showed no effect. This suggests that the McGurk effect is not mediated or even modulated by the motor system.

One possible issue with our results is that participants may have failed to articulate simultaneously with the auditory stimulus. This is unlikely given that participants were cued to begin articulation before the onset of the stimulus and continue throughout its duration at a fairly rapid rate (2 Hz) and because participants fused at nearly 100% during the articulation task, implying that this would have had to happen on nearly every trial. In a previous version of the experiment that was run as a pilot study, we used a single stimulus presentation with a single simultaneous articulation, in accordance with Sams et al. (2005), and did not observe the reported interference effect. This led us to adopt the current design with four stimulus repetitions and rapid articulatory suppression in an attempt to afford the motor system the most chance to influence perception. However, one might still argue that the motor system was not sufficiently driven by this task. A second issue concerns our usage of /pa/ and /ba/ during the articulatory suppression task. It is possible that the use of more than one syllable caused some form of confusion and led participants to rely more on the visual stimulus, contaminating the results. A third potential issue with our design is the presence of four stimulus repetitions, which may have somehow altered the results because of participants making a collective judgment on multiple stimuli rather than a single stimulus. To address these concerns, we ran a second experiment in which we employed a rapid articulatory suppression of /pa/ alone without cueing (i.e., as fast as possible) and trials that consisted of only a single stimulus presentation.

Methods—Experiment 1b

Participants

Seventeen right-handed, native speakers of English (aged 18–39 years, mean = 21 years, 10 women) volunteered for participation. Participants had normal or corrected-to-normal vision and no hearing impairment. Participants were given course credit for their participation. Consent was acquired from each participant before participation in the study, and all procedures were approved by the Institutional Review Board of University of California, Irvine.

Stimuli

Stimuli were identical to Experiment 1a, with the following modifications: The duration of each stimulus was lengthened to 2000 msec (syllable duration the same), white noise level was increased to 20% RMS of speech, and each stimulus consisted of only a single presentation.

Procedure

The experimental procedure was identical to Experiment 1b, with the following modifications. We altered the articulatory suppression task such that participants were instructed to articulate /pa/ silently and as rapidly as possible from when the trial began and throughout stimulus presentation, instead of cued to articulate /pa/…/ba/ at 2 Hz. We replaced the finger-tapping task with a baseline condition with no secondary task. Stimuli were blocked by modality (AO, AV) and condition (baseline, articulatory suppression) in four experimental runs. Order of condition and modality was different (partially counterbalanced) for each participant. Ten trials of each stimulus were presented in random order in each run. Participants made their responses by pressing the appropriate key on the keyboard. Once the participant completed a trial, she or he cued the onset of the next trial in a self-paced fashion. Each trial began by cueing the participant to get ready and then began with a button press when the participant was ready to begin the trial. A fixation “x” appeared for 1.5 sec, followed by the stimulus.

The data were analyzed in the same manner as Experiment 1a, replacing the finger-tapping condition with the baseline condition.

Results—Experiment 1b

The results are consistent with Experiment 1a. Average correct identification of the AO and congruent AV stimuli was at ceiling across /pa/, /ta/, and /ka/ for both secondary tasks. Figure 2 illustrates that the McGurk effect was equally robust during baseline and articulatory suppression. As in Experiment 1a, we performed nonparametric statistical examinations of the data (Kruskal–Wallis). Participants reported significantly more /pa/ responses during the AO /pa/ condition than the MM condition, an effect of Condition, χ2(1, n = 68) = 45.80, p < .001, indicating a successful McGurk effect. There was no effect of Task on /pa/ responses in the MM condition, χ2(1, n = 34) = 0.30, p = .584, nor was there an effect of Task on /ta/ (fusion) responses in the MM condition, χ2(1, n = 34) = 0.09, p = .762, indicating no effect of articulatory suppression on the McGurk effect. All reported Kruskal–Wallis tests are Bonferroni-corrected for multiple comparisons with a familywise error rate of p < .05 (per-comparison error rates of p < .0167). There were some differences from Experiment 1a in the response rate for each alternative, but they were qualitatively similar, with a majority fusion responses (/ta/, 68%).

Figure 2. 

Average number of responses for each alternative during Experiment 1b for the AO /pa/ and MM (A-/pa/, V-/ka/; McGurk) conditions during baseline (black bars) and the articulatory suppression task (white bars). ns = not significant.

Figure 2. 

Average number of responses for each alternative during Experiment 1b for the AO /pa/ and MM (A-/pa/, V-/ka/; McGurk) conditions during baseline (black bars) and the articulatory suppression task (white bars). ns = not significant.

Discussion—Experiment 1b

As in Experiment 1a, participants correctly identified AO and congruent AV syllables, but performance changed dramatically during perception of the incongruent stimuli, confirming the presence of a McGurk effect (McGurk & MacDonald, 1976) under conditions of articulatory suppression. There were some differences in the overall fusion rate between the two experiments (∼90% in Experiment 1a and ∼65% in Experiment 1b) and concomitant differences in visual capture and auditory perceptions. The difference in fusion rates may be largely explained by the difference in presentation: Four repetitions of the same stimulus were used in Experiment 1a, whereas only a single presentation was used in Experiment 1b. In addition, the noise level increase may have affected some participants' judgments. However, the alterations in the experimental design did not qualitatively change the results: McGurk fusion rate does not change from baseline during articulatory suppression. This allays the noted concerns from Experiment 1a.

The result that the McGurk effect is not weakened under articulatory suppression conflicts with the results of Sams et al. (2005). However, the discrepancy can be explained by closely examining their results. Considering the type of response in their study (fusion /ta/ or visual/articulatory capture /ka/), the proportion of fusion responses was the same in the baseline condition as during their articulation condition (23%), with only “capture responses” (percept is congruent with the visual stimulus) increasing with articulation (46% vs. 9%). This trend held for all of their experimental conditions, including for a written /ka/ (26% /ka/). This is different from most McGurk paradigms, in which the bulk of the interference effect derives from fusion rather than visual capture. The interference effect of the written stimulus, along with their effects being driven by capture rather than fusion, may be partially explained by the high-amplitude noise added to the auditory stimulus to drive baseline /pa/ identification down to ∼68%. In this light, the interference effect that obtained from this study is relatively weak and may have resulted from response bias induced by the noisy auditory stimulus.

In summary, we found no evidence that behavioral interference involving the motor speech system modulates the McGurk effect, casting doubt on both strong and weak versions of the motor hypothesis of AV integration.

EXPERIMENT 2

The goal of Experiment 2 was to use fMRI to examine the profile of activation in motor speech regions in response to auditory, visual, and AV speech and to compare this profile with the one observed in the STS. We were particularly interested in determining whether speech motor areas (specifically, the pars opercularis of Broca's area and premotor cortex) exhibit an AV integration profile (AV > AO and AV > VO).

Methods—Experiment 2

Participants

Twenty right-handed, native speakers of English (aged 20–30 years, eight men) volunteered for participation. Participants had normal or corrected-to-normal vision and no hearing impairment and reported no history of neurological disorder. Participants were paid $30 an hour for their participation. Consent was acquired from each participant before participation in the study, and all procedures were approved by the Institutional Review Board of University of California, Irvine.

Stimuli and Design

The stimuli from Experiment 2 were identical to Experiment 1, except for the following: All stimuli had duration of 1000 msec, and the noise level was set to 25% RMS of speech. VO stimuli were added, consisting of the same videos as the congruent AV stimuli with no sound. In addition, an ART was added, cued by a flickering fixation cross. In summary, the experiment consisted of a 3 × 3 design, Condition (AO, VO, AV) × Stimulus (/pa/, /ta/, /ka/), plus two additional conditions, MM and ART.

Procedure

Participants were informed that they would view videos of a talker articulating the speech sounds /pa/, /ta/, and /ka/ and instructed to make decisions regarding the identity of the stimuli. Trials consisted of a block of 10 sequential identical speech sounds followed by 2.5 sec of fixation. Participants were instructed to pay attention throughout the duration of the trial and at the end of the block to identify the speech sound in audio and AV trials and the intended speech sound in visual trials. As in the behavioral experiment, participants were not informed of the incongruent nature of the MM stimulus, although two participants were aware of the presence of a MM stimulus. Responses were made with a response box using the left hand. Participants assigned a distinct button for each possibility in a 3AFC design among /pa/, /ta/, and /ka/, with three fingers assigned to the respective buttons. Participants were instructed to make their response within 2 sec of stimulus offset. AO trials were presented alongside a still image of the speaker's face, whereas VO trials were presented in silence (aside from the background scanner noise). During ART trials, the cue to articulate was a fixation cross that flickered at 2 Hz, and participants were instructed to produce the sequence /pa/…/ta/…/ka/ repeatedly throughout the duration of flickering (10 sec) without producing sound or opening their mouth while still making movements internal to the vocal tract including tongue movements. Participants stopped articulating when the fixation cross stopped flickering. Stimuli were delivered with Matlab software (Mathworks, Inc, USA) utilizing Cogent (vislab.ucl.ac.uk/cogent_2000.php) and MR-compatible headphones. The experiment consisted of nine runs—one practice run, six functional runs, and two localizer runs—and one anatomical scan. The practice run was utilized to familiarize participants with the stimuli and task, and no data were analyzed from this run. Four trials of each condition along with four rest trials (still image of speaker's face) were presented in random order within each functional run (24 trials total). Because of a coding error, two participants were given slightly uneven amounts of trials from the AO and VO conditions, with one of those participants also given slightly fewer MM trials. The localizer runs consisted solely of VO and rest trials to obtain functionally independent ROIs for further analysis (12 VO, 6 rest per run). The stimuli and task remained the same throughout these two localizer runs. Following this, we collected a high-resolution anatomical scan. In all, the participants were in the scanner less than an hour.

fMRI Data Collection and Preprocessing

MR images were obtained in a Philips Achieva 3T (Philips Medical Systems, Andover, MA) fitted with an eight-channel radio frequency receiver head coil at the high-field scanning facility at University of California, Irvine. We first collected a total of 1110 T2*-weighted EPI volumes over nine runs using Fast Echo EPI in ascending order (repetition time = 2.5 sec, echo time = 25 msec, flip angle = 90°, in-plane resolution = 1.95 mm × 1.95 mm, slice thickness = 3 mm with 0.5 mm gap). The first four volumes of each run were collected before stimulus presentation and discarded to control for saturation effects. After the functional scans, a high-resolution T1-weighted anatomical image was acquired in the axial plane (repetition time = 8 msec, echo time = 3.7 msec, flip angle = 8°, size = 1 mm isotropic).

Slice-timing correction, motion correction, and spatial smoothing were performed using AFNI software (afni.nimh.nih.gov/afni). Motion correction was achieved by using a six-parameter rigid body transformation, with each functional volume in a run first aligned to a single volume in that run. Functional volumes were aligned to the anatomical image and subsequently aligned to Talairach space (Talairach & Tournoux, 1988). Functional images were resampled to 2.5 mm isotropic voxels and spatially smoothed using a Gaussian kernel of 6 mm FWHM.

First-level analyses were performed on each individual participant's data using AFNI's 3dDeconvolve function. The regression analysis was performed to find parameter estimates that best explained variability in the data. Each predictor variable representing the time course of stimulus presentation was convolved with the hemodynamic response function and entered into the general linear model. The following five regressors of interest were used in the experimental analysis: AO speech, VO speech, congruent AV speech, MM, and ART. The six motion parameters were included as regressors of no interest. The independent localizer data were analyzed in the same fashion, with the single regressor of interest, the VO condition. A second-level analysis was then performed on the parameter estimates, using AFNI's 3dANOVA2 function. Using a false discovery rate (FDR) correction for multiple comparisons, a threshold of q < 0.05 was used to examine activity above baseline for each condition and for the following contrasts: [AV > AO], [AV > VO], and [MM > AV].

To compare the profile of activation in motor areas to those of the pSTS, we split our functional data into even and odd runs. The even runs were used to localize the pSTS multisensory region using a conjunction analysis of AO and VO (Nath & Beauchamp, 2012) with individual uncorrected p < .05 for each condition. We justify the use of this liberal threshold because (a) the threshold was used only to select the ROIs, (b) the conjunction analysis produces a combined statistical threshold much more stringent than the individual thresholds, and (c) the power is greatly reduced because the data are split between localization and analysis. The conjunction analysis from the even runs resulted in a pSTS ROI from both hemispheres. The data from odd runs were averaged within each ROI, and the means were entered into t tests. No motor speech areas were localized using the conjunction analysis, so we used the results of the localizer analysis (VO > rest, q < 0.01) to define two frontal-motor ROIs previously reported to be engaged in AV speech integration, posterior Broca's area (pars opercularis), and dorsal premotor cortex of the precentral gyrus (Skipper et al., 2007; Ojanen et al., 2005). The parameter estimates for each participant for each condition from the functional runs were averaged within each ROI, and the means were entered into a statistical analysis.

Results—Experiment 2

Behavioral Performance

Participants accurately identified 93% of stimuli in the AO condition, 68% in the VO condition, 97% in the AV condition, and 4% in the MM condition (86% fusion, 8% visual capture). Analyzing behavioral performance during the AV, AO, and VO conditions showed a significant main effect of Condition, F(2, 38) = 74.607, p < .001, a significant main effect of Syllable, F(2, 38) = 24.239, p < .001, and a significant interaction, F(4, 76) = 41.229, p < .001. There was a significant effect of Syllable in the AO condition, F(2, 38) = 6.766, p = .003, no effect of Syllable in the AV condition, F(2, 38) = 1.401, p = .259, and a significant effect of Syllable in the VO condition, F(2, 38) = 63.004, p < .001.

Individual comparisons in the AO condition revealed that identification of AO /pa/ was lower than AO /ta/, t(19) = 2.613, p = .017 (two-tailed), AO /pa/ was lower than AO /ka/, t(19) = 2.626, p = .017 (two-tailed), with no difference between AO /ta/ and AO /ka/, t(19) = 0.567, p = .577 (two-tailed). Individual comparisons in the VO condition revealed that identification of VO /pa/ was greater than VO /ta/, t(19) = 3.796, p = .001 (two-tailed), VO /pa/ was greater than VO /ka/, t(19) = 14.312, p < .001 (two-tailed), and VO /ta/ was greater than VO /ka/, t(19) = 5.998, p < .001 (two-tailed). All reported t tests are Bonferroni-corrected for multiple comparisons with a family-wise error rate of p < .05, with per-comparison error rates of p < .0167.

We suspected that the poor performance in the VO condition was because of the similarity between /ta/ and /ka/. The difference in these two stimuli resides in the place of articulation of the tongue, which is difficult to see when viewing the face. The visual similarity of these two stimuli suggests that the consonants belong to the same viseme (the visual counterpart to the phoneme; Fisher, 1968). On the contrary, bilabial /pa/ is easily discriminated from /ta/ and /ka/ because of the involvement of lip closure. However, our statistical examination of percent correct indicated that /ta/ was significantly more accurate than /ka/. This suggested that participants, when faced with ambiguous /ta/ or /ka/, were biased to respond /ta/, resulting in higher accuracy than chance in the /ta/ condition and lower accuracy than chance in the /ka/ condition. Thus, we decided to examine these data using signal detection theory, which allowed us to account for response bias and obtain a true measure of participants' ability to discriminate these two stimuli. We analyzed only the /ta/ and /ka/ data using a standard 2AFC calculation of d′ (Macmillan & Creelman, 2005), treating /ta/ responses to /ta/ as hits and /ta/ responses to /ka/ as false alarms. We are justified in excluding /pa/ from the decision space as participants falsely identified VO /pa/ as /ta/ or /ka/ 0% of the time and falsely identified VO /ta/ and /ka/ as /pa/ 0% of the time, indicating that participants never considered /pa/ as a possibility during identification of /ta/ and /ka/. Our results showed a d′ of 0.11, indicating that participants were effectively at chance discriminating VO /ta/ and /ka/, confirming our expectation.

fMRI Analyses

Activation relative to “rest” (still image of speaker's face; no auditory stimulus) for each condition is shown in Figure 3. AV generated greater activity than AO in lateral occipital lobe bilaterally, right hemisphere IFG, left premotor cortex, and right parietal lobe (Figure 4; Table 1). AV generated greater activity than VO in the superior temporal lobe bilaterally and throughout the default network (Buckner, Andrews-Hanna, & Schacter, 2008; Figure 4; Table 2). MM activated the motor speech network significantly more than AV speech, including the pars opercularis of Broca's area, anterior insula, and left premotor cortex (Figure 4; Table 3). The VO localizer activated a similar set of brain regions as the VO condition in the experiment (Figure 3); we selected the left and right hemisphere pars opercularis and dorsal precentral gyrus activations as our ROIs for further analysis.

Figure 3. 

Activations above baseline during the functional runs from each condition during Experiment 2. All activations are shown with an FDR-corrected threshold of q < 0.05. Auditory, AV, and MM speech activated a peri-sylvian language network including superior temporal lobes, inferior/middle frontal gyrus, dorsal precentral gyrus, and inferior parietal lobe. Visual speech activated lateral occipital lobe, posterior middle temporal lobe, inferior/middle frontal gyrus, dorsal precentral gyrus, and inferior parietal lobe. ART activated posterior IFG, precentral gyrus, and inferior parietal lobe.

Figure 3. 

Activations above baseline during the functional runs from each condition during Experiment 2. All activations are shown with an FDR-corrected threshold of q < 0.05. Auditory, AV, and MM speech activated a peri-sylvian language network including superior temporal lobes, inferior/middle frontal gyrus, dorsal precentral gyrus, and inferior parietal lobe. Visual speech activated lateral occipital lobe, posterior middle temporal lobe, inferior/middle frontal gyrus, dorsal precentral gyrus, and inferior parietal lobe. ART activated posterior IFG, precentral gyrus, and inferior parietal lobe.

Figure 4. 

Contrasts from Experiment 2. All activations are positive with an FDR-corrected threshold of q < 0.05.

Figure 4. 

Contrasts from Experiment 2. All activations are positive with an FDR-corrected threshold of q < 0.05.

Table 1. 

Talairach Coordinates of Clusters Activated by the Contrast AV > AO

Region
Hemisphere
x
y
z
Cluster Size (mm3)
Lateral occipital lobe Left −34 −78 −1 22,422 
Lateral occipital lobe Right 37 −74 −1 21,594 
Precentral gyrus Right 51 36 1,500 
Superior parietal lobule Right 31 −55 46 1,000 
Inferior frontal gyrus (pars opercularis) Right 55 21 890 
Inferior frontal gyrus (pars triangularis) Right 51 29 13 344 
Precentral gyrus Left −55 −1 35 344 
Inferior parietal lobule Right 38 −36 42 297 
Middle OFC Right 20 43 −9 250 
Region
Hemisphere
x
y
z
Cluster Size (mm3)
Lateral occipital lobe Left −34 −78 −1 22,422 
Lateral occipital lobe Right 37 −74 −1 21,594 
Precentral gyrus Right 51 36 1,500 
Superior parietal lobule Right 31 −55 46 1,000 
Inferior frontal gyrus (pars opercularis) Right 55 21 890 
Inferior frontal gyrus (pars triangularis) Right 51 29 13 344 
Precentral gyrus Left −55 −1 35 344 
Inferior parietal lobule Right 38 −36 42 297 
Middle OFC Right 20 43 −9 250 

n = 20, cluster threshold = 10 voxels, FDR q < 0.05.

Table 2. 

Talairach Coordinates of Clusters Activated by the Contrast AV > VO

Region
Hemisphere
x
y
z
Cluster Size (mm3)
Medial occipital lobe Left/right −79 14 22,891 
Superior temporal lobe Left −51 −20 21,781 
Superior temporal lobe Right 56 −17 19,422 
Angular gyrus Left −42 −64 26 2,156 
Hippocampus Left/right −33 −5 1,891 
Anterior medial frontal cortex Left/right −3 54 1,219 
Medial superior frontal gyrus Right 55 30 640 
Anterior cingulate Left/right −30 313 
Angular gyrus Right 46 −61 25 297 
Cerebellum Left −19 −48 −19 266 
Anterior inferior cingulate Left/right 12 −3 234 
Superior medial gyrus Right 46 35 234 
BG Right 25 188 
Orbital gyrus Right 13 41 172 
Cerebellum Left −19 −50 −46 156 
Region
Hemisphere
x
y
z
Cluster Size (mm3)
Medial occipital lobe Left/right −79 14 22,891 
Superior temporal lobe Left −51 −20 21,781 
Superior temporal lobe Right 56 −17 19,422 
Angular gyrus Left −42 −64 26 2,156 
Hippocampus Left/right −33 −5 1,891 
Anterior medial frontal cortex Left/right −3 54 1,219 
Medial superior frontal gyrus Right 55 30 640 
Anterior cingulate Left/right −30 313 
Angular gyrus Right 46 −61 25 297 
Cerebellum Left −19 −48 −19 266 
Anterior inferior cingulate Left/right 12 −3 234 
Superior medial gyrus Right 46 35 234 
BG Right 25 188 
Orbital gyrus Right 13 41 172 
Cerebellum Left −19 −50 −46 156 

n = 20, cluster threshold = 10 voxels, FDR q < 0.05.

Table 3. 

Talairach Coordinates of Clusters Activated by the Contrast MM > AV

Region
Hemisphere
x
y
z
Cluster Size (mm3)
SMA Left/right −12 46 4,688 
Anterior insula Left −32 19 3,547 
Anterior insula Right 34 19 2,703 
Middle frontal gyrus Right 32 49 13 2,188 
Middle/IFG (pars triangularis) Left −46 21 27 1,766 
Middle frontal gyrus Left −31 46 20 1,188 
Inferior frontal gyrus Left −54 −6 20 672 
Precentral gyrus Left −47 −5 50 422 
Middle/IFG (pars triangularis) Right 46 20 28 188 
Middle/IFG (pars triangularis) Right 54 24 26 172 
Inferior frontal/precentral gyrus Left −35 30 156 
Region
Hemisphere
x
y
z
Cluster Size (mm3)
SMA Left/right −12 46 4,688 
Anterior insula Left −32 19 3,547 
Anterior insula Right 34 19 2,703 
Middle frontal gyrus Right 32 49 13 2,188 
Middle/IFG (pars triangularis) Left −46 21 27 1,766 
Middle frontal gyrus Left −31 46 20 1,188 
Inferior frontal gyrus Left −54 −6 20 672 
Precentral gyrus Left −47 −5 50 422 
Middle/IFG (pars triangularis) Right 46 20 28 188 
Middle/IFG (pars triangularis) Right 54 24 26 172 
Inferior frontal/precentral gyrus Left −35 30 156 

n = 20, cluster threshold = 10 voxels, FDR q < 0.05).

Figure 5 (top) illustrates the results of the ROI analyses in these left hemisphere motor areas localized through the independent VO runs. Both regions in the left hemisphere were strongly activated by the ART condition, confirming that they were indeed motor speech areas. In the right hemisphere, only the premotor ROIs were activated by the ART condition. All ROIs were strongly activated by the VO condition. However, comparisons among the conditions in both regions revealed an activation profile inconsistent with AV integration. AO produced little activation, and VO was significantly greater than AV and the MM condition. By contrast, Figure 6 illustrates the results of the analyses in pSTS. Bilateral pSTS, localized through AO conjunction VO using independent runs, exhibited the expected AV integration pattern, with AV conditions producing more than either AO or VO alone. We were unable to effectively localize motor regions using this conjunction analysis.

Figure 5. 

Analyses of frontal-motor ROIs from Experiment 2. (top) Pars opercularis and dorsal premotor cortex ROIs localized during the VO localizer runs. Percent signal change values for each condition within each ROI are reported in bar graphs. Left pars opercularis: VO activated this region significantly more so than AV, t(19) = 6.054, p < .001, one-tailed, and MM, t(19) = 2.852, p = .005, one-tailed. Left dorsal premotor cortex: VO activated this region significantly more so than AV, t(19) = 4.944, p < .001, one-tailed, and MM, t(19) = 3.081, p = .002, one-tailed. Right pars opercularis: VO activated this region significantly more so than AV, t(19) = 4.25, p < .001, one-tailed, and MM, t(19) = 2.539, p = .010, one-tailed. Right dorsal premotor cortex: VO activated this region significantly more so than AV, t(19) = 4.316, p < .001, one-tailed, and MM, t(19) = 3.406, p = .002, one-tailed. (bottom) Pars opercularis and dorsal premotor cortex ROIs localized from AV trials during even experimental runs. Percent signal change values for each condition from odd runs within each ROI are reported in bar graphs. Left pars opercularis: VO speech activated this region significantly more so than AV, t(19) = 4.859, p < .001, one-tailed, and MM, t(19) = 3.004, p = .004, one-tailed. Left dorsal premotor cortex: VO speech activated this region significantly more so than AV, t(19) = 4.878, p < .001, one-tailed, and MM, t(19) = 2.923, p = .004, one-tailed. Right pars opercularis: VO speech activated this region significantly more so than AV, t(19) = 2.308, p < .016, one-tailed. VO was not significantly greater than MM, t(19) = 1.538, p = .065, one-tailed. Right dorsal premotor cortex: VO speech activated this region significantly more so than AV, t(19) = 4.011, p = .001, one-tailed, and MM, t(19) = 2.559, p = .010, one-tailed. ns = not significant. Error bars indicate standard error for each condition. All reported t tests are Bonferroni corrected for multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .025).

Figure 5. 

Analyses of frontal-motor ROIs from Experiment 2. (top) Pars opercularis and dorsal premotor cortex ROIs localized during the VO localizer runs. Percent signal change values for each condition within each ROI are reported in bar graphs. Left pars opercularis: VO activated this region significantly more so than AV, t(19) = 6.054, p < .001, one-tailed, and MM, t(19) = 2.852, p = .005, one-tailed. Left dorsal premotor cortex: VO activated this region significantly more so than AV, t(19) = 4.944, p < .001, one-tailed, and MM, t(19) = 3.081, p = .002, one-tailed. Right pars opercularis: VO activated this region significantly more so than AV, t(19) = 4.25, p < .001, one-tailed, and MM, t(19) = 2.539, p = .010, one-tailed. Right dorsal premotor cortex: VO activated this region significantly more so than AV, t(19) = 4.316, p < .001, one-tailed, and MM, t(19) = 3.406, p = .002, one-tailed. (bottom) Pars opercularis and dorsal premotor cortex ROIs localized from AV trials during even experimental runs. Percent signal change values for each condition from odd runs within each ROI are reported in bar graphs. Left pars opercularis: VO speech activated this region significantly more so than AV, t(19) = 4.859, p < .001, one-tailed, and MM, t(19) = 3.004, p = .004, one-tailed. Left dorsal premotor cortex: VO speech activated this region significantly more so than AV, t(19) = 4.878, p < .001, one-tailed, and MM, t(19) = 2.923, p = .004, one-tailed. Right pars opercularis: VO speech activated this region significantly more so than AV, t(19) = 2.308, p < .016, one-tailed. VO was not significantly greater than MM, t(19) = 1.538, p = .065, one-tailed. Right dorsal premotor cortex: VO speech activated this region significantly more so than AV, t(19) = 4.011, p = .001, one-tailed, and MM, t(19) = 2.559, p = .010, one-tailed. ns = not significant. Error bars indicate standard error for each condition. All reported t tests are Bonferroni corrected for multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .025).

Figure 6. 

Analyses of pSTS ROIs from Experiment 2. (top) Left pSTS ROI localized through a conjunction of AO and VO trials during even experimental runs. Percent signal change values from odd runs for each condition within the ROI are reported in the bar graph to the right. The contrast AO: −2 VO: 0 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced significantly greater activity than AO, F(1, 19) = 6.045, p = .024, and the contrast AO: 0 VO: −2 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced marginally significantly greater activity than VO, F(1, 19) = 4.748, p = .042. (bottom) Right pSTS ROI localized through the conjunction of AO and VO trials during even experiment runs. Percent signal change values from odd runs for each condition within the ROI are reported in the bar graph to the right. The contrast AO: −2 VO: 0 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced significantly greater activity than AO, F(1, 19) = 6.045, p = .025, and the contrast AO: 0 VO: −2 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced marginally significantly greater activity than VO F(1, 19) = 4.748, p = .030. Error bars indicate standard error for each condition. All reported contrasts are Bonferroni corrected for multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .025).

Figure 6. 

Analyses of pSTS ROIs from Experiment 2. (top) Left pSTS ROI localized through a conjunction of AO and VO trials during even experimental runs. Percent signal change values from odd runs for each condition within the ROI are reported in the bar graph to the right. The contrast AO: −2 VO: 0 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced significantly greater activity than AO, F(1, 19) = 6.045, p = .024, and the contrast AO: 0 VO: −2 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced marginally significantly greater activity than VO, F(1, 19) = 4.748, p = .042. (bottom) Right pSTS ROI localized through the conjunction of AO and VO trials during even experiment runs. Percent signal change values from odd runs for each condition within the ROI are reported in the bar graph to the right. The contrast AO: −2 VO: 0 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced significantly greater activity than AO, F(1, 19) = 6.045, p = .025, and the contrast AO: 0 VO: −2 AV: 1 MM: 1 ART: 0 revealed that the multisensory conditions (AV and MM) produced marginally significantly greater activity than VO F(1, 19) = 4.748, p = .030. Error bars indicate standard error for each condition. All reported contrasts are Bonferroni corrected for multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .025).

One could argue that using a unimodal localizer (VO) to select the ROIs biased the analysis against finding an AV integration region in motor cortex. To rule out this possibility, we redefined the ROIs in the pars opercularis and premotor cortex using AV versus rest from the even functional runs and reran the analysis on the odd functional runs. The results from this analysis are therefore biased in favor of AV integration. Regardless, the same response profile resulted from the analysis as indicated by Figure 5 (bottom).

Given the disparity in behavioral performance for individual stimuli during the VO condition (/pa/ at ceiling, /ta/ and /ka/ at chance), we decided to perform a post hoc analysis to determine if activation in the motor system during VO followed this pattern. We reran the individual participant deconvolution analyses, replacing the regressor of interest for the VO condition with three individual regressors for /pa/, /ta/, and /ka/. The parameter estimates for each participant for each stimulus were averaged within both frontal ROIs obtained from the VO localizer, and the means were entered into t tests. Consistent with our prediction, Figure 7 shows that activity in both the pars opercularis of Broca's area and dorsal premotor cortex was higher for /ta/ and /ka/ than for /pa/, with no difference between /ta/ and /ka/, confirming that activity was lowest when participants were at ceiling (/pa/) and highest when at chance (/ta/, /ka/).

Figure 7. 

Analyses of left hemisphere frontal-motor ROIs for the individual stimuli from the VO condition from Experiment 2. (top) Pars opercularis ROI localized during the VO localizer runs. Percent signal change values for each VO stimulus within this ROI are reported in the bar graph to the right. VO /pa/ activated this region significantly less than VO /ta/, t(19) = −3.392, p = .003, two-tailed, as well as VO /ka/, t(19) = −4.029, p = .001, two-tailed. There was no significant difference between VO /ta/ and VO /ka/, t(19) = −0.326, p = .748, two-tailed. (bottom) Dorsal premotor cortex ROI localized during the VO localizer runs. Percent signal change values for each VO stimulus within this ROI are reported in the bar graph to the right. VO /pa/ activated this region significantly less than VO /ta/, t(19) = −3.634, p = .002, two-tailed, as well as VO /ka/, t(19) = −3.623, p = .002, two-tailed. There was no significant difference between VO /ta/ and VO /ka/, t(19) = 0.173, p = .864, two-tailed. Error bars indicate standard error for each condition. ns = not significant. All reported t tests are Scheffé corrected for post hoc multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .008).

Figure 7. 

Analyses of left hemisphere frontal-motor ROIs for the individual stimuli from the VO condition from Experiment 2. (top) Pars opercularis ROI localized during the VO localizer runs. Percent signal change values for each VO stimulus within this ROI are reported in the bar graph to the right. VO /pa/ activated this region significantly less than VO /ta/, t(19) = −3.392, p = .003, two-tailed, as well as VO /ka/, t(19) = −4.029, p = .001, two-tailed. There was no significant difference between VO /ta/ and VO /ka/, t(19) = −0.326, p = .748, two-tailed. (bottom) Dorsal premotor cortex ROI localized during the VO localizer runs. Percent signal change values for each VO stimulus within this ROI are reported in the bar graph to the right. VO /pa/ activated this region significantly less than VO /ta/, t(19) = −3.634, p = .002, two-tailed, as well as VO /ka/, t(19) = −3.623, p = .002, two-tailed. There was no significant difference between VO /ta/ and VO /ka/, t(19) = 0.173, p = .864, two-tailed. Error bars indicate standard error for each condition. ns = not significant. All reported t tests are Scheffé corrected for post hoc multiple comparisons with a family-wise error rate of p < .05 (per-comparison error rates of p < .008).

Discussion—Experiment 2

Consistent with previous research, the whole-brain contrasts AV > AO and AV > VO each resulted in activations in the vicinity of the pSTS. The ROI analysis confirmed that this region displayed the expected AV integration profile: AV > AO and AV > VO. Although the whole-brain contrast AV > AO resulted in activity in the posterior left IFG (Broca's area), the contrast AV > VO did not. The ROI analysis of posterior left IFG and left dorsal premotor cortex, two motor speech regions implicated in AV integration (Skipper et al., 2007), confirmed that the motor system did not display the AV integration profile: VO speech activated these areas significantly more than AV speech (VO > AV). Post hoc analysis of the individual stimuli from the VO condition revealed that the ambiguous stimuli /ta/ and /ka/ drove much of the activation in these areas. These results suggest that AV integration for speech involves the pSTS but not the speech motor system.

GENERAL DISCUSSION

Neither experiment found evidence that the motor speech system is involved in AV integration, even in a weak modulatory capacity. Using behavioral measures, Experiment 1 found that strongly modulating the activity of the motor speech system via articulatory suppression did not correspondingly modulate the strength of the McGurk effect; in fact it had no effect. If the motor speech system mediates the AV integration processes that underlie the McGurk effect then we should have seen a significant of motor modulation on McGurk fusion, yet we did not. Using fMRI, Experiment 2 found that the response of motor speech areas did not show the characteristic signature of AV integration (AV > AO and AV > VO). Instead, AV stimuli activated motor speech areas significantly less than visual speech stimuli alone. Consistent with previous reports, response properties of the STS were more in line with a region critically involved in AV integration (AV > AO and AV > VO). Taken together, these studies substantially weaken the position that motor speech areas play a significant role in AV speech integration and strengthen the view that the STS is the critical site.

If motor speech areas are not involved in AV integration, why do these regions activate under some speech-related conditions, such as VO speech? One view is that motor speech circuits are needed for perception and therefore recruited for perceptual purposes under noisy or ambiguous conditions. There is no doubt that motor regions are indeed more active when the perceptual signal is degraded, as shown by previous studies (Fridriksson, 2008; Miller & D'Esposito, 2005). This was evident in this study with partially ambiguous VO speech generating more motor-related activity than relatively perceptible AV speech. But to say that the motor system is recruited under demanding perceptual conditions only restates the facts. The critical question is: does the motor system actually aid or improve perception in any way?

A recent fMRI study suggests that this is not the case. Venezia et al. held perceptibility (d′) constant in an auditory CV syllable discrimination task and varied performance by manipulating response bias using different ratios of same to different trials (Venezia et al., 2012). Neural activity in motor speech areas was significantly negatively correlated with behaviorally measured response bias, although perceptual discriminability was held constant. This suggests that, although motor regions are recruited under some task conditions, their involvement does not necessarily result in better perceptual performance. Similar results were obtained in a purely behavioral experiment in which use-induced motor plasticity of speech articulators modulated bias but not discriminability of auditory syllables (Sato et al., 2011). The results are consistent with the motor system interacting with participant responses, but not aiding in perception, as d′ would be expected to vary if this were the case.

One alternative interpretation of motor activity during speech tasks is that it is purely epiphenomenal, deriving from associations between auditory and motor speech systems that are critical for speech production, but not for speech perception. Existing models of speech motor control provide a mechanism for such an association in the form of feedback control architectures (Hickok, 2012; Hickok, Houde, & Rong, 2011; Houde & Nagarajan, 2011; Tourville & Guenther, 2011). However, this view fails to explain why the motor system is more active in some conditions than others. If pure association were the only mechanism driving the motor activations, one would expect equal activation under all conditions; clearly, the activity differs among modalities (AO, VO, AV), stimulus quality, and task. In addition, the findings of Venezia et al. (2012) and Sato et al. (2011) point toward some role of the motor system in generating participant's responses, as response bias correlates with activity in the motor system. Since epiphenomenal, that is pure association-related activation alone cannot account for these effects, another mechanism must be driving the activations.

A third possibility is that the motor system somehow participates in response selection. As the response selection demands increase, so does activity in motor speech systems. This accounts for the correlation between response bias and motor speech region activity as well as the tendency for these regions to be more active when the perceptual stimuli are degraded and ambiguous (thus increasing the load on response selection). Previous work has implicated motor-related regions in the IFG in response selection (Snyder, Feigenson, & Thompson-Schill, 2007; Novick et al., 2005; Thompson-Schill, D'Esposito, Aguirre, & Farah, 1997), which is broadly consistent with this view. What remains unclear is the role that lower-level motor speech areas (including the dorsal premotor cortex) play in response selection. One possibility is that they contribute via their involvement in phonological working memory, which clearly has an articulatory component (Buchsbaum, Olsen, Koch, & Berman, 2005; Hickok, Buchsbaum, Humphries, & Muftuler, 2003). During syllable identification or discrimination tasks, participants may utilize verbal working memory resources in difficult processing environments, resulting in activation in the motor system. What is clear from the evidence is that the activation of these regions does not track with speech perception or AV integration. More work is needed to determine precisely the role that these motor regions play in response selection during speech comprehension.

Conclusions

The results of our experiments suggest that the motor system does not play a role in AV speech integration. First, ART had no effect on the McGurk effect, showing that activation of articulatory representations does not inhibit McGurk fusion, suggesting that the motor speech network and the AV integration network do not interact during McGurk fusion. Second, motor speech regions (including the pars opercularis of Broca's area and dorsal premotor cortex) exhibited an activation profile inconsistent with AV integration. Demands on response selection likely account for much of the activity in these regions during speech perception, unisensory or multisensory. Alternatively, the pSTS does exhibit such an integration pattern, consistent with previous accounts of its role in AV integration.

Acknowledgments

The authors would like to thank J. Venezia for useful comments and suggestions throughout this investigation. This investigation was supported by a grant (DC009659) from the U.S. National Institutes of Health.

Reprint requests should be sent to William Matchin, Department of Cognitive Sciences, University of California, Irvine, 3151 Social Science Plaza, Irvine, CA 92697, or via e-mail: wmatchin@uci.edu, wmatchin@gmail.com.

REFERENCES

Baddeley
,
A.
,
Eldridge
,
M.
, &
Lewis
,
V.
(
1981
).
The role of subvocalization in reading.
Quarterly Journal of Experimental Psychology: Section A-Human Experimental Psychology
,
33
,
439
454
.
Beauchamp
,
M. S.
(
2005
).
Statistical criteria in fMRI studies of multisensory integration.
Neuroinformatics
,
3
,
93
113
.
Beauchamp
,
M. S.
,
Argall
,
B. D.
,
Bodurka
,
J.
,
Duyn
,
J. H.
, &
Martin
,
A.
(
2004
).
Unraveling multisensory integration: Patchy organization within human STS multisensory cortex.
Nature Neuroscience
,
7
,
1190
1192
.
Beauchamp
,
M. S.
,
Lee
,
K. E.
,
Argall
,
B. D.
, &
Martin
,
A.
(
2004
).
Integration of auditory and visual information about objects in superior temporal sulcus.
Neuron
,
41
,
809
823
.
Beauchamp
,
M. S.
,
Nath
,
A. R.
, &
Pasalar
,
S.
(
2010
).
fMRI-guided transcranial magnetic stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk effect.
Journal of Neuroscience
,
30
,
2414
2417
.
Brainard
,
D. H.
(
1997
).
The psychophysics toolbox.
Spatial Vision
,
10
,
433
436
.
Buchsbaum
,
B. R.
,
Olsen
,
R. K.
,
Koch
,
P.
, &
Berman
,
K. F.
(
2005
).
Human dorsal and ventral auditory streams subserve rehearsal-based and echoic processes during verbal working memory.
Neuron
,
48
,
687
697
.
Buckner
,
R. L.
,
Andrews-Hanna
,
J. R.
, &
Schacter
,
D. L.
(
2008
).
The brain's default network.
Annals of the New York Academy of Sciences
,
1124
,
1
38
.
Callan
,
D. E.
,
Jones
,
J. A.
,
Munhall
,
K.
,
Callan
,
A. M.
,
Kroos
,
C.
, &
Vatikiotis-Bateson
,
E.
(
2003
).
Neural processes underlying perceptual enhancement by visual speech gestures.
NeuroReport
,
14
,
2213
2218
.
Callan
,
D. E.
,
Jones
,
J. A.
,
Munhall
,
K.
,
Kroos
,
C.
,
Callan
,
A. M.
, &
Vatikiotis-Bateson
,
E.
(
2004
).
Multisensory integration sites identified by perception of spatial wavelet filtered visual speech gesture information.
Journal of Cognitive Neuroscience
,
16
,
805
816
.
Calvert
,
G.
,
Brammer
,
M.
, &
Campbell
,
R.
(
2001
).
Cortical subtrates of seeing speech: Still and moving faces.
Neuroimage
,
13
,
S513
.
Calvert
,
G. A.
, &
Campbell
,
R.
(
2003
).
Reading speech from still and moving faces: The neural substrates of visible speech.
Journal of Cognitive Neuroscience
,
15
,
57
70
.
Calvert
,
G. A.
,
Campbell
,
R.
, &
Brammer
,
M. J.
(
2000
).
Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex.
Current Biology
,
10
,
649
657
.
Campbell
,
R.
,
MacSweeney
,
M.
,
Surguladze
,
S.
,
Calvert
,
G.
,
McGuire
,
P.
,
Suckling
,
J.
,
et al
(
2001
).
Cortical substrates for the perception of face actions: An fMRI study of the specificity of activation for seen speech and for meaningless lower-face acts (gurning).
Cognitive Brain Research
,
12
,
233
243
.
Fisher
,
C. G.
(
1968
).
Confusions among visually perceived consonants.
Journal of Speech, Language and Hearing Research
,
11
,
796
.
Fridriksson
,
J.
,
Moss
,
J.
,
Davis
,
B.
,
Baylis
,
G. C.
,
Bonilha
,
L.
, &
Rorden
,
C.
(
2008
).
Motor speech perception modulates the cortical language areas.
Neuroimage
,
41
,
605
613
.
Hickok
,
G.
(
2012
).
Computational neuroanatomy of speech production.
Nature Reviews Neuroscience
,
13
,
135
145
.
Hickok
,
G.
,
Buchsbaum
,
B.
,
Humphries
,
C.
, &
Muftuler
,
T.
(
2003
).
Auditory-motor interaction revealed by fMRI: Speech, music, and working memory in area Spt.
Journal of Cognitive Neuroscience
,
15
,
673
682
.
Hickok
,
G.
,
Houde
,
J.
, &
Rong
,
F.
(
2011
).
Sensorimotor integration in speech processing: Computational basis and neural organization.
Neuron
,
69
,
407
422
.
Houde
,
J. F.
, &
Nagarajan
,
S. S.
(
2011
).
Speech production as state feedback control.
Frontiers in Human Neuroscience
,
5
,
82
.
Jones
,
E. G.
, &
Powell
,
T. P. S.
(
1970
).
An anatomical study of converging sensory pathways in the cerebral cortex of the monkey.
Brain Behavior and Evolution
,
93
,
793
820
.
Kleiner
,
M.
,
Brainard
,
D.
, &
Pelli
,
D.
(
2007
).
What's new in Psychtoolbox-3? Perception
,
36
,
1-1
.
Liberman
,
A. M.
, &
Mattingly
,
I. G.
(
1985
).
The motor theory of speech perception revised.
Cognition
,
21
,
1
36
.
Macmillan
,
N. A.
, &
Creelman
,
C. D.
(
2005
).
Detection theory: A user's guide.
Mahwah, NJ
:
Erlbaum
.
MacSweeney
,
M.
,
Amaro
,
E.
,
Calvert
,
G. A.
,
Campbell
,
R.
,
David
,
A. S.
,
McGuire
,
P.
,
et al
(
2000
).
Silent speechreading in the absence of scanner noise: An event-related fMRI study.
NeuroReport
,
11
,
1729
1733
.
McGurk
,
H.
, &
MacDonald
,
J.
(
1976
).
Hearing lips and seeing voices.
Nature
,
264
,
746
748
.
Mesulam
,
M.
, &
Mufson
,
E. J.
(
1982
).
Insula of the old world monkey. III: Efferent cortical output and comments on function.
Journal of Comparative Neurology
,
212
,
38
52
.
Miller
,
L. M.
, &
D'Esposito
,
M.
(
2005
).
Perceptual fusion and stimulus coincidence in the cross-modal integration of speech.
Journal of Neuroscience
,
25
,
5884
5893
.
Nath
,
A. R.
, &
Beauchamp
,
M. S.
(
2012
).
A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion.
Neuroimage
,
59
,
781
787
.
Novick
,
J. M.
,
Trueswell
,
J. C.
, &
Thompson-Schill
,
S. L.
(
2005
).
Cognitive control and parsing: Reexamining the role of Broca's area in sentence comprehension.
Cognitive Affective & Behavioral Neuroscience
,
5
,
263
281
.
Ojanen
,
V.
,
Mottonen
,
R.
,
Pekkola
,
J.
,
Jaaskelainen
,
I. P.
,
Joensuu
,
R.
,
Autti
,
T.
,
et al
(
2005
).
Processing of audiovisual speech in Broca's area.
Neuroimage
,
25
,
333
338
.
Okada
,
K.
, &
Hickok
,
G.
(
2009
).
Two cortical mechanisms support the integration of visual and auditory speech: A hypothesis and preliminary data.
Neuroscience Letters
,
452
,
219
223
.
Paulesu
,
E.
,
Perani
,
D.
,
Blasi
,
V.
,
Silani
,
G.
,
Borghese
,
N. A.
,
De Giovanni
,
U.
,
et al
(
2003
).
A functional-anatomical model for lipreading.
Journal of Neurophysiology
,
90
,
2005
2013
.
Pelli
,
D. G.
(
1997
).
The VideoToolbox software for visual psychophysics: Transforming numbers into movies.
Spatial Vision
,
10
,
437
442
.
Rosenblum
,
L. D.
,
Schmuckler
,
M. A.
, &
Johnson
,
J. A.
(
1997
).
The McGurk effect in infants.
Perception & Psychophysics
,
59
,
347
357
.
Sams
,
M.
,
Mottonen
,
R.
, &
Sihvonen
,
T.
(
2005
).
Seeing and hearing others and oneself talk.
Cognitive Brain Research
,
23
,
429
435
.
Sato
,
M.
,
Buccino
,
G.
,
Gentilucci
,
M.
, &
Cattaneo
,
L.
(
2010
).
On the tip of the tongue: Modulation of the primary motor cortex during audiovisual speech perception.
Speech Communication
,
52
,
533
541
.
Sato
,
M.
,
Grabski
,
K.
,
Glenberg
,
A. M.
,
Brisebois
,
A.
,
Basirat
,
A.
,
Menard
,
L.
,
et al
(
2011
).
Articulatory bias in speech categorization: Evidence from use-induced motor plasticity.
Cortex
,
47
,
1001
1003
.
Sekiyama
,
K.
,
Kanno
,
I.
,
Miura
,
S.
, &
Sugita
,
Y.
(
2003
).
Auditory-visual speech perception examined by fMRI and PET.
Neuroscience Research
,
47
,
277
287
.
Seltzer
,
B.
, &
Pandya
,
D. N.
(
1978
).
Afferent cortical connections and architectonics of superior temporal sulcus and surrounding cortex in rhesus-monkey.
Brain Research
,
149
,
1
24
.
Seltzer
,
B.
, &
Pandya
,
D. N.
(
1980
).
Converging visual and somatic sensory cortical input to the intraparietal sulcus of the rhesus-monkey.
Brain Research
,
192
,
339
351
.
Skipper
,
J. I.
,
Nusbaum
,
H. C.
, &
Small
,
S. L.
(
2005
).
Listening to talking faces: Motor cortical activation during speech perception.
Neuroimage
,
25
,
76
89
.
Skipper
,
J. I.
,
van Wassenhove
,
V.
,
Nusbaum
,
H. C.
, &
Small
,
S. L.
(
2007
).
Hearing lips and seeing voices: How cortical areas supporting speech production mediate audiovisual speech perception.
Cerebral Cortex
,
17
,
2387
2399
.
Snyder
,
H. R.
,
Feigenson
,
K.
, &
Thompson-Schill
,
S. L.
(
2007
).
Prefrontal cortical response to conflict during semantic and phonological tasks.
Journal of Cognitive Neuroscience
,
19
,
761
775
.
Stevens
,
K. N.
, &
Halle
,
M.
(
1967
).
Remarks on analysis by synthesis and distinctive features.
In W. Wathen-Dunn (Ed.)
,
Models for the perception of speech and visual form
(pp.
88
102
).
Cambridge, MA
:
MIT Press
.
Stevenson
,
R. A.
,
VanDerKlok
,
R. M.
,
Pisoni
,
D. B.
, &
James
,
T. W.
(
2011
).
Discrete neural substrates underlie complementary audiovisual speech integration processes.
Neuroimage
,
55
,
1339
1345
.
Sumby
,
W. H.
, &
Pollack
,
I.
(
1954
).
Visual contribution to speech intelligibility in noise.
Journal of the Acoustical Society of America
,
26
,
212
215
.
Talairach
,
J.
, &
Tournoux
,
P.
(
1988
).
Co-planar stereotaxic atlas of the human brain.
New York
:
Thieme
.
Thompson-Schill
,
S. L.
,
D'Esposito
,
M.
,
Aguirre
,
G. K.
, &
Farah
,
M. J.
(
1997
).
Role of left inferior prefrontal cortex in retrieval of semantic knowledge: A reevaluation.
Proceedings of the National Academy of Sciences, U.S.A.
,
94
,
14792
14797
.
Tourville
,
J. A.
, &
Guenther
,
F. H.
(
2011
).
The DIVA model: A neural theory of speech acquisition and production.
Language and Cognitive Processes
,
26
,
952
981
.
Venezia
,
J. H.
,
Saberi
,
K.
,
Chubb
,
C.
, &
Hickok
,
G.
(
2012
).
Response bias modulates the speech motor system during syllable discrimination.
Frontiers in Psychology
,
3
,
157
.
Watkins
,
K. E.
,
Strafella
,
A. P.
, &
Paus
,
T.
(
2003
).
Seeing and hearing speech excites the motor system involved in speech production.
Neuropsychologia
,
41
,
989
994
.
Yeterian
,
E. H.
, &
Pandya
,
D. N.
(
1985
).
Corticothalamic connections of the posterior parietal cortex in the rhesus-monkey.
Journal of Comparative Neurology
,
237
,
408
426
.