Multisensory integration of visual mouth movements with auditory speech is known to offer substantial perceptual benefits, particularly under challenging (i.e., noisy) acoustic conditions. Previous work characterizing this process has found that ERPs to auditory speech are of shorter latency and smaller magnitude in the presence of visual speech. We sought to determine the dependency of these effects on the temporal relationship between the auditory and visual speech streams using EEG. We found that reductions in ERP latency and suppression of ERP amplitude are maximal when the visual signal precedes the auditory signal by a small interval and that increasing amounts of asynchrony reduce these effects in a continuous manner. Time–frequency analysis revealed that these effects are found primarily in the theta (4–8 Hz) and alpha (8–12 Hz) bands, with a central topography consistent with auditory generators. Theta effects also persisted in the lower portion of the band (3.5–5 Hz), and this late activity was more frontally distributed. Importantly, the magnitude of these late theta oscillations not only differed with the temporal characteristics of the stimuli but also served to predict participants' task performance. Our analysis thus reveals that suppression of single-trial brain responses by visual speech depends strongly on the temporal concordance of the auditory and visual inputs. It further illustrates that processes in the lower theta band, which we suggest as an index of incongruity processing, might serve to reflect the neural correlates of individual differences in multisensory temporal perception.
Audiovisual Integration of Speech Signals
We live in a complex environment in which events frequently generate signals in multiple sensory modalities. Multisensory integration, the process of combining these sensory inputs to form a single coherent percept, has been shown to offer numerous behavioral and perceptual advantages in a variety of tasks (Murray & Wallace, 2012). A particularly striking and ecologically important example of this process is the integration of visual speech (i.e., mouth movements) with auditory speech. In acoustically challenging environments, the integration of these signals has been shown to substantially facilitate speech comprehension (Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007; Cherry, 1953). The presence of visual inputs is also known to play an important role in assisting in the process of stream segregation, in which features of selectively attended auditory signals are grouped together while unattended signals are filtered out (Shinn-Cunningham, 2008).
An important factor facilitating these integrative processes is that auditory and visual speech signals share an obligatory temporal correlation due to the nature of speech production. This correlation is highly intuitive and well quantified in natural speech (Schwartz & Savariaux, 2014; Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009)—when the mouth is open, the speech envelope (i.e., the sound amplitude) is large, and when the mouth is closed, the speech envelope is small. This temporal correlation is primarily found at relatively low frequencies (1–5 Hz), which correspond with the temporal structure imposed by the basic units of speech, syllables, and words (Chandrasekaran et al., 2009). Given this seemingly useful temporal structure, a number of studies have aimed to investigate the degree to which temporal concordance is important for multisensory speech processing (Ten Oever, Sack, Wheat, Bien, & van Atteveldt, 2013; van Wassenhove, Grant, & Poeppel, 2007; Munhall, Gribble, Sacco, & Ward, 1996). These studies have elucidated that integration, as measured through a number of psychophysical tasks, occurs with a degree of temporal tolerance around true simultaneity. To capture the temporal interval within which auditory and visual signals can be perceptually integrated and bound, the construct of a temporal binding window (TBW) has been put forth (Wallace & Stevenson, 2014). The TBW has been shown to be asymmetric for speech stimuli and larger for temporal asynchronies in which vision leads audition. This is both consistent with the statistics of the natural environment, in which sound travels more slowly than light, as well as the causal structure of speech signals in which articulatory movements of the vocal apparatus generally precede sounds (Schwartz & Savariaux, 2014; Chandrasekaran et al., 2009).
Neural Manifestations of Audiovisual Speech Integration
This recognized ecological importance of vision for bolstering the processing of speech signals has spurred investigations regarding how the presence of visual inputs modifies neural processing of the acoustic signal. Early investigations revealed that the presence of visual speech attenuates the magnitude of early brain responses to auditory speech (van Wassenhove, Grant, & Poeppel, 2005; Besle, Fort, Delpuech, & Giard, 2004) and may reduce the latency of individual processing stages (van Wassenhove et al., 2005). A meta-analytical approach has indicated that these changes in processing magnitude and latency are present over a wide range of experimental designs and task demands (Baart, 2016), reinforcing that reduction in neural response and reduced onset latency are robust and generalized indicators of audiovisual speech integration. A notable feature of this integrative effect is the subadditive nature of the interaction, in which the absolute magnitude of the multisensory neural response is substantially smaller than its constituent components (i.e., AV < A + V). This subadditivity contrasts starkly with the superadditive neural responses often seen for nonspeech stimuli (Cappe, Thut, Romei, & Murray, 2010). This suggests important mechanistic differences in the processing and integration of multisensory stimuli with informative anticipatory information, such as speech (Stekelenburg & Vroomen, 2007).
The presence of visual speech has also been shown to contribute to the ability of the brain to entrain to the auditory speech envelope. Speech is constructed of elements (syllables and words) that are produced in a semirhythmic stream. The brain has been shown to utilize this information by locking the phase of neural activity to the phase of the speech rhythm. This process, known as entrainment, is believed to form the backbone of temporal attention (Jones, Moynihan, MacKenzie, & Puente, 2002), processing “to the beat” (Breska & Deouell, 2016), and, for speech, serves to selectively amplify future speech signals occurring at the correct phase (Giraud & Poeppel, 2012). The presence of visual speech has also been shown to facilitate this entrainment process when multiple speakers are present (Zion Golumbic, Cogan, Schroeder, & Poeppel, 2013; Zion Golumbic, Ding, et al., 2013) as well as to directly entrain neural oscillations (Park, Kayser, Thut, & Gross, 2016), indicating that the visual rhythm can be used to disambiguate which portions of the acoustic envelope should be entrained to (Schroeder, Lakatos, Kajikawa, Partan, & Puce, 2008). Importantly, the phase of these rhythms has been shown to causally affect speech perception under challenging conditions. When the auditory signal is perceptually ambiguous, the final speech percept depends on the phase of these ongoing spontaneous oscillations (Ten Oever & Sack, 2015). In a further entrainment experiment, it was then established that acoustic entrainment generates rhythmic fluctuations in perceptual outcomes at the entrained frequency (Ten Oever & Sack, 2015). These experiments establish low-frequency oscillations as more than simple neural resonance or a byproduct of meaningful acoustic processing and further indicate that visual influences on these oscillations likely have perceptual consequences.
For audiovisual integration, these cortical entrainment processes have been shown to occur over an extended window of time (>400 msec; Crosse, Di Liberto, & Lalor, 2016). Entrainment also exhibits the greatest amount of audiovisual integration at the low frequencies capable of capturing the slow cycle times present in naturalistic visual speech (Crosse, Butler, & Lalor, 2015). Importantly, this neural integration time is substantially greater than the temporal integration time seen for auditory speech processing under challenging acoustic conditions (∼200 msec; Ding & Simon, 2013a). This extended audiovisual window strongly suggests that audiovisual speech integration functions as a temporally privileged operation, which may be critical given that audiovisual speech has a variable temporal structure for the constituent auditory and visual components (Schwartz & Savariaux, 2014; Ten Oever et al., 2013).
Motivations for the Current Study
Despite the robustness of the finding that visual speech reduces the magnitude of neural responses to auditory speech, the degree to which this depends on the temporal structure of the auditory and visual signals has not been fully explored. Previous work has established that an ecologically implausible 200-msec auditory lead precludes this effect (Pilling, 2009) but did not systematically determine how changes in temporal structure influenced the effect. Given the strong characterizations of temporal tolerance in the behavioral domain (Wallace & Stevenson, 2014; Vroomen & Keetels, 2010), we hypothesized that a similar temporal window for response reduction would be present in the associated neural measures. We further hypothesized that these effects would be strongest at a small visual lead rather than true synchrony. This would be consistent with the natural statistics of audiovisual speech signals, predictive coding accounts of audiovisual integration (Talsma, 2015), and numerous behavioral accounts in which participants judge slight visual leads to be “most synchronous,” a point often referred to as the point of subjective simultaneity (PSS; Vroomen & Keetels, 2010).
We thus sought to determine the degree to which temporal coincidence between auditory and visual inputs mediates the multisensory integration of speech signals measured via reduction in neural response amplitude. To do so, we recorded EEG from human participants while they performed a psychophysical simultaneity judgment (SJ) task featuring audiovisual speech stimuli. Our results indicate that reductions in response amplitude afforded by the presence of visual speech operate within an asymmetric temporal window with striking similarities to the TBW reported in behavioral studies. The width of this window also strongly aligns with the cycle times of the frequencies with strongest temporal correlation in natural audiovisual speech. We further identify a novel theta band incongruity signal present in later stages of processing, which is of greatest amplitude in conditions in which temporal misalignment is present. Crucially, we link the strength of this signal to participants' ability to identify asynchronous stimuli correctly. Our results shed new light on the neural correlates of the temporal tolerance for audiovisual speech integration by elucidating the nature of temporal integration in multiple frequency bands and time windows corresponding with distinct stages of cortical processing.
Twenty-eight typically developing adults participated in the study. All participants reported that they were right-handed and had normal or corrected-to-normal vision and normal hearing. Two participants were excluded from the analysis because of behavioral performance indicating that they did not correctly perform the task, and one participant did not complete the task, leaving 25 analyzed participants (16 women) with a mean age of 22.08 (±4.21) years. The study was conducted in accordance with the Declaration of Helsinki, and informed written consent was obtained from all participants. All procedures were approved by the Vanderbilt University institutional review board.
Participants performed a speeded two-alternative forced choice SJ task (Figure 1). The experimental stimuli consisted of an audiovisual movie of a woman saying the syllable “BA,” including all prearticulatory movements, with a resolution of 720 × 1280 and a duration of 2000 msec. We selected “BA” as our stimulus because it is a highly visually specified syllable (i.e., it is easily lip-read), which may generate stronger integration effects than less visually specified syllables (van Wassenhove et al., 2005). The movie was presented on a 24-in. monitor (ASUS VG248QE) with a refresh rate of 60 Hz at a distance of 1 m. The woman's face was central on the monitor and occupied an area of approximately 12 cm high × 8.5 cm wide (approximately 6.8° × 4.8° of visual angle), whereas the open mouth occupied an area of approximately 1.75 cm high × 3 cm wide (approximately 1° × 1.7° of visual angle). The auditory portion of the movie was presented at a normal conversational volume (∼65 dB) through bilateral speakers 1 m from the participant's head. Trials began with presentation of a still face consisting of the first video frame between 1700 and 2000 msec with a uniform distribution. This was followed by the audiovisual movie, with a duration of 2000 msec. After the movie, a still face consisting of the last video frame was presented for 750 msec. If no response was given by the end of the still face period, a response screen appeared for a maximum of 2500 msec or until a response was given. Participants were instructed to fixate on the mouth and to use their right hand to indicate whether the stimuli were perceived to occur at the same time (i.e., synchronously) or at different times (i.e., asynchronously) via keyboard button press. Participants were also explicitly told to respond as quickly and accurately as possible and that the appearance of the response screen was an indicator that their responses were too slow. All participants completed a practice block before the main experiment.
To create the experimental temporal asynchronies, we manipulated the audiovisual stimulus by delaying either the visual stimulus (to create an AV trial) or the auditory stimulus (to create a VA trial). We created six asynchronies ranging from audition leading vision by 450 msec (A450V) to vision leading audition by 450 msec (V450A) in steps of 150 msec, resulting in seven conditions including the original movie featuring synchronized stimuli. Blocks consisted of 105 stimuli presented in a random order, and participants completed 13 or 14 blocks, for a total of 1,365 or 1,470 trials. Stimulus onset for all stimuli was considered relative to the leading stimulus. That is, for auditory leads, stimulus onset was at the time of auditory onset, whereas for visual leads, stimulus onset was the onset of the video frame associated with auditory onset in the original video. These events occurred simultaneously in the synchronous video. In other words, Time 0 corresponded with the first point at which task relevant information was present.
Behavioral Data Analysis
We began data analysis by first excluding trials in which no response was given and trials in which response times were less than 150 msec. We then excluded, on a per-condition-and-participant basis, trials with response times more than 3 SDs above that participant's mean response time. We also excluded EEG data using these same response time criteria. Together, these procedures resulted in an exclusion of 10.36 ± 3.19 trials per participant. We then calculated for each participant the percentage of trials in each condition in which they indicated that the stimulus occurred synchronously. For each participant, we then fit a Gaussian distribution to the reported rate of synchronies in all seven conditions using the MATLAB fit.m function with free parameters of amplitude, mean, and standard deviation. The standard deviation of this distribution was taken as the TBW width; and its mean, as the PSS. We note that TBWs are known to be asymmetric, but we utilized a symmetric Gaussian fit given the limited number of data points (seven total SOAs), which impairs the reliability of asymmetric fitting procedures. We also calculated the full width at 75% maximum (FW75M), equivalent to 1.517 SDs, for each Gaussian distribution.
In addition, we calculated mean participant response time for each condition. Response time was calculated from onset of the auditory stimulus for conditions with auditory leads (A450V, A300V, and A150) and synchronous stimulus onset (AV). For visual leads, response times were calculated from onset of the video frame where auditory onset would have occurred, had it not been delayed. In all conditions, response times thus began at the point at which task-relevant information was first available. We compared response times across conditions using repeated-measures ANOVA with follow-up paired sample t tests.
EEG Recording and Processing
Continuous EEG was recorded from 128 electrodes referenced to the vertex (Cz) using a Net Amps 400 amplifier and Hydrocel GSN 128 EEG cap (EGI Systems Inc., Eugene, OR). Data were acquired with NetStation 5.3 with a sampling rate of 1000 Hz and were further processed using MATLAB and EEGLAB (Delorme & Makeig, 2004). Continuous EEG data were band-pass filtered from 0.15 to 50 Hz with a 6-dB roll-off of 0.075–50.075 Hz using the EEGLab firfiltnew.m function, which implements a bidirectional zero-phase finite impulse response filter. Epochs 3 sec long from 1000 msec before to 2000 msec after onset of the first stimulus were then extracted. Artifact-contaminated trials and bad channels were identified and removed through a combination of automated identification of trials in which any channel exceeded ±100 μV and rigorous visual inspection. Data were then recalculated to the average reference and submitted to ICA using the Infomax algorithm (0.5E-7 stopping weight, 768 maximum steps) (Jung, Makeig, Humphries, et al., 2000; Jung, Makeig, Westerfield, et al., 2000). Last, bad channels were reconstructed using spherical spline interpolation (Perrin, Pernier, Bertrand, Giard, & Echallier, 1987), and data were reinspected for residual artifacts. Overall, a mean of 1,081 (79 ± 9.5%) trials were retained, and 4.17 (SD = 2.42) channels and 10.56 (SD = 4.14) independent components were removed per participant. There was no difference in the number of trials accepted per condition across participants, F(6, 144) = 1.46, p = .196.
ERPs were calculated by averaging trials for each condition in the time domain. To reduce the possibility of brain responses to the prearticulatory mouth movements contaminating the baseline, we baseline corrected ERPs using the period from 300 to 100 msec before onset of the first stimulus. We focused our ERP analysis on auditory ERPs based on previous literature showing that they are moderated by concurrent visual speech (Baart, 2016; van Wassenhove et al., 2005; Besle et al., 2004).
Peak Amplitude of Auditory ERPs
We extracted the amplitude of ERPs by defining windows based on canonical brain responses and averaging amplitude within those windows. For the N1 component, we used a window of 90–130 msec after auditory stimulus onset. For the P2 component, we used a window of 160–240 msec after auditory stimulus onset. For the N2 component, we used a window of 250–350 msec after auditory stimulus onset. For peak-to-peak voltage differences and topographies, we subtracted either the N1 or N2 component from the P2 component. Positive values thus indicate greater prominence of the P2 relative to the preceding or following ERP component.
Peak Latency of Auditory ERPs
We calculated the latency of the N1 and P2 peaks based on previous work indicating that they should occur earlier in the presence of a visual stimulus (Baart, 2016; van Wassenhove et al., 2005). We identified peaks using the findpeaksG.m function (mathworks.com/matlabcentral/fileexchange/11755) with slope threshold = 0.001, amplitude threshold = −1.5, and a 5-point boxcar smoothing kernel. The N1 peak was searched in the range of 70–150 msec, and the P2 peak was searched in the range of 160–240 msec. In cases where more than one peak was found or the peak was within 5 msec of the edges of the search range, the primary peak was selected if unambiguous by the first author (D. S.) and otherwise treated as missing data. For statistical analysis of peak latency, missing data were imputed using the MDI toolbox (Folch-Fortuny, Arteaga, & Ferrer, 2016) via trimmed data regression, and analyses were repeated with only complete records to confirm results.
Frequency Domain Analysis
We further examined whether visual inputs modulated brain responses without signal averaging using time–frequency analysis. Time–frequency decomposition of single-trial EEG data was accomplished using convolution with Morlet wavelets with frequencies from 3.5 to 35 Hz in 0.5-Hz steps. Wavelets had 2.5 cycles at the lowest frequency rising to 9.3 cycles at the highest, and convolution was performed with a temporal resolution of 10 msec. Power was then decibel transformed relative to the −600 to −200 prestimulus period.
Realignment to Auditory Stimulus Onset
We present ERPs and time–frequency representations (TFRs) aligned such that Time 0 is onset of the leading stimulus and task-relevant information. For statistical analysis of auditory-locked brain responses to stimuli in which visual inputs occurred first, we employed a stimulus realignment procedure. Data were realigned with conditions in which audition occurred first by subtracting the temporal delay from the data time points (i.e., for the V150A condition, 150 msec was subtracted from the time point values for all data points). This results in a new time series in which onset of the auditory stimulus occurs at Time 0. Note that this realignment procedure was performed after baseline correction to prevent potential baseline contamination, as these realigned time series then had visual stimuli occurring before Time 0.
We performed statistical analysis using paired sample t tests and repeated-measures ANOVAs when time windows of interest were known. For analysis of time series data without a predefined window of interest, we utilized nonparametric randomization testing to control for multiple comparisons. For the amplitude of evoked potentials, we utilized repeated-measures ANOVA and performed follow-up paired sample t tests when appropriate. For TFRs of single-trial data collapsed in the theta and alpha bands, we then performed time series testing using nonparametric randomization testing with cluster-based correction for multiple comparisons (Maris & Oostenveld, 2007). We used the implementation in Fieldtrip (Oostenveld, Fries, Maris, & Schoffelen, 2011) with 10,000 randomizations, cluster inclusion alpha = 0.05, and a permutation alpha = 0.025. The test statistic used for cluster inclusion was dependent-sample F multivariate, although we note that the choice of test statistic and cluster inclusion threshold does not directly impact the likelihood of a significant finding compared with the permutation distribution (Maris & Oostenveld, 2007). Data from 600 msec before to 1050 msec after auditory stimulus onset were included in this process. For the theta band, we then performed follow-up repeated-measures ANOVA testing by collapsing in two separate time windows and then using follow-up paired sample t tests as appropriate. The first time window was the a priori selected canonical N1 to P2 time region (90–240 msec). The second time window was 300–650 msec and was selected to account for the remaining significant theta power differences while reducing data contamination from the earlier period due to the temporal precision of the wavelet transform. We restricted follow-up analysis of this second period to the lower portion of the theta band (3.5–5 Hz) based on visual inspection of TFRs, which indicated that most of the effect was found in the lowest frequencies. For completeness, we further analyzed whether temporal windows of 350–650 or 400–650 msec yielded different theta band results than our initial analytical selection. For induced alpha power, we took an additional step by calculating the first temporal derivative of alpha power to isolate the auditory-locked power increase from slow induced power decreases clearly attributable to the presence of a visual stimulus.
To estimate the temporal extent of early power effects, we fit inverted Gaussian functions to the group-averaged power data using the MATLAB fit.m function with free parameters of amplitude, mean, and standard deviation and report the temporal window within 1.517 SDs (equivalent to the point at which power attenuation effects were at 75% of maximum, the FW75M). Finally, we analyzed brain–behavior correlations using linear regression (Pearson correlation) between participant behavior and time–frequency data averaged over both time and frequency.
We began our analysis by compiling reported rates of synchrony for each of the seven SOAs. Consistent with previous experiments, synchronously presented stimuli (i.e., SOA of 0 msec) were reported as synchronous most frequently, and increasing rates of asynchrony led to increasing rates of reported asynchrony (Conrey & Pisoni, 2006). Gaussian functions were then fit to the reported rates of synchrony and were found to fit the individual participant data well (96.11 ± 3.33% of variance explained). Mean TBW size across participants, measured as the standard deviation of these individual Gaussian functions, was 296.1 ± 97.7 msec. In terms of FW75M, this window is 449.2 msec in width. We also replicated the frequently reported asymmetric nature of the TBW as stimuli with a visual lead were reported as synchronous more frequently than their auditory first counterparts (mean PSS = 75.2 ± 33.5 msec; t test vs. 0; t(24) = 11.226, p = 4.902 × 10−11; Figure 2A). Importantly, the smallest visual lead (V150A) was perceived as synchronous on most trials by most participants. Response times were found to differ significantly across conditions (repeated-measures ANOVA, F(6, 144) = 24.93, p = 3.8288 × 10−20; Figure 2B) and were faster for synchronous trials when compared with any of the other asynchronous conditions (all six pairwise comparisons, p < .0015 uncorrected). Response times also changed asymmetrically across conditions: Increasing the asynchrony for visually leading trials resulted in increasing response times, whereas increasing the asynchrony for auditory leading trials resulted in decreasing response times. We present all pairwise response time comparisons, masked for significance of p < .05 uncorrected, in Figure 2C.
The first step in our EEG analysis focused on the time domain by averaging the signal to derive ERPs (Figure 3). We chose to focus our analysis on electrode Cz as this electrode is optimally positioned to capture auditory ERPs and proved to be relatively isolated from contamination due to ongoing ERPs from the visual stimulus. Distinct auditory ERPs were clearly present in all seven SOA conditions at this electrode (Figure 3A and B). Despite its relative isolation from visual ERPs, strong positive voltage trends, likely a result of parietal voltage buildup related to decisional processes, were clearly present in conditions with visual leads (Figure 3B). We thus began our analysis by focusing on the four conditions without a visual lead (A450, A300V, A150V, and AV). Across these four conditions, the amplitude of the N1 ERP component (90–130 msec) was significantly different (repeated-measures ANOVA, F(3, 72) = 25.54, p = 2.3218 × 10−11; Figure 3C). Follow-up paired sample t tests (Figure 3D) indicated that differences occurred because the smallest auditory lead, A150V, had a smaller N1 (i.e., a less negative voltage) than the two larger auditory leads (A450V vs. A150V: t(24) = 3.412, p = .0023; A300V vs. A150V: t(24) = 5.056, p = 3.6 × 10−5). Furthermore, this effect increased for these large auditory leads compared with synchronous presentation (A450V vs. AV: t(24) = 5.915, p = 4.12 × 10−6; A300V vs. AV: t(24) = 6.431, p = 1.19 × 10−6). Importantly, the two larger auditory leads (A450V and A300V) were consistent with one another (t(24) = 0.439, p = .665) as expected (Pilling, 2009), whereas the small auditory lead (A150V) represented an intermediate step between these SOAs and the synchronous condition (A150V vs. AV: t(24) = 3.833, p = .0008). Voltage for the P2 ERP component (160–240 msec) was found to not be significantly different across conditions (repeated-measures ANOVA, F(3, 72) = 0.4574, p = .7129). These findings were found to be qualitatively similar when using N1 and P2 amplitudes drawn from individualized peaks (see peak latency analysis below) rather than averaging over predetermined windows (N1 amplitude: F(3, 72) = 27.2, p = 7.19 × 10−12; P2 amplitude: F(3, 72) = 1.22, p = .3072). We thus not only replicate previous findings of reduced N1 amplitude due to the presence of visual speech and the lack of this effect for a sufficiently large auditory lead but also establish the level of temporal asynchrony associated with an intermediate level of N1 amplitude suppression.
We then extended our ERP analysis to include all seven SOAs by focusing on peak-to-peak voltage differences (i.e., peak prominence), which has been previously applied in similar studies (Pilling, 2009). One advantage of this peak prominence measure is that it functions to cancel the observed parietal buildup effects. We first subtracted N1 voltage from P2 voltage (P2 − N1; Figure 3E). This voltage difference was found to be largest for the large auditory leads and differed significantly across conditions (repeated-measures ANOVA, F(6, 144) = 8.089, p = 1.5396 × 10−7). Similar to single ERP peak analysis, this peak-to-peak effect was found to be robust when using individualized ERP peak amplitudes (F(6, 144) = 11.27, p = 2.679 × 10−10). Large auditory leads (A450V, A300V) had a significantly larger N1-to-P2 voltage change than all other conditions, whereas the smallest auditory lead (A150V) had a larger N1-to-P2 voltage change than synchronous presentation, but not larger than visual leads. Because of the number of unique pairwise comparisons (21), we present absolute t value for all comparisons, masked for significance of p < .05 uncorrected (Figure 3F). We then repeated this analytical approach for peak-to-peak differences between P2 and N2 (P2 − N2; Figure 3G). Voltage differences between these peaks were also found to be significantly different across conditions (repeated-measures ANOVA, F(6, 144) = 15.166, p = 2.0512 × 10−13). Auditory leads (A450V, A300V, and A150V) and the largest visual lead (V450A) were found to have the largest P2-to-N2 voltage differences, whereas this difference was small for synchronously presented stimuli and for the small-visual-lead condition (V150A). We present all pairwise comparisons, masked for significance of p < .05 uncorrected, in Figure 3H. Topographies for the ERP components of interest (N1, P2, and N2) as well as difference topographies (P2 − N1 and P2 − N2) are presented in Figure 4.
We further analyzed the latency of the N1 and P2 peaks based on previous literature that has shown these components to be accelerated by visual speech (Baart, 2016; van Wassenhove et al., 2005). For the N1 analysis, 19 of 25 participants had identifiable peak latencies in all seven conditions, whereas the latency of 10 individual ERP peaks (of 175 peaks [25 participants × 7 conditions] or 5.71%) could not be identified. We imputed these 10 data points using the MDI toolbox (Folch-Fortuny et al., 2016) via trimmed scores regression with four principal components. We then compared N1 latencies across conditions and found that they differed significantly (repeated-measures ANOVA, F(6, 144) = 13.03, p = 9.6727 × 10−12; Figure 5A). Follow-up t tests indicated that this occurred because visual leads reduced the latency of the N1, with the shortest latency occurring at the smallest visual lead (V150A). We present all follow-up t tests, masked for p < .05 uncorrected, in Figure 5B. We repeated this analysis excluding the six participants with incomplete data and found that the differences were still significant (F(6, 114) = 9.73, p = 1.1219 × 10−8). For the analysis of the P2 component, 22 of 25 participants had identifiable peaks in all conditions, whereas three individual ERP peaks (1.71%) could not be identified and were imputed using trimmed scores regression and three principal components. We then compared P2 latencies across conditions (repeated-measures ANOVA, F(6, 144) = 4.95, p = .0001; Figure 5C). Effects for the P2 illustrated a monotonically decreasing latency as the amount of time from visual onset increased. We present all follow-up t tests, masked for p < .05 uncorrected, in Figure 5D. We repeated this analysis excluding the three participants with incomplete data and found that differences were still significant (F(6, 126) = 3.71, p = .002).
To recapture dynamics lost during signal averaging (i.e., induced brain oscillations) and to better separate auditory-linked activity from ongoing visual and/or cognitive activity (i.e., the centro-parietal positivity [O'Connell, Dockree, & Kelly, 2012], which is clearly visible in the ERPs and occurs at a lower frequency [∼1–2 Hz]), we utilized time–frequency analysis. In this analysis, the wavelets effectively function as band-pass filters, removing slow activity related to decisional buildup. TFRs resulting from wavelet convolution at electrode Cz for all seven conditions are presented in Figure 6. We focused our analysis on the theta (4–8 Hz) and alpha (8–12 Hz) frequency bands as these frequency ranges have established functional significance and exhibited clear activity in the TFRs. We did not analyze the pronounced beta desynchronization as it exhibited a left-lateralized topography consistent with initiation of motor responses. We averaged data within the alpha and theta bands and then used nonparametric randomization testing with cluster-based correction for multiple comparisons to determine the period during which conditions differed. In the theta band, a sustained period of significant difference was found from 60 to 660 msec (p < .0001, randomization test; Figure 7A). In the alpha band, we noted that slow drifts in induced alpha power were present beginning with the onset of the visual stimuli. These appeared to be distinct from the more localized sharp deflections associated with auditory stimulus onset (Figure 7B). We corrected for these drifts and associated spurious statistical differences by taking the first temporal derivative (i.e., the rate of change) of alpha power, thus isolating sharp deflections in power. These differences were found to be significant from 5 to 145 msec via randomization testing (p < .0001, randomization test; Figure 7C). Importantly, the duration of significant differences in theta power was far more sustained than that for alpha power. Furthermore, theta power demonstrated a topography that evolved between distinct central and fronto-central distributions. Although statistical significance in the theta band was continuous, changes in topography and bandwidth suggested that theta activity might represent multiple consecutive processes. We thus opted to analyze theta power in an early window (90–240 msec, corresponding with the beginning of our a priori N1 window and ending with our a priori P2 window) and a late window (300–650 msec). Given the continuous nature of significant time points, we also confirmed that alternative selections for the late time window (i.e., 350–650 or 400–650 msec) were consistent with our initial selection (see Results section). Change in alpha power was averaged across the entire 5- to 145-msec period deemed significant by the randomization test.
Topographies for early theta power were found to have a central topography consistent with auditory cortical generators and are presented in Figure 7D, whereas topographies for the alpha power were similarly centrally distributed and are presented in Figure 7E. Early theta power differed significantly across conditions (F(6, 144) = 24.23, p = 1.0513 × 10−19), with the highest power found for large auditory leads (A450V and A300V). These large auditory leads were not different from one another (t(24) = −1.238, p = .228) but were different from all other conditions with a more temporally proximate visual stimulus. Strikingly, with the sole exception of V150 compared with V300A, all individual steps of changing temporal synchrony were significantly different from their nearest neighbors (Figure 7F). We present all pairwise comparisons for early theta power in Figure 7G. Change in alpha power was also found to be significantly different across conditions (F(6, 144) = 13.25, p = 6.447 × 10−12). Similarly to theta power, change in alpha power was high for auditory leads, attenuated by synchronous visual stimuli and small visual leads, and recovered for larger visual leads (Figure 7H). We present all pairwise comparisons for changes in alpha power in Figure 7I.
We next sought to determine the overall temporal tolerance of our participants, as indexed by the change in suppression across SOAs. To estimate the temporal extent of the effects at the group level, we thus fit a Gaussian distribution (i.e., a TBW) to group-averaged power in each frequency band. For theta power, a Gaussian function fit the group data extremely well (96.26% of variance explained) and had a standard deviation of 325.4 msec, a mean of 193.25 msec, and an FW75M of 493.6 msec. For alpha power, a Gaussian function was similarly found to fit the group data well (95.80% of variance explained) and had a standard deviation of 257.9 msec, a mean of 102 msec, and an FW75M of 391.2 msec. Our time–frequency analysis of the initial auditory responses thus strongly indicates that the temporal structure of the audiovisual stimulus relationship mediates the magnitude to which visual speech suppresses single-trial auditory speech responses. Furthermore, this process occurs in a highly continuous manner with a temporal tuning function remarkably similar to that found in participant behavior, although we note that we did not find a direct relationship between neural response suppression and participant temporal acuity.
Having examined the frequency representation of the initial auditory response, we then proceeded to analyze the later period of theta band significance. Visual inspection of results indicated that these late theta power differences were primarily driven by frequencies at the lower bound of the TFRs (3.5–5 Hz). We thus initially restricted our analysis of the late (300–650 msec) effects to these frequencies (see below for analysis in other theta band portions). Topographies of low theta (3.5–5 Hz) activity averaged across this period are presented in Figure 8A. Notably, the topography in this time–frequency range presented a more fronto-central distribution than observed for early alpha or theta band activity, less consistent with auditory cortical generators. This late theta activity also demonstrated differences across conditions (F(6, 144) = 11.87, p = 8.5026 × 10−11), but notably, these differences were less continuous than for the early effects (i.e., within “categories” of auditory lead, synchronous or small visual lead, and large visual lead, no significant differences were found; Figure 8B). We present all pairwise comparisons for late theta power in Figure 8C. The fronto-central distribution of this late theta activity is consistent with previous reports of neural responses to visual stimulus incongruence (Hanslmayr et al., 2008) and with more recent reports of cross-modal stimulus incongruence (Roa Romero, Keil, Balz, Gallinat, & Senkowski, 2016). To determine if this potentially congruence-related activity corresponded with participants' ability to perceive temporal incongruence (i.e., stimulus asynchrony) in our experiment, we correlated late theta power with individual perceptual report separately for each condition. We found that, for small and moderate auditory leads, the level of individual theta power negatively correlated with the rate at which participants reported synchrony in each condition (A150V: r = −.5920, p = .0018; A300V: r = −.4349, p = .0298; Figure 8D and E). For the largest auditory lead (A450V), this correlation approached significance (r = −.3890, p = .0546). This weaker correlation for the A450V condition is potentially explained by the high number of participants rarely reporting simultaneity (9/25 participants < 1% reported rate of synchrony, 19/25 participants < 5% reported rate of synchrony). Correlations between reported rate of synchrony and late theta oscillations were similarly negative for synchronous or visually leading conditions but were not significant (all ps > .163). We further correlated late theta oscillations in each of these conditions to overall TBW size. This correlation was found to be significant for theta power in the A450V (r = −.4397, p = .0279) and A150V (r = −.5318, p = .0062) conditions but was not significant for the A300V condition (r = −.3141, p = .126). Similar to correlations for perceptual accuracy, correlations between theta power and binding window size were found to be nonsignificant for synchronous and visually leading stimuli (all ps > .221).
For completeness, we repeated these analyses in the upper portion of the theta band (5.5–8 Hz) and further examined whether relationships held for the earlier (90–240 msec) time window. In the upper portion of the theta band (5.5–8 Hz), correlations between perceptual accuracy and theta power were substantially weaker, reaching only limited trend levels of significance and which are potentially attributable to spectral imprecision (A450V: r = −.2277, p = .2737; A300V: r = −.3977, p = .049; A150V: r = −.3736, p = .0658). Theta power in the early time window (90–240 msec) was not found to correlate with perceptual accuracy in any condition for low theta (3.5–5 Hz, all ps > .109) or high theta (5.5–8 Hz, all ps > .32). In addition, we investigated the degree to which the temporal window selected for “late” oscillations contributed to the results. We found that utilizing narrower, and thus less conservative, analytical windows yielded qualitatively similar results. For example, in the A150V condition, a 400- to 650-msec window yielded a correlation between reported synchrony and theta power of r = −.5943, p = .0017. Given the strong similarity across windows and the need to make a temporal division between early and late power, we opted to utilize the most conservative temporal window and do not report on the narrower analytical windows further. Last, we confirmed that results in the later time window were not entirely phase locked by repeating the TFR analysis using the ERPs instead of single-trial data. This analysis yielded qualitatively dissimilar results, which indicates that the findings in this window require consideration of single trials and are likely at least partially oscillatory. Results in the early time window were found to be phase locked and thus primarily represent the frequency domain version of the ERP.
We sought to elucidate the effects of temporal concordance between visual speech and auditory speech on amplitude suppression, a well-established neural measure of multisensory integration for speech signals (Baart, 2016; Pilling, 2009; van Wassenhove et al., 2005; Besle et al., 2004). To do so, we utilized an SJ task that requires participants to attend to both vision and audition and manipulated the temporal relationship between the sensory inputs. In terms of ERPs, we partially replicated previous work (Baart, 2016; Pilling, 2009; van Wassenhove et al., 2005) by demonstrating suppression of the N1 component by synchronous visual signals. Importantly, we also demonstrate the presence of an intermediate step, in which a sufficiently small auditory lead results in partial amplitude suppression. This intermediate step offers confirmation that audiovisual speech integration operates in an efficient manner in which any visual input makes contributions proportional to its information content. Furthermore, through time–frequency analyses designed to reduce the influence of ongoing visual and, in particular, decisional activity occurring at lower frequencies, we established that this amplitude suppression exhibits substantial temporal tuning. Effects were found to be maximal at synchrony and small visual leads, both of which correspond with the greatest reports of perceptual synchrony. We also found that effects in the alpha band occurred more rapidly (∼50 msec earlier) than those seen in the theta band. These same time–frequency analyses also further indicated that, for auditory leads, low theta (3.5–5 Hz) oscillations persisted well after the auditory stimulus. These more persistent oscillations presented with a topographical pattern consistent with congruence processing and, most importantly, were found to correspond with task performance.
Changes in ERP Amplitude Are Limited to the N1 and N2 Components
When comparing ERP amplitudes for auditory leading stimuli and for true AV synchronous presentations, a highly significant reduction in absolute N1 amplitude was present when visual speech was presented synchronously or with a 150-msec delay relative to the auditory signal (A150V). This replicates and extends previous results indicating that a reduction in the magnitude of auditory cortical responses by visual speech can occur even when vision is slightly lagged and the prearticulatory motion occurs concurrent with the auditory signal. Importantly, we also establish that the neural response to a stimulus with a small visual lag differs from both audiovisual responses with large visual lags and those truly synchronous stimuli, thus establishing for the first time an intermediate level of suppression. This partial reduction by concurrent prearticulation is consistent with the predictive coding account of audiovisual integration (Talsma, 2015), in which the articulatory movements are still informative but to a lesser degree.
Interestingly, we did not replicate findings of reduced P2 amplitude in the presence of audiovisual speech. It is possible that the lack of P2 amplitude reduction in our results is due to the nature of the task participants were performing. The SJ task requests that participants segregate stimuli in an effort to keep their timing as separate as possible. Previous work has established that the P2 may be less automatic and more amenable to top–down regulation than the earlier P1 and N1 cortical responses. Specifically, in a multisensory speech task featuring the McGurk illusion (McGurk & MacDonald, 1976), P2 modulation by visual speech was reduced by the presence of an “incongruent context” before stimulus onset (Ganesh, Berthommier, Vilain, Sato, & Schwartz, 2014). In this study, if the period preceding the experimental stimulus contained mismatched auditory and visual stimuli, then the level of P2 integration (i.e., the amount of P2 suppression) for the experimental stimulus was reduced. This indicates that the top–down factor of whether stimulus modalities are appropriate to integrate can modulate the degree of integration measured in P2. We thus speculate that the lack of P2 reduction we observe may be a manifestation of the task demands, which asks our participants to segregate the stimuli as much as possible. Establishing empirically whether this is the case will require additional future work.
Despite the lack of P2 attenuation, peak-to-peak measures indicated that highly asynchronous visual speech accentuates the relative prominence of the P2 compared with the N2. This indicates enhanced N2 negativity when the stimulus has a large degree of temporal asynchrony. The N2 has been associated with conflict processing in a number of tasks (Iannaccone et al., 2015; Larson, Clayson, & Clawson, 2014; Yeung, Botvinick, & Cohen, 2004), and the enhanced negativity here indicates that the N2 has sensitivity to temporal congruence. Time and frequency domain representations of error processing, which may have similar monitoring circuit substrates, have been shown to be partially independent and carry complementary information (Munneke, Nap, Schippers, & Cohen, 2015). Given the substantial temporal overlap between ERPs and our time–frequency results, we thus discuss temporal misalignment as a form of stimulus conflict further in the context of theta oscillations below.
Visual Speech Accelerates the N1 and P2
Previous reports have indicated that the presence of visual speech also accelerates the onset of early ERP components (and thus, presumably, their associated processing stages). We partially replicated this result and found that, for the N1, this acceleration was greatest when vision slightly precedes audition (V150A). Although an ∼11-msec acceleration might seem small, it represents a roughly 10% speeding of peak latency. However, given the highly visually specified token that was used (the syllable “ba,” which is easily lip-read and carries relatively well-specified temporal information), the acceleration present for synchronous presentation (∼5 msec) is only about half of that expected based on previous reports linking the amount of acceleration to visual intelligibility (van Wassenhove et al., 2005). This relative reduction may be a result of task demands, similar to our lack of P2 amplitude reduction, or may reflect that even subtle differences in experiments seem to change whether this effect is found (Baart, 2016). We also found acceleration of the P2 component, despite not finding amplitude reduction in this same component. Our P2 finding, however, consisted of a monotonic latency reduction, which reached significance in only the largest visual leads and does not replicate findings of fairly substantial (∼20 msec) P2 latency facilitation afforded by the presence of a synchronous and readily recognized viseme (van Wassenhove et al., 2005). Given the apparent lack of temporal tuning and the vulnerability of peak latency measures to effects such as entrainment of alpha oscillations by visual inputs, we do not believe that our P2 acceleration result serves as an indicator of multisensory integration in this task. Rather, we interpret this finding similarly to our P2 amplitude finding, to indicate that the degree of P2 latency acceleration present for audiovisual speech may depend on task demands and context.
Temporal Constraints on Multisensory Speech Integration
Time–frequency analysis indicated that reductions in the magnitude of early brain responses occurred in both the theta and alpha frequency bands and were strongly mediated by the degree of temporal concordance between the auditory and visual signals. As expected, large auditory leads resulted in a robust brain response consistent with auditory-only processing. As the temporal lag of the visual stimulus decreased, there were corresponding and continuous decreases in response magnitude, with the smallest neural response at a small visual lead (V150A). Critically, further increases in visual lead resulted in a recovery of response magnitude. This neural distribution, with its visual lead bias and highly Gaussian shape, bears striking resemblance to the TBW found both in our behavioral results as well as in other similar reports (for reviews, see Wallace & Stevenson, 2014; Vroomen & Keetels, 2010). This extended temporal window for audiovisual speech integration is also consistent with reports that the auditory system integrates speech signals over a relatively protracted period, particularly in challenging acoustic conditions (Ding & Simon, 2013c). Furthermore, these results are also highly consistent with recent reports of a similarly extended temporal integration window in the reconstruction of continuous audiovisual speech (Crosse et al., 2016). Last, the extent of the temporal integration window we found (∼500 msec) corresponds well with the cycle time of the ∼2-Hz lower-frequency bound in which auditory speech temporally correlates with visual speech (Chandrasekaran et al., 2009). Similar slow frequencies have also been shown to offer the most robust audiovisual gain in speech stimulus reconstruction (Crosse et al., 2015), suggesting not only that integration is occurring over a large time window but also that integration over longer temporal epochs may result in more robust processing and encoding.
Intriguingly, we observed notable differences in the modulatory effects of visual speech on power in the alpha and theta bands. Specifically, differences in the alpha band emerged substantially earlier, whereas theta band effects were found later and to be more sensitive in representing differences between small ecologically plausible temporal offsets (i.e., AV vs. V150A). In addition, the positive deflection in alpha power was completely removed by a synchronous visual stimulus, whereas theta power was only attenuated. These differences are unlikely to be related to limitations in spectral resolution as symmetric spectral transforms such as Morlet wavelets spread lower frequencies further backward in time. Instead, they likely reflect neural activity originating in functionally distinct but anatomically overlapping cortical circuits in auditory cortical regions. This is particularly relevant for the alpha band, which has been associated with a “working” frequency that determines the temporal resolution of the visual system (Samaha & Postle, 2015) and similarly affects audiovisual multisensory processing (Cecere, Rees, & Romei, 2015). Cortical alpha activity has also been proposed as an important tool for selective inhibition in challenging listening conditions, when integration of visual speech is most valuable (Strauss, Wostmann, & Obleser, 2014). Last, phase reset across sensory systems has been shown to be a generalized context and attention-sensitive mechanism for multisensory neural interaction (Lakatos et al., 2009; Lakatos, Chen, O'Connell, Mills, & Schroeder, 2007; for a review, see van Atteveldt, Murray, Thut, & Schroeder, 2014). In light of this previous work, the observed differences in timing and effect magnitude indicate that integrative processing of audiovisual speech stimuli may impact the efficacy of alpha reset mechanisms earlier than neural circuits with responses in the theta band, resulting in rapid selective inhibition of neural populations that might otherwise contribute to the response. Such rapid dampening may then propagate to the slower theta band, which tracks the speech envelope and, in our results, carries a more continuous and precise representation of the temporal offset. This proposed interaction between frequency bands in speech processing is also supported by evidence that intrinsic theta oscillations shape syllable perception whereas endogenous alpha oscillations do not (Ten Oever & Sack, 2015). Our findings of band-specific latency differences can thus be well accounted for by a two-stage model of multisensory temporal integration for speech signals. In this model, neural populations operating at higher alpha frequencies activate more rapidly but carry less precise temporal information. These fast alpha circuits then refine cortical responses occurring in slower theta frequencies and thus allow these theta circuits to carry a more precise temporal and envelope representation. Such a model is well aligned with invasive physiological work indicating that subadditive neural interactions are associated with enhanced information content in cortical signals (Angelaki, Gu, & DeAngelis, 2009). We thus suggest that frequency domain analyses such as those conducted here are able to, at least partially, disentangle the temporal dynamics of visual influences on auditory cortical responses.
Theta Oscillations as a Marker of Cross-modal Incongruence Processing
In addition to early theta band effects, we also noted theta oscillations enduring long after auditory stimulus onset. These persistent theta oscillations occurred primarily at the lower end of the theta band (3.5–5 Hz) and had a more fronto-central distribution when compared with the early alpha and theta band effects. These oscillations not only differed across levels of temporal asynchrony, virtually vanishing in conditions where participants report the stimuli as synchronous but also directly corresponded with the accuracy of participants' perceptual report in auditory leading conditions. Given the nature of the task, in which participants are asked to detect temporal incongruence in the stimulus, we believe that these oscillations index processing of cross-modal temporal incongruence in the brain. This is consistent with previous work, which has indicated that similar theta oscillations are active during reconciliation of incongruent stimulus features (also known as conflict detection or stimulus–stimulus conflict; Cohen, 2014). A well-known example of such processing is the Stroop task, in which written color words and the color they are written in are mismatched (e.g., the word “blue” written in red). During performance of this task, fronto-central theta oscillations are observed on trials with conflicting information (Hanslmayr et al., 2008). Another example of such oscillations is during trials with conflicting information in a flanker task (Nigbur, Ivanova, & Sturmer, 2011). The topography of theta oscillations seen in such tasks, as well as in our experiment, is also consistent with anterior cingulate and other medial frontal generators, which have been linked to stimulus error processing (Cavanagh & Frank, 2014). The late timing of differences is also consistent with recent work examining the formation of large-scale functional brain networks associated with multisensory speech perception (Kumar et al., 2016) and may indicate that frontal monitoring circuits play a critical role in such networks. In addition, we believe that the presence of this relationship for high level (i.e., late frontal) and its absence for low level (i.e., early auditory cortex) may stem from the dissociation between perceived simultaneity and low-level multisensory integration (Harrar, Harris, & Spence, 2017). Our report thus additionally serves as evidence that perceptual simultaneity may emerge from higher cognitive processes emerging relatively late in time.
Importantly, it has been shown that similar theta oscillations with medial frontal generators are attenuated during cross-modal conflict in populations with reduced top–down error control such as patients with schizophrenia (Roa Romero et al., 2016). Atypical multisensory temporal processing is increasingly being recognized in a number of neurological and neuropsychiatric disorders, including schizophrenia (Martin, Giersch, Huron, & van Wassenhove, 2013) and autism (Stevenson et al., 2014). This finding suggests that feedforward activity from sensory processing regions to error monitoring systems may form a neurophysiological basis for multisensory temporal dysfunction in these disorders. In addition, we believe that it is important to highlight the consistency of this relationship across conditions with auditory leads. This consistency indicates that superior temporal acuity is associated with stronger incongruence signaling regardless of relative perceptual difficulty. In other words, stronger conflict signaling during multisensory temporal incongruence is an individual trait and at least somewhat independent of perceptual threshold. Future work relating conflict signaling to individual differences in multisensory integration may yield further insights into the importance of this process.
That we do not find a relationship between theta oscillations linked to incongruence processing and behavior in synchronous or visually leading trials is not surprising. For synchrony and the smallest visual lead, there are both little behavioral variability across participants and a little perceived conflict to be signaled, as participants report the stimulus as occurring synchronously most of the time. In conditions with larger visual leads, theta band conflict signaling would be expected to occur 300–650 msec after onset of the leading visual stimulus, during which frontal theta activity is obscured by the much larger auditory cortical response. Alternatively, temporal incongruence in visually leading conditions might be processed by different neural networks than auditory leading stimuli. This possibility is specifically raised by recent work elucidating that the neural networks engaged during SJ depend on stimulus ordering (Cecere, Gross, Willis, & Thut, 2017).
Multisensory Temporal Integration as a Fundamental Feature of Speech Processing
Taken together, our time and frequency domain analysis points to a substantial temporal window in which visual speech reduces the amplitude and speeds the onset of the neural processing associated with speech signals. This window forms a substrate for the integration of relatively slowly occurring mouth movements and envelope fluctuations and further supports accounts that delta band (1–4 Hz, cycle time = 250 msec–1 sec) brain activity may serve a role in integrating temporal information in speech signals (Schroeder et al., 2008). Given the nature of audiovisual speech, in which individual syllables have variable visual–auditory onset timing, the presence of such a tolerant mechanism may form a fundamental component of the ability to correctly incorporate visual speech to enhance auditory perception. Our findings are also consistent with work indicating that multisensory temporal integration may serve as a gain control mechanism (Crosse et al., 2016). Our theta band tuning profile in particular, although quite broad, is also quite deep (∼1.5 dB), giving it a great deal of dynamic range to impact processing of signals differently depending on the degree of temporal alignment with visual inputs. This temporal weighting may serve to facilitate neural entrainment in particular, by providing strong weighting to near-concurrent events and a more moderate weighting to events with ambiguous temporal concordance. In a rich visual environment, such a process would serve as a temporal “filter” on visual inputs. Last, we establish that temporal discordance between stimuli generates activity consistent with systems that respond to stimulus incongruence. In the context of naturalistic speech, continuous monitoring of temporal congruity and appropriate feedback to sensory systems may make crucial contributions to sharpening the influences visual inputs have on auditory speech processing. The combination of these factors indicates that temporal integration is a fundamental feature of speech processing and that neural systems are strongly adapted to take advantage of temporal structure in speech signals.
We establish that the temporal relationship between auditory and visual stimuli is critically important to the degree to which visual inputs attenuate auditory brain responses and accelerate the onset of early ERP components. Importantly, this attenuation operates in an asymmetric temporal window, in strong agreement with both behavioral and electrophysiological measures of multisensory temporal integration for speech signals. Furthermore, the perceived temporal relationship of the stimuli is more categorically reflected by late theta oscillations, in that these oscillations are present for stimuli frequently reported as asynchronous and virtually absent for stimuli predominantly reported to be synchronous. In conditions in which audition leads, the strength of these categorical theta oscillations directly corresponded with participant performance, and this relationship was particularly strong when the stimulus was perceptually ambiguous. These findings contribute to a growing body of literature indicating that auditory and visual speech signals are integrated over a surprisingly wide window of time, while further indicating that temporal mismatch between sensory modalities is processed in a manner similar to other types of stimulus conflict. The band-specific nature of the neural processing differences also suggests that distinct neural populations contribute to temporally distinct stages of integration of audiovisual speech signals. Further investigation of these temporal dynamics may make substantial contributions to the refinement of circuit models that account for visual enhancement of information content in cortical speech representations.
Although we believe that this work sheds important light on multisensory temporal processing, it is not without limitations. The evoked design, which by nature is highly repetitive, is less naturalistic than normal speech. Similarly, the SJ task is somewhat removed from normal speech task demands such as comprehension and stream segregation, and in particular, our lack of P2 modulation and unusual P2 latency facilitation may be specific to our experimental design. Last, because substantial phase resetting of ongoing neural processes contributes to evoked responses (Makeig et al., 2002) and the inherent nonindependence of phase and amplitude measures in noisy signals such as EEG (Ding & Simon, 2013b), our design precludes a robust analysis of neural phase. Neural phase is known to play a fundamental role in speech processing (Giraud & Poeppel, 2012; Schroeder et al., 2008), multisensory timing (Kosem, Gramfort, & van Wassenhove, 2014), and the maintenance of ongoing oscillatory dynamics at speech frequencies (Simon & Wallace, 2017; Herrmann, Henry, Haegens, & Obleser, 2016), and the inability to analyze phase limits our ability to assess a potentially important processing dynamic contributing to multisensory integration.
The current study suggests several avenues of potential future research. One such approach is examining temporal modulation of neural responses to audiovisual speech across development. Multisensory temporal integration is known to have a developmental trajectory (Hillock-Dunn, Grantham, & Wallace, 2016; Hillock, Powers, & Wallace, 2011), and utilizing a similar approach in children may serve to extend existing findings of reduced audiovisual speech integration in childhood (Kaganovich & Schumaker, 2014; Knowland, Mercure, Karmiloff-Smith, Dick, & Thomas, 2014) by determining the degree to which temporal integration sharpens during maturation. Similarly, given the known deficits in multisensory temporal integration in a number of disorders such as autism spectrum disorder (Stevenson et al., 2014), dyslexia (Hairston, Burdette, Flowers, Wood, & Wallace, 2005), and schizophrenia (Ross, Saint-Amour, Leavitt, Molholm, et al., 2007), extension of this study to these populations may offer to shed light on the nature of dysfunctional multisensory temporal processing. Recent approaches to temporal processing have also elucidated that neural processing of time is highly variable based on the existing temporal context (Simon, Noel, & Wallace, 2017). Given the associations that we establish between temporal acuity and congruity processing, examining trial-by-trial variability may yield important insights into information transfer between sensory systems and performance monitoring circuits in the brain. The use of such approaches may serve to elucidate the integrity and developmental trajectory of multisensory temporal integration in both the typically and atypically developing brain.
This work was supported by NIH U54 HD083211, NIH DC010927, NIH CA183492, and NIH HD83211 to M. T. W.
Reprint requests should be sent to Mark T. Wallace, 7110 MRB III BioSci Bldg., 465 21st Ave. South, Nashville, TN 3740, or via e-mail: firstname.lastname@example.org.