Predictions about forthcoming auditory events can be established on the basis of preceding visual information. Sounds being incongruent to predictive visual information have been found to elicit an enhanced negative ERP in the latency range of the auditory N1 compared with physically identical sounds being preceded by congruent visual information. This so-called incongruency response (IR) is interpreted as reduced prediction error for predicted sounds at a sensory level. The main purpose of this study was to examine the impact of probability manipulations on the IR. We manipulated the probability with which particular congruent visual–auditory pairs were presented (83/17 vs. 50/50 condition). This manipulation led to two conditions with different strengths of the association of visual with auditory information. A visual cue was presented either above or below a fixation cross and was followed by either a high- or low-pitched sound. In 90% of trials, the visual cue correctly predicted the subsequent sound. In one condition, one of the sounds was presented more frequently (83% of trials), whereas in the other condition both sounds were presented with equal probability (50% of trials). Therefore, in the 83/17 condition, one congruent combination of visual cue and corresponding sound was presented more frequently than the other combinations presumably leading to a stronger visual–auditory association. A significant IR for unpredicted compared with predicted but otherwise identical sounds was observed only in the 83/17 condition, but not in the 50/50 condition, where both congruent visual cue–sound combinations were presented with equal probability. We also tested whether the processing of the prediction violation is dependent on the task relevance of the visual information. Therefore, we contrasted a visual–auditory matching task with a pitch discrimination task. It turned out that the task only had an impact on the behavioral performance but not on the prediction error signals. Results suggest that the generation of visual-to-auditory sensory predictions is facilitated by a strong association between the visual cue and the predicted sound (83/17 condition) but is not influenced by the task relevance of the visual information.
In our daily lives, we are surrounded by a variety of information from interacting sensory input. In human communication especially, visual information is used to facilitate auditory processing. More precisely, an early integration of audiovisual information has been observed, which is indicated by auditory response suppression for audiovisual compared with auditory-only stimulations (e.g., Stekelenburg & Vroomen, 2007; van Wassenhove, Grant, & Poeppel, 2005). These effects have been attributed to predictive processes (Stekelenburg & Vroomen, 2007, 2012; Vroomen & Stekelenburg, 2010; Besle, Fort, & Giard, 2005), because information from lip movements generally precedes speech sounds (on average, 150 msec earlier, Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009). In general, whenever visual information precedes an auditory stimulus, the preceding visual cue is thought to be used to predict the upcoming auditory stimulus. In this study, we focus on the modulation of these underlying prediction mechanisms. We present visual symbolic information preceding auditory stimuli (visual cue–sound combinations), which is assumed to generate visually based prediction effects. In the current study, we manipulate the strength of the association of the visual symbolic information with auditory information and examine its impact on prediction effects.
Widmann, Kujala, Tervaniemi, Kujala, and Schröger (2004) investigated visually based prediction effects by using a symbol-to-sound mapping paradigm. They presented a sequence of score-like visual symbols before a corresponding sound sequence, which was either congruent or incongruent to the visual sequence (maximum of one incongruent sound per sequence). The ERPs to the incongruent events (contradicting a possibly established prediction) were compared with the ERPs to the congruent events, in which the difference between the events was in the visual information (i.e., identical sounds were presented). This revealed a negative deflection of the difference wave (incongruent-minus-congruent) about 100–130 msec after sound onset at frontolateral sites. The amplitude difference was interpreted in the context of the predictive coding framework (Friston, 2005). The predictive coding framework describes extrapolations that are made on the basis of previous experiences. The interaction between the current sensory input and prior knowledge is used to predict an upcoming event. A mismatch between the prediction and the actual experienced stimulus is termed “prediction error” (for a review, see Bendixen, SanMiguel, & Schröger, 2012). Hence, the amplitude difference (incongruent-minus-congruent) was interpreted as an enhanced prediction error signal for incongruent (i.e., unpredicted) sounds compared with predicted sounds and termed “incongruency response” (IR). Elicitation of the IR is thought to mark a sensory prediction error, because it has auditory generators in the superior temporal gyrus (Pieszek, Widmann, Gruber, & Schröger, 2013). The presence of an IR therefore leads to the suggestion that the auditory system established a certain expectation of the upcoming auditory sound based on the preceding visual information (Widmann et al., 2004). More precisely, the IR is thought to be elicited by a discrepancy between the established prediction of an auditory stimulus and the actually occurring sound. Meanwhile, the IR has been replicated in several studies (Pieszek, Schröger, & Widmann, 2014; Pieszek et al., 2013).
The literature shows a diverse picture of what information from visual cues is used to facilitate the processing of upcoming auditory events. For instance, Stekelenburg and Vroomen (2007) argue that the effect is driven by induced temporal predictions and that the congruency of the information does not play a role. In contrast, Paris, Kim, and Davis (2017) and van Laarhoven, Stekelenburg, and Vroomen (2017) claim that temporal and content information induce a prediction of the upcoming auditory event. Interestingly, these studies (Paris et al., 2017; Stekelenburg & Vroomen, 2007) did not find processing differences of sounds being preceded by congruent in contrast to incongruent visual information within 100 msec after sound onset, whereas studies reporting the IR did find such differences (e.g., Pieszek et al., 2013). A main difference between these studies is that, for instance, Pieszek et al. (2013) used two visual cue–sound combinations whereby one of them was presented more frequently than the other. In contrast, Paris et al. (2017) prepared three different visual cue–sound combinations that were presented with equal probability, and Stekelenburg and Vroomen (2007) presented four. This shows that studies who find a difference between the sensory processing of congruent and incongruent visual cue–sound combinations (i.e., an IR) present one congruent combination more frequently than the other visual cue–sound combination, whereas those who do not find such a modulation use equal probability of the stimulus material. The frequent presentation of one congruent visual cue–sound combination probably facilitates the associative learning of the relation between the visual and auditory stimuli. Furthermore, this could lead to a strong association of visual with auditory information, which might induce a visually based sensory prediction and hence a modulation of sensory auditory processing based on the congruency of the preceding visual information.
In addition, the modulation of sensory auditory processing could be influenced by the task requirements, namely, the participant's focus of attention. Stekelenburg and Vroomen (2007) asked their participants to react toward a catch trial in the presented video sequence. Hence, to fulfill the task, participants had to pay attention just to the visual stimulus, but the task was independent of the visual content. In Paris et al. (2017), the participants did not have a task during the EEG recording and were simply instructed to pay attention to the stimuli. In contrast to these studies, Pieszek et al. (2013) instructed participants to discriminate the pitch of the two presented sounds and additionally to use the visual information to answer as fast and accurately as possible. It could be that the task requirements influence the formation of visually based predictions and thereby the facilitation of the processing of upcoming auditory events. We would assume facilitated processing based on visual preceding information if both the information from the visual and auditory stimuli are task relevant.
In this study, we were interested in the effects of the strength of the association of visual with auditory information on the formation of visually based sensory predictions as well as in the influence of task requirements. Our experimental paradigm was adapted from Pieszek et al. (2013), who presented visual note symbols either above or below a fixation cross before either high- or low-pitched sounds. The resulting visual cue–sound combinations were either congruent (both stimuli high or both stimuli low) or incongruent. In our setup, high- and low-pitched sounds were either presented with the same probability (50/50 condition) or one auditory stimulus (counterbalanced across participants) was presented more frequently than the other tone (83/17 condition). Overall, the congruent visual cue–sound combinations were presented more frequently (90%) than the incongruent combinations (10%). For the 83/17 condition, this results into a more frequent presentation of one congruent visual cue–sound combination and hence probably leads to a stronger association of the predictive visual cue with the auditory information than in the 50/50 condition. In addition, two different task conditions were used. Participants either had to perform a visual–auditory matching task or a pitch discrimination task. In other words, we aimed to systematically test which of these manipulations (sound probability vs. task requirements) leads to the effects of visually induced predictions on auditory processing (apparent in the IR elicitation) in a trial-by-trial design. In terms of the participants' behavioral performance, we expected faster RTs and higher accuracy in the congruent compared with the incongruent trials. This has been referred to as the congruency effect and has been attributed to participants' ability to control interference induced by incongruent information (Sarmiento, Shore, Milliken, & Sanabria, 2012; Egner & Hirsch, 2005).
The data (EEG and behavioral data) of 19 participants were acquired. One participant was excluded because of the quality of the EEG data (only 60% of trials remained after artifact correction), and two participants were excluded because of poor performance in the behavioral task (accuracy close to chance level). The remaining 16 participants (eight women; age range = 18–32 years, mean age = 23.2 years; 13 right-handed) all reported normal hearing, normal or corrected-to-normal vision, and not taking any medication affecting the CNS. All participants gave written consent in accordance with the Declaration of Helsinki and received either credit points or modest financial compensation for their participation. The project was approved by the local ethical committee of the Medical Faculty of the University of Leipzig.
Apparatus and Stimuli
The experiment was conducted within two sessions that were at least 20 hr apart, whereby on average they were 6 days apart. Before the first session, all participants completed a shortened German version of the Edinburgh handedness inventory (Oldfield, 1971). The participants were seated in an acoustically attenuated and electrically shielded chamber. During the experimental task, EEG was continuously recorded. The stimulation was presented using Psychophysics Toolbox (Version 3; Kleiner et al., 2007; Brainard, 1997) on a Linux-based system using GNU Octave (Version 4.0.0). A fixation cross (0.3° × 0.3° of visual angle) was permanently presented on the screen, 60 cm in front of the participant. Each trial started with the presentation of a white eighth note symbol (0.7° × 0.9° of visual angle), which was presented for 100 msec and was positioned either above or below the fixation cross. The eccentricity between fixation cross and the outer bound of the visual stimulus was 1° of the visual angle. The visual stimulus was followed by an auditory stimulus (tone with either 440- or 352-Hz frequency), which was presented 600 msec after the onset of the visual stimulus. The auditory stimulus was presented for 100 msec (including 5 msec rise and fall times, respectively; with an intensity of 73 dB SPL) via loudspeakers (Bose Companion 2 Series II, Bose Corporation) that were positioned at the left and right sides of the screen. Thereby, the presented visual cue–sound combinations were more frequently congruent (90% of the trials) than incongruent (10% of the trials). A congruent visual cue–sound combination is defined as the high-pitched tone being preceded by the white eighth note symbol presented above the fixation cross or the low-pitched tone being preceded by the white eighth note symbol presented below the fixation cross.
We investigated two task conditions (match vs. pitch task) and two conditions regarding the sound probability (50/50 vs. 83/17) over two experimental sessions. In each session, participants performed either the visual–auditory matching task or the pitch discrimination task (task order was counterbalanced across participants). In the visual–auditory matching task, the participants had to indicate whether the auditory stimulus was congruent or incongruent to the visual stimulus. Whereas in the pitch discrimination task, they had to indicate whether the auditory stimulus was high- or low-pitched by pressing the corresponding button (balanced across participants). We introduced these two tasks to investigate whether the task relevance of the visual information (match task) has an influence on the auditory processing. In both task conditions, the participants were additionally instructed to pay attention to the visual stimuli and use the informational content to be able to respond fast and accurately. Participants were instructed to pay attention to the visual information in both task conditions to ensure that the visual stimulus is also processed in the pitch discrimination task, that is, even when the visual information is not task relevant. The response window started after the onset of the auditory stimulus and lasted 900 msec. The SOA was jittered—that is, the next trial (onset of the visual stimulus) started either 1550, 1700, or 1850 msec (on average, 1700 msec) after the onset of the previous trial (onset of the visual stimulus). Within one session, participants were presented with nine blocks per sound probability condition. In the 50/50 condition, high- and low-pitched sounds (i.e., congruent high/low visual cue–sound combinations) were presented with equal (50%) probability, and each block consisted of 100 trials (∼2.8-min duration). Overall, 810 congruent and 90 incongruent visual cue–sound combinations were presented. In the 83/17 condition, one of the tones was a frequent tone (similar to a standard tone in an oddball paradigm). The frequent tone was presented in 83% of the trials, whereas the infrequent tone (similar to a deviant tone in an oddball paradigm) was presented in 17% of the trials. Which tone, high- or low-pitched, was considered as the frequent tone was balanced across participants. This means that one congruent visual cue–sound combination was presented more frequently than the other congruent combination. Identical to the 50/50 condition, 90% of the trials were congruent and 10% were incongruent visual cue–sound combinations. In the 83/17 condition, this leads to further differentiations (see Figure 1): when the frequent tone is preceded by a congruent visual stimulus (CON [83/17-cond.], 75%), when the frequent tone is preceded by an incongruent visual stimulus (INC [83/17-cond.], 8%), when the infrequent tone is preceded by a congruent visual stimulus (15%), and when the infrequent tone is preceded by an incongruent visual stimulus (2%). For further investigation, we focused on the first two conditions (CON and INC) to investigate the influence of frequent presentation of one congruent visual cue–sound combination on the elicitation of the IR. In our overall experiment, 810 trials for condition CON (83/17-cond.) and 90 trials for condition INC (83/17-cond.) were acquired within nine blocks. Each block lasted about 3.4 min. For both sound probability conditions, we pseudorandomized the trial order within each block. The first two trials of each block were always congruent visual cue–sound combinations, and there was at least one trial with a congruent visual cue–sound combination between two incongruent trials. In the 83/17 condition, we specified that these trials had to be not only congruent but also present the tone that was overall presented more frequently than the other tone. Before the experimental blocks of one sound probability condition, the participants performed a training block to familiarize themselves with the stimulation (same length and probability distribution as experimental blocks). Within the whole experiment, the participants received occasional feedback about their performance to keep them motivated and to remind them to use the content of the visual information in the pitch discrimination task. The experimental task itself had a duration of 56.1 min per session.
Data Recording and Analysis
EEG was recorded from 32 active Ag/AgCl electrodes using a BrainAmp amplifier and the Vision Recorder 1.20 software (Brain Products). The active electrodes were placed on an EEG cap (actiCAP, Brain Products) in accordance with the extended international 10–20 system (Chatrian, Lettich, & Nelson, 1985). In addition, electrodes were placed on the tip of the nose (reference electrode), on the center of the forehead (ground electrode), and on the left and right mastoid sites. To detect eye movements, two electrodes were placed on the outer canthi of the left and right eyes, and one electrode was positioned below the left eye. During the first session, to ensure the electrodes were positioned at the same locations across both sessions, the electrode positions were 3-D localized with an ultrasound-measuring instrument from Zebris (Software ElGuide Version 1.6). In the second session, electrodes were placed at the same location using the template from the first session. EEG was continuously recorded (500 Hz sampling rate) and amplified with 32 active electrodes and a BrainProducts ActiCap EEG system, using a nose-tip reference. For data analyses, the EEGLAB toolbox (v13; Delorme & Makeig, 2004) for MATLAB was used. The data were high-pass filtered with a 0.1-Hz cutoff (finite impulse response [FIR] filter; Kaiser-windowed; Kaiser beta = 5.65; filter length = 9056 points) and low-pass filtered with a 48-Hz cutoff (FIR filter; Kaiser-windowed; Kaiser beta = 5.65; filter length = 1812). Epochs were generated from 100 msec before the onset of the auditory stimulus until 500 msec after the onset. For the data from the 50/50 condition, a baseline correction from −100 to 0 msec was performed. In the 83/17 condition, contingent negative variation (CNV) potentials were observed that were more pronounced for the rarely presented visual stimuli (INC; see Figure 2). The negative drift started after the onset of the visual stimulus and lasted until the onset of the auditory stimulus, thereby contaminating the auditory prestimulus phase that would usually be used for baseline correction (−100 msec to 0 msec). To correct for the influence of the CNVs, the data from the 83/17 condition was high-pass filtered, with a 1.3-Hz cutoff (FIR filter; Kaiser-windowed; Kaiser beta = 5.65; filter length = 9056 points) as used in Pieszek et al. (2013). It is common to use even higher cutoff values (2 Hz) to correct for CNV artifacts (Brown, Clarke, & Barry, 2007; Teder-Sälejärvi, McDonald, Di Russo, & Hillyard, 2002), but this could lead to an invalid interpretation of later ERP components (Widmann, Schröger, & Maess, 2015; Widmann & Schröger, 2012; Luck, 2005). Because we did not expect ERP condition differences within the first 50 msec after stimulus onset, we used this interval (0 to +50 msec after onset of auditory stimulus) for baseline correction, as described in Pieszek et al. (2013). Bad channels were identified according to the deviation criterion (Bigdely-Shamlo, Mullen, Kothe, Su, & Robbins, 2015), which detects channels with unusually high- or low-amplitude deviation. For each channel, the robust z score of the robust standard deviation is calculated, and the channel is marked as a bad channel if the value is greater than 3. Selected channels were removed from analysis and interpolated after independent component analysis (ICA). An extended ICA was trained on 1-Hz filtered data and nonoverlapping epochs with maximal length (−600 to 800 msec). Bad independent components (ICs) were selected based on faster and adjust criteria (Chaumon, Bishop, & Busch, 2015; Mognon, Jovicich, Bruzzone, & Buiatti, 2011; Nolan, Whelan, & Reilly, 2010) as well as on visual identification of lateral eye movements, eye blinks, heartbeat, or drift (Debener, Thorne, Schneider, & Viola, 2010). All selected bad ICs were removed (on average, 3 per participant). In addition, the first two epochs of each block and all epochs that followed an incongruent epoch were rejected to remove possible attention reorientation effects. Furthermore, epochs with signal changes exceeding thresholds of 150 μV after ICA artifact removal were excluded, and bad channels that were identified before the ICA were interpolated.
For all further analyses, only incongruent epochs and their congruent siblings were considered: For each incongruent epoch, the preceding congruent epoch with the same tone frequency was identified and kept, whereas the other (nonsibling) congruent epochs were rejected. This was done to adjust trial numbers to achieve a similar signal-to-noise ratio in both (incongruent and congruent) conditions. Hence, for the ERP analyses, 90 congruent and 90 incongruent epochs were considered for each task and sound probability condition. An average of 89.2 of 90 trials (SD = 1.4) were included for each condition. There were no significant differences in the number of included trials between conditions.
In a final step, grand averages were calculated for both task conditions and both probability conditions, whereby for the latter, only the ERPs that were elicited by the frequent tones were considered. In addition, the respective difference waves were calculated, resulting into two difference waves for each task condition: INC–CON (50/50-cond.: match/pitch task) and INC–CON (83/17-cond.: match/pitch task).
Behavioral data and ERP mean amplitudes within time window and ROIs were tested with Bayesian repeated-measures ANOVAs estimating Bayes factors (BF10) computed in JASP (JASP Team, 2018). Fifty thousand Monte Carlo sampling iterations and a scaling factor r = .5 for fixed effects (corresponding to the default “medium” effect size prior for fixed effects in the R Bayes factor package; Morey & Rouder, 2015) and r = 1 for the participant random effect (corresponding to the default “nuisance” prior for random effects in the R Bayes factor package) were used. We compared all models (constrained by the principle of marginality) with the null model (BF10) and additionally evaluated main effects and interactions by comparing the models containing a main effect or interaction to the equivalent models stripped of the effect excluding higher order interactions (“Baws factor” or “inclusion Bayes factor based on matched models”, reported as BFIncl; Mathôt, 2017). Data were interpreted as moderate evidence in favor of the alternative (or null) hypothesis if BF10 was larger than 3 (or lower than 0.33) or strong evidence if BF10 was larger than 10 (or lower than 0.1; Lee & Wagenmakers, 2013). BF10 between 0.33 and 3 were reported as anecdotal evidence.
Bayesian ANOVAs were complemented by frequentist repeated-measures ANOVAs with identical designs. An alpha level of .05 was defined for all frequentist statistical tests. Statistically significant results were reported, including the η2 effect size measure. Follow-up ANOVAs and two-tailed t tests were computed for statistically significant interactions.
For the behavioral data, the first two trials of each block and all trials that directly followed an incongruent trial were removed from the data set. To investigate the behavioral effects, a 2 × 2 × 2 repeated-measures ANOVA was conducted that included the factors Task (match vs. pitch task), Congruency (congruent vs. incongruent), and Sound Probability (50/50 vs. 83/17).
To compute ERP mean amplitudes, we applied canonical region and time window of interests (ROIs), as previously defined by Pieszek et al. (2013). For the IR, two ROIs for the left (mean of FC5 and C3) and right hemisphere (mean of FC6 and C4) and an analysis time window from 105 to 130 msec after the onset of the auditory stimulus was used. A 2 × 2 × 2 × 2 repeated-measures ANOVA was performed with the factors Task (match vs. pitch task), Congruency (congruent vs. incongruent), Sound Probability (50/50 vs. 83/17), and Hemisphere (left vs. right) for the ROI mean amplitudes in the IR range. For the N2 and P3 components, a midline ROI (mean of Fz, Cz, and Pz) in a 185–225 msec (N2) and a 235–355 msec (P3) time window was considered (Pieszek et al., 2013). Repeated-measures ANOVAs (2 × 2 × 2) with the factors Task (match vs. pitch task), Congruency (congruent vs. incongruent), and Sound Probability (50/50 vs. 83/17) were performed for the ROI mean amplitudes in the N2 and P3 range.
In 95.6% of trials, a behavioral response was given within the RT window. The averages of the RT and accuracy data are displayed in Figure 3. RTs were faster in response to congruent sounds compared with incongruent sounds, faster in the pitch compared with the match task, and also faster in the 83/17 compared with the 50/50 sound probability condition. The effect of congruency was larger in the match compared with the pitch task and also larger in the 50/50 compared with the 83/17 sound probability condition. The Bayesian ANOVA (2 × 2 × 2 including the factors Task [match vs. pitch task], Sound Probability [50/50 vs. 83/17], and Congruency [congruent vs. incongruent]) favored the model, including the Task, Sound Probability, and Congruency main effects, and the Congruency × Sound Probability and Congruency × Task interactions (BF10 = 2.77 × 1054). The data provided strong evidence for the Task (BFIncl = 7.77 × 1012), Sound Probability (BFIncl = 3.63 × 107), and Congruency main effects (BFIncl = 8.10 × 1044) and the Congruency × Sound Probability (BFIncl = 4.370) and Congruency × Task (BFIncl = 4.75 × 106) interactions. The data provided moderate evidence against a Task × Sound Probability interaction effect (BFIncl = 0.224). In the follow-up Bayesian t tests, the data provide strong evidence for the congruency effect in both Task (match: BF10 = 1.96 × 108; pitch: BF10 = 1.07 × 107) and Sound Probability conditions (50/50: BF10 = 2.92 × 109; 83/17: BF10 = 1.70 × 108). The frequentist repeated-measures ANOVA of the mean amplitude mirrors the effects reported for the Bayesian ANOVA. We observed significant main effects of all factors (Task: F(1, 15) = 91.589, p < .001, η2 = .859; Sound Probability: F(1, 15) = 66.177, p < .001, η2 = .815; Congruency: F(1, 15) = 405.356, p < .001, η2 = .964), a significant interaction of the factors Sound Probability × Congruency (F(1, 15) = 16.795, p < .001, η2 = .528), and a significant interaction of the factors Task × Congruency (F(1, 15) = 26.237, p < .001, η2 = .636). Responses to congruent sounds were significantly faster than responses to incongruent sounds in both Task (match: t(15) = −16.78, p < .001; pitch task: t(15) = −13.41, p < .001) and Sound Probability conditions (50/50: t(15) = −20.54, p < .001; 83/17: t(15) = −16.60, p < .001; see Figure 3).
Response accuracy was higher in response to congruent compared with incongruent sounds and higher in the 83/17 compared with the 50/50 sound probability condition. The effect of Congruency was larger in the 50/50 compared with the 83/17 Sound Probability condition. The Bayesian ANOVA (2 × 2 × 2 including the factors Task [match vs. pitch task], Sound Probability [50/50 vs. 83/17], and Congruency [congruent vs. incongruent]) favored the model including the Sound Probability and Congruency main effects and the Congruency × Sound Probability interaction (BF10 = 3.96 × 1024). The data provided strong evidence for the Sound Probability (BFIncl = 203,938) and Congruency main effects (BFIncl = 2.14 × 1016) and the Congruency × Sound Probability interaction (BFIncl = 838,940). In the follow-up Bayesian t tests, the data provide strong evidence for the Congruency effect in both Sound Probability conditions (50/50: BF10 = 10,347; 83/17: BF10 = 1393). The frequentist repeated-measures ANOVA of the mean amplitude mirrors the effects reported for the Bayesian ANOVA. We observed significant main effects of the factors Sound Probability (F(1, 15) = 35.955, p < .001, η2 = .706) and Congruency (F(1, 15) = 59.154, p < .001, η2 = .798), whereas the factor Task did not have a significant main effect (F(1, 15) = 3.261, p = .091, η2 = .179). Furthermore, significant interactions of Sound Probability × Congruency (F(1, 15) = 35.363, p < .001, η2 = .702) and Task × Sound Probability (F(1, 15) = 8.804, p = .01, η2 = .370) and the three-way interaction of the factors Task × Sound Probability × Congruency (F(1, 15) = 4.604, p = .049, η2 = .235) were observed. Responses to congruent sounds were significantly more accurate than responses to incongruent sounds in both Sound Probability conditions (50/50: t(15) = −7.529, p < .001; 83/17: t(15) = −6.208, p < .001). We did not perform frequentist follow-up tests for the significant Task × Sound Probability and Task × Sound Probability × Congruency interactions as the Bayesian ANOVA indicated stronger support by the data for the null hypothesis than for both interaction effects (BFIncl = 0.469 and BFIncl = 0.407, respectively).
An IR component (time window 105–130 msec) was observed in the 83/17 sound probability condition, but not in the 50/50 condition (Figure 4). IR amplitude was unaffected by the task and peaked 122 msec after sound onset with a bilateral temporal scalp distribution (Figure 7). The Bayesian ANOVA (2 × 2 × 2 × 2 including the factors Task [match vs. pitch task], Sound Probability [50/50 vs. 83/17], Congruency [congruent vs. incongruent], and Hemisphere [left vs. right]) favored the model including the Congruency, Sound Probability, and Hemisphere main effects and the Congruency × Sound Probability interaction (BF10 = 491.027). The data only provided anecdotal evidence for the Hemisphere main effect (BFIncl = 1.849) but strong evidence for the Congruency × Sound Probability interaction (BFIncl = 12.980; main effect Congruency BFIncl = 165.628; main effect Sound Probability BFIncl = 0.147; all other BFIncl < 0.5). The data provided moderate evidence against an effect of the Task on the IR (Task × Congruency: BFIncl = 0.208; Task × Sound Probability × Congruency: BFIncl = 0.316). In follow-up Bayesian t tests, the data provided moderate evidence against an effect of Congruency in the 50/50 Sound Probability condition (BF10 = 0.272) and very strong evidence for an effect of Congruency—more negative ERPs in response to incongruent compared with congruent sounds—in the 83/17 Sound Probability condition (BF10 = 87.792). The frequentist repeated-measures ANOVA of the IR mean amplitudes mirrored the effects reported for the Bayesian ANOVA. We observed a significant main effect of Congruency (F(1, 15) = 6.312, p = .024, η2 = .296) and no other significant main effects (Sound Probability: F(1, 15) = 0.047, p = .832, η2 = .003; Task: F(1, 15) = 1.430, p = .250, η2 = .087; Hemisphere: F(1, 15) = 3.632, p = .076, η2 = .195). Importantly, a significant interaction of the factors Sound Probability × Congruency (F(1, 15) = 5.264, p = .036, η2 = .260) was observed. An effect of Congruency (IR) was only observed in the 83/17 Sound Probability condition (t(15) = 4.550, p < .001) but not in the 50/50 Sound Probability condition (t(15) = 0.373, p = .714).
The observed effects of Sound Probability could be potentially confounded by differences in the preprocessing (filter parameters 0.1 Hz vs. 1.3 Hz high-pass filter and baseline correction) introduced to compensate for the CNVs observed in the 83/17 condition. To exclude the possibility that these effects were due to the differences in preprocessing, we performed two additional analyses: First, to demonstrate that the IR was not due to filter artifacts, we additionally analyzed the data of the 50/50 condition using the identical parameters used in the 83/17 condition (1.3 Hz high-pass filter cutoff and 0 to +50 msec baseline correction). As depicted in Figure 5 (first and second row), contrasting both preprocessing approaches for the 50/50 condition did not reveal a significant enhancement of the difference wave in the statistical testing window where an IR elicitation is expected (105–130 msec).
Second, it has been shown that early ERP components (N1, N2 peak amplitudes) within deviant waveforms may be distorted (i.e., enhanced) when applying higher high-pass filter cutoff frequencies (Widmann et al., 2015). This can happen because, in noncausal filters, the information from later components can be transferred to earlier components—that is, the later P3 amplitude can have an influence on preceding N1 and N2 amplitudes. To rule out that such effects could have affected the IR in our setting, we additionally tested a minimum-phase causal filter (still high-pass filter with 1.3 Hz cutoff) when preprocessing the data from the 83/17 condition. Furthermore, no baseline correction was performed. All other data processing steps were identical to the original processing of the 83/17 condition data. Results (see Figure 5, fourth row) display a significant elicitation of the IR also with a causal filter. The additional analyses demonstrate that the IR in our study does not result from filter artifacts that are induced by a large high-pass filter cutoff parameter. It is important to mention that the causal filter considerably distorts the temporal dynamics of the later ERP components (Widmann et al., 2015). Hence, for the sake of interpretability of all elicited components, we adhered to the original data processing (noncausal filtering).
An N2 component was only elicited in response to incongruent sounds. ERP mean amplitudes in the N2 range (185–225 msec) were unaffected by the task and sound probability conditions. Figure 6 displays an exemplary plot at Cz. The Bayesian ANOVA favored the model with a main effect of Congruency only (BF10 = 8.09 × 1020). The data did not provide evidence for any other effect than Congruency (all other BFIncl < 1). The corresponding frequentist 2 × 2 × 2 repeated-measures ANOVA revealed a significant main effect of the factor Congruency (F(1, 15) = 44.428, p < .001, η2 = .748), whereas no statistically significant other main or interaction effects were observed (Task: F(1, 15) = 1.962, p = .182, η2 = .116; Sound Probability: F(1, 15) = 2.461, p = .138, η2 = .141).
ROI mean amplitudes in the P3 range (235–355 msec) were more positive in response to incongruent compared with congruent sounds and also more positive in the 50/50 compared with the 83/17 sound probability conditions indicating enhanced P3 component amplitudes (see Figure 6 for exemplary plot). The Bayesian ANOVA favored the model including the Sound Probability and Congruency main effects (BF10 = 8.83 × 1010). The data did not provide evidence for any other effect than Sound Probability (BFIncl = 3.02 × 109) and Congruency (BFIncl = 362.833; all other BFIncl < 1). The corresponding frequentist 2 × 2 × 2 repeated-measures ANOVA of the mean amplitude differences of the conditions revealed a significant main effect of the factors Congruency (F(1, 15) = 16.004, p = .001, η2 = .516) and Sound Probability (F(1, 15) = 18.750, p < .001, η2 = .556), whereas the factor Task did not have a significant main effect (F(1, 15) = 1.286, p = .275, η2 = .079). No other statistically significant interaction effects were observed.
We measured brain responses elicited by auditory stimuli that either matched (i.e., congruent) or violated (i.e., incongruent) visual-based predictions. We systematically investigated factors that potentially modulate sensory auditory processing. A significant modulation was observed only in the 83/17 condition: with a frequent presentation of one sound and thereby a more frequent presentation of one congruent visual cue–sound combination. Only in this condition, unpredicted compared with predicted but otherwise identical sounds elicited enhanced negativity around 105–130 msec after tone onset (IR). This shows that the higher relative probability of one congruent visual cue–sound combination facilitates the formation of a sensory prediction of the upcoming auditory event that is indicated by the elicitation of IR. We attribute the modulation of sensory auditory processing observed within the IR time range to predictive coding mechanisms.
In line with the predictive coding theory (Friston, 2012; Friston & Kiebel, 2009; Garrido et al., 2008), we suppose that the auditory representation of the expected sound is “preactivated” in the predictive layer by the predictive visual information via feedback connections. The preactivation of the neurons by the preceding visual information results in a reduced prediction error response when the expected sound is presented compared with when an unexpected sound is presented. In other words, if the prediction is violated, the prediction error-related ERP in response to the unexpected auditory stimulus is enhanced relative to the response to the predicted stimulus.
In the 50/50 condition, no IR component but only the subsequent N2b component was observed. This leads to the conclusion that in the 50/50 trial-by-trial presentation, the simple precedence of visual symbolic information is able to generate an expectation at higher cognitive levels (as N2b demonstrates); however, apparently it was not able to implement a prediction of the upcoming auditory event at sensory levels of processing (as the absence of IR demonstrates). Why does only the 83/17 sound probability condition trigger such a sensory prediction? In the 83/17 condition, one congruent visual cue–sound combination is presented more frequently than all the other combinations. This facilitates the associative learning of the relation between the visual and auditory stimuli. In this sense, the high probability of the congruent visual cue–sound combination establishes a strong relationship between the two stimuli that is not present or at least not strong enough in the 50/50 condition. It seems to be important that one visual cue–sound combination is dominant and that there are not many alternatives to it. Predictive coding apparently requires a strong association between the visual and auditory stimuli to classify the visual information as a valid predictor of the upcoming auditory event.
An alternative explanation for the IR elicitation, besides prediction effects, could be repetition suppression. The frequent presentation of one congruent visual cue–sound combination (83/17 condition) leads to its frequent repetition and could lead to a suppression of both visual and auditory ERP responses, whereby any difference in the visual part of the visual cue–sound combination (resulting in the expectation of a different sound) might result in a disinhibition of subsequent auditory processing even in case of sound repetition. A modulation of repetition suppression by expected versus unexpected stimuli was already reported previously for faces (Summerfield, Trittschuh, Monti, Mesulam, & Egner, 2008) and also for sounds (Todorovic, van Ede, Maris, & de Lange, 2011). Repetitions do, however, also occur in the 50/50 condition, and also here, an (possibly smaller) IR would be expected. Thus, either repetition suppression is not the only relevant mechanism or a higher number of repetitions—as in the 83/17 condition—is required.
Another explanation would be mechanisms of associative learning. A more frequent presentation of the congruent visual cue–sound combination compared with the incongruent combination could lead to object-adapted suppression of the N1. The observed modulation could be explained merely by learning the association and hence the differentiation of congruent and incongruent visual cue–sound combinations.
Nevertheless, the various explanations (predictive coding, repetition suppression, associative learning) share commonalities in that they illustrate modulation of sensory processing. The alternative explanations imply that, based on the visual information, the system generates an expectation about the upcoming tone and shows an enhanced response for stimuli violating the expectation relative to the response for stimuli confirming the expectation. We would like to note that our paradigm does not allow to unequivocally distinguish whether the observed relative difference reflects an attenuation of the response to expected stimuli (as in repetition suppression) or an enhancement of the response to unexpected stimuli (as in prediction error) or both. Importantly, the modulation of processing at such an early stage allows to process information as fast and efficiently as possible to reduce the amount of energy that our brain needs to interact with our environment. This is in line with the free energy principle (for a review, see Friston, 2010), which suggests that the brain always tries to optimize information processing.
Furthermore, research from related fields support the predictive coding explanation: The auditory sensory memory is sensitive to visual information (Besle et al., 2005), and preceding visual information is an important factor in the processing of speech (e.g., Treille, Vilain, & Sato, 2014; Stekelenburg & Vroomen, 2007) and nonspeech stimuli (Stekelenburg & Vroomen, 2007, 2012; Vroomen & Stekelenburg, 2010). More precisely, Stekelenburg and Vroomen (2007) claim that the N1 suppression effect—that is observed for visual–auditory compared with unimodal auditory stimulation—is dependent on anticipatory visual motion that induces temporal predictions about the upcoming auditory event. A more recent study (Paris et al., 2017) claims that temporal and identity predictions modulate the effect of N1 facilitation for predicted auditory stimuli, which is in line with studies that found an IR for prediction violations of visually based sensory predictions. This is supported by van Laarhoven et al. (2017), who found in an omission-N1 paradigm that visual information about timing and content is essential to form an expectation of the upcoming auditory event. Paris et al. (2017) go even further and claim that the N1 facilitation in an N1 suppression paradigm is based on prediction rather than on multisensory integration mechanisms. It has to be noted that these studies used different paradigms and stimuli. Nevertheless, all of them induce a certain form of auditory prediction and contrast the processing of prediction confirmations and prediction violations. The findings of the current study additionally support that content information is used to form predictions about upcoming events but only if there is a strong association of visual with auditory information (83/17 condition).
In contrast to the current study, Widmann et al. (2004) showed that the IR can also be elicited in a condition with equal sound presentation probability (as here in the 50/50 condition). They presented a pattern of score-like visual symbols and serially corresponding sounds with equal probability and showed that the IR is elicited when one sound violates the visual pattern. The main difference, compared with our study, is that Widmann et al. (2004) presented a sequence of score-like visual symbols before a corresponding sound sequence, whereas we used a trial-by-trial design. Apparently, the contextual (i.e., visual) information about the transition of one tone to the next tone facilitated the generation of visual-to-auditory sensory predictions. In addition, an intensive training session on a separate day preceded the experimental session. Furthermore, the task in the Widmann et al. (2004) study was similar to the reading of a simplified musical score because visual symbol sequences were continuously presented on the screen before sound onset. The learning phase, together with the high similarity of the task with maximally trained reading processes, potentially facilitates automatized top–down predictive processing that is probably used in similar reading conditions.
In conclusion, the frequent presentation of one congruent visual cue–sound combination (83/17 condition) is sufficient to modulate sensory processing mechanisms. Nevertheless, it is not a necessary prerequisite, because contextual information also facilitates the generation of visual-to-auditory sensory predictions (see Widmann et al., 2004). A more natural association of visual and auditory stimuli (e.g., reading words, speech processing, a ball hitting a wall) is probably also able to establish visually based sensory predictions, even in an equal probability condition, because it requires a strong relationship between visual and auditory information. However, Stekelenburg and Vroomen (2007) used ecologically valid audiovisual events and did not observe processing differences between sounds preceded by congruent in contrast to incongruent visual information at the N1 level. At the later P2 processing level, they observed effects of stimulus congruency and attribute it to multisensory integration at higher processing level (associative, semantic, or phonetic). In the current study, visual inspection reveals as well a reduction of P2 for incongruent compared with congruent visual cue–sound combinations (visible in Figure 6), but a possible P2 attenuation was not empirically tested. The reduction is observed in both the 50/50 and 83/17 conditions and hence not modulated by the probability of visual cue–sound combinations (a comparison between probability conditions is not possible because of the different preprocessing). The circumstance that Stekelenburg and Vroomen (2007) observe similar effects at P2 but not at N1 level could be explained by the high amount of incongruent trials (25%) that they presented and hence could imply that also natural associations might be sensitive to the probability of congruent visual cue–sound combinations.
The topographical distribution (see Figure 7, bottom) that was observed for the IR is in line with previous observations (Pieszek et al., 2013, 2014). Pieszek et al. (2013) performed a VARETA source analysis and reported evidence of the IR having auditory generators in the superior temporal gyrus. They also admitted that they could not exclude audiovisual integration processes. Likewise, our study was not designed to precisely localize the generators of the IR component.
In both 50/50 and 83/17 conditions, N2 and P3 components were elicited. In the 50/50 condition, the P3 seems to be larger than in the 83/17 condition. They cannot be reliably contrasted, because the preprocessing was different in both conditions. Nevertheless, this shows that, in both conditions, the incongruent visual cue–sound combinations were detected and elicited enhanced signals.
In the context of an oddball paradigm, an elicited N2 component is usually associated with attentive deviant detection (Patel & Azzam, 2005; Novak, Ritter, Vaughan, & Wiznitzer, 1990; Näätänen, Simpson, & Loveless, 1982). In addition, go/no-go studies found N2 elicitation and hence ascribed it to reflect cognitive control processes (Gruendler, Ullsperger, & Huster, 2011; Dimoska, Johnstone, & Barry, 2006). In our investigation, the congruent response is more frequently correct than the incongruent response. Hence, in an incongruent trial, the participant has to inhibit the frequent response. This is mainly true not only in the visual–auditory matching task condition but also in the pitch task condition, if the participant learned to rely on the visual information and to prepare the response accordingly. The P3 component is mainly thought to reflect the involuntary switch of attention (Escera, Alho, Schröger, & Winkler, 2000) or mobilization for action (Nieuwenhuis, De Geus, & Aston-Jones, 2011). In the current study, it is more probable that we actually observed a P3a that has been reported to be elicited in response to infrequent distinct tones or distractors (Polich, 2007). Hence, it reflects a switch of attention toward unexpected stimuli. The relatively short peak latency as well as the frontocentral maximum of the peak amplitude support this assumption.
The finding of N2 and P3/P3a, which presumably also reflects prediction error-related processes, in both conditions contrasts with the finding that the IR is confined to the 83/17 condition and absent in the 50/50 condition. It shows that the ability to use a visual stimulus as a reliable predictor of an upcoming auditory event at higher hierarchical levels is potentially independent of the strength of the association of visual with auditory information. However, the generation of visual-to-auditory sensory predictions, which is reflected by the elicitation of a sensory prediction error signal at the N1 level in response to violations of the visual predictions (i.e., IR), seems to be facilitated by the frequent presentation of one congruent visual cue–sound combination.
For all the tested conditions, we were able to show that the presentation of congruent visual cue–sound combinations was beneficial for RT and accuracy. This is in line with previous investigations (e.g., Sarmiento et al., 2012), showing that incongruent visual cue–sound combinations induced interference that had to be compensated for at the cost of RT and accuracy. With regard to the different sound probability conditions, we observed that the participants responded faster in the 83/17 condition compared with the 50/50 condition. This is probably due to the frequent presentation of one specific visual cue–sound combination that facilitates the behavioral response.
Influence of Task Requirements
The visual–auditory matching task and pitch discrimination task were contrasted to investigate whether task relevance of the preceding visual information influences visually based predictions. Our findings show that there is no significant influence of the factor task on the elicitation of the IR. Hence, the task did not lead to differences in the generation of the visual-to-auditory sensory prediction of the upcoming auditory event.
In the behavioral performance, participants responded faster and more accurately in the pitch discrimination task compared with the visual–auditory matching task. This can be explained by the differences in task complexity. When performing the visual–auditory matching task, it was important to keep the visual information in mind and compare the information obtained from the visual and auditory domains. In contrast, in the pitch discrimination task, only the auditory information had to be classified. At this point, it should be noted that, in both tasks, the participants were instructed to pay attention to the visual stimulus and use it to respond fast and accurately. In other words, the two situations were similar until the onset of the auditory stimulus. This could explain why there are differences in participants' behavioral responses but not on a sensory processing level.
Based on the existing literature, we assumed that preceding visual information would be used to form auditory sensory predictions and that a violation of the prediction by the auditory stimulus would already be efficiently detected during early sensory processing (at the level of the N1 auditory evoked potential). We showed that, besides other factors such as contextual information, frequent presentation of one congruent visual cue–sound combination is a sufficient condition for the formation of a sensory prediction of the upcoming auditory event. In addition, our findings revealed that this effect on early sensory processing even occurs when the visual information is not relevant for solving the auditory pitch discrimination task.
M. V. S. was funded by the International Max Planck Research School for Neuroscience of Communication.
Reprint requests should be sent to Maria V. Stuckenberg, Institute of Psychology, University of Leipzig, Germany, or via e-mail: email@example.com, firstname.lastname@example.org.
This paper is part of a Special Focus deriving from a symposium at the 2017 International Multisensory Research Forum (IMRF).