It is still a matter of debate whether visual aids improve learning of music. In a multisession study, we investigated the neural signatures of novel music sequence learning with or without aids (auditory-only: AO, audiovisual: AV). During three training sessions on three separate days, participants (nonmusicians) reproduced (note by note on a keyboard) melodic sequences generated by an artificial musical grammar. The AV group (n = 20) had each note color-coded on screen, whereas the AO group (n = 20) had no color indication. We evaluated learning of the statistical regularities of the novel music grammar before and after training by presenting melodies ending on correct or incorrect notes and by asking participants to judge the correctness and surprisal of the final note, while EEG was recorded. We found that participants successfully learned the new grammar. Although the AV group, as compared to the AO group, reproduced longer sequences during training, there was no significant difference in learning between groups. At the neural level, after training, the AO group showed a larger N100 response to low-probability compared with high-probability notes, suggesting an increased neural sensitivity to statistical properties of the grammar; this effect was not observed in the AV group. Our findings indicate that visual aids might improve sequence reproduction while not necessarily promoting better learning, indicating a potential dissociation between sequence reproduction and learning. We suggest that the difficulty induced by auditory-only input during music training might enhance cognitive engagement, thereby improving neural sensitivity to the underlying statistical properties of the learned material.
Music forms a vital part of the school curriculum in much of the Western world. During the first years of music education, teaching music usually takes the form of a game (Bowles, 1998; Aronoff, 1983): Different colors represent different pitches, imaginary stairs symbolize musical scales, and claps represent rhythms. A widely used method is to put colorful stickers on the keys of a piano keyboard or on the violin fingerboard (Abler, 2002) to indicate finger positions. Guitar Hero (www.guitarhero.com/uk/en/), a music computer game, makes people feel empowered by being able to reproduce popular songs on a toy guitar using visual cues; however, do players really learn music? There is no research (to the best of our knowledge) testing whether such methods improve learning. Over a multisession musical training experiment, conducted on separated days, we examined whether visual aids would lead to better learning of an unfamiliar music grammar and investigated the respective electrophysiological correlates of statistical music learning.
Musical learning depends not only on developing abilities for singing or playing a musical instrument but also on learning a musical grammar, that is, the statistical properties of a particular musical style. Musical experts have typically internalized the rules or probabilistic regularities that govern a specific music style and can form expectations for subsequent events while listening (Jonaitis & Saffran, 2009; Meyer, 1956). The fulfillment or violation of these expectations plays a crucial role in the emotional experience of music (Juslin & Västfjäll, 2008; Huron, 2006). Importantly, the formation of expectations can be used as an index of learning: The greater the knowledge of a learned musical style, the larger the degree of unexpectedness when a rule is violated (Steinbeis, Koelsch, & Sloboda, 2006).
Humans can acquire knowledge of the statistical regularities of auditory structures even after short exposure (Rohrmeier & Cross, 2014; Loui, 2012; Rohrmeier & Rebuschat, 2012; Misyak, Christiansen, & Tomblin, 2010; Pothos, 2007; Lieberman, Chang, Chiao, Bookheimer, & Knowlton, 2004; Saffran, Johnson, Aslin, & Newport, 1999; Saffran, Aslin, & Newport, 1996; Saffran, Newport, & Aslin, 1996; Reber, 1993). Statistical learning and recognition of grammatical patterns through passive exposure has been demonstrated in tone (Saffran, Reeck, Niebuhr, & Wilson, 2005; Saffran et al., 1999) and timbre (Tillmann & McAdams, 2004) sequences as well as in unfamiliar musical systems (e.g., use of the Bohlen–Pierce scale: Loui, Wessel, & Kam, 2010; Loui & Wessel, 2008). Participants can also perform accurate predictions on other types of stimuli based on their temporal statistics, such as on sequences of visual stimuli (e.g., abstract visual shapes: Fiser & Aslin, 2002; Gabor patches: Luft, Baker, Goldstone, Zhang, & Kourtzi, 2016; Luft, Meeson, Welchman, & Kourtzi, 2015; tones in oddball task: Debener, Makeig, Delorme, & Engel, 2005; semantics: Proverbio, Leoni, & Zani, 2004). Studies with infants demonstrate statistical learning of both auditory and visual information, providing evidence for an underlying domain-general mechanism (Kirkham, Slemmer, & Johnson, 2002; Saffran, Aslin, et al., 1996). However, Conway and Christiansen (2006) found that adult participants can simultaneously learn the statistical regularities of two different artificial grammars, one presented with auditory stimuli and one presented with visual stimuli, therefore suggesting modality-specific statistical learning.
The electrophysiological recording, especially the ERP response, has routinely been used to study the neural correlates of learning because of its excellent temporal resolution (Rugg & Coles, 1995) and has been associated with sensory and perceptual processing modulated by expectation and familiarity (Tremblay & Kraus, 2002; Näätänen, Gaillard, & Mäntysalo, 1978). Violation, as compared to fulfillment, of pitch expectations is robustly associated with a larger N100 component, a frontocentral negativity around 100 msec after the onset of a melodically unexpected note (Koelsch & Jentschke, 2010; Pearce, Ruiz, Kapasi, Wiggins, & Bhattacharya, 2010). The N100 has been also used as an index of statistical learning of auditory sequences, with studies showing increased N100 in response to tones with lower transitional probability compared to tones with higher probability (Moldwin, Schwartz, & Sussman, 2017; Paraskevopoulos, Kuchenbuch, Herholz, & Pantev, 2012b; Abla, Katahira, & Okanoya, 2008). Koelsch, Busch, Jentschke, and Rohrmeier (2016) found that the amplitude of this response was negatively related to the probability of an auditory event. It has been suggested that this early component reflects the magnitude of prediction errors in statistical learning contexts (Tsogli, Jentschke, Daikoku, & Koelsch, 2019).
The P200, a positive ERP component peaking around 200 msec after the onset of an event, has been linked to stimulus familiarity. For example, familiar speech variants of syllables (Tremblay & Kraus, 2002) and familiar words (Stuellein, Radach, Jacobs, & Hofmann, 2016; Liu, Perfetti, & Wang, 2006) have been associated with a larger P200 than unfamiliar syllables and words. Furthermore, musicians demonstrate larger P200 in auditory tasks compared with nonmusicians, which is usually attributed to their long-term musical training, inducing greater familiarity with the stimuli (Atienza, Cantero, & Dominguez-Marin, 2002; Tremblay, Kraus, McGee, Ponton, & Otis, 2001).
The aforementioned studies on statistical learning mostly focused on learning by training a single perceptual modality (e.g., auditory/visual). Previous studies have demonstrated the beneficial effects of multimodality on learning (e.g., Tierney, Bergeson-Dana, & Pisoni, 2008; Brünken, Plass, & Leutner, 2004; Cleary, Pisoni, & Geers, 2001). There is behavioral and neurophysiological evidence demonstrating that adults are faster at detecting a target when correlated information is presented to multiple sensory modalities than when information is presented unimodally (e.g., Sinnett, Soto-Faraco, & Spence, 2008; Colonius & Diederich, 2006; Molholm, Ritter, Javitt, & Foxe, 2004). The “Simon” task has been widely used to study statistical learning: It uses a game device with four colored buttons corresponding to different tones. Every time a tone is played, the respective button lights up. Tierney and colleagues (2008) asked participants to reproduce random sequences of colored lights by pressing the keys on the Simon device. Results showed that longer sequences were reproduced in the audiovisual condition (color names spoken and buttons lighting up simultaneously) compared to the auditory-only or visual-only condition. Beneficial effects of audiovisual presentation on learning have also been found in other tasks, such as presentation of biological textbook material with versus without verbal instruction, in addition to pictorial presentation (Brünken et al., 2004).
Our study is the first (to our knowledge) to investigate the effect of visual aids on statistical learning of music with interleaved passive exposure to and active reproduction of music. Alternating different methods is efficient for learning and generalization of knowledge (Richland, Bjork, Finley, & Linn, 2005) as well as more ecologically valid compared to mere passive exposure to the learned material. In contrary to previous studies that assessed learning just after exposure, we performed a 1-day follow-up test to ensure that we measure learning rather than immediate effects of exposure. We introduced a novel experimental paradigm combining behavioral, electrophysiological, and computational methods. Specifically, nonmusicians were trained on an unfamiliar artificial music grammar (AMG; taken from Rohrmeier, Rebuschat, & Cross, 2011) through passive exposure and active reproduction of melodic sequences on a sound keyboard with or without visual aids, over 3 separate days. An AMG was ideal for our investigation because it represented a completely novel musical style for all participants. Participants' knowledge of the novel grammar was assessed before and after training by taking judgments of the perceived correctness and surprisal of high-probability (HP), low-probability (LP), incorrect (INC), and random notes. The ERPs in response to these notes were also analyzed.
We used a computational model of auditory expectation (Information Dynamics of Music [IDyOM]: Pearce, 2018) to quantify the conditional probability of each note in every sequence, reflecting the degree of expectedness of a particular note given the preceding musical context. IDyOM uses variable-order Markov models (Begleiter, El-Yaniv, & Yona, 2004) to generate the conditional probability of a note given its preceding context based on the frequency with which each note has followed the context in a given corpus of music. IDyOM embodies the hypothesis that listeners base their expectations on learning the statistical regularities in the musical environment, with listeners perceiving HP notes as expected and LP notes as unexpected. Previous behavioral, physiological, and EEG studies have demonstrated that IDyOM successfully predicts listeners' expectations (Hansen & Pearce, 2014; Egermann, Pearce, Wiggins, & McAdams, 2013; Omigie, Pearce, Williamson, & Stewart, 2013; Pearce, Müllensiefen, & Wiggins, 2010; Pearce, Ruiz, et al., 2010). The probability of each event according to the model can be log-transformed to yield its information content (IC), which reflects how unpredictable the model finds a note in a particular context. We used IDyOM to analyze each melodic sequence generated by the AMG and manipulated these sequences to construct melodies terminating on HP, LP, INC, and random notes. Participants' learning was evaluated in terms of their accuracy in recognizing notes belonging to the grammar.
Previous studies have demonstrated a distinction between performance during training and learning (e.g., Kantak & Winstein, 2012; Schmidt & Bjork, 1992; Lee & Genovese, 1988). In a review, Soderstrom and Bjork (2015) argued in favor of differentiating learning from performance during training, as the former refers to a long-term change in behavior or knowledge that supports retention and transfer and the latter refers to temporary fluctuations in behavior or knowledge that are observed close to the acquisition period. In our study, “performance during training” (according to Soderstrom & Bjork, 2015) corresponds to sequence reproduction in the training sessions, and “learning” refers to acquisition of the statistical regularities of the AMG. On the basis of the aforementioned studies, our hypothesis is twofold. First, we hypothesized that multimodality would aid sequence reproduction because the visual cues would signal to the participants which exact keys to press. Second, we expected that the presence of visual aids would have a negative impact on learning as the auditory input is modality-appropriate for learning music, and visual information in this context might work as a distractor for better encoding. At the neural level, we predicted that the N100 component would be higher in response to LP and INC notes (compared to HP notes) after training. Because larger N100 in response to LP notes indicates better learning of the statistical regularities of the grammar, we hypothesized this would be higher for the auditory-only (AO) group. Furthermore, we expected that the P200 component, as an index of familiarity (Tremblay & Kraus, 2002), would be enhanced after training in both groups. Finally, we explored how the early right anterior negativity (ERAN), an ERP component previously associated with syntactical violations in music (Pearce & Rohrmeier, 2018; Koelsch, Gunter, Friederici, & Schröger, 2000), would be modulated in our statistical learning paradigm.
Forty neurologically healthy human adults (24 women) aged between 20 and 32 years (mean = 22.42 years, SD = 3.04 years) participated in the experiment. Participants were randomly assigned to one of two groups that differed in the training method: audiovisual (AV) group (n = 20, 12 women, age range: 20–32 years, mean = 22.25 years, SD = 3.37 years) and AO group (n = 20, 12 women, age range: 20–30 years, mean = 22.60 years, SD = 3.52 years). All participants self-reported that they were nonmusicians, and this was validated by the Goldsmiths Musical Sophistication Index (Gold-MSI) questionnaire (Müllensiefen, Gingras, Musil, & Stewart, 2014): Mean ± SD Gold-MSI musical training scores were 12.08 ± 3.63 for the AV group and 12.10 ± 5.51 for the AO group from a possible range of 7–49 points (higher values indicating more musical training). The scores were not significantly different between groups, t(38) = 0.017, p = .987. Two participants were excluded because they did not sufficiently engage with the task (gave the same response throughout the pretest and posttest), leaving 19 participants per group. All participants reported normal hearing and normal or corrected-to-normal vision. Participants gave written informed consent and received financial compensation at a rate of £7 per hour for their participation. The study was approved by the ethics board at Queen Mary University of London.
Gold-MSI Musical Training Questionnaire
The musical training factor (Dimension 3) of the Gold-MSI comprises a self-report measure including seven statements regarding formal musical training experience and musical skill. Each statement (e.g., “I have never been complimented for my talents as a musical performer”) requires a response from 1 (completely disagree) to 7 (completely agree). This measure was used to validate that all participants were nonmusicians.
Our melodic stimuli were sequences generated by an AMG taken from Rohrmeier and colleagues (2011; Figure 1A). This grammar consists of eight different tone pairs, and the tones belong to the Western diatonic major scale (C4, D4, E4, F4, G4, A4, B4). The AMG generated 18 different melodic sequences, ranging from 8 to 22 notes long (length: mean = 14.56, SD = 3.87). Melodic sequences with circular paths were excluded, as they were too long to be used in our paradigm. Twelve of these sequences were used for the training and test sessions (“old-grammatical”), and the remaining six were only presented in the last session to test generalization to unheard melodies of the grammar (“new-grammatical”). Please refer to the supplementary materials1 for the 18 melodic sequences in musical notation (Figure S1).
IDyOM Analyses of Melodic Sequences
An information theoretic model of music expectation, IDyOM (Pearce, 2005, 2018), was used to analyze the statistical properties of the melodic sequences generated by the AMG. We conducted leave-one-out cross-validations, while IDyOM generated predictions for each sequence after pretraining on the other 17 sequences. IDyOM uses viewpoints to generate predictions. We evaluated different sets of viewpoints and selected the viewpoint chromatic pitch and chromatic interval (cross-entropy = 0.986), which outperformed the single-viewpoint chromatic pitch (cross-entropy = 1.007), and the viewpoint set chromatic pitch, chromatic interval, and contour (cross-entropy = 1.043).
Furthermore, IDyOM was used to make predictions combining a long-term model, which was pretrained on the 17 other melodies and incrementally on the current melody, as well as a short-term model, which was only trained incrementally on the current melody. This combination of long- and short-term models has been found to reflect listeners' expectations well (Pearce, 2005). IDyOM estimates the probability for each note in each of the 18 AMG melodies. We calculated IC by taking the negative logarithm (Base 2) of this probability estimate. Low IC corresponds to HP (i.e., predictable) notes, whereas high IC corresponds to LP (i.e., unpredictable) notes based on a given grammar.
Melodic Stimuli for the Judgment Sessions
The melodies were interrupted after a target note, and participants were prompted to judge if the last note was correct or incorrect and surprising or not surprising. For the pretest and posttest, we used 280 melodies, terminating with “target” notes of different levels of note probability: 70 HP, 70 LP, 70 INC, and 70 random (Figure 1B). For the generalization session, we used 105 melodies: 35 HP, 35 LP, and 35 INC. The melodies for the test sessions were generated from the 12 old-grammatical sequences, whereas for the generalization session, they were generated from the six new-grammatical sequences.
To generate the melodies ending on HP and LP notes, we first identified those with the lowest 30% IC (extreme HP) and those with the highest 30% (extreme LP) out of all the notes of the 18 AMG sequences. The probability values of the identified HP notes ranged from 0.83 to 0.94 (M = 0.90, SD = 0.03), whereas the probability of the LP notes ranged from 0.01 to 0.37 (M = 0.21, SD = 0.10). There were 79 notes with extreme probabilities: 55 belonged to the old-grammatical sequences; and 24, to the new-grammatical sequences. Of the 55 ones, 36 notes were HP and 19 were LP. To reach 70 trials per condition, 34 (randomly picked) of the 36 HP melodies were repeated once, whereas all 19 LP melodies were repeated three times (giving 57 melodies), and 13 (randomly selected from the middle 40% of the distribution) were added (total = 70). The same was applied for the new-grammatical sequences. The 16 HP melodies were repeated once (32), and three more (randomly picked) were added (35 in total). The eight LP melodies were repeated four times (32), and three more (randomly picked) were added (35).
The INC melodies were generated by replacing the last note of the HP and LP melodies with a note that never appeared in that context in the AMG. Three different sets of INC melodies were created, one for the pretest (70), one for the posttest (70), and one for the generalization (35). We also generated two different sets of 70 random melodies, presented in the pretest and posttest. The random melodies had a similar length to the rest of the melodies, by producing five random melodies for each of the possible lengths (7–20 notes). The melodies were played through speakers located to the left and right of participants. Notes had a duration of 330 msec, with the next note beginning immediately after the end of the previous note, and were played with a piano timbre. All notes had a 100-msec fade-out time. Psychtoolbox (Brainard, 1997) was used for stimuli presentation. Examples of the stimuli are now included as audio files in the supplementary materials.
Participants came to the laboratory on 4 separate days with a maximum 2-day gap between any of the days (Figure 2A). Participants received training on the melodies generated by the AMG, through active reproduction on a keyboard with (AV group) or without (AO group) visual cues across three sessions (Days 1–3). Learning of the AMG was assessed before and after training (Days 1 and 4, respectively). Participants were presented with melodies and were prompted to judge if the final note was correct or incorrect and surprising or not surprising, while their EEG was recorded. In the last generalization session, participants were asked to judge if the final note of previously unheard sequences was surprising or not surprising. As the primary aim of our study was to determine the efficacy of visual aids in music (but not visual) learning, participants from both groups were tested only in the auditory domain (auditory without visual stimuli) in the pretest, posttest, and generalization sessions. On Days 2 and 3, after a brief (5-min) passive exposure to all the old-grammatical sequences three times (36 in total), participants were then asked to complete a short surprisal (yes/no) judgment task of melodies ending with HP or LP notes (intermediate surprisal sessions). After each training session, participants were asked to compose and perform a musical composition based on the learned materials, but this part is outside the scope of this paper.
Participants received training on a computer keyboard that was adjusted to serve as a sound keyboard. A red, an orange, a yellow, a green, a blue, a pink, and a brown sticker were put on keys A, D, G, J, L, ', and ENTER, respectively (see Figure 2C). Before the first session only, participants had some familiarization time with the keyboard. First, they listened to the whole scale ascending three times, while the visual cue corresponding to each note was simultaneously presented on screen. The cues were spatially positioned on the screen in the same configuration as the stickers on the keyboard, that is, lower notes on the left and higher notes on the right. Participants were then allowed 3 min to familiarize themselves with the keyboard. To confirm they had basic understanding of the tones, they took a short discrimination test: They listened to pairs of notes for which they were presented with only the first visual cue. They were required to identify the second note and reproduce the note pair on the keyboard. After three attempts, the solution was presented on screen. There were 42 pairs in total, covering all possible note combinations, for example, C–D, C–E, C–F, and so forth. All participants passed an arbitrary threshold of 70% correct and proceeded with the training.
Participants attended three 25-min training sessions on 3 separate days. The training proceeded as follows. Participants began by hearing a melodic sequence. Then, the first two notes of the melody were presented. Only after participants reproduced them correctly, the next segment was increased by a note and so on. If they made a mistake, the melodic segment would repeat for further (maximum = 7) attempts. The difference between the AV and AO groups was that the former was presented on screen with the visual cues of all the notes, whereas the latter was only given the first cue as a reference (to indicate the first note of the sequence) but relied only on the auditory information to reproduce the rest of the sequence.
In the generalization session, participants were presented with unheard sequences in randomized order and were asked to judge if the last note was surprising or not surprising. There were 105 trials in total, and the session lasted around 20 min.
Passive Exposure Sessions
Following the statistical learning literature (e.g., Rohrmeier et al., 2011; Loui et al., 2010), participants attended two (Days 2 and 3) passive exposure sessions to three repetitions of the grammatical sequences in randomized order. They were instructed to listen attentively to the melodies. There were 36 sequences, and the session lasted approximately 5 min.
Intermediate Surprisal Sessions
After each exposure session (Days 2 and 3), participants were presented with sequences terminating on HP and LP notes and were asked to judge if the last note was surprising or not surprising. There were 36 trials, lasting around 7 min in total.
To assess learning, test sessions were conducted before (pretest: day 1) and after (posttest: day 4) training. Participants were seated in front of a computer while their EEG was recorded. In the pretest only, they were informed that they would listen to melodies of an unfamiliar music grammar governed by a set of rules. They were instructed to attend as the melodies would stop at random points and were asked to make two judgments on the last note: (1) correct or incorrect and (2) surprising or not surprising (Figure 2B). The distinction between correct and INC notes is related to the grammar rules, whereas the probability refers to the IC. Specifically, correctness refers to whether a note is allowed or disallowed by the grammar: A correct note is grammatical, whereas an INC note is ungrammatical. Within the correct notes, some have a low probability whereas others have a high probability. The surprisal ratings add to the correctness judgment, as some notes can be surprising but also correct. Therefore, the surprisal ratings were used as a measure of perceived expectedness of the stimuli, which would reflect successful internalized acquisition of the statistical rules.
Furthermore, in the two intermediate sessions, we used only HP and LP stimuli to test participants, as we did not want to expose participants to INC stimuli during the “training days” (see the Procedure section under Intermediate Surprisal Sessions). We thus needed to use surprisal judgments as a measure of learning during training as correctness judgments would not be appropriate on those sessions (both HP and LP stimuli are correct). Similar studies on implicit sequence learning of melodic sequences have used two-alternative forced-choice recognition tasks that use other ratings apart from correctness to assess learning of an artificial musical system (e.g., Loui et al., 2010; Loui & Wessel, 2008). For example, Loui and Wessel (2008) used familiarity ratings, that is, presented participants with two melodies and asked them to indicate which one is more familiar.
Three practice trials familiarized participants with the task. Across participants the presentation order of the trials was randomized. There were 280 trials in total, and each session lasted around 40 min.
Working Memory Task
Participants also completed a working memory (WM) span task (adjusted from the Wechsler Adult Intelligence Scale; Wechsler, 1955). In this task, participants were presented with sequences of random numbers from 1 to 9 and had to replicate them on a number pad. Starting from Length 3, the number of digits was increased by one every time a correct response was made; otherwise, the number of digits of the next sequence was reduced by one. This lasted 10 min, and the WM span was calculated as the mean length of the correctly reproduced sequences. Because of technical problems, data from only 31 participants remained for the WM task.
EEG Recording and Preprocessing
EEG was recorded from 64 Ag–AgCl electrodes attached to the EGI geodesic sensor net system (HydroCel GSN 64 1.0; EGI System 200, Electrical Geodesic Inc.; www.egi.com/) and amplified by an EGI Amp 300. The sampling frequency was 500 Hz. The MATLAB Toolbox EEGLAB (Delorme & Makeig, 2004) was used for data preprocessing; and FieldTrip (Oostenveld, Fries, Maris, & Schoffelen, 2011), for data analysis. Data were recorded with an online reference at the right mastoid and rereferenced to the average of the left and right mastoids. Continuous data were high-pass filtered at 0.5 Hz and then epoched from −0.2 to 0.6 sec after the onset of the last note. Data from electrodes with consistently poor signal quality, as observed by visual inspection and by studying the topographical maps of their power spectra, were removed and replaced by interpolating neighboring electrodes. Artifact rejection was conducted in a semiautomatic fashion: First, artifactual epochs containing movement, muscle artifacts, and saccades were removed after visual inspection, and second, independent component analysis was used to correct for eye-blink-related artifacts. Subsequently, data were detrended; that is, from each data point of the averaged ERP of each participant, we subtracted the average ERP value. The epoched data were low-pass filtered at 30 Hz and baseline corrected from −0.2 to 0 sec. Five participants were removed because of poor EEG data quality (more than 30% of the trials rejected in at least one of the test sessions; NAO = 15, NAV = 18).
Participants' learning was assessed throughout, including pretest and posttest, intermediate surprisal sessions, and the generalization session. For the pretest and posttest sessions, a response was considered correct if an HP or LP note was judged as correct and if an INC note was judged as incorrect. We performed a 2 (Session: pre, post) × 2 (Group: AV, AO) mixed factorial ANOVA on accuracy.
To assess sensitivity to the statistical probabilities of the AMG, we calculated the percentage of notes judged as surprising within each note probability category in the pretest and posttest as well as in the intermediate sessions. For the pretest and posttest sessions, we conducted a 3 (Note probability: HP, LP, INC) × 2 (Session: pre, post) × 2 (Group: AV, AO) mixed ANOVA with percentage judged as surprising as the dependent variable. For the intermediate surprisal sessions, we conducted a 3 (Note probability: HP, LP, INC) × 2 (Intermediate session: 1, 2) × 2 (Group: AV, AO) mixed ANOVA with the same dependent variable.
We evaluated sequence reproduction performance at the training sessions by calculating the mean length of correctly reproduced sequences (in number of notes). Because of technical problems with saving the results, four participants were excluded from this analysis (NAO = 15, NAV = 19). We conducted a 3 (Training session: 1, 2, 3) × 2 (Group: AV, AO) mixed ANOVA on sequence length.
Finally, we investigated whether sequence reproduction performance predicts learning by performing a multiple linear regression with average length of reproduced sequences in the third training session and group as predictors as well as accuracy in the posttest as the dependent variable. To test whether sequence reproduction performance or learning depended on WM skills, as assessed from the digit span task, we conducted two linear regressions: Group and WM were the predictors, and (i) sequence reproduction performance and (ii) learning were the dependent variables.
The following ROIs were used for the analysis, based on previous literature (Halpern et al., 2017; Carrus, Pearce, & Bhattacharya, 2013) and visual inspection of the ERPs: N100 (80–145 msec) and P200 (150–225 msec) in frontocentral regions (E8, E6, E4, E9, E3, E7, E54, and E47 in the EGI configuration, corresponding to AFz, Fz, FCz, F1, F2, FC1, FC2, and Cz in the standard 10–20 system). For each ROI, the mean ERP amplitude, as well as the peak latencies of the N100 and P200 components, was calculated. Two 3 × 2 × 2 mixed, repeated-measures ANOVAs were performed (one for N100 and one for P200) with the following factors: Note probability (HP, LP, INC), Session (pretest, posttest), and Group (AV, AO).
The ERAN was also analyzed from 0.140 to 0.220 sec (based on Koelsch, Kilches, Steinbeis, & Schelinski, 2008) at FPz. The ERAN was identified as the difference in response to LP minus HP notes and to INC minus HP notes. Two 2 (Session: pretest vs. posttest) × 2 (Group: AO vs. AV) mixed ANOVAs with LP–HP and INC–HP as the dependent variables, respectively, were performed.
Nonparametric cluster permutation.
To explore potential differences of AO versus AV training on brain responses, we further conducted a nonparametric cluster permutation test (Maris & Oostenveld, 2007). This test first performs independent t tests at each data point and then identifies clusters of electrodes that exceed a defined threshold and have the same sign. Subsequently, the cluster-level statistic is calculated as the sum of the t values of the cluster. Finally, the maximum value of the cluster-level statistic is evaluated by calculating the probability that it would be observed under the assumption that the two compared conditions are not significantly different.
Specifically, we compared AV versus AO on their brain responses to LP minus HP notes (LP − HP) within the first 500 msec after note onset. The permutation distribution was extracted from the statistic values of independent samples t tests based on 500 random permutations. The probability threshold was set at p = .05. Subsequently, we performed independent samples t tests on the average values of the identified clusters.
Pretest and Posttest Sessions
A 2 (Session: pretest, posttest) × 2 (Group: AO, AV) mixed ANOVA showed that participants successfully learned the grammar (main effect of Session: F(1, 36) = 67.751, p < .001, η2 = .653; Figure 3A). There was no effect of Group or interaction between the variables (p > .7).
As the random notes were neither correct nor incorrect based on the grammar, we did not expect to see any difference in the percentage judged as correct in the pretest versus the posttest. This was confirmed by a 2 (Session) × 2 (Group) ANOVA that showed no significant main effects or interaction between the variables (p > .2).
Furthermore, participants judged LP and INC notes as more surprising than HP notes after training, showing that learning made them more sensitive to the statistical probabilities of the grammar (Figure 3B). A 3 × 2 × 2 mixed ANOVA revealed significant main effects of Note probability, F(2, 72) = 45.566, p < .001, η2 = .559, as well as a Note Probability × Session interaction, F(2, 72) = 28.081, p < .001, η2 = .438. Post hoc contrasts revealed that participants judged HP and LP notes as less surprising in the posttest compared to the pretest session (HP: t(37) = −3.982, p < .001, Cohen's d = −0.650; LP: t(37) = −2.841, p = .007, Cohen's d = −0.472), whereas the opposite was found for the INC notes; that is, participants judged them more as surprising in the postsession, t(37) = 3.331, p = .002, Cohen's d = 0.559. In both sessions, INC notes were judged as surprising significantly more often than the HP notes (pretest: t(37) = 3.913, p < .001, Cohen's d = 0.635; posttest: t(37) = 7.108, p < .001, Cohen's d = 1.159) as well as than the LP notes (pretest: t(37) = 2.623, p = .013, Cohen's d = 0.428; posttest: t(37) = 6.741, p < .001, Cohen's d = 1.102). The LP notes were more often judged to be surprising than the HP notes (pretest: t(37) = 2.362, p = .024, Cohen's d = 0.387; posttest: t(37) = 4.926, p < .001, Cohen's d = 0.801). There was no effect of Group or any other effect or interaction between the variables (p > .3). There was no difference in the percentage of random notes judged as surprising in the pretest versus the posttest (p > .2).
Participants' surprisal judgments in the two intermediate sessions were also evaluated by a 2 (Note probability: HP, LP) × 2 (Intermediate session: 1, 2) × 2 (Group: AO, AV) mixed ANOVA (Figure 3C). Results revealed significant main effects of Session, F(1, 36) = 4.860, p = .034, η2 = .119, and Note probability, F(1, 36) = 5.135, p = .030, η2 = .125. There was also a significant Note Probability × Session interaction, F(1, 36) = 49.013, p < .001, η2 = .577. Post hoc analysis revealed that participants judged HP notes as significantly less surprising in the second compared to the first session (HP: t(37) = −6.741, p < .001, Cohen's d = −1.094), whereas the opposite was found for the LP notes, that is, participants judged them as more surprising in the second session, t(37) = 3.411, p = .002, Cohen's d = 0.554. Furthermore, in the first session, HP notes were judged as surprising significantly more often than the LP notes, t(37) = −2.080, p = .045, Cohen's d = −0.340, whereas the opposite effect was observed in the second session, t(37) = 6.103, p < .001, Cohen's d = 0.996. There was no effect of Group or interaction between the variables (p > .2). Therefore, HP notes were judged as less surprising in the second compared to the first session, whereas the opposite was found for the LP notes, and this effect did not differ between groups.
A 3 (Note probability: HP, LP, INC) × 2 (Group: AO, AV) mixed ANOVA demonstrated that participants successfully differentiated between the statistical probabilities of unheard sequences (main effect of Note probability: F(2, 72) = 10.166, p < .001, η2 = .220). Planned contrasts revealed that participants judged LP notes as more surprising than HP ones, t(37) = 2.362, p = .024, Cohen's d = 0.383, and INC notes as more surprising than both HP ones, t(37) = 3.913, p < .001, Cohen's d = 0.635, and LP ones, t(37) = 2.623, p = .013, Cohen's d = 0.426. There was no effect of Group or interaction between the variables (p > .3).
Performance during training improved incrementally, and the AV group did substantially better in all sessions, as confirmed by a 3 (Session: 1, 2, 3) × 2 (Group: AO, AV) mixed ANOVA (Figure 3D). In particular, results revealed a main effect of Session, F(2, 64) = 37.676, p < .001, η2 = .541. Furthermore, the AV group was able to reproduce longer sequences overall compared to the AO group (main effect of Group: F(1, 32) = 111.335, p < .001, η2 = .777). Results also revealed a significant Session × Group interaction, F(2, 64) = 3.369, p = .041, η2 = .095. Planned contrasts showed that the AV group were significantly better in all sessions compared with the AO group (Session 1: t(32) = 8.407, p < .001, Cohen's d = 0.691; Session 2: t(32) = 9.893, p < .001, Cohen's d = 0.621; Session 3: t(32) = 9.867, p < .001, Cohen's d = 0.640). Paired t tests showed that both groups performed better in the third session compared with the first (AO: t(14) = 5.140, p < .001, Cohen's d = 1.327; AV: t(18) = 5.762, p < .001, Cohen's d = 1.322) and second (AO: t(14) = 4.392, p = .001, Cohen's d = 1.134; AV: t(18) = 4.055, p = .001, Cohen's d = 0.930) sessions and better in the second compared with the first session (AO: t(14) = 3.277, p = .006, Cohen's d = 0.846; AV: t(18) = 4.062, p = .001, Cohen's d = 0.932).
To further investigate whether one of the two groups improved more from Training Session 1 to Training Session 3, we conducted a paired samples t test on training performance (i.e., length of the replicated sequences) between the Training Session 3 minus Training Session 1 differences of the AV group versus the AO group. Results revealed that the AV group improved more (M = 1.404, SD = 1.061) compared to the AO group (M = 0.815, SD = 0.614), but that was only marginally statistically significant, t(32) = −1.906, p = .066.
Training Predicting Learning
To investigate whether better sequence reproduction performance during training leads to better learning, a multiple linear regression analysis was performed to predict learning (accuracy in the posttest) with the predictors group (AO, AV) and sequence reproduction (average length of reproduced sequences in the third training session). Results showed that neither group nor sequence reproduction significantly predicted learning (p > .1), and the model was not significant overall, F(2, 31) = 1.506, p = .238, R2 = .089, suggesting that sequence reproduction during training does not necessarily ensure successful learning in either of the groups.
WM Predicting Sequence Reproduction and Learning
We tested whether sequence reproduction or learning depended on WM capacity, as assessed by the digit span task. We conducted two linear regression analyses with group (AO, AV) and WM performance as predictors and (i) sequence reproduction and (ii) learning as the dependent variables. In the first regression, Group was a significant predictor of sequence reproduction (p < .001), but not WM (p = .880); the model was overall significant, F(2, 20) = 20.043, p < .001, R2 = .667. In the second regression, neither Group nor WM was a significant predictor of learning (p > .1), and the model was not significant overall, F(2, 20) = 1.269, p = .303, R2 = .113.
N100 Time Window (80–145 msec)
As shown in Figure 4 (A, C, D), neural sensitivity to the statistical properties of the grammar was reflected in the N100 amplitude of the AO group as the N100 was higher in response to LP than HP notes in the posttest; the AV group did not show a similar differentiation. Confirming this, a mixed ANOVA yielded a significant Session × Note Probability × Group interaction, F(2, 62) = 4.290, p = .018, η2 = .122, as well as a significant Session × Group interaction, F(1, 31) = 4.140, p = .050, η2 = .118. Planned contrasts showed that the N100 was higher in response to LP compared to HP notes in the posttest in the AO group, t(14) = −2.319, p = .036, Cohen's d = 0.599, whereas that was not significant in the pretest, t(14) = 0.979, p = .344, Cohen's d = 0.253. The contrast between LP and HP was not significant in the AV group for neither of the sessions (pretest: t(17) = −0.911, p = .375, Cohen's d = 0.215; posttest: t(17) = 1.617, p = .124, Cohen's d = 0.381). Furthermore, the N100 amplitude in response to HP notes became less negative from pretest to posttest in the AO group, t(14) = 3.193, p = .007, Cohen's d = 0.866. No effect was found for the AV group (p > .1) (Figure 4B and D). There was no other significant effect or interaction between the factors (p > .4).
The enhanced neural sensitivity of the AO group to the statistical regularities of the grammar was also reflected on the latency of the N100 (Figure 4E). A mixed ANOVA revealed a significant Session × Note Probability interaction, F(2, 62) = 4.076, p = .022, η2 = .116. In the AO group, the latency of the N100 in response to INC notes was longer in the posttest compared to the pretest session, t(14) = 2.620, p = .020, Cohen's d = 0.676. Furthermore, in the pretest, the AO group showed shorter latencies to INC than LP notes, t(14) = −2.300, p = .037, Cohen's d = 0.734, whereas in the posttest, they showed longer latencies to INC than HP notes, t(14) = 2.567, p = .022, Cohen's d = 0.663. There was no difference in the posttest versus pretest latency in the AV group (p > .3). There was no other significant effect or interaction between the variables (p > .1).
P200 Time Window (150–225 msec)
There was a marginal effect of Session, F(1, 31) = 11.936, p = .002, η2 = .278 (Figure 4A), as the P200 amplitudes increased from the pretest (M = 1.263, SD = 2.258) to the posttest (M = 2.012, SD = 2.214) session. There was no other effect or interaction between the variables (p > .2).
There was no significant effect or interaction between the variables on the P200 latencies (p > .1).
ERAN Time Window (140–220 msec)
Two 2 (Session: pretest vs. posttest) × 2 (Group: AO vs. AV) mixed ANOVAs with LP–HP and INC–HP as the dependent variables, respectively, were performed on the ERAN (see Figure S2 in the supplementary materials). The LP–HP ANOVA did not reveal any significant main effect or interaction (p > .1). However, the INC–HP difference increased from pretest to posttest (main effect of Session: F(1, 31) = 5.836, p = .022, η2 = .158), but there was no other effect or interaction between the variables (p > .2). The analysis in detail and a figure are included in the supplementary materials.
Nonparametric Cluster Permutation Analysis
We compared brain responses to LP minus HP (LP − HP) notes between the AV group versus the AO group with a nonparametric cluster permutation test. Results revealed two frontocentral clusters, the first from 100 to 200 msec (p = .023) and the second from 250 to 350 msec (p = .031; see Figure S3A for the topography in the supplementary materials). In both clusters, the AO group showed an enhanced negativity in response to LP compared to HP notes, which was not the case for the AV group (Figure S3C). This is also evident from the difference ERP plots in Figure S3B, in which there is a negative-going wave in the AO group, but not in the AV group, in both time windows. Both clusters were further statistically tested by independent samples t tests confirming the group difference (0.1–0.2: t(28) = 3.484, p = .002, Cohen's d = 1.272; 0.1–0.2: t(28) = 3.048, p = .005, Cohen's d = 1.113). Considering the time windows and the midfrontal topography of both clusters, they might represent the N100–P200 effects we observed and a later negative-going component resembling an N200, respectively. These findings can be visualized in Figure S3.
Our main goal was to investigate the effect of multisensory music learning using visual aids, and on the respective neural correlates, as well as examine the distinction between sequence reproduction and learning of the statistical regularities of an unfamiliar music grammar. Our study was the first, to our knowledge, to investigate statistical learning over multiple sessions conducted on separate days under different training regimes. In contrast to previous studies showing that multimodality is beneficial for learning (e.g., Tierney et al., 2008; Brünken et al., 2004; Cleary et al., 2001), we found that visual aids boosted sequence reproduction but did not improve statistical learning, suggesting that performance during training and actual learning are two distinct or relatively independent processes. This was also reflected in the neural correlates, as training without visual aids was associated with increased sensitivity to the statistical properties of the musical style.
As expected, participants who received musical training with visual aids were able to reproduce considerably longer sequences compared to those without visual aids. Previous studies have demonstrated that visual cues engage WM resources in visuospatial (e.g., Gathercole & Alloway, 2012) and arithmetic (e.g., St Clair-Thompson, Stevens, Hunt, & Bolder, 2010) domains. Thus, it might have been easier for the AV group to reproduce the sequences by relying on STM of the visual cues. However, this mechanism might only be efficient for immediate reproduction and not necessarily beneficial for longer-term acquisition of knowledge or for developing an enhanced sensitivity to the underlying rules. Future studies are needed to investigate the efficacy of visual aids in longer-term learning, over periods of weeks, months, and years, of the statistical regularities of a certain musical style.
Interestingly, the superior sequence reproduction using visual aids was not reflected in greater knowledge of the grammar, as the AV and AO groups performed equally well in the test after training. This is in line with previous studies demonstrating that performance during training and learning are distinct processes (e.g., Kantak & Winstein, 2012; Schmidt & Bjork, 1992; Lee & Genovese, 1988). According to the desirable difficulties theory (Bjork & Bjork, 2011; Bjork, 1994), learning and retention can improve with the use of more difficult and challenging materials during acquisition. That is, learning can be substantially improved by superficial changes in the presentation of the material, for example, using a letter format that is harder to read (Diemand-Yauman, Oppenheimer, & Vaughan, 2011), because of an increasing level of cognitive engagement during learning, which enhances subsequent recall. For example, when a response word associated with a stimulus word is generated rather than read by the participant, later recall is improved, attributed to strengthening of memory associating the two items (Hirshman & Bjork, 1988). Furthermore, learning and generalization of knowledge beyond specific recall improves when two sets of information are interleaved rather than grouped into separate blocks (Richland et al., 2005). Likewise, participants exhibit better understanding of paragraphs with deleted letters than paragraphs with intact letters (Maki, Foley, Kajer, Thompson, & Willert, 1990). A beneficial effect of using a hard-to-read letter font on memory recall has been also demonstrated in classroom settings (Diemand-Yauman et al., 2011). Our findings suggest that participants' reliance on the auditory information constituted an extra difficulty in the sequence reproduction task, which may have led to enhanced neural sensitivity to the statistical properties of the learned material (Craik & Tulving, 1975). In contrary, visual cues made the task much easier, thus requiring less cognitive engagement and providing no enhancement of subsequent retrieval.
Another explanation might lie on the modality specificity of statistical learning (see Frost, Armstrong, Siegelman, & Christiansen, 2015, for a review). In particular, it is possible that training with visual cues could have led to visual learning dominating over auditory learning, thus producing a deficit in the test sessions. Transfer of learning across modalities has been found to be limited (Tunney & Altmann, 1999; Redington & Chater, 1996), although there are qualitative learning biases among the auditory, visual, and tactile modalities (Emberson, Conway, & Christiansen, 2011; Conway & Christiansen, 2005). Furthermore, there is evidence suggesting that, sometimes, multimodality results in cross-modal competition (e.g., Robinson & Sloutsky, 2013; Sinnett, Spence, & Soto-Faraco, 2007; see Spence, 2009, for a review), where the more salient a stimulus representation, the more it dominates the competition (Rapp & Hendel, 2003). Furthermore, visual dominance effects have been demonstrated, where participants failed to detect a tone when it was simultaneously presented with a light (Colavita effect: Spence, 2009). Visual dominance can be modulated but not completely reversed with selective attention (Sinnett et al., 2007); however, in our study, there were no instructions on direction of attention. Contrary to our findings, previous studies have shown that combining auditory with visual information can be beneficial for sequence learning (e.g., Shams & Seitz, 2008; Tierney et al., 2008; Seitz, Kim, & Shams, 2006; Brünken et al., 2004; Pisoni & Cleary, 2004; Cleary et al., 2001; Coull, Tremblay, & Elliott, 2001). For example, participants presented with audiovisual stimuli have been found to reproduce longer sequences compared to when they were presented with audio- or visual-only cues (Simon task: Tierney et al., 2008; Cleary et al., 2001). Likewise, in our study, the participants trained with audiovisual stimuli were able to reproduce longer sequences during training (compared to participants trained with auditory stimuli only), but this benefit did not result in better learning of the statistical regularities of the AMG.
Importantly, however, the aforementioned studies used the same stimulus modalities for the learning and testing phases (auditory-only, visual-only, or audiovisual). Here, we need to note that, because the multimodality advantage is often used as a justification for visual aids in music learning, we tested both groups without the visual cues (even if they were used during training). The reason we did not assess learning in the visual or audiovisual modality was that our primary focus was on music learning; that is, we aimed to study whether visual aids improve music learning. Nevertheless, it would be interesting to see how the AV group would perform if tested with audiovisual stimuli. Future studies are needed to investigate the effect of audiovisual training on multimodal learning versus modality-specific learning. In our study, visual cues improved sequence reproduction but not statistical learning, thus providing corroborating evidence for modality-specific learning.
Furthermore, our study differs from the aforementioned studies in additional aspects. Whereas the auditory cues of the Simon task (Tierney et al., 2008; Cleary et al., 2001) consisted of the names of the colors (i.e., participants heard the names of the colors they needed to reproduce), our participants heard tones and were required to find the corresponding keys—a much more difficult task, especially for nonmusicians. Furthermore, there were differences in the method of testing the knowledge after training. For example, Cleary and colleagues (2001) assessed performance based on the length of the sequences participants accurately reproduced during training, whereas Tierney and colleagues (2008) used familiarity ratings on a scale from 1 (least familiar) to 7 (most familiar). In contrast, we asked for grammaticality judgments of specific notes (correct or incorrect). We could speculate here that the latter requires better learning of the statistical regularities of the music grammar because it cannot rely purely on WM (reproduction) or only familiarity. This, however, did not occur for the AV group because the multimodal nature of the stimuli made the immediate reproduction task so much easier to achieve.
Participants showed generalization of their knowledge to new melodies, and there was no difference between the groups. Previous studies have demonstrated generalization effects after a brief exposure to novel music (Loui et al., 2010; Loui & Wessel, 2008). This suggests that both groups internalized the underlying rules of the new grammar and were able to extrapolate their knowledge to unheard melodies. Participants also exhibited sensitivity to notes with different levels of predictability, as they scored INC, LP, and HP notes as increasingly more surprising.
On the neural level, after training, the AO group showed enhanced N100 in response to LP compared to HP notes, but this effect was not present in the AV group, which exhibited no differences between probability types. The N100 has been linked to expectation (e.g., Daikoku, Yatomi, & Yumoto, 2015; Omigie et al., 2013; Koelsch & Jentschke, 2010). For example, Omigie and colleagues (2013) found that a similar enhanced early frontal negativity was elicited in response to unpredictable notes only in controls, but not in amusic patients, which showed impaired explicit knowledge of the music for the latter group. Abla and colleagues (2008) found that, after first exposure to a novel grammar, the N100 was increased in response to unexpected words in high-learners only. In our study, the AO group exhibited an increased sensitivity to the statistical properties of the AMG, whereas the AV group had a less robust representation of the material. Thus, the N100 in response to unexpected notes could reflect the strength of the prediction error: Better learning would lead to formation of strong predictions, which, if violated, would elicit an increased N100 amplitude. On the contrary, there is not much prediction error when the predictions are weak. On the basis of predictive coding, the brain inhibits the neural responses to predictable stimuli to achieve efficient processing (Friston, 2005). Auditory, task-relevant training might have led to the AO group forming more specific expectations of how music should unfold because of better knowledge of the statistical properties of the grammar, thus creating stronger prediction error signals when those expectations were violated. On the other hand, the AV group might have had less sensitivity to the subtle statistical regularities of the grammar because of modality-specific learning.
The P200 component was larger in the posttest compared to the pretest session, and this was not different between groups. This early positive deflection is reported to be enhanced after a prolonged training (e.g., Bosnyak, Eaton, & Roberts, 2004; Reinke, He, Wang, & Alain, 2003). Lexical processing with familiar words induces larger P200 amplitudes than unfamiliar words (Stuellein et al., 2016; Liu et al., 2006). Liu and colleagues (2006) observed larger P200 amplitudes in response to familiar compared to unfamiliar Chinese characters and English words, proposing a potential link between P200 and processing speed. In Stuellein and colleagues (2016), recently seen words were associated with larger P200 and faster RTs compared to unseen words during the experiment, suggesting quicker lexical access and semantic integration in memory. In the auditory domain, participants' P200 amplitudes showed a robust increase after training associated with learning to distinguish two synthetic speech variants of the syllable /ba/ (Tremblay & Kraus, 2002). The authors suggested that this component reflects a preattentive mechanism linked to enhanced perception as a result of learning.
The latency of the N100 was delayed for unpredictable notes compared to predictable notes in the AO group, but not in the AV group. The latency of this component has been associated with processing speed (Polich, Ellerson, & Cohen, 1996) and correlated with task difficulty (Goodin, Squires, & Starr, 1983). Therefore, in our study, the N100 latency effect could reflect that the AO group processed faster and easier the expected events compared to unexpected, whereas the AV group did not differentiate the varying types of expectancy. As manifested by the early neural responses, the results provide evidence for successful early discrimination of subtle statistical differences and, therefore, increased neural sensitivity in the AO group, which was not apparent in the AV group.
Our results revealed no substantial modulation of the ERAN by training method or clear effects of learning. This is an unexpected finding, considering that the ERAN has been associated with violation of syntax in Western tonal music in both chord and melodic sequences (Pearce & Rohrmeier, 2018; Koelsch et al., 2000, 2008; Steinbeis & Koelsch, 2008; Loui, Grent-'t-Jong, Torpey, & Woldorff, 2005). Specifically, ERAN is increased in response to completely ungrammatical or stylistically unpredictable (but not ungrammatical) chords (Leino, Brattico, Tervaniemi, & Vuust, 2007; Steinbeis et al., 2006) but diminished in response to expected elements, after a certain context has been established (Leino et al., 2007). One explanation could be that, in our study, participants were not able to infer harmony from the presented melodies, which participants in studies using Western music are potentially performing. Koelsch et al. (2016) supported that ERAN does not necessarily reflect processing of local dependencies, as local irregularities confound with hierarchical structure (Kim, Kim, & Chung, 2011; Villarreal, Brattico, Leino, Østergaard, & Vuust, 2011). In our study, we tracked the statistical properties of the melodic sequences with the IDyOM computational model. This model's probability estimates are long-term (based on prior learning of the grammar) and contextual (the probabilities are conditional on the entire preceding melodic sequence). Therefore, it is unexpected that the ERAN did not capture robustly the statistical learning process, and future studies are needed to investigate this further. In contrary, the N100 is more sensitive to local expectancy violation in various modalities, such as auditory, visual, and temporal (Duzcu, Özkurt, Mapelli, & Hohenberger, 2019; Michalski, 2000). This component has been especially reflective of statistical learning, with larger N100 in response to tones with a lower transitional probability compared to tones with a higher probability (Zioga, Harrison, Pearce, Bhattacharya, & Luft, 2020; Halpern et al., 2017; Moldwin et al., 2017; Paraskevopoulos et al., 2012b; Abla et al., 2008).
In line with the ROI analysis, a whole-head cluster permutation analysis provided corroborating evidence for the increased sensitivity of the AO group to the subtle statistical properties of the musical grammar. Besides the N100 and P200 time windows, a later, negative-going wave was revealed, which was increased in response to LP compared to HP notes in the AO group, but not in the AV group. The topography and time of this resemble the N200 component observed in prediction error studies (Hajihosseini & Holroyd, 2013; Ferdinand, Mecklinger, & Kray, 2008; Oliveira, McDonald, & Goodman, 2007; Kopp & Wolff, 2000). Both the N100 and N200 are sensitive to prediction errors, with the former as a more immediate response, whereas the latter represents more top–down, later processes. Specifically, the N200 is elicited by deviant stimuli (Hoffman, 1990). This was initially identified in oddball paradigms, where a continuously presented stimulus is interrupted by infrequent stimuli (Näätänen & Picton, 1986). The N200 is evoked to prediction errors when a mismatch between an expected stimulus and the sensory input is detected (Hajihosseini & Holroyd, 2013; Ferdinand et al., 2008; Oliveira et al., 2007; Kopp & Wolff, 2000). For example, in a music performance study (Maidhof, Vavatzanidis, Prinz, Rieger, & Koelsch, 2010), pianists showed an N200 component after unexpected notes, which was enhanced during performance than during perception of musical sequences. Therefore, our findings suggest enhanced sensitivity to statistical regularities as evidenced from both early (unconscious) and late (conscious) neural responses after musical training with auditory-only cues.
Previous studies have demonstrated multisensory neuroplastic changes in the auditory cortex after multisensory music training (Pantev, Paraskevopoulos, Kuchenbuch, Lu, & Herholz, 2015; Paraskevopoulos, Kraneburg, Herholz, Bamidis, & Pantev, 2015; Kuchenbuch, Paraskevopoulos, Herholz, & Pantev, 2014; Paraskevopoulos, Kuchenbuch, Herholz, Foroglou, et al., 2014; Paraskevopoulos, Kuchenbuch, Herholz, & Pantev, 2014; Paraskevopoulos, Kuchenbuch, Herholz, & Pantev, 2012a). Long-term musical training is associated with enhanced multisensory, audiovisual integration and neuroplastic changes in the auditory cortex, whereas short-term training affects the processing of each modality separately (Pantev et al., 2015). In a magnetoencephalographic audio-tactile mismatch paradigm (Kuchenbuch et al., 2014), musicians showed enhanced higher-order audio-tactile integration as evidenced by their brain responses to multisensory deviant stimuli, whereas nonmusicians demonstrated only bottom–up processing driven by tactile stimuli. In an audiovisual integration study on musicians, Paraskevopoulos and colleagues (2015) showed increased connectivity in areas relying on the contribution of the left inferior frontal cortex in response to auditory pattern violations, which was interpreted as better audiovisual cortical integration. In contrary, nonmusicians had more sparse integration of visual and auditory information and relied more on the visual information. Considering that our participants were nonmusicians, it could be that training with visual cues might have triggered an overreliance on these cues, which then distracted them from the statistical regularities of the music.
This analysis did not reveal an effect of training with visual aids in visual processing or other posterior regions. This is in contrast with previous neuroscientific work on multisensory learning (Paraskevopoulos, Chalas, Kartsidis, Wollbrink, & Bamidis, 2018; Pantev et al., 2015; Paraskevopoulos et al., 2015). For example, in a multisensory oddball paradigm, Paraskevopoulos et al. (2018) demonstrated that deviant visual stimuli were associated with activation of middle temporal and visual association areas, and that was not different between musicians and nonmusicians. It could be thus expected that, in our study, unexpected notes would elicit a response in visual areas in the AV training group. However, this could not be directly investigated as our participants were presented with the auditory stimuli only during the posttest session, which means there was no incongruency in relation to the visual signals used for training as they were not present in the test sessions (pretest and posttest). In other words, in our experiment, the prediction error was always auditory, rather than an incongruent audiovisual stimulus pair as in the aforementioned studies (Pantev et al., 2015; Paraskevopoulos et al., 2015). Finally, a study on visual processing found that the amplitude of the N170 component varied depending on reference method, whereas latency was independent across methods (Joyce & Rossion, 2005), which might suggest that the mastoid references used here could potentially contribute for the lack of effects in visual areas.
Our experimental design is not without limitations. First, it is possible that the effects we observed are specific to the artificial grammar used, which is necessarily limited in scope and may not have provided sufficient challenge to distinguish performance between the groups. However, the behavioral findings do not suggest a ceiling effect, which speaks against this possibility. Second, because the visual modality tends to dominate when attentional resources are depleted (Robinson, Chandra, & Sinnett, 2016), it could be that visual cues disrupted learning by increasing task demands, requiring participants to make associations between the sounds, the keys, and the visual cues, whereas the AO group needed to only map the keys with the sounds. Furthermore, it would be interesting to examine potential superadditivity effects of the audiovisual integration, that is, whether multisensory stimulation elicits higher neural activation than the sum of the unisensory stimuli (Stanford & Stein, 2007). Superadditivity effects have been demonstrated in various domains, such as the audiovisual (Paraskevopoulos et al., 2018; Nichols & Grahn, 2016) and the audio-tactile (Hoefer et al., 2013). The absence of a visual-only condition comprises a limitation of our study, which could be investigated in future studies. Finally, we acknowledge that we might have potentially missed effects around the temporal and visual areas because of the average mastoid EEG rereferencing. Future studies are necessary to more appropriately explore the effect of visual aids on music learning on ERPs at temporal sites.
We conclude that musical training with visual aids is not necessarily beneficial for learning; rather, it might serve as a distraction from encoding the main material. On the other hand, training without visual aids can lead to an enhanced understanding of the statistical subtleties of an unfamiliar music grammar, as evidenced by an increased sensitivity to statistical regularities at the neural level. Therefore, adding visual cues might give the illusion of learning as we can reproduce long sequences; however, it impairs actual learning of the material, as indexed by neural response properties.
C. D. B. L. and I. Z. received funding from Fundação BIAL (grant number 503 323 055). P. M. C. H. is supported by a doctoral studentship from the EPSRC and AHRC Centre for Doctoral Training in Media and Arts Technology (EP/L01632X/1). I. Z. was also supported by a doctoral studentship from Queen Mary University of London. We would like to thank Khadija Rita Khatun for helping with data collection. We also thank the anonymous reviewers for their constructive comments towards the improvement of the paper.
Reprint requests should be sent to Ioanna Zioga or Caroline Di Bernardi Luft, School of Biological and Chemical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom, or via e-mail: email@example.com, firstname.lastname@example.org.
Supplementary materials for this paper can be retrieved from https://github.com/JoannaZioga/Zioga-audiovisual-Supplement-JoCN.