Musicianship has been associated with auditory processing benefits. It is unclear, however, whether pitch processing experience in nonmusical contexts, namely, speaking a tone language, has comparable associations with auditory processing. Studies comparing the auditory processing of musicians and tone language speakers have shown varying degrees of between-group similarity with regard to perceptual processing benefits and, particularly, nonlinguistic pitch processing. To test whether the auditory abilities honed by musicianship or speaking a tone language differentially impact the neural networks supporting nonlinguistic pitch processing (relative to timbral processing), we employed a novel application of brain signal variability (BSV) analysis. BSV is a metric of information processing capacity and holds great potential for understanding the neural underpinnings of experience-dependent plasticity. Here, we measured BSV in electroencephalograms of musicians, tone language-speaking nonmusicians, and English-speaking nonmusicians (controls) during passive listening of music and speech sound contrasts. Although musicians showed greater BSV across the board, each group showed a unique spatiotemporal distribution in neural network engagement: Controls had greater BSV for speech than music; tone language-speaking nonmusicians showed the opposite effect; musicians showed similar BSV for both domains. Collectively, results suggest that musical and tone language pitch experience differentially affect auditory processing capacity within the cerebral cortex. However, information processing capacity is graded: More experience with pitch is associated with greater BSV when processing this cue. Higher BSV in musicians may suggest increased information integration within the brain networks subserving speech and music, which may be related to their well-documented advantages on a wide variety of speech-related tasks.
Psychophysiological evidence supports an association between music and speech such that experience in one domain is related to processing in the other (e.g., Bidelman, Gandour, & Krishnan, 2011; Koelsch, Maess, Gunter, & Friederici, 2001). Musicianship has been associated with benefits to auditory processing, such as enhanced spectral acuity for the perception of degraded speech (Zendel & Alain, 2012; Bidelman & Krishnan, 2010; Parbery-Clark, Skoe, Lam, & Kraus, 2009), lexical pitch judgments (e.g., Chandrasekaran, Krishnan, & Gandour, 2009; Schon, Magne, & Besson, 2004), and binaural sound processing (Parbery-Clark, Strait, Hittner, & Kraus, 2013). It is unclear, however, whether pitch processing experience in nonmusical contexts, namely, speaking a tone language, has comparable associations with auditory processing.
Tone languages, unlike other types of languages, use pitch phonemically (i.e., at the word level; e.g., Yip, 2002) to distinguish lexical meaning. Of all tone languages, Cantonese has one of the largest tonal inventories, comprising six tones—three of which are level, and three of which are contour (Rattanasone, Attina, Kasisopa, & Burnham 2013; Wong et al., 2012). These level pitch patterns are differentiable based on pitch height (Khouw & Ciocca, 2007; Gandour, 1981). The proximity of tones is approximately one semitone (i.e., a 6% difference in frequency, calculated from Peng, 2006), which is also the smallest distance found between pitches in music (Bidelman, Hutka, & Moreno, 2013). Note that this does not mean that Cantonese language experience is on par with musicians' auditory experience. Cantonese speakers have less pitch processing experience than musicians who have extensive experience with 12 level tones (i.e., the number of semitones in a scale) at several octaves, the processing of pitch contours as a result of the demands of musicianship, and the perception and production of complex melodies and harmonies. Furthermore, musicians' auditory demands include processing simultaneous tones (e.g., chords) and attending to the tone quality (i.e., timbre) of their instrument and other instruments around them. In comparison, tone language speakers have lesser auditory demands, typically processing a single, sequential stream of speech, without the same emphasis as musicians on tracking timbral cues. Because of the higher auditory demands faced by musicians (relative to tone language speakers), one might predict that benefits to auditory processing in musicians might be greater than to tone language speakers. Furthermore, one might predict that benefits to auditory processing in tone language speakers might be greater than in controls without musical training or tone language experience. However, studies comparing the auditory processing of musicians and tone language speakers have shown varying degrees of between-group similarity with regard to perceptual processing benefits and, particularly, nonlinguistic pitch processing.
Tone language experience (e.g., Mandarin: Bidelman et al., 2011; Cantonese: Bidelman, Hutka, et al., 2013) has been shown to similarly affect the neural encoding (Bidelman et al., 2011) and perception (Bidelman, Hutka, et al., 2013) of pitch. These studies imply that tone language experience may confer some benefits to spectral acuity that are comparable with those conferred by musicianship. However, behavioral studies have also revealed contradictory findings on tone language speakers' nonlinguistic pitch perception abilities, ranging from weak (Wong et al., 2012; Giuliano, Pfordresher, Stanley, Narayana, & Wicha, 2011) to no enhancements (Schellenberg & Trehub, 2008; Bent, Bradlow, & Wright, 2006; Stagray & Downs, 1993). Neuroimaging studies have also been unclear in this regard (e.g., Bidelman et al., 2011). Our group has found that enhanced preattentive processing in brainstem and cortical auditory evoked potentials in musicians (Hutka, Bidelman, & Moreno, 2015; Bidelman et al., 2011), as well as tone language speakers (Bidelman et al., 2011) for nonlinguistic pitch. Yet, neural advantages do not necessarily coincide with behavioral benefits in nonlinguistic pitch discrimination and vice versa (Hutka et al., 2015).
In Bidelman et al. (2011), musicians and tone language speakers had similar, stronger brainstem representation of the defining pitches of musical sequences, as compared with controls (i.e., nonmusicians, non-tone language speakers). However, only musicians showed enhanced behavioral, musical pitch discrimination, relative to tone language speakers and controls. These findings suggest that enhanced processing at the brainstem level (i.e., preattentive stages of auditory processing) do not necessarily equate to perceptual, behavioral benefits. More puzzling is the absence of neural effects, given the presence of perceptual benefits, which was observed in Hutka et al. (2015). Hutka et al. (2015) measured the behavioral (i.e., difference limens) and automatic change detection response (i.e., MMN) to variations in pitch and timbre in musicians, Cantonese speakers, and nonmusicians controls. Musicians and Cantonese speakers outperformed controls on the behavioral pitch discrimination task. Only musicians showed enhanced behavioral timbral processing, relative to tone language speakers and controls. Parallel enhancements of behavioral spectral acuity in early auditory processing were observed in musicians only. That is, tone language users' advantages in pitch discrimination that were observed behaviorally were not reflected in early cortical MMN responses to pitch changes.
If both musicianship and speaking a tone language hone a common cue (pitch), why is this not reflected in their automatic cortical responses to pitch changes (Hutka et al., 2015)? It is possible that the mean activation over a cortical patch (i.e., the ERP, MMN measures) used in Hutka et al. (2015) may not adequately represent neural processes underlying the processing pitch. Musicians arguably have a greater range of experience with pitch (e.g., manipulating and producing complex melodies and harmonies) than do tone language speakers. By this logic, tone language speakers should not show neural responses to—nor behavioral benefits in—pitch discrimination that is comparable to that of musicians. However, because such behavioral benefits were observed in Cantonese speakers in Hutka et al. (2015), it is possible that there are unique neural circuitries associated with pitch processing in these individuals that were not adequately captured in ERP measures. More generally, this discrepancy between brain and behavior raises the question of the extent to which musicianship and tone language experience similarly shape the processing of nonlinguistic pitch processing, relative to controls.
To test whether automatic, nonlinguistic pitch processing is supported by common neural network activations in musicians and Cantonese speakers, two requirements emerge. First, one would need to apply a methodology that could detect nuanced effects in the brain signal that might underlie the differences between auditory processing for Cantonese speakers and musicians. Second, one would need to apply this methodology to the existing EEG data set from Hutka et al. (2015), which could then be directly compared with the ERP data from this study. Both of these requirements were met by the measurement of brain signal variability (BSV) in the continuous EEG signal as a metric of information integration (Heisz, Shedden, & McIntosh, 2012; Misic, Mills, Taylor, & McIntosh, 2010; Ghosh, Rho, McIntosh, Kötter, & Jirsa, 2008; McIntosh, Kovacevic, & Itier, 2008).
Brain Signal Variability
BSV is the brain signal's transient, temporal fluctuations (McIntosh et al., 2013). There is strong evidence that BSV conveys important information about network dynamics (Deco, Jirsa, & McIntosh, 2011). The modeling of neural networks involves mapping an integration of information across widespread brain regions, via variations in correlated activity between areas across multiple timescales (Honey, Kötter, Breakspear, & Sporns, 2007; Jirsa & Kelso 2000; see Garrett et al., 2013, for a discussion). These transient changes result in fluctuations in the dynamics of the corresponding brain signal (Garrett et al., 2013). The networks with more potential configurations produce a more variable response (Garrett et al., 2013). Therefore, BSV appears to represent the system's information processing capacity, in which higher variability reflects greater information integration (Garrett et al., 2013; McIntosh et al., 2008, 2013; Heisz et al., 2012; Misic et al., 2010; Ghosh et al., 2008). However, like any nonlinear system, there is a theoretical “sweet spot” around which too little or too much variability may compromise information processing (Deco, Jirsa, & McIntosh, 2013).
As a metric of neural network dynamics, BSV provides valuable information about these dynamics that could not be obtained through the sole measurement of mean neural activity (e.g., using ERPs; Heisz et al., 2012; Vakorin, Misic, Krakovska, & McIntosh, 2011; McIntosh et al., 2008; see also Garrett, Kovacevic, McIntosh, & Grady, 2011; Ghosh et al., 2008). Averaging across trials (i.e., as in a traditional ERP analysis) removes the variability in each trial (see Hutka, Bidelman, & Moreno, 2013, for a discussion). This variability is not noise, instead providing information about network dynamics (Deco et al., 2011) and considers the entire neural network's activation and interactions (Hutka et al., 2013).
We posit that BSV might have great potential for studying the implicit neuroplasticity afforded by experience and learning. Studies have shown that the more information available to a listener about a given stimulus, the greater the BSV in response to that stimulus (Heisz et al., 2012; Misic et al., 2010). Variability should therefore increase as a function of learning, such that the more information one acquires for a stimulus, the greater information carried in the brain signal (Heisz et al., 2012). A study of Heisz et al. (2012) confirmed this expectation, showing greater BSV was associated with greater knowledge representation for certain faces—a result that was not reflected in the mean ERP amplitude in the same data set. Variability increased with face familiarity, suggesting that the perception of well-known stimuli engages a broader network of brain regions, manifesting in greater spatiotemporal changes in BSV (Heisz et al., 2012). These findings suggest that BSV is a useful metric of knowledge representation, capable of conveying information above and beyond what could be learned for mean neural activity. BSV can therefore be applied to the study experience-dependent plasticity and, in particular, pitch processing in musicians and tone language speakers (Hutka et al., 2013).
At present, we applied BSV analysis to the continuous EEG data set (i.e., not the ERP data) of Hutka et al. (2015), with the objective of studying the implicit impact of neuroplasticity afforded by experience and learning1 in the auditory processing of nonlinguistic pitch in musicians, tone language speakers, and controls. Specifically, this design tested whether pitch processing, relative to another auditory cue (timbral processing), is supported by common neural network activations in musicians and Cantonese speakers, relative to controls. Note that, throughout this manuscript, we are not seeking to make claims regarding fine-grained anatomical differences between groups or conditions. Instead, we sought to examine activation patterns of information integration during automatic processing of music (i.e., nonlinguistic pitch) and speech (linguistic timbre) in the three aforementioned groups.
If auditory experience via musicianship and tone language experience are associated with comparable information processing capacities supporting automatic music processing (i.e., pitch), then both groups would show greater BSV supporting auditory processing, as compared with controls (i.e., musicians = Cantonese speakers > controls). If the auditory expertise honed by musicianship and tone language are associated with different information processing capacities supporting automatic pitch processing, then we would predict different BSV between musicians and tone language speakers. This latter prediction would also manifest in unique spatiotemporal distributions for each group, as each group would be using a different brain network to support processing of pitch versus timbre.
Participants, stimuli, and EEG recording and preprocessing are the same as in Hutka et al. (2015).
Sixty right-handed, young adult participants were recruited from the University of Toronto and Greater Toronto Area. All participants provided written, informed consent in compliance with an experimental protocol approved by the Baycrest Centre research ethics committee and were provided financial compensation for their time. English-speaking musicians (M; n = 21, 14 women) had at least 8 years of continuous training in Western classical music on their primary instrument (μ ± σ: 15.43 ± 6.46 years), beginning formal music training at a mean age of 7.05 (±3.32 years). English-speaking nonmusicians (NM; n = 21, 14 women) had ≤3 years of formal music training on any combination of instruments throughout their lifetime (μ ± σ: 0.81 ± 1.40 years). Neither Ms nor NMs had experience with a tonal language of any kind. Native Cantonese-speaking participants (C; n = 18; 11 women) also had minimal musical training throughout their lifetime (0.78 ± 0.94 years). Importantly, NM and C did not differ in their minimal extent of music training, F(1, 37) = 0.007, p = .935. C were born and raised in mainland China or Hong Kong, started formal instruction in English at mean age of 10.27 (±5.13 years), and used Cantonese on a regular basis (>40% of daily language use).
The three groups were closely matched in age (M: 25.24 ± 4.17 years, C: 24.17 ± 4.12 years, NM: 23.38 ± 4.07 years; F(2, 57) = 1.075, p = .348) and years of formal education (M: 18.19 ± 3.25 years, C: 16.94 ± 2.46 years; NM: 16.67 ± 2.76 years; F(2, 57) = 1.670, p = .198). All groups performed comparably on a measure of general fluid intelligence (Raven's Advanced Progressive Matrices; Raven, Raven, & Court, 1998) and nonverbal, short-term visuospatial memory (Corsi blocks; Corsi, 1972), p > .05.
EEG Task Stimuli
EEGs were recorded using a passive, auditory oddball paradigm, consisting of two conditions—namely, music and speech sound contrasts presented in separate blocks (Figure 1). There were a total of 780 trials in each condition including 90 large deviants (12% of the trials) and 90 small deviants (12% of the trials). The notes (piano timbre) consisted of middle C (C4, F0 = 261.6 Hz), middle C mistuned by an increase of 0.5 semitones (large deviant; 269.3 Hz; 2.9% increase in frequency from standard), and middle C mistuned by an increase of 0.25 semitones cents (small deviant; 265.4 Hz; 1.4% increase in frequency from standard). Tone durations were 300 msec, including 5 msec of rise/fall time to reduce spectral splatter in the stimuli. Note that these changes were selected because previous behavioral research has demonstrated that both Cantonese speakers and musicians can distinguish between half-semitone changes in a given melody better than controls, whereas musicians outperform Cantonese speakers and controls when detecting a quarter-semitone change (Bidelman, Hutka, et al., 2013).
Speech stimuli consistent of three steady-state vowel sounds (Bidelman, Moreno, & Alain, 2013), namely, “oo” as in “book” [ʊ], “aw” as in “pot” [a], and “uh” as in “but” [Λʌ], as the standard, large deviant, and small deviant (on the border of categorical perception between the standard and large deviant; Bidelman, Moreno, et al., 2013), respectively. The duration of each vowel was 250 msec, including 10 msec of rise/fall. Note that the speech and note stimuli durations are different, as we were interested in maintaining natural acoustic features and presenting the sound as naturally as possible (Hutka et al., 2015). The sound onset asynchrony was 1000 msec in both conditions so that the stimulus repetition rates (and thus, neural adaptation effects) were comparable for both speech and music EEG recordings.
The standard vowel had a first formant (F1) of 430 Hz, the large deviant 730 Hz (41.1% increase in frequency from standard), and the small deviant 585 Hz (26.5% increase in frequency from standard). Speech tokens contained identical voice fundamental (F0), second (F2), and third (F3) formant frequencies (F0: 100, F2: 1090, and F3: 2350 Hz, respectively), chosen to match prototypical productions from a male speaker (Peterson & Barney 1952). The magnitude of F1 change between the standard and each speech deviant was chosen to parallel the magnitude of change in the music standard and deviants. However, it is notable that a greater magnitude of change was required to detect the standard large deviant and standard small deviant changes for F1 than F0. This difference was informed by past findings showing that participants require a larger percent change between two vowel sounds (i.e., F1) to detect a difference, as compared with between two pitches (i.e., F0; Bidelman & Krishnan, 2010). Pilot testing was used at present to determine the specific F0 and F1 standard deviant changes that musicians and nonmusicians could reliably detect.
EEG Recording and Preprocessing
EEG was recorded using a 76-channel ActiveTwo amplifier system (Biosemi, Amsterdam, The Netherlands) with electrodes placed around the scalp according to standard 10–20 locations (Oostenveld & Praamstra, 2001). Continuous EEG recordings were sampled at 512 Hz and bandpass filtered online between 0.01 and 50 Hz. Source estimation was performed on the EEG data at 72 ROIs2 defined in Talairach space (Diaconescu, Alain, & McIntosh, 2011) using sLORETA (Pascual-Marqui, 2002), as implemented in Brainstorm (Tadel, Baillet, Mosher, Pantazis, & Leahy, 2011). Source reconstruction was constrained to the cortical mantle of the standardized brain template MNI/Colin27 defined by the Montreal Neurological Institute in Brainstorm. Current density for one source orientation (X component) was mapped at 72 brain ROIs, adapting the regional map coarse parcellation scheme of the cerebral cortex developed in Kötter and Wanke (2005). Multiscale entropy (MSE) was calculated on the source waveform at each ROI for each participant.
Sample entropy quantifies the predictability of a time series by calculating the conditional probability that any two sequences of m consecutive data points that are similar to each other within a certain criterion (r) will remain similar at the next point (m + 1) in the data set (N), where N is the length of the time series (Richman & Moorman, 2000). In this study, MSE was calculated with pattern length3 set to m = 5, and the similarity criterion4 was set to r = 1.
MSE estimates were obtained for each participant as the mean across single-trial entropy measures for each timescale.
Power spectral density (PSD) was also measured for all trials. This spectral analysis was conducted because previous studies suggested that changes in MSE tend to closely follow changes in spectral power, while providing unique information about the data (Misic et al., 2010; Lippe, Kovacevic, & McIntosh, 2009; McIntosh et al., 2008; Gudmundsson, Runarsson, Sigurdsson, Eiriksdottir, & Johnsen, 2007). Therefore, changes in sample entropy across sources and temporal scales were examined, as well as at changes in PSD across sources and frequency bands.
Single-trial power spectra were computed using the fast Fourier transform. To capture the relative contribution from each frequency band, all time series were first normalized to mean 0 and SD 1. Given a sampling rate of 512 Hz and 614 data points per trial, the effective frequency resolution was 0.834 Hz. Hence, all spectral analyses were constrained to a bandwidth of 0.834–50 Hz.
Task Partial Least Squares Analysis
Task partial least squares analysis (PLS) is a multivariate statistical technique that was used to assess between- and within-subject changes in MSE during listening (McIntosh & Lobaugh, 2004; McIntosh, Bookstein, Haxby, & Grady, 1996). Similar to multivariate techniques, such as canonical correlation analysis, PLS operates on the entire data structure at once, extracting the patterns of maximal covariance between two mean-centered data matrices, in the present case either group membership or condition (i.e., task design), and MSE measures (McIntosh et al., 2013). The analysis was done to emphasize two aspects of the experiment: (1) Between-group, which emphasizes main effects by centering group means to the overall grand mean, and (2) Between-condition, which identifies potential interactions by mean-centering each group to its own grand mean, which eliminates the between-group effects.
The PLS model is constructed with a singular value decomposition applied to the mean-centered MSE or PSD matrices. Singular value decomposition identified the strongest group and/or condition differences and the corresponding scalp topography, producing a set of orthogonal latent variables (LVs), with descending order of magnitude of accounted-for covariance. Each LV consists of (1) a pattern of design scores, (2) a singular image showing the distribution across brain regions and sampling scales, and (3) a singular value representing the covariance between the design scores and the singular image (McIntosh & Lobaugh, 2004; McIntosh et al., 1996). Statistical assessment in PLS consists of two steps. First, the overall significance of each LV that related the two data matrices was assessed with permutation testing (Good, 2000), which generates an estimated null distribution of the data. An LV was considered significant if the observed pattern (i.e., its singular value) was present less than 5% of the time in random permutations (i.e., p < .05). The dot product of an individual subject's raw MSE data and the singular image from the LV produces a brain score. The brain score is similar to a factor score in factor analysis that indicates how strongly an individual participant expresses the patterns on the LV. Analysis of brain scores allowed us to estimate 95% confidence intervals for the mean effects in each group and task condition.
Second, the reliability of the scalp topographies was determined using bootstrap resampling. This bootstrap resampling estimated standard error confidence intervals around the individual singular vector weights in each LV, assessing the relative contribution of particular locations and timescales and the stability of the relation with either group or condition (Efron & Tibshirani, 1986). For scalp topographies, the singular vector weights for each channel were divided by the bootstrap estimated standard error, giving bootstrap ratios. A bootstrap ratio is similar to a z score if the distribution of singular vector weights is Gaussian, but are best interpreted as approximating a confidence interval (McIntosh et al., 2013). Brain regions with a singular vector weight over standard error ratio >3.0 correspond to a 99% confidence interval and were considered to be reliable (Sampson, Streissguth, Barr, & Bookstein, 1989).
The large and small deviant conditions were combined into a single condition for all analyses, as preliminary analysis showed there were no differences in MSE or PSD between these conditions.
When comparing groups across all conditions (Figure 2; Figure 3, showing sample entropy curves for each timescale, across all conditions), the first LV (LV1) of the MSE analysis captured greater sample entropy in the musician group as compared with the Cantonese group (LV1, p = .004, singular value = 1.0856 corresponding to 43.82% of the covariance). This difference was robustly expressed across both fine and coarse timescales across all neural ROIs, particularly in the right hemisphere. The largest effects were seen across all timescales (particularly, in coarse scales) in the right inferior parietal, angular gyrus, and primary somatosensory area; medial posterior cingulate; and bilateral primary motor, medial premotor, precuneus, cuneus, and superior parietal area.
LV1 of the spectral analysis captured differences in the musician group as compared with the control and Cantonese groups (LV1, p = .012, singular value = 0.1626, corresponding to 37.01% of the covariance). This difference was robustly expressed across frequencies that were lower than 20 Hz (primarily theta/alpha band: 4–12 Hz) in a number of brain regions similar or identical to those observed in the MSE results.
Collectively, PLS analyses revealed that each group could be distinguished based on the variability (MSE) and spectral details of their EEG (particularly in the right hemisphere), when listening to speech and music stimuli. Furthermore, in the areas in which these contrasts were robustly expressed (e.g., right angular gyrus; Figure 3), musicians had the greatest sample entropy across all conditions; Cantonese speakers had the lowest sample entropy; nonmusicians were in-between these two groups.
LV1 for the MSE analysis (Figure 4) captured differences in sample entropy between the music and speech conditions for nonmusicians (p = .002, singular value = 0.2518, corresponding to 22.85% of the covariance). These differences were robustly expressed at fine timescales in several left hemisphere areas, namely, the anterior insula, centrolateral and dorsomedial pFC, frontal polar area, and secondary visual areas. Specifically, greater information integration supporting speech processing, as compared with music processing, was observed in these left hemisphere regions. Differences were also robustly expressed in the right primary and secondary visual areas and the cuneus. Namely, greater information processing capacity supporting music processing, rather than speech processing, was observed in these right hemisphere regions.
Similarly, LV1 of the spectral analysis captured spectral differences between the music and speech conditions for nonmusicians (p < .001, singular value = 0.0533, corresponding to 25.28% of the covariance). Processing of music, as compared with speech, was robustly expressed at frequencies below 10 Hz (e.g., theta, 4–7 Hz for the music condition) in multiple left hemisphere regions, namely, the left anterior insula, claustrum, centrolateral and dorsomedial pFC, frontal polar, parahippocampal cortex, thalamus, and dorsolateral and ventrolateral premotor cortex. These differences were also expressed in the midline posterior cingulate cortex, and the right cuneus, thalamus, and ventrolateral pFC. Processing of speech, as compared with music, was robustly expressed in frequencies above 12 Hz (e.g., beta, 12–18 Hz; gamma, 25–70 Hz for the speech stimuli) in multiple left hemisphere areas (left anterior insula, centrolateral and dorsomedial pFC, OFC, frontal polar, and dorsolateral premotor cortex), and the right primary motor area, precuneus, and dorsolateral pFC.
LV2 for the MSE analysis (Figure 5) captured differences in sample entropy between the music and speech conditions for Cantonese speakers (p = .052, singular value = 0.2029, corresponding to 18.41% of the covariance). Specifically, greater information processing capacity for music processing, rather than speech processing, was robustly expressed in the midline posterior cingulate and retrosplenial cingulate cortex at fine timescales and the primary visual area at coarse timescales. Greater information processing capacity for speech processing, rather than music processing, was expressed in the left medial premotor cortex and right medial premotor cortex at coarse timescales.
Similarly, LV2 of the spectral analysis captured differences between the music and speech conditions for Cantonese speakers (p = .036, singular value = 0.0382, corresponding to 18.12% of the covariance). The processing of speech, as compared with music, was robustly expressed at frequencies below 10 Hz (e.g., theta, 4–7 Hz) in the bilateral medial premotor cortex. The processing of the music condition, as compared with speech, was robustly expressed in low-frequency activity (e.g., theta, 4–7 Hz) in the left parahippocampal cortex, and right anterior insula, ventral temporal cortex, and fusiform gyrus. Processing of music was also robustly expressed at frequencies above 12 Hz (e.g., beta, 12–18 Hz; gamma, 25–70 Hz), in the midline posterior and retrosplenial cingulate cortex, left superior parietal cortex, and bilateral primary and secondary visual areas.
Interestingly, a third LV (LV3), contrasting music and speech conditions for the musician group, was not significant (MSE: p = .256; spectral analysis: p = .210). Although it is possible that this effect would become significant at a larger sample size, the bootstrap-estimated standard errors were small, suggesting that this lack of an effect was robust (i.e., a stable-zero estimate; see McIntosh & Lobaugh 2004). The fact that we failed to detect a difference between musicians' processing of music or speech stimuli suggests that this group used a similar neural architecture to process acoustic information, regardless of the stimulus domain (i.e., music ≈ speech).
Collectively, the between-condition analyses revealed that each group processed the distinction between music and speech using a unique spatiotemporal network. LV1 showed that nonmusicians had greater sample entropy and higher-frequency activity for speech than music at several left hemisphere areas. LV2 showed that Cantonese speakers had greater sample entropy, for music than speech, particularly in midline regions. The spectral analyses revealed that this contrast was also expressed across multiple frequency bands. LV3, which was not significant, suggested that musicians used similar neural networks to support the processing of both music and speech stimuli.
By examining sample entropy between groups, we sought to test if musicianship and tone language (Cantonese) experience are associated with comparable patterns of information integration during automatic processing of music (i.e., nonlinguistic pitch) and speech (linguistic timbre). Between groups, we found that musicians had greater BSV than nonmusicians when listening to both music and speech stimuli. Cantonese speakers had the lowest entropy of all three groups for both stimulus conditions. Although this pattern of results was evident across multiple neural regions and timescales, it was particularly prominent in right hemisphere regions at coarse timescales. These data support the hypothesis that musicianship and tone language differentially impact information integration supporting both music (pitch) and speech (timbre) processing. It is notable that, although pitch cues are used extensively in both musicians' and Cantonese speakers' auditory experience, their information processing networks for pitch appears to be differentially shaped by their unique, domain-specific use and knowledge of this cue (Cantonese: linguistic pitch context; musicians: nonlinguistic pitch context).
The finding that musicians' increased BSV was most prominent in the right hemisphere corroborates the finding that this hemisphere is engaged in fine spectral features of auditory input, as compared with the left hemisphere, which is more specialized for temporal processing (see Zatorre, Belin, & Penhune, 2002, for a review). Similarly, expression in coarse timescales suggests that the dynamics supporting pitch and timbre processing are distributed, rather than locally based (Vakorin et al., 2011). Collectively, our findings indicate that musicians' processing of fine spectral features—both for pitch and timbre—is likely supported by a more expansive network than in Cantonese speakers and English-speaking nonmusicians. These data align with evidence that musicianship benefits a wide range of spectral processing (e.g., Parbery-Clark et al., 2013; Zendel & Alain, 2012; Bidelman & Krishnan, 2010; Chandrasekaran et al., 2009; Parbery-Clark et al., 2009; Schon et al., 2004) and, particularly, timbre (Hutka et al., 2015; Bidelman & Krishnan, 2010).
Overall, the results suggest that the extent of information integration during pitch processing is associated with whether one gained pitch experience via musicianship or speaking Cantonese. These results support our earlier prediction that, because of the higher auditory demands faced by musicians (relative to tone language speakers), the benefits to auditory processing in musicians are greater than to tone language speakers. It is interesting to contemplate if the present differences are rooted in the relative contributions of nature to each type of pitch experience. Specifically, there is evidence that suggests that musicianship is self-selected, with factors such as genetics (e.g., Tan, McPherson, Peretz, Berkovic, & Wilson, 2014), intelligence (Schellenberg, 2011), socioeconomic status (e.g., Sergeant & Thatcher, 1974), and personality traits (e.g., Corrigall, Schellenberg, & Misura, 2013), causing certain individuals to begin and continue music training, as compared with others (e.g., Schellenberg, 2015). Conversely, the networks supporting Cantonese speakers' pitch processing are only subject to nurture, being born into a language that happens to use pitch to convey lexical meaning. Future studies could examine the link between preexisting factors and BSV related to pitch processing in musicians versus Cantonese speakers to determine the extent to which information processing capacity is shaped by nature rather than nurture.
In the between-conditions results, we found that each group engaged unique spatiotemporal distributions to process the differences between music and speech. Nonmusicians had greater BSV supporting speech processing, as compared with music processing (Figure 4). This difference was primarily expressed in several left hemisphere areas at fine timescales. The lateralization of this result is consistent with reports that in musically naive listeners, speech processing is more left-lateralized than music, given the left hemispheres' specialization for temporal processing (see Zatorre et al., 2002, for a review). These findings also suggest that nonmusicians may have greater, locally based information integration supporting speech processing, as compared with music processing (see Vakorin et al., 2011). Unlike for speech, this group's processing of music was right-lateralized, aligning with evidence for right hemisphere specialization for spectral processing (Zatorre et al., 2002).
Cantonese speakers had greater sample entropy for music as compared with speech (Figure 5). This distinction was primarily expressed in the midline posterior cingulate and retrosplenial cingulate cortex at fine timescales. This finding suggests that Cantonese speakers' use of lexical pitch may manifest for greater sample entropy for this cue, as compared with timbre. This finding aligns with the idea that the more familiar one is with a stimulus, the greater sample entropy associated with processing that stimulus (i.e., familiar vs. unfamiliar faces; Heisz et al., 2012). Finally, we did not detect a difference in musicians' BSV when processing music and speech sounds. This null result is consistent from what we would expect in musicians, as the auditory acuity honed by musicianship may enhance the information integration supporting both pitch and timbral cues in nonspeech and speech signals (i.e., music training benefitting speech processing; see Patel, 2011). Collectively, our data demonstrate that each group processes the distinction between music and speech using a different spatiotemporal network. Furthermore, the activation patterns for each group suggest a gradient of pitch processing capacity, such that the more experience one has with pitch (i.e., musicians > Cantonese > nonmusicians), the greater sample entropy associated with processing this cue. Namely, nonmusicians had greater sample entropy for speech as compared with music; Cantonese speakers had a greater sample entropy capacity for music than speech; musicians had similar levels of sample entropy for both conditions. An analogous gradient was observed in behavioral data for a pitch memory task in Bidelman, Hutka, et al. (2013). This gradient effect suggests that musicianship hones more than just spectral acuity (unlike in Cantonese speakers and nonmusicians) and is thus associated with greater information integration supporting both pitch and timbre processing. Cantonese speakers only use pitch in a lexical context and thus have less information integration than musicians, but still more than that of nonmusicians.
Comparing MSE Results to the Spectral Analysis Results
The MSE analyses yielded some unique information that was not obtained in the spectral analyses, as well as data that were complementary to the spectral analysis. Between-group comparisons of sample entropy revealed that musicians had greater brain signal complexity than tone language speakers, across all conditions. In contrast, spectral analyses revealed that musicians' processing of all conditions drew more heavily upon low theta/alpha (4–12 Hz) frequencies than in the other groups. Low frequencies of the EEG have traditionally been interpreted as reflecting long-range neural integration (von Stein & Sarnthein, 2000). Both the MSE and spectral results were also observed in similar neural regions. Collectively, both types of analyses suggest long range and more “global” processing of auditory stimuli in musicians compared with tone language speakers or nonmusicians. Indeed, the observation that whenever there is a preponderance of low frequencies, the entropy at longer timescales is higher suggests that there is a close relationship between the MSE and PSD. We have noted elsewhere, however, that MSE is more dependent on higher-order relations of the signal that are not present in measures of spectral density (McIntosh et al., 2008).
This global processing aligns with multiple neuroimaging findings in which musicians have regional anatomical differences that could facilitate interhemispheric communication, as compared with nonmusicians. For example, musicians—relative to nonmusicians—have a larger anterior corpus callosum, which is responsible for such interhemispheric communications and connecting premotor, supplementary motor, and motor cortices (Schlaug, Jancke, Huang, & Steinmetz, 1995). Numerous studies have since found differences in the corpus callosum between musicians and nonmusicians (e.g., Steele, Bailey, Zatorre, & Penhune, 2013; Hyde et al., 2009; Schlaug et al., 2009; Schlaug, Norton, Overy, & Winner, 2005), particularly in regions connecting motor areas (Schlaug et al., 2005, 2009). These differences may be honed by the bimanual coordination related to playing an instrument (Moore, Schaefer, Bastin, Roberts, & Overy, 2014).
Between-condition comparison of sample entropy revealed that each group showed unique spatiotemporal distributions in their response to processing music and speech. Nonmusicians had greater BSV for speech processing than music processing at fine timescales in several left hemisphere areas (e.g., anterior insula, centrolateral and dorsomedial pFC, frontal polar area). The spectral data revealed beta and gamma frequency activity when processing speech (as compared with music) in similar neural regions as found in the MSE analysis. High-frequency activity has been associated with local perceptual processing (von Stein & Sarnthein, 2000) and is in accordance with the fine timescale (i.e., local) activation observed in our MSE analysis (Vakorin et al., 2011).
The spectral data characterizing the nonmusician effect differed from the MSE analyses via the results for the music condition. Specifically, low-frequency (theta) activation was associated with music processing in many of the same regions that expressed higher frequencies when processing speech. This suggests that nonmusicians may utilize longer-range neural integration to process music (von Stein & Sarnthein, 2000). However, this difference was not reflected in the MSE analysis (i.e., no increase in sample entropy at coarse timescales for the music condition), suggesting that nonmusicians do not have less information integration for music, relative to speech. This is plausible, as nonmusicians may have experience casually listening to/processing music,5 but not the precise pitch experience present in musicians or Cantonese speakers.
In the MSE results for the Cantonese speakers, there was greater sample entropy for music as compared with speech—a difference that was primarily expressed at fine timescales in midline regions. Similarly, the spectral data showed that processing of music, as compared with speech, was associated with beta and gamma frequencies in similar neural regions as in the MSE results. Both the fine timescale and high-frequency activity suggest that the processing of music versus speech in Cantonese speakers relies on locally—rather than globally—distributed networks (Vakorin et al., 2011; von Stein & Sarnthein, 2000). There was also low-frequency (i.e., theta) activation associated with processing music, particularly in several right hemisphere areas (e.g., anterior insula, ventral temporal cortex, and fusiform gyrus), and with processing speech in the bilateral medial cortex. This low-frequency activity may suggest that Cantonese speakers utilize long-range neural integration to process music and speech (von Stein & Sarnthein, 2000). This does not align with either the local complexity supporting pitch processing, as suggested in the MSE data, or the low-frequency communication supporting such processing, as suggested in the high-frequency spectral data. Clarifying the global versus local nature of neural networks supporting music and speech processing in Cantonese speakers could be further investigated in future studies.
Comparisons to ERPs
The EEG time series analyzed at present was previously examined in our previous ERP study (Hutka et al., 2015). MMNs (e.g., Näätänen, Paavilainen, Rinne, & Alho, 2007) were measured in these same groups to index early, automatic cortical discrimination of music and speech sounds. In that study, only musicians showed an enhanced MMN response to both music and speech, aligning with the current between-group effects. Our collective findings suggest that musicians show greater automatic processing (Hutka et al., 2015) and information integration (present study) supporting the automatic processing of both music and speech, as compared with Cantonese speakers and controls.
However, we previously failed to find a difference in MMN amplitude between music and speech stimuli for any group (Hutka et al., 2015). In the current study, both sample entropy and spectral characteristics between conditions were observed in controls and Cantonese speakers. Furthermore, each group had a unique spatiotemporal distribution in response to music and speech. Despite having lower sample entropy than musicians or nonmusicians across all conditions, Cantonese speakers showed greater sample entropy for music as compared with speech. These data suggest that Cantonese speakers have greater information integration supporting pitch processing, as compared with timbral processing. In contrast, MMNs did not reveal a difference in automatic processing of music versus speech in the Cantonese group (Hutka et al., 2015). The differences between the MMN findings and the present results suggest that the nonlinear analyses currently applied afforded additional, more fine-grained information about these between-condition effects (see Hutka et al., 2013, for a discussion). That is, the averaging conducted to increase the signal-to-noise ratio in ERP analyses may eliminate important signal variability that carries information about brain functioning (Hutka et al., 2013).
The present data suggest that the use of pitch for musicians versus for tone language speakers is associated with different information processing capacities supporting the automatic processing of pitch. Furthermore, each group's pitch processing was associated with a unique spatiotemporal distribution, suggesting that musicianship and tone language do not share processing resources for pitch, but instead, use different networks. This recruitment of different networks may help explain how similar behavioral pitch discrimination benefits in musicians and Cantonese speakers are not reflected in mean activations in response to pitch (i.e., Hutka et al., 2015). Collectively, these results further elaborate the discussion of music and speech processing in the context of experience-dependent plasticity. These data also serve as a proof of concept of the theoretical premise outlined in Hutka et al. (2013), namely, how applying a nonlinear approach to the study of the music–language association can advance our knowledge of each domain, as well as experience-dependent plasticity in general.
This work was supported by the Ontario Graduate Scholarship (to S. H.), a GRAMMY Foundation grant (to G. M. B.), the Ministry of Economic Development and Innovation of Ontario (to S. M.), and the Natural Sciences and Engineering Research Council (RGPIN-06196-2014 to S. M.).
Reprint requests should be sent to Stefanie Hutka, Rotman Research Institute, Baycrest Centre for Geriatric Care, 3560 Bathurst Street, Toronto, ON M6A 2E1, Canada, or via e-mail: firstname.lastname@example.org.
That is, BSV does not in itself measure changes in connectivity; it simply measures the changes in dynamics associated with different connectivity patterns that might be the product of experienced-dependent plasticity.
The 72-region parcellation scheme was meant to reduce the dimensionality of the source map to something more meaningful than the 15,000 vertices generated by Brainstorm (Tadel et al., 2011). This specific parcellation scheme was used to maximize the definitional overlap of the regions with other reported regions in human and macaque data. These regions were mapped based on maximally agreed upon boundaries in the literature (see Kötter & Wanke, 2005).
Estimating sample entropy is based on nonlinear dynamics and employs a procedure called time delay embedding, for which we need to specify the embedding dimension. Time delay embedding is a critical step in the nonlinear analysis, but the choice of parameters is essentially based on heuristics. At the same time, Takens' (1981) embedding theorem states that, for reconstructing macrocharacteristics of a dynamical system underlying observed time series (such as entropy), embedding dimension should be relatively large. We used a priori m = 5 for embedding dimension as a compromise between the requirements imposed by Takens' theorem and the fact that our time series are not only finite, but also relatively short.
The selection of this similarity criterion was guided by the simulations performed by Richman and Moorman (2000). Using a series of tests, they showed that the reconstructed values of sample entropy were close to the theoretical ones, when the tolerance parameter r was approaching 1, especially for relatively short time series.
Note that Bigand and Poulin-Charronnat (2006) discuss the large overlap in neural activity in musically trained and untrained listeners, in response to Western musical features (e.g., structure). On the basis of this evidence, one might predict that examining BSV in musicians versus untrained controls while listening to more complex musical excerpts might show a smaller BSV difference than one might initially anticipate.